Intermittently on spark executors we are seeing blockmgr directories not being cleaned up after execution and is filling up disk. These executors are using Mesos dynamic resource allocation and no single app using an executor seems to be the culprit. Sometimes an app will run and be cleaned up and then on a subsequent run that same AppExecId will run and not be cleaned up. The runs that have left behind folders did not have any obvious task failures in the SparkUI during that time frame.
The Spark shuffle service in the ami is version 2.1.1 The code is running on spark 2.0.2 in the mesos sandbox. In a case where files are cleaned up the spark.log looks like the following 18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} ... 18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application 33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files. 18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application 33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true In a case where files are not cleaned up we do not see the "MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing shuffle files." We are using this config when starting the job "--conf spark.worker.cleanup.enabled=true" but I believe this only pertains to standalone mode and we are using the mesos deployment mode. So I don't think this flag actually does anything. Thanks, Jeff --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org