Intermittently on spark executors we are seeing blockmgr directories not being 
cleaned up after execution and is filling up disk.  These executors are using 
Mesos dynamic resource allocation and no single app using an executor seems to 
be the culprit.  Sometimes an app will run and be cleaned up and then on a 
subsequent run that same AppExecId will run and not be cleaned up.  The runs 
that have left behind folders did not have any obvious task failures in the 
SparkUI during that time frame.  

The Spark shuffle service in the ami is version 2.1.1
The code is running on spark 2.0.2 in the mesos sandbox.

In a case where files are cleaned up the spark.log looks like the following
18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor 
AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with 
ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
...
18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application 
33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files.
18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application 
33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true


In a case where files are not cleaned up we do not see the 
"MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing 
shuffle files."

We are using this config when starting the job "--conf 
spark.worker.cleanup.enabled=true" but I believe this only pertains to 
standalone mode and we are using the mesos deployment mode. So I don't think 
this flag actually does anything. 


Thanks,
Jeff
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to