[ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133887#comment-15133887
 ] 

Iulian Dragos commented on SPARK-12430:
---------------------------------------

The PR I linked to was just merged. Normally this should fix the race 
condition, so please give it a try.

Regarding the reasons behind moving the directory out of {{spark-id...}}, it's 
all in that comment, but here are some pointers:

- a well-behaved framework should only store things in the Mesos sandbox
- the spark temporary directory is deleted on shutdown (using a shutdown hook, 
a VM-level callback), including everything underneath, recursively
- when the external shuffle service is enabled the shuffle files should not be 
deleted, even after the executor exits. That's because the (external) shuffle 
service reads and serves them to other executors. The executor may exit early 
due to dynamic allocation. So shuffle files are moved out of there
- when dynamic allocation is disabled, shuffle files are deleted as part of the 
standard shutdown procedure (*not* the VM-level shutdown hook). This part seems 
flaky, and what the PR I linked to is fixing (Mesos apparently kills the 
executors if the driver exits first).

We might move those files under {{spark-id}} when the external shuffle server 
is disabled, but it seemed simpler to put them in the same place all the time, 
but delete/keep them depending on this flag.

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12430
>                 URL: https://issues.apache.org/jira/browse/SPARK-12430
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.1, 1.5.2, 1.6.0
>         Environment: Ubuntu server
>            Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to