[ https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692692#comment-14692692 ]
Pat Ferrel commented on SPARK-4796: ----------------------------------- Why is this marked resolved? Spark does indeed leave around a lot of files and unless you are looking you'd never know. It sounds like the only safe method to remove these is to shutdown Spark and delete them. I skimmed the issue so sorry if I missed something. 15G on the MBP and counting :-) > Spark does not remove temp files > -------------------------------- > > Key: SPARK-4796 > URL: https://issues.apache.org/jira/browse/SPARK-4796 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 1.1.0 > Environment: I'm runnin spark on mesos and mesos slaves are docker > containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, > docker 1.2.0. > Reporter: Ian Babrou > > I started a job that cannot fill into memory and got "no space left on > device". That was fair, because docker containers only have 10gb of disk > space and some is taken by OS already. > But then I found out when job failed it didn't release any disk space and > left container without any free disk space. > Then I decided to check if spark removes temp files in any case, because many > mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after > spark task is finished. I attached with strace to running job: > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36" > <unfinished ...> > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a" > <unfinished ...> > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d" > <unfinished ...> > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc") > = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898" > <unfinished ...> > [pid 30212] <... unlink resumed> ) = 0 > [pid 30212] > unlink("/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0") > = 0 > And after job is finished, some files are still there: > /tmp/spark-local-20141209091330-48b5/ > /tmp/spark-local-20141209091330-48b5/11 > /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4 > /tmp/spark-local-20141209091330-48b5/32 > /tmp/spark-local-20141209091330-48b5/04 > /tmp/spark-local-20141209091330-48b5/05 > /tmp/spark-local-20141209091330-48b5/0f > /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2 > /tmp/spark-local-20141209091330-48b5/3d > /tmp/spark-local-20141209091330-48b5/0e > /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1 > /tmp/spark-local-20141209091330-48b5/15 > /tmp/spark-local-20141209091330-48b5/0d > /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0 > /tmp/spark-local-20141209091330-48b5/36 > /tmp/spark-local-20141209091330-48b5/31 > /tmp/spark-local-20141209091330-48b5/12 > /tmp/spark-local-20141209091330-48b5/21 > /tmp/spark-local-20141209091330-48b5/10 > /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3 > /tmp/spark-local-20141209091330-48b5/1e > /tmp/spark-local-20141209091330-48b5/35 > If I look into my mesos slaves, there are mostly "shuffle" files, overall > picture for single node: > root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l > 781 > root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l > 10 > root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle > /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3 > /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111 > /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705 > /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e > /tmp/spark-local-20141119144512-67c4/24/temp_f0e4cc43-6cc9-42a7-882d-f8a031fa4dc3 > /tmp/spark-local-20141119144512-67c4/29/temp_a8bbe2cb-f590-4b71-8ef8-9c0324beddc7 > /tmp/spark-local-20141119144512-67c4/3a/temp_9fc08519-f23a-40ac-a3fd-e58df6871460 > /tmp/spark-local-20141119144512-67c4/1e/temp_d66668ab-2999-48af-a136-84cfd6f5f6cb > /tmp/spark-local-20141205110922-f78e/0a/temp_7409add5-e6ff-46e5-ae3f-6a4c7b2ddf8f > /tmp/spark-local-20141205111026-0b53/01/temp_72024c94-7512-4692-8bd1-ef2417143d8c > Conclusions: > 1. Shuffle files should be removed, but they stay. > 2. Temp files should always be removed, but they stay. > Maybe we should unlink temp and shuffle files immediately after creation to > remove them even if spark fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org