Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Another thing I have noticed is that out of my master+15 slaves, two slaves always carry a higher inode load. So for example right now I am running an intensive job that takes about an hour to finish and two slaves have been showing an increase in inode consumption (they are about 10% above the

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, the inode usage was about 50%. On two of the slaves, one had inode usage of 96% and on the other it was 100%. When i went into /tmp on these two nodes - there were a bunch of /tmp/spark* subdirectories which I deleted. This r

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a "No space left on device" error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski wrote: > I would love to work

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list. Ognen On 3/23/14, 10:04 PM, Aaron Davidson wrote: Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue wh

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevs

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks (

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through "watch df -h" that the free space is always hovering around 3+GB on the

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions i

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
On 3/23/14, 5:35 PM, Aaron Davidson wrote: On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the "df" command to check to free space of "/tmp" or root if tmp isn't listed. 3 GB also seems

Re: No space left on device exception

2014-03-23 Thread Matei Zaharia
You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei On Mar 23, 2014, at 3:35 PM, Aaron Davidson wrote: > On some systems, /tmp/ i

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the "df" command to check to free space of "/tmp" or root if tmp isn't listed. 3 GB also seems pretty low for the remaining free space of a disk

No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Hello, I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for? [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed