Can you try increasing the ulimit -n on your machine. On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier <mps....@gmail.com> wrote:
> Hi Sameer, > > I’m still using Spark 1.1.1, I think the default is hash shuffle. No > external shuffle service. > > We are processing gzipped JSON files, the partitions are the amount of > input files. In my current data set we have ~850 files that amount to 60 GB > (so ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM > each. We extract five different groups of data from this to filter, clean > and denormalize (i.e. join) it for easier downstream processing. > > By the way this code does not seem to complete at all without using > coalesce() at a low number, 5 or 10 work great. Everything above that make > it very likely it will crash, even on smaller datasets (~300 files). But > I’m not sure if this is related to the above issue. > > > On 23.02.2015, at 18:15, Sameer Farooqui <same...@databricks.com> wrote: > > Hi Marius, > > Are you using the sort or hash shuffle? > > Also, do you have the external shuffle service enabled (so that the Worker > JVM or NodeManager can still serve the map spill files after an Executor > crashes)? > > How many partitions are in your RDDs before and after the problematic > shuffle operation? > > > > On Monday, February 23, 2015, Marius Soutier <mps....@gmail.com> wrote: > >> Hi guys, >> >> I keep running into a strange problem where my jobs start to fail with >> the dreaded "Resubmitted (resubmitted due to lost executor)” because of >> having too many temp files from previous runs. >> >> Both /var/run and /spill have enough disk space left, but after a given >> amount of jobs have run, following jobs will struggle with completion. >> There are a lot of failures without any exception message, only the above >> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and >> the spill disk, everything goes back to normal. >> >> Thanks for any hint, >> - Marius >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >