Yeah did that already (65k). We also disabled swapping and reduced the amount of memory allocated to Spark (available - 4). This seems to have resolved the situation.
Thanks! > On 26.02.2015, at 05:43, Raghavendra Pandey <raghavendra.pan...@gmail.com> > wrote: > > Can you try increasing the ulimit -n on your machine. > > On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier <mps....@gmail.com > <mailto:mps....@gmail.com>> wrote: > Hi Sameer, > > I’m still using Spark 1.1.1, I think the default is hash shuffle. No external > shuffle service. > > We are processing gzipped JSON files, the partitions are the amount of input > files. In my current data set we have ~850 files that amount to 60 GB (so > ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We > extract five different groups of data from this to filter, clean and > denormalize (i.e. join) it for easier downstream processing. > > By the way this code does not seem to complete at all without using > coalesce() at a low number, 5 or 10 work great. Everything above that make it > very likely it will crash, even on smaller datasets (~300 files). But I’m not > sure if this is related to the above issue. > > >> On 23.02.2015, at 18:15, Sameer Farooqui <same...@databricks.com >> <mailto:same...@databricks.com>> wrote: >> >> Hi Marius, >> >> Are you using the sort or hash shuffle? >> >> Also, do you have the external shuffle service enabled (so that the Worker >> JVM or NodeManager can still serve the map spill files after an Executor >> crashes)? >> >> How many partitions are in your RDDs before and after the problematic >> shuffle operation? >> >> >> >> On Monday, February 23, 2015, Marius Soutier <mps....@gmail.com >> <mailto:mps....@gmail.com>> wrote: >> Hi guys, >> >> I keep running into a strange problem where my jobs start to fail with the >> dreaded "Resubmitted (resubmitted due to lost executor)” because of having >> too many temp files from previous runs. >> >> Both /var/run and /spill have enough disk space left, but after a given >> amount of jobs have run, following jobs will struggle with completion. There >> are a lot of failures without any exception message, only the above >> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and the >> spill disk, everything goes back to normal. >> >> Thanks for any hint, >> - Marius >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <> >> For additional commands, e-mail: user-h...@spark.apache.org <> >> > >