Yeah did that already (65k). We also disabled swapping and reduced the amount 
of memory allocated to Spark (available - 4). This seems to have resolved the 
situation.

Thanks!

> On 26.02.2015, at 05:43, Raghavendra Pandey <raghavendra.pan...@gmail.com> 
> wrote:
> 
> Can you try increasing the ulimit -n on your machine.
> 
> On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier <mps....@gmail.com 
> <mailto:mps....@gmail.com>> wrote:
> Hi Sameer,
> 
> I’m still using Spark 1.1.1, I think the default is hash shuffle. No external 
> shuffle service.
> 
> We are processing gzipped JSON files, the partitions are the amount of input 
> files. In my current data set we have ~850 files that amount to 60 GB (so 
> ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We 
> extract five different groups of data from this to filter, clean and 
> denormalize (i.e. join) it for easier downstream processing.
> 
> By the way this code does not seem to complete at all without using 
> coalesce() at a low number, 5 or 10 work great. Everything above that make it 
> very likely it will crash, even on smaller datasets (~300 files). But I’m not 
> sure if this is related to the above issue.
> 
> 
>> On 23.02.2015, at 18:15, Sameer Farooqui <same...@databricks.com 
>> <mailto:same...@databricks.com>> wrote:
>> 
>> Hi Marius,
>> 
>> Are you using the sort or hash shuffle?
>> 
>> Also, do you have the external shuffle service enabled (so that the Worker 
>> JVM or NodeManager can still serve the map spill files after an Executor 
>> crashes)?
>> 
>> How many partitions are in your RDDs before and after the problematic 
>> shuffle operation?
>> 
>> 
>> 
>> On Monday, February 23, 2015, Marius Soutier <mps....@gmail.com 
>> <mailto:mps....@gmail.com>> wrote:
>> Hi guys,
>> 
>> I keep running into a strange problem where my jobs start to fail with the 
>> dreaded "Resubmitted (resubmitted due to lost executor)” because of having 
>> too many temp files from previous runs.
>> 
>> Both /var/run and /spill have enough disk space left, but after a given 
>> amount of jobs have run, following jobs will struggle with completion. There 
>> are a lot of failures without any exception message, only the above 
>> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and the 
>> spill disk, everything goes back to normal.
>> 
>> Thanks for any hint,
>> - Marius
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
>> For additional commands, e-mail: user-h...@spark.apache.org <>
>> 
> 
> 

Reply via email to