Re: Executor lost with too many temp files

Marius Soutier Mon, 23 Feb 2015 09:27:03 -0800

Hi Sameer,

I’m still using Spark 1.1.1, I think the default is hash shuffle. No external 
shuffle service.

We are processing gzipped JSON files, the partitions are the amount of input 
files. In my current data set we have ~850 files that amount to 60 GB (so ~600 
GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We extract 
five different groups of data from this to filter, clean and denormalize (i.e. 
join) it for easier downstream processing.

By the way this code does not seem to complete at all without using coalesce() 
at a low number, 5 or 10 work great. Everything above that make it very likely 
it will crash, even on smaller datasets (~300 files). But I’m not sure if this 
is related to the above issue.

> On 23.02.2015, at 18:15, Sameer Farooqui <same...@databricks.com> wrote:
> 
> Hi Marius,
> 
> Are you using the sort or hash shuffle?
> 
> Also, do you have the external shuffle service enabled (so that the Worker 
> JVM or NodeManager can still serve the map spill files after an Executor 
> crashes)?
> 
> How many partitions are in your RDDs before and after the problematic shuffle 
> operation?
> 
> 
> 
> On Monday, February 23, 2015, Marius Soutier <mps....@gmail.com 
> <mailto:mps....@gmail.com>> wrote:
> Hi guys,
> 
> I keep running into a strange problem where my jobs start to fail with the 
> dreaded "Resubmitted (resubmitted due to lost executor)” because of having 
> too many temp files from previous runs.
> 
> Both /var/run and /spill have enough disk space left, but after a given 
> amount of jobs have run, following jobs will struggle with completion. There 
> are a lot of failures without any exception message, only the above mentioned 
> lost executor. As soon as I clear out /var/run/spark/work/ and the spill 
> disk, everything goes back to normal.
> 
> Thanks for any hint,
> - Marius
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>

Re: Executor lost with too many temp files

Reply via email to