Re: Executor lost with too many temp files

2015-02-26 Thread Marius Soutier
Yeah did that already (65k). We also disabled swapping and reduced the amount 
of memory allocated to Spark (available - 4). This seems to have resolved the 
situation.

Thanks!

> On 26.02.2015, at 05:43, Raghavendra Pandey  
> wrote:
> 
> Can you try increasing the ulimit -n on your machine.
> 
> On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier  > wrote:
> Hi Sameer,
> 
> I’m still using Spark 1.1.1, I think the default is hash shuffle. No external 
> shuffle service.
> 
> We are processing gzipped JSON files, the partitions are the amount of input 
> files. In my current data set we have ~850 files that amount to 60 GB (so 
> ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We 
> extract five different groups of data from this to filter, clean and 
> denormalize (i.e. join) it for easier downstream processing.
> 
> By the way this code does not seem to complete at all without using 
> coalesce() at a low number, 5 or 10 work great. Everything above that make it 
> very likely it will crash, even on smaller datasets (~300 files). But I’m not 
> sure if this is related to the above issue.
> 
> 
>> On 23.02.2015, at 18:15, Sameer Farooqui > > wrote:
>> 
>> Hi Marius,
>> 
>> Are you using the sort or hash shuffle?
>> 
>> Also, do you have the external shuffle service enabled (so that the Worker 
>> JVM or NodeManager can still serve the map spill files after an Executor 
>> crashes)?
>> 
>> How many partitions are in your RDDs before and after the problematic 
>> shuffle operation?
>> 
>> 
>> 
>> On Monday, February 23, 2015, Marius Soutier > > wrote:
>> Hi guys,
>> 
>> I keep running into a strange problem where my jobs start to fail with the 
>> dreaded "Resubmitted (resubmitted due to lost executor)” because of having 
>> too many temp files from previous runs.
>> 
>> Both /var/run and /spill have enough disk space left, but after a given 
>> amount of jobs have run, following jobs will struggle with completion. There 
>> are a lot of failures without any exception message, only the above 
>> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and the 
>> spill disk, everything goes back to normal.
>> 
>> Thanks for any hint,
>> - Marius
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
>> For additional commands, e-mail: user-h...@spark.apache.org <>
>> 
> 
> 



Re: Executor lost with too many temp files

2015-02-25 Thread Raghavendra Pandey
Can you try increasing the ulimit -n on your machine.

On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier  wrote:

> Hi Sameer,
>
> I’m still using Spark 1.1.1, I think the default is hash shuffle. No
> external shuffle service.
>
> We are processing gzipped JSON files, the partitions are the amount of
> input files. In my current data set we have ~850 files that amount to 60 GB
> (so ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM
> each. We extract five different groups of data from this to filter, clean
> and denormalize (i.e. join) it for easier downstream processing.
>
> By the way this code does not seem to complete at all without using
> coalesce() at a low number, 5 or 10 work great. Everything above that make
> it very likely it will crash, even on smaller datasets (~300 files). But
> I’m not sure if this is related to the above issue.
>
>
> On 23.02.2015, at 18:15, Sameer Farooqui  wrote:
>
> Hi Marius,
>
> Are you using the sort or hash shuffle?
>
> Also, do you have the external shuffle service enabled (so that the Worker
> JVM or NodeManager can still serve the map spill files after an Executor
> crashes)?
>
> How many partitions are in your RDDs before and after the problematic
> shuffle operation?
>
>
>
> On Monday, February 23, 2015, Marius Soutier  wrote:
>
>> Hi guys,
>>
>> I keep running into a strange problem where my jobs start to fail with
>> the dreaded "Resubmitted (resubmitted due to lost executor)” because of
>> having too many temp files from previous runs.
>>
>> Both /var/run and /spill have enough disk space left, but after a given
>> amount of jobs have run, following jobs will struggle with completion.
>> There are a lot of failures without any exception message, only the above
>> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and
>> the spill disk, everything goes back to normal.
>>
>> Thanks for any hint,
>> - Marius
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi Sameer,

I’m still using Spark 1.1.1, I think the default is hash shuffle. No external 
shuffle service.

We are processing gzipped JSON files, the partitions are the amount of input 
files. In my current data set we have ~850 files that amount to 60 GB (so ~600 
GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We extract 
five different groups of data from this to filter, clean and denormalize (i.e. 
join) it for easier downstream processing.

By the way this code does not seem to complete at all without using coalesce() 
at a low number, 5 or 10 work great. Everything above that make it very likely 
it will crash, even on smaller datasets (~300 files). But I’m not sure if this 
is related to the above issue.


> On 23.02.2015, at 18:15, Sameer Farooqui  wrote:
> 
> Hi Marius,
> 
> Are you using the sort or hash shuffle?
> 
> Also, do you have the external shuffle service enabled (so that the Worker 
> JVM or NodeManager can still serve the map spill files after an Executor 
> crashes)?
> 
> How many partitions are in your RDDs before and after the problematic shuffle 
> operation?
> 
> 
> 
> On Monday, February 23, 2015, Marius Soutier  > wrote:
> Hi guys,
> 
> I keep running into a strange problem where my jobs start to fail with the 
> dreaded "Resubmitted (resubmitted due to lost executor)” because of having 
> too many temp files from previous runs.
> 
> Both /var/run and /spill have enough disk space left, but after a given 
> amount of jobs have run, following jobs will struggle with completion. There 
> are a lot of failures without any exception message, only the above mentioned 
> lost executor. As soon as I clear out /var/run/spark/work/ and the spill 
> disk, everything goes back to normal.
> 
> Thanks for any hint,
> - Marius
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 



Re: Executor lost with too many temp files

2015-02-23 Thread Sameer Farooqui
Hi Marius,

Are you using the sort or hash shuffle?

Also, do you have the external shuffle service enabled (so that the Worker
JVM or NodeManager can still serve the map spill files after an Executor
crashes)?

How many partitions are in your RDDs before and after the problematic
shuffle operation?



On Monday, February 23, 2015, Marius Soutier  wrote:

> Hi guys,
>
> I keep running into a strange problem where my jobs start to fail with the
> dreaded "Resubmitted (resubmitted due to lost executor)” because of having
> too many temp files from previous runs.
>
> Both /var/run and /spill have enough disk space left, but after a given
> amount of jobs have run, following jobs will struggle with completion.
> There are a lot of failures without any exception message, only the above
> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and
> the spill disk, everything goes back to normal.
>
> Thanks for any hint,
> - Marius
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi guys,

I keep running into a strange problem where my jobs start to fail with the 
dreaded "Resubmitted (resubmitted due to lost executor)” because of having too 
many temp files from previous runs.

Both /var/run and /spill have enough disk space left, but after a given amount 
of jobs have run, following jobs will struggle with completion. There are a lot 
of failures without any exception message, only the above mentioned lost 
executor. As soon as I clear out /var/run/spark/work/ and the spill disk, 
everything goes back to normal.

Thanks for any hint,
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org