Re: Executor lost with too many temp files

2015-02-26 Thread Marius Soutier
Yeah did that already (65k). We also disabled swapping and reduced the amount 
of memory allocated to Spark (available - 4). This seems to have resolved the 
situation.

Thanks!

 On 26.02.2015, at 05:43, Raghavendra Pandey raghavendra.pan...@gmail.com 
 wrote:
 
 Can you try increasing the ulimit -n on your machine.
 
 On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com 
 mailto:mps@gmail.com wrote:
 Hi Sameer,
 
 I’m still using Spark 1.1.1, I think the default is hash shuffle. No external 
 shuffle service.
 
 We are processing gzipped JSON files, the partitions are the amount of input 
 files. In my current data set we have ~850 files that amount to 60 GB (so 
 ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We 
 extract five different groups of data from this to filter, clean and 
 denormalize (i.e. join) it for easier downstream processing.
 
 By the way this code does not seem to complete at all without using 
 coalesce() at a low number, 5 or 10 work great. Everything above that make it 
 very likely it will crash, even on smaller datasets (~300 files). But I’m not 
 sure if this is related to the above issue.
 
 
 On 23.02.2015, at 18:15, Sameer Farooqui same...@databricks.com 
 mailto:same...@databricks.com wrote:
 
 Hi Marius,
 
 Are you using the sort or hash shuffle?
 
 Also, do you have the external shuffle service enabled (so that the Worker 
 JVM or NodeManager can still serve the map spill files after an Executor 
 crashes)?
 
 How many partitions are in your RDDs before and after the problematic 
 shuffle operation?
 
 
 
 On Monday, February 23, 2015, Marius Soutier mps@gmail.com 
 mailto:mps@gmail.com wrote:
 Hi guys,
 
 I keep running into a strange problem where my jobs start to fail with the 
 dreaded Resubmitted (resubmitted due to lost executor)” because of having 
 too many temp files from previous runs.
 
 Both /var/run and /spill have enough disk space left, but after a given 
 amount of jobs have run, following jobs will struggle with completion. There 
 are a lot of failures without any exception message, only the above 
 mentioned lost executor. As soon as I clear out /var/run/spark/work/ and the 
 spill disk, everything goes back to normal.
 
 Thanks for any hint,
 - Marius
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 For additional commands, e-mail: user-h...@spark.apache.org 
 
 
 



Re: Executor lost with too many temp files

2015-02-25 Thread Raghavendra Pandey
Can you try increasing the ulimit -n on your machine.

On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com wrote:

 Hi Sameer,

 I’m still using Spark 1.1.1, I think the default is hash shuffle. No
 external shuffle service.

 We are processing gzipped JSON files, the partitions are the amount of
 input files. In my current data set we have ~850 files that amount to 60 GB
 (so ~600 GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM
 each. We extract five different groups of data from this to filter, clean
 and denormalize (i.e. join) it for easier downstream processing.

 By the way this code does not seem to complete at all without using
 coalesce() at a low number, 5 or 10 work great. Everything above that make
 it very likely it will crash, even on smaller datasets (~300 files). But
 I’m not sure if this is related to the above issue.


 On 23.02.2015, at 18:15, Sameer Farooqui same...@databricks.com wrote:

 Hi Marius,

 Are you using the sort or hash shuffle?

 Also, do you have the external shuffle service enabled (so that the Worker
 JVM or NodeManager can still serve the map spill files after an Executor
 crashes)?

 How many partitions are in your RDDs before and after the problematic
 shuffle operation?



 On Monday, February 23, 2015, Marius Soutier mps@gmail.com wrote:

 Hi guys,

 I keep running into a strange problem where my jobs start to fail with
 the dreaded Resubmitted (resubmitted due to lost executor)” because of
 having too many temp files from previous runs.

 Both /var/run and /spill have enough disk space left, but after a given
 amount of jobs have run, following jobs will struggle with completion.
 There are a lot of failures without any exception message, only the above
 mentioned lost executor. As soon as I clear out /var/run/spark/work/ and
 the spill disk, everything goes back to normal.

 Thanks for any hint,
 - Marius


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Executor lost with too many temp files

2015-02-23 Thread Sameer Farooqui
Hi Marius,

Are you using the sort or hash shuffle?

Also, do you have the external shuffle service enabled (so that the Worker
JVM or NodeManager can still serve the map spill files after an Executor
crashes)?

How many partitions are in your RDDs before and after the problematic
shuffle operation?



On Monday, February 23, 2015, Marius Soutier mps@gmail.com wrote:

 Hi guys,

 I keep running into a strange problem where my jobs start to fail with the
 dreaded Resubmitted (resubmitted due to lost executor)” because of having
 too many temp files from previous runs.

 Both /var/run and /spill have enough disk space left, but after a given
 amount of jobs have run, following jobs will struggle with completion.
 There are a lot of failures without any exception message, only the above
 mentioned lost executor. As soon as I clear out /var/run/spark/work/ and
 the spill disk, everything goes back to normal.

 Thanks for any hint,
 - Marius


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;




Re: Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi Sameer,

I’m still using Spark 1.1.1, I think the default is hash shuffle. No external 
shuffle service.

We are processing gzipped JSON files, the partitions are the amount of input 
files. In my current data set we have ~850 files that amount to 60 GB (so ~600 
GB uncompressed). We have 5 workers with 8 cores and 48 GB RAM each. We extract 
five different groups of data from this to filter, clean and denormalize (i.e. 
join) it for easier downstream processing.

By the way this code does not seem to complete at all without using coalesce() 
at a low number, 5 or 10 work great. Everything above that make it very likely 
it will crash, even on smaller datasets (~300 files). But I’m not sure if this 
is related to the above issue.


 On 23.02.2015, at 18:15, Sameer Farooqui same...@databricks.com wrote:
 
 Hi Marius,
 
 Are you using the sort or hash shuffle?
 
 Also, do you have the external shuffle service enabled (so that the Worker 
 JVM or NodeManager can still serve the map spill files after an Executor 
 crashes)?
 
 How many partitions are in your RDDs before and after the problematic shuffle 
 operation?
 
 
 
 On Monday, February 23, 2015, Marius Soutier mps@gmail.com 
 mailto:mps@gmail.com wrote:
 Hi guys,
 
 I keep running into a strange problem where my jobs start to fail with the 
 dreaded Resubmitted (resubmitted due to lost executor)” because of having 
 too many temp files from previous runs.
 
 Both /var/run and /spill have enough disk space left, but after a given 
 amount of jobs have run, following jobs will struggle with completion. There 
 are a lot of failures without any exception message, only the above mentioned 
 lost executor. As soon as I clear out /var/run/spark/work/ and the spill 
 disk, everything goes back to normal.
 
 Thanks for any hint,
 - Marius
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;
 



Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi guys,

I keep running into a strange problem where my jobs start to fail with the 
dreaded Resubmitted (resubmitted due to lost executor)” because of having too 
many temp files from previous runs.

Both /var/run and /spill have enough disk space left, but after a given amount 
of jobs have run, following jobs will struggle with completion. There are a lot 
of failures without any exception message, only the above mentioned lost 
executor. As soon as I clear out /var/run/spark/work/ and the spill disk, 
everything goes back to normal.

Thanks for any hint,
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org