Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Shixiong Zhu
There are 2 cases for No space left on device:

1. Some tasks which use large temp space cannot run in any node.
2. The free space of datanodes is not balance. Some tasks which use large
temp space can not run in several nodes, but they can run in other nodes
successfully.

Because most of our cases are the second one, we set
spark.scheduler.executorTaskBlacklistTime to 3 to solve such No
space left on device errors. So if a task runs unsuccessfully in some
executor, it won't be scheduled to the same executor in 30 seconds.


Best Regards,
Shixiong Zhu

2015-03-16 17:40 GMT+08:00 Jianshi Huang jianshi.hu...@gmail.com:

 I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353


 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 We're facing No space left on device errors lately from time to time.
 The job will fail after retries. Obvious in such case, retry won't be
 helpful.

 Sure it's the problem in the datanodes but I'm wondering if Spark Driver
 can handle it and decommission the problematic datanode before retrying it.
 And maybe dynamically allocate another datanode if dynamic allocation is
 enabled.

 I think there needs to be a class of fatal errors that can't be recovered
 with retries. And it's best Spark can handle it nicely.

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353


On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 We're facing No space left on device errors lately from time to time.
 The job will fail after retries. Obvious in such case, retry won't be
 helpful.

 Sure it's the problem in the datanodes but I'm wondering if Spark Driver
 can handle it and decommission the problematic datanode before retrying it.
 And maybe dynamically allocate another datanode if dynamic allocation is
 enabled.

 I think there needs to be a class of fatal errors that can't be recovered
 with retries. And it's best Spark can handle it nicely.

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Thanks Shixiong!

Very strange that our tasks were retried on the same executor again and
again. I'll check spark.scheduler.executorTaskBlacklistTime.

Jianshi

On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu zsxw...@gmail.com wrote:

 There are 2 cases for No space left on device:

 1. Some tasks which use large temp space cannot run in any node.
 2. The free space of datanodes is not balance. Some tasks which use large
 temp space can not run in several nodes, but they can run in other nodes
 successfully.

 Because most of our cases are the second one, we set
 spark.scheduler.executorTaskBlacklistTime to 3 to solve such No
 space left on device errors. So if a task runs unsuccessfully in some
 executor, it won't be scheduled to the same executor in 30 seconds.


 Best Regards,
 Shixiong Zhu

 2015-03-16 17:40 GMT+08:00 Jianshi Huang jianshi.hu...@gmail.com:

 I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353


 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 We're facing No space left on device errors lately from time to time.
 The job will fail after retries. Obvious in such case, retry won't be
 helpful.

 Sure it's the problem in the datanodes but I'm wondering if Spark Driver
 can handle it and decommission the problematic datanode before retrying it.
 And maybe dynamically allocate another datanode if dynamic allocation is
 enabled.

 I think there needs to be a class of fatal errors that can't be
 recovered with retries. And it's best Spark can handle it nicely.

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L.

I'll try setting it to 3 immediately. Thanks for the help!

Jianshi

On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Thanks Shixiong!

 Very strange that our tasks were retried on the same executor again and
 again. I'll check spark.scheduler.executorTaskBlacklistTime.

 Jianshi

 On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu zsxw...@gmail.com wrote:

 There are 2 cases for No space left on device:

 1. Some tasks which use large temp space cannot run in any node.
 2. The free space of datanodes is not balance. Some tasks which use large
 temp space can not run in several nodes, but they can run in other nodes
 successfully.

 Because most of our cases are the second one, we set
 spark.scheduler.executorTaskBlacklistTime to 3 to solve such No
 space left on device errors. So if a task runs unsuccessfully in some
 executor, it won't be scheduled to the same executor in 30 seconds.


 Best Regards,
 Shixiong Zhu

 2015-03-16 17:40 GMT+08:00 Jianshi Huang jianshi.hu...@gmail.com:

 I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353


 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 We're facing No space left on device errors lately from time to time.
 The job will fail after retries. Obvious in such case, retry won't be
 helpful.

 Sure it's the problem in the datanodes but I'm wondering if Spark
 Driver can handle it and decommission the problematic datanode before
 retrying it. And maybe dynamically allocate another datanode if dynamic
 allocation is enabled.

 I think there needs to be a class of fatal errors that can't be
 recovered with retries. And it's best Spark can handle it nicely.

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/