Re: Handling fatal errors of executors and decommission datanodes

Shixiong Zhu Mon, 16 Mar 2015 03:03:26 -0700

There are 2 cases for "No space left on device":

1. Some tasks which use large temp space cannot run in any node.
2. The free space of datanodes is not balance. Some tasks which use large
temp space can not run in several nodes, but they can run in other nodes
successfully.


Because most of our cases are the second one, we set
"spark.scheduler.executorTaskBlacklistTime" to 30000 to solve such "No
space left on device" errors. So if a task runs unsuccessfully in some
executor, it won't be scheduled to the same executor in 30 seconds.


Best Regards,
Shixiong Zhu

2015-03-16 17:40 GMT+08:00 Jianshi Huang <jianshi.hu...@gmail.com>:

> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353
>
>
> On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> We're facing "No space left on device" errors lately from time to time.
>> The job will fail after retries. Obvious in such case, retry won't be
>> helpful.
>>
>> Sure it's the problem in the datanodes but I'm wondering if Spark Driver
>> can handle it and decommission the problematic datanode before retrying it.
>> And maybe dynamically allocate another datanode if dynamic allocation is
>> enabled.
>>
>> I think there needs to be a class of fatal errors that can't be recovered
>> with retries. And it's best Spark can handle it nicely.
>>
>> Thanks,
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: Handling fatal errors of executors and decommission datanodes

Reply via email to