Re: Handling fatal errors of executors and decommission datanodes

Jianshi Huang Mon, 16 Mar 2015 08:35:49 -0700

Thanks Shixiong!

Very strange that our tasks were retried on the same executor again and
again. I'll check spark.scheduler.executorTaskBlacklistTime.


Jianshi

On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu <zsxw...@gmail.com> wrote:

> There are 2 cases for "No space left on device":
>
> 1. Some tasks which use large temp space cannot run in any node.
> 2. The free space of datanodes is not balance. Some tasks which use large
> temp space can not run in several nodes, but they can run in other nodes
> successfully.
>
> Because most of our cases are the second one, we set
> "spark.scheduler.executorTaskBlacklistTime" to 30000 to solve such "No
> space left on device" errors. So if a task runs unsuccessfully in some
> executor, it won't be scheduled to the same executor in 30 seconds.
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-03-16 17:40 GMT+08:00 Jianshi Huang <jianshi.hu...@gmail.com>:
>
>> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353
>>
>>
>> On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang <jianshi.hu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We're facing "No space left on device" errors lately from time to time.
>>> The job will fail after retries. Obvious in such case, retry won't be
>>> helpful.
>>>
>>> Sure it's the problem in the datanodes but I'm wondering if Spark Driver
>>> can handle it and decommission the problematic datanode before retrying it.
>>> And maybe dynamically allocate another datanode if dynamic allocation is
>>> enabled.
>>>
>>> I think there needs to be a class of fatal errors that can't be
>>> recovered with retries. And it's best Spark can handle it nicely.
>>>
>>> Thanks,
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Handling fatal errors of executors and decommission datanodes

Reply via email to