Re: Handling fatal errors of executors and decommission datanodes

Jianshi Huang Mon, 16 Mar 2015 02:41:46 -0700

I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353



On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang <jianshi.hu...@gmail.com>
wrote:

> Hi,
>
> We're facing "No space left on device" errors lately from time to time.
> The job will fail after retries. Obvious in such case, retry won't be
> helpful.
>
> Sure it's the problem in the datanodes but I'm wondering if Spark Driver
> can handle it and decommission the problematic datanode before retrying it.
> And maybe dynamically allocate another datanode if dynamic allocation is
> enabled.
>
> I think there needs to be a class of fatal errors that can't be recovered
> with retries. And it's best Spark can handle it nicely.
>
> Thanks,
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Handling fatal errors of executors and decommission datanodes

Reply via email to