Chawla,

We hit this issue, too. I worked around it by setting
spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was that
the scheduler was using locality to select the executor, even though it had
already failed there. The executor task blacklist time controls how long
the scheduler will avoid using an executor for a failed task, which will
cause it to avoid rescheduling on the executor. The default was 0, so the
executor was put back into consideration immediately.

In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not sure
if that does exactly the same thing. The default for that setting is 1h
instead of 0. It’s better to have a non-zero default to avoid what you’re
seeing.

rb
​

On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> I am seeing a strange issue. I had a bad behaving slave that failed the
> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
> like all task retries happen on the same slave in case of failure.  My
> expectation was that task will be retried on different slave in case of
> failure, and chance of all 8 retries to happen on same slave is very less.
>
>
> Regards
> Sumit Chawla
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to