Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Chawla,Sumit
Thanks a lot @ Dongjin, @Ryan

I am using Spark 1.6.  I agree with your assesment Ryan.  Further
investigation seemed to suggest that our cluster was probably at 100%
capacity at that point of time.  Though tasks were failing on that slave,
still it was accepting the task, and task  retries exhausted much faster
than the other slaves freeing up to accept those tasks.

Regards
Sumit Chawla


On Mon, Apr 24, 2017 at 9:48 AM, Ryan Blue  wrote:

> Looking at the code a bit more, it appears that blacklisting is disabled
> by default. To enable it, set spark.blacklist.enabled=true.
>
> The updates in 2.1.0 appear to provide much more fine-grained settings for
> this, like the number of tasks that can fail before an executor is
> blacklisted for a stage. In that version, you probably want to set
> spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
>  and search for
> “blacklist” to see all the options.
>
> rb
> ​
>
> On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue  wrote:
>
>> Chawla,
>>
>> We hit this issue, too. I worked around it by setting
>> spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was
>> that the scheduler was using locality to select the executor, even though
>> it had already failed there. The executor task blacklist time controls how
>> long the scheduler will avoid using an executor for a failed task, which
>> will cause it to avoid rescheduling on the executor. The default was 0, so
>> the executor was put back into consideration immediately.
>>
>> In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not
>> sure if that does exactly the same thing. The default for that setting is
>> 1h instead of 0. It’s better to have a non-zero default to avoid what
>> you’re seeing.
>>
>> rb
>> ​
>>
>> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit 
>> wrote:
>>
>>> I am seeing a strange issue. I had a bad behaving slave that failed the
>>> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
>>> like all task retries happen on the same slave in case of failure.  My
>>> expectation was that task will be retried on different slave in case of
>>> failure, and chance of all 8 retries to happen on same slave is very less.
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Looking at the code a bit more, it appears that blacklisting is disabled by
default. To enable it, set spark.blacklist.enabled=true.

The updates in 2.1.0 appear to provide much more fine-grained settings for
this, like the number of tasks that can fail before an executor is
blacklisted for a stage. In that version, you probably want to set
spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
 and search for
“blacklist” to see all the options.

rb
​

On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue  wrote:

> Chawla,
>
> We hit this issue, too. I worked around it by setting spark.scheduler.
> executorTaskBlacklistTime=5000. The problem for us was that the scheduler
> was using locality to select the executor, even though it had already
> failed there. The executor task blacklist time controls how long the
> scheduler will avoid using an executor for a failed task, which will cause
> it to avoid rescheduling on the executor. The default was 0, so the
> executor was put back into consideration immediately.
>
> In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not
> sure if that does exactly the same thing. The default for that setting is
> 1h instead of 0. It’s better to have a non-zero default to avoid what
> you’re seeing.
>
> rb
> ​
>
> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit 
> wrote:
>
>> I am seeing a strange issue. I had a bad behaving slave that failed the
>> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
>> like all task retries happen on the same slave in case of failure.  My
>> expectation was that task will be retried on different slave in case of
>> failure, and chance of all 8 retries to happen on same slave is very less.
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Chawla,

We hit this issue, too. I worked around it by setting
spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was that
the scheduler was using locality to select the executor, even though it had
already failed there. The executor task blacklist time controls how long
the scheduler will avoid using an executor for a failed task, which will
cause it to avoid rescheduling on the executor. The default was 0, so the
executor was put back into consideration immediately.

In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not sure
if that does exactly the same thing. The default for that setting is 1h
instead of 0. It’s better to have a non-zero default to avoid what you’re
seeing.

rb
​

On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit 
wrote:

> I am seeing a strange issue. I had a bad behaving slave that failed the
> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
> like all task retries happen on the same slave in case of failure.  My
> expectation was that task will be retried on different slave in case of
> failure, and chance of all 8 retries to happen on same slave is very less.
>
>
> Regards
> Sumit Chawla
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Dongjin Lee
Sumit,

I think the post below is describing the very case of you.

https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/

Regards,
Dongjin

--
Dongjin Lee

Software developer in Line+.
So interested in massive-scale machine learning.

facebook: http://www.facebook.com/dongjin.lee.kr
linkedin: http://kr.linkedin.com/in/dongjinleekr
github: http://github.com/dongjinleekr
twitter: http://www.twitter.com/dongjinleekr

On 22 Apr 2017, 5:32 AM +0900, Chawla,Sumit , wrote:
> I am seeing a strange issue. I had a bad behaving slave that failed the 
> entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all 
> task retries happen on the same slave in case of failure. My expectation was 
> that task will be retried on different slave in case of failure, and chance 
> of all 8 retries to happen on same slave is very less.
>
>
> Regards
> Sumit Chawla
>