Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Chawla,Sumit
Thanks a lot @ Dongjin, @Ryan I am using Spark 1.6. I agree with your assesment Ryan. Further investigation seemed to suggest that our cluster was probably at 100% capacity at that point of time. Though tasks were failing on that slave, still it was accepting the task, and task retries

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Looking at the code a bit more, it appears that blacklisting is disabled by default. To enable it, set spark.blacklist.enabled=true. The updates in 2.1.0 appear to provide much more fine-grained settings for this, like the number of tasks that can fail before an executor is blacklisted for a

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Chawla, We hit this issue, too. I worked around it by setting spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was that the scheduler was using locality to select the executor, even though it had already failed there. The executor task blacklist time controls how long the

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Dongjin Lee
Sumit, I think the post below is describing the very case of you. https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/ Regards, Dongjin -- Dongjin Lee Software developer in Line+. So interested in massive-scale machine learning. facebook:

What is correct behavior for spark.task.maxFailures?

2017-04-21 Thread Chawla,Sumit
I am seeing a strange issue. I had a bad behaving slave that failed the entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all task retries happen on the same slave in case of failure. My expectation was that task will be retried on different slave in case of failure, and