Sumit, I think the post below is describing the very case of you.
https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/ Regards, Dongjin -- Dongjin Lee Software developer in Line+. So interested in massive-scale machine learning. facebook: http://www.facebook.com/dongjin.lee.kr linkedin: http://kr.linkedin.com/in/dongjinleekr github: http://github.com/dongjinleekr twitter: http://www.twitter.com/dongjinleekr On 22 Apr 2017, 5:32 AM +0900, Chawla,Sumit <sumitkcha...@gmail.com>, wrote: > I am seeing a strange issue. I had a bad behaving slave that failed the > entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all > task retries happen on the same slave in case of failure. My expectation was > that task will be retried on different slave in case of failure, and chance > of all 8 retries to happen on same slave is very less. > > > Regards > Sumit Chawla >