[ https://issues.apache.org/jira/browse/SPARK-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-15865: ------------------------------------ Assignee: Apache Spark (was: Imran Rashid) > Blacklist should not result in job hanging with less than 4 executors > --------------------------------------------------------------------- > > Key: SPARK-15865 > URL: https://issues.apache.org/jira/browse/SPARK-15865 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.0.0 > Reporter: Imran Rashid > Assignee: Apache Spark > > Currently when you turn on blacklisting with > {{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than > {{spark.task.maxFailures}} executors, you can end with a job "hung" after > some task failures. > If some task fails regularly (say, due to error in user code), then the task > will be blacklisted from the given executor. It will then try another > executor, and fail there as well. However, after it has tried all available > executors, the scheduler will simply stop trying to schedule the task > anywhere. The job doesn't fail, nor it does it succeed -- it simply waits. > Eventually, when the blacklist expires, the task will be scheduled again. > But that can be quite far in the future, and in the meantime the user just > observes a stuck job. > Instead we should abort the stage (and fail any dependent jobs) as soon as we > detect tasks that cannot be scheduled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org