GitHub user squito opened a pull request: https://github.com/apache/spark/pull/13603
[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors ## What changes were proposed in this pull request? Before this change, when you turn on blacklisting with {{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than {{spark.task.maxFailures}} executors, you can end with a job "hung" after some task failures. Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on *any* known executor. If not, the taskset (and any corresponding jobs) are failed. * Worst case, this is O(numExecutors). But unless many executors are bad, this should be small * This does not fail as fast as possible -- when a task becomes unschedulable, we keep scheduling other tasks. This is to avoid an O(numPendingTasks) operation * Also, it is conceivable this fails too quickly. You may be 1 millisecond away from unblacklisting a place for a task to run, or acquiring a new executor. Also, I scratched an itch and tightened up visibility throughout `TaskSetManager`. Can undo that if it clutters this change. ## How was this patch tested? Added unit tests, ran via jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark progress_w_few_execs_and_blacklist Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13603.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13603 ---- commit 3f462750c1e99d9357106526ec724cebe639c561 Author: Imran Rashid <iras...@cloudera.com> Date: 2016-06-10T05:23:03Z if all executors have been blacklisted for a ask, abort stage (instead of just hanging) commit bc80e8ceeee180c37fa9211ee1b85ce0a7a90ac7 Author: Imran Rashid <iras...@cloudera.com> Date: 2016-06-10T06:43:23Z tighten visibility throughout TaskSetManager ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org