GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/13603

    [SPARK-15865][CORE] Blacklist should not result in job hanging with less 
than 4 executors

    ## What changes were proposed in this pull request?
    
    Before this change, when you turn on blacklisting with 
{{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than 
{{spark.task.maxFailures}} executors, you can end with a job "hung" after some 
task failures.
    
    Whenever a taskset is unable to schedule anything on 
resourceOfferSingleTaskSet, we check whether the last pending task can be 
scheduled on *any* known executor.  If not, the taskset (and any corresponding 
jobs) are failed.
    * Worst case, this is O(numExecutors).  But unless many executors are bad, 
this should be small
    * This does not fail as fast as possible -- when a task becomes 
unschedulable, we keep scheduling other tasks.  This is to avoid an 
O(numPendingTasks) operation
    * Also, it is conceivable this fails too quickly.  You may be 1 millisecond 
away from unblacklisting a place for a task to run, or acquiring a new executor.
    
    Also, I scratched an itch and tightened up visibility throughout 
`TaskSetManager`.  Can undo that if it clutters this change.
    
    ## How was this patch tested?
    
    Added unit tests, ran via jenkins.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark 
progress_w_few_execs_and_blacklist

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13603
    
----
commit 3f462750c1e99d9357106526ec724cebe639c561
Author: Imran Rashid <iras...@cloudera.com>
Date:   2016-06-10T05:23:03Z

    if all executors have been blacklisted for a ask, abort stage (instead of 
just hanging)

commit bc80e8ceeee180c37fa9211ee1b85ce0a7a90ac7
Author: Imran Rashid <iras...@cloudera.com>
Date:   2016-06-10T06:43:23Z

    tighten visibility throughout TaskSetManager

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to