GitHub user ericvandenbergfb opened a pull request:

    https://github.com/apache/spark/pull/18427

    [SPARK-21219][scheduler] Fix race condition between adding task to pe…

    …nding list and updating black list state.
    
    ## What changes were proposed in this pull request?
    
    There's a race condition in the current TaskSetManager where a failed task 
is added for retry (addPendingTask), and can asynchronously be assigned to an 
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the 
result is the task might re-execute on the same executor.  This is particularly 
problematic if the executor is shutting down since the retry task immediately 
becomes a lost task (ExecutorLostFailure).  Another side effect is that the 
actual failure reason gets obscured by the retry task which never actually 
executed.  There are sample logs showing the issue in the 
https://issues.apache.org/jira/browse/SPARK-21219 
    
    The fix is to change the ordering of the addPendingTask and 
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    ## How was this patch tested?
    
    Implemented a unit test that verifies the task is black listed before it is 
added to the pending task.  Ran the unit test without the fix and it fails.  
Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericvandenbergfb/spark blacklistFix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18427
    
----
commit 3cf068df4cb9f863b895b10d12203f3b5406a989
Author: Eric Vandenberg <ericvandenb...@fb.com>
Date:   2017-06-26T22:20:42Z

    [SPARK-21219][scheduler] Fix race condition between adding task to pending 
list and updating black list state.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to