GitHub user ericvandenbergfb opened a pull request: https://github.com/apache/spark/pull/18427
[SPARK-21219][scheduler] Fix race condition between adding task to pe⦠â¦nding list and updating black list state. ## What changes were proposed in this pull request? There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask ## How was this patch tested? Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericvandenbergfb/spark blacklistFix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18427.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18427 ---- commit 3cf068df4cb9f863b895b10d12203f3b5406a989 Author: Eric Vandenberg <ericvandenb...@fb.com> Date: 2017-06-26T22:20:42Z [SPARK-21219][scheduler] Fix race condition between adding task to pending list and updating black list state. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org