[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083082#comment-16083082 ]
Jose Soltren commented on SPARK-21219: -------------------------------------- I think it would be good to backport this to 2.2 and 2.1. Any objections? I'll start putting together backport PRs shortly, assuming there are none. > Task retry occurs on same executor due to race condition with blacklisting > -------------------------------------------------------------------------- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.1.1 > Reporter: Eric Vandenberg > Assignee: Eric Vandenberg > Priority: Minor > Fix For: 2.3.0 > > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is (1) added into the pending task list and then (2) > corresponding black list policy is enforced (ie, specifying if it can/can't > run on a particular node/executor/etc.) Unfortunately the ordering is such > that retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the order should be (1) the black list state should be updated and then (2) > the task assigned, ensuring that the black list policy is properly enforced. > The attached logs demonstrate the race condition. > See spark_executor.log.anon: > 1. Task 55.2 fails on the executor > 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID > 39575) > java.lang.OutOfMemoryError: Java heap space > 2. Immediately the same executor is assigned the retry task: > 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 > 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) > 3. The retry task of course fails since the executor is also shutting down > due to the original task 55.2 OOM failure. > See the spark_driver.log.anon: > The driver processes the lost task 55.2: > 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID > 39575, foobar####.masked-server.com, executor > attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0): > java.lang.OutOfMemoryError: Java heap space > The driver then receives the ExecutorLostFailure for the retry task 55.3 > (although it's obfuscated in these logs, the server info is same...) > 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID > 39651, foobar####.masked-server.com, executor > attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0): > ExecutorLostFailure (executor > attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0 > exited caused by one of the running tasks) Reason: Remote RPC client > disassociated. Likely due to containers exceeding thresholds, or network > issues. Check driver logs for WARN messages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org