[ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083082#comment-16083082
 ] 

Jose Soltren commented on SPARK-21219:
--------------------------------------

I think it would be good to backport this to 2.2 and 2.1. Any objections? I'll 
start putting together backport PRs shortly, assuming there are none.

> Task retry occurs on same executor due to race condition with blacklisting
> --------------------------------------------------------------------------
>
>                 Key: SPARK-21219
>                 URL: https://issues.apache.org/jira/browse/SPARK-21219
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.1
>            Reporter: Eric Vandenberg
>            Assignee: Eric Vandenberg
>            Priority: Minor
>             Fix For: 2.3.0
>
>         Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2) 
> corresponding black list policy is enforced (ie, specifying if it can/can't 
> run on a particular node/executor/etc.)  Unfortunately the ordering is such 
> that retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the order should be (1) the black list state should be updated and then (2) 
> the task assigned, ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar####.masked-server.com, executor 
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 
> 39651, foobar####.masked-server.com, executor 
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
>  ExecutorLostFailure (executor 
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0
>  exited caused by one of the running tasks) Reason: Remote RPC client 
> disassociated. Likely due to containers exceeding thresholds, or network 
> issues. Check driver logs for WARN messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to