[ 
https://issues.apache.org/jira/browse/SPARK-37300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiahua updated SPARK-37300:
-----------------------------
    Description: When a executor finished a task of some stage, the driver will 
receive a StatusUpdate event to handle it. At the same time the driver found 
the executor heartbeat timed out, so the dirver also need handle ExecutorLost 
event simultaneously. There was a race condition issues here, which will make 
TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result. More 
detailed description and discussion can be viewed at 
https://issues.apache.org/jira/browse/SPARK-36575 and 
https://github.com/apache/spark/pull/33872  (was: When a executor finished a 
task of some stage, the driver will receive a StatusUpdate event to handle it. 
At the same time the driver found the executor heartbeat timed out, so the 
dirver also need handle ExecutorLost event simultaneously. There was a race 
condition issues here, which will make TaskSetManager.successful and 
TaskSetManager.tasksSuccessful wrong result.

The problem is that TaskResultGetter.enqueueSuccessfulTask use asynchronous 
thread to handle successful task, that mean the synchronized lock of 
TaskSchedulerImpl was released prematurely during midway 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61.
 So TaskSchedulerImpl may handle executorLost first, then the asynchronous 
thread will go on to handle successful task. It cause TaskSetManager.successful 
and TaskSetManager.tasksSuccessful wrong result.)

> TaskSchedulerImpl should ignore task finished event if its task was already 
> finished state
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37300
>                 URL: https://issues.apache.org/jira/browse/SPARK-37300
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: hujiahua
>            Priority: Major
>
> When a executor finished a task of some stage, the driver will receive a 
> StatusUpdate event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result. 
> More detailed description and discussion can be viewed at 
> https://issues.apache.org/jira/browse/SPARK-36575 and 
> https://github.com/apache/spark/pull/33872



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to