Xingbo Jiang created SPARK-46182:
------------------------------------

             Summary: Shuffle data lost on decommissioned executor caused by 
race condition between lastTaskRunningTime and lastShuffleMigrationTime
                 Key: SPARK-46182
                 URL: https://issues.apache.org/jira/browse/SPARK-46182
             Project: Spark
          Issue Type: Task
          Components: Spark Core
    Affects Versions: 3.5.0, 3.4.1
            Reporter: Xingbo Jiang
            Assignee: Xingbo Jiang


We recently identified a very tricky race condition in decommissioned node, 
which could lead to shuffle data lost even data migration is enabled:

* At 04:30:51, RDD block refresh happened, and found no pending works
* Shortly after that (a few milliseconds), the shutdownThread in 
CoarseGrainedExecutorBackend found 1 running task, so lastTaskRunningTime 
updated to the current system nano time
* Shortly after that, Shuffle block refresh happened, and found no pending works
* Shortly after that, a task finished on the decommissioned executor, and 
generated new shuffle blocks
* One second later, the shutdownThread in CoarseGrainedExecutorBackend found no 
running task, lastTaskRunningTime would not be updated, and the executor didn’t 
exit because min(lastRDDMigrationTime, lastShuffleMigrationTime) <  
lastTaskRunningTime
* After 30 seconds, at 04:31:21, RDD block refresh happened, and found no 
pending works, lastRDDMigrationTime updated to the current system nano time
* At this exact moment, all known blocks are migrated, and 
min(lastRDDMigrationTime, lastShuffleMigrationTime) > lastTaskRunningTime
* shutdownThread is triggered, and asked to stop the executor
* Shuffle block refresh thread was still sleeping, and got interrupted by the 
stop command, so it didn’t have the chance to discover the shuffle blocks 
generated by the previously finished task
* Eventually, the executor exited, and the output of the task was lost, Spark 
need to recompute that partition

The root cause for the race condition is that the Shuffle block refresh 
happened between lastTaskRunningTime was updated and task finished, in that 
case the shutdownThread could request to stop the executor before the 
BlockManagerDecommissioner discover the new shuffle blocks generated by the 
latest finished task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to