Z1Wu commented on PR #3144:
URL: https://github.com/apache/celeborn/pull/3144#issuecomment-2734252785

   > For celeborn shuffle fetch failure, it will rerun the whole stage, so it 
is fine to cancel all the running tasks.
   
   ``` sh
   shuffleMapStage -> FailedStage 
   ```
   
   In current implementation, when fetch failure happens without 
indeterministic shuffle map output, Spark will only rerun the whole 
shuffleMapStage and tasks that are not finished in the FailedStage. And tasks 
that finished during rerun of shuffleMapStage will still be counted as 
successful tasks and will not be rerun in retry of FailedStage.
   
   
![image](https://github.com/user-attachments/assets/d12f7217-1b38-4f3d-848f-0db967d6c4bf)
   
   
https://github.com/apache/spark/blob/0670e4f7945ae4935261ed1f45db7ede79aca127/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1928-L1936
   
   Please correct  me if I have any misunderstanding.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to