Z1Wu commented on PR #3144: URL: https://github.com/apache/celeborn/pull/3144#issuecomment-2734252785
> For celeborn shuffle fetch failure, it will rerun the whole stage, so it is fine to cancel all the running tasks. ``` sh shuffleMapStage -> FailedStage ``` In current implementation, when fetch failure happens without indeterministic shuffle map output, Spark will only rerun the whole shuffleMapStage and tasks that are not finished in the FailedStage. And tasks that finished during rerun of shuffleMapStage will still be counted as successful tasks and will not be rerun in retry of FailedStage.  https://github.com/apache/spark/blob/0670e4f7945ae4935261ed1f45db7ede79aca127/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1928-L1936 Please correct me if I have any misunderstanding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
