turboFei commented on PR #3144:
URL: https://github.com/apache/celeborn/pull/3144#issuecomment-2746539744

   > > For celeborn shuffle fetch failure, it will rerun the whole stage, so it 
is fine to cancel all the running tasks.
   > 
   > ```shell
   > shuffleMapStage -> FailedStage 
   > ```
   > 
   > In current implementation, when fetch failure happens without 
indeterministic shuffle map output, Spark will only rerun the whole 
shuffleMapStage and tasks that are not finished in the FailedStage. And tasks 
that finished during rerun of shuffleMapStage will still be counted as 
successful tasks and will not be rerun in retry of FailedStage.
   > 
   > 
![image](https://private-user-images.githubusercontent.com/21239012/424122319-d12f7217-1b38-4f3d-848f-0db967d6c4bf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDI3NzEzODAsIm5iZiI6MTc0Mjc3MTA4MCwicGF0aCI6Ii8yMTIzOTAxMi80MjQxMjIzMTktZDEyZjcyMTctMWIzOC00ZjNkLTg0OGYtMGRiOTY3ZDZjNGJmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAzMjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMzIzVDIzMDQ0MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFjYjdiYTkwMGIyYWRlZmE3MmM3MzJlZmNiYTgyYTBjNWUxOTY0ZDlhODNhNTNlYmY5NWIwZWVlYzFhNTMyMmImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Kk-pLjhIXTxLoutKBcE_SmEIlMVFvIJvX5c0Ik4mkmI)
   > 
   > 
https://github.com/apache/spark/blob/0670e4f7945ae4935261ed1f45db7ede79aca127/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1928-L1936
   > 
   > Maybe we need to add a condition to only cancel still-running tasks which 
fetch from indeterministic shuffle output. Please correct me if I have any 
misunderstanding.
   
   This function is for the zombie task case.
   
   But for celeborn stage rerun, it would rerun the whole tasks.
   
   It means that:
   1. if we do not cancel the running tasks in the stage, it would occupy the 
resources, and the tasks in last stage to re-run might wait these tasks 
finished before launching
   2. The tasks might be keep running with the new stage attempts tasks at the 
same time, it double the computing resource.
   
   So, I think for celeborn stage rerun, cancel all the tasks directly should 
be fine. @Z1Wu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to