Re: [PR] [CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold [celeborn]

via GitHub Wed, 11 Mar 2026 04:13:06 -0700


s0nskar commented on PR #3601:
URL: https://github.com/apache/celeborn/pull/3601#issuecomment-4038469927


   Wouldn't server side approach like – 
https://github.com/apache/celeborn/pull/3336 makes more sense to handle this. 
Just thinking out loud, Few cons i can see with this approach:
   
   1. We are not considering the existing shuffle data stored for the app on 
Celeborn server or multiple shuffle stages running in parallel.
   
   2. We are removing the written bytes as soon as all mappers are completed
   ```
         if (shuffleWriteLimitEnabled) {
           shuffleTotalWrittenBytes.remove(shuffleId)
         }
   ```
   but the shuffle data will be stored on the server till shuffle cleanup 
happens.
   
   3. No central config management, such configs should be managed by config 
store so it can be applied globally to all apps, instead of each app having 
control on such configs. (Override functionality can be provided for certain 
apps)
   
   Cons with server side approach –
   
   1. Since it relies on heartbeats, for very high throughput applications the 
difference between threshold and actual killing can be large but for normal 
applications it should be fine.
   
   
   @SteNicholas @RexXiong wanted to know your thoughts on this?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold [celeborn]

Reply via email to