saurabhd336 opened a new pull request, #3644:
URL: https://github.com/apache/celeborn/pull/3644

   ### What changes were proposed in this pull request?
   Often times, celeborn is too late in detecting diskfull issues simply 
because the DiskInfo's `usableSpace` is updated asynchronously in the worker 
heartbeat flow.
   In such cases, if heartbeats are missed and / or multiple highly large 
writers end up pushing too much data to memory buffers (bypassing the diskfull 
based HARD_SPLIT checks), it can cause severe degradation.
   
   In some cases we've noticed that we easily breach the configured disk usage 
limit, causing job degradations, cleanup failures (due to rocksdb sharing the 
disk with shuffle data) which makes the situation even worse.
   
   This change proposes a more realtime, coordinated acquisition during flush, 
making the disk full detection full proof preventing any spillage beyond the 
configured limits.
   
   ### Why are the changes needed?
   Disk full detection is not full proof
   
   ### Does this PR resolve a correctness bug?
   No
   
   ### Does this PR introduce _any_ user-facing change?
   TODO (will add behind an off by default config based on reviews)
   
   
   ### How was this patch tested?
   TODO
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to