saurabhd336 opened a new pull request, #3644: URL: https://github.com/apache/celeborn/pull/3644
### What changes were proposed in this pull request? Often times, celeborn is too late in detecting diskfull issues simply because the DiskInfo's `usableSpace` is updated asynchronously in the worker heartbeat flow. In such cases, if heartbeats are missed and / or multiple highly large writers end up pushing too much data to memory buffers (bypassing the diskfull based HARD_SPLIT checks), it can cause severe degradation. In some cases we've noticed that we easily breach the configured disk usage limit, causing job degradations, cleanup failures (due to rocksdb sharing the disk with shuffle data) which makes the situation even worse. This change proposes a more realtime, coordinated acquisition during flush, making the disk full detection full proof preventing any spillage beyond the configured limits. ### Why are the changes needed? Disk full detection is not full proof ### Does this PR resolve a correctness bug? No ### Does this PR introduce _any_ user-facing change? TODO (will add behind an off by default config based on reviews) ### How was this patch tested? TODO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
