leixm opened a new pull request, #3642:
URL: https://github.com/apache/celeborn/pull/3642

   ### What changes were proposed in this pull request?
   CommitHandler support retry interval
   
   
   ### Why are the changes needed?
   When commitFiles RPC fails, the current implementation retries immediately 
without any backoff. If the worker is experiencing transient network issues, 
immediate retries are likely to fail again. Adding a configurable retry 
interval (celeborn.client.requestCommitFiles.retryInterval, default 10s) gives 
the worker time to recover before the next attempt, significantly improving the 
success rate of retries. A dedicated ScheduledExecutorService is used to avoid 
blocking threads in the shared RPC pool during the wait.
   
   
   ### Does this PR resolve a correctness bug?
   No.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to