[PR] [CELEBORN-2040] Avoid throw FetchFailedException when GetReducerFileGroupResponse failed via broadcast [celeborn]

via GitHub Fri, 20 Jun 2025 04:53:20 -0700


vastian180 opened a new pull request, #3341:
URL: https://github.com/apache/celeborn/pull/3341


   ### What changes were proposed in this pull request?
   
   In our production environment, when obtaining GetReducerFileGroupResponse 
via broadcast[CELEBORN-1921], failures may occur due to reasons such 
as：Executor preemption or local disk errors when task writing broadcast data. 
These scenarios throw a CelebornIOException, which is eventually converted to a 
FetchFailedException. 
   
   However, I think these errors are not caused by shuffle-related metadata 
loss, so a FetchFailedException should not be thrown to trigger a stage retry. 
Instead, the task should simply fail and be retried at the task level.
   
   ### Why are the changes needed?
   
   To reduce false positive fetch failures.
   
   ### Does this PR introduce _any_ user-facing change?
   When `ShuffleClient.deserializeReducerFileGroupResponse(shuffleId, 
response.broadcast())` return null, will not report a fetch failure, instead, 
the task will simply fail.
   
   ### How was this patch tested?
   Long-running Production Validation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [CELEBORN-2040] Avoid throw FetchFailedException when GetReducerFileGroupResponse failed via broadcast [celeborn]

Reply via email to