SteNicholas opened a new pull request, #3123: URL: https://github.com/apache/celeborn/pull/3123
### What changes were proposed in this pull request? Backport https://github.com/apache/celeborn/pull/3103 to branch-0.5. When I tested the performance of Flink with Celeborn in session mode, using both the shuffle-service plugin and hybrid shuffle integration strategies, I noticed that the task heap continuously grew even when no jobs were running. The issue arises because the Celeborn client sends addCredit or notifyRequiredSegment requests, expecting a response. This creates a callback and maintains a reference to CelebornBufferStream, SingleInputGate, and StreamTask. These callbacks are stored in TransportResponseHandler#outstandingRpcs and are cleared upon receiving a response. However, the Worker does not send a response for these two RPC requests, which leads to a significant memory leak in the client. ### Why are the changes needed? It may cause flink client memory leak. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test by rerun TPCDS in Flink session mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
