leixm commented on PR #2975: URL: https://github.com/apache/celeborn/pull/2975#issuecomment-2519277606
There are three problems in total 1. PartitionDataWriter#write flushBuffer.addComponent(true, data) After OOM occurs, data is not released. 2. After OOM occurs in PushDataHandler#writeLocalData fileWriter.write(body), fileWriter.decrementPendingWrites() is not called, which will cause some fileWriters to wait for a period of time when call close(). 3. For the failed shuffle, commit was not called and the PartitionDataWriter was not closed, so there was no returnBuffer. The PartitionDataWriter should be closed when the shuffle expires. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
