hudi-bot opened a new issue, #15627: URL: https://github.com/apache/hudi/issues/15627
Currently, when writing in Spark we will be keeping the File Writers for individual partitions open as long as we're processing the batch which entails that all of the data written out will be kept in memory (at least the last row-group in case of Parquet writers) until batch is fully processed and all of the writers are closed. While this allows us to better control how many files are created in every partition (we keep the writer open and hence we don't need to create a new file when a new record comes in), this brings a penalty of keeping all of the data in memory potentially leading to OOMs, longer GC cycles, etc ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-5385 - Type: Bug - Epic: https://issues.apache.org/jira/browse/HUDI-3249 - Affects version(s): - 0.12.1 - Fix version(s): - 1.1.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
