hudi-bot opened a new issue, #15627:
URL: https://github.com/apache/hudi/issues/15627

   Currently, when writing in Spark we will be keeping the File Writers for 
individual partitions open as long as we're processing the batch which entails 
that all of the data written out will be kept in memory (at least the last 
row-group in case of Parquet writers) until batch is fully processed and all of 
the writers are closed.
   
   While this allows us to better control how many files are created in every 
partition (we keep the writer open and hence we don't need to create a new file 
when a new record comes in), this brings a penalty of keeping all of the data 
in memory potentially leading to OOMs, longer GC cycles, etc
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5385
   - Type: Bug
   - Epic: https://issues.apache.org/jira/browse/HUDI-3249
   - Affects version(s):
     - 0.12.1
   - Fix version(s):
     - 1.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to