Will-Lo opened a new pull request, #3751: URL: https://github.com/apache/gobblin/pull/3751
Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-XXX ### Description - [ ] Here are some details about my PR, including screenshots (if applicable): The current ORCWriter which converts Avro to ORC frequently runs into OOM issues on large schemas. This is theorized to be partially due to the way that the converter allocates memory for large lists and maps, it uses a resize algorithm that multiplies the last array size by 3. This can lead to a lot of extra space, along with the large records already stored within the buffer and the file writer, will cause memory issues. This PR introduces a few components/ideas to manage memory: 1. Have the converter also estimate the size of each record since it needs to traverse through the record in order to perform the conversion. 2. The internal buffer of the `GobblinBaseOrcWriter` should account for the memory available in the JVM (which is available through the Java runtime APIs) minus the size of the records that can be stored in the underlying file writer and the size of the Avro to ORC converter due to resizes. It should then divide this number by the average size of a record 3. There was a conscious decision to not re-initialize the underlying ORCWriter every time a tune is performed because it would have to create a new file which can lead to a large number of files in the end. Since there is a compression done when rows are added to this writer, it should generally perform well enough* if it is tuned during each writer initialization at the beginning, in Fast Ingest this occurs every 5 minutes 4. Average record size and the size allocated to the converter is stored in the Gobblin state and every writer initialization will use the previous run's calculations instead of slowly tuning up. ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 4. Subject is limited to 50 characters 5. Subject does not end with a period 6. Subject uses the imperative mood ("add", not "adding") 7. Body wraps at 72 characters 8. Body explains "what" and "why", not "how" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
