[GitHub] [gobblin] Will-Lo opened a new pull request, #3751: [DRAFT] Create selftuning buffered ORC writer

via GitHub Fri, 25 Aug 2023 10:59:42 -0700


Will-Lo opened a new pull request, #3751:
URL: https://github.com/apache/gobblin/pull/3751


   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references 
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-XXX
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if 
applicable):
   
   The current ORCWriter which converts Avro to ORC frequently runs into OOM 
issues on large schemas. This is theorized to be partially due to the way that 
the converter allocates memory for large lists and maps, it uses a resize 
algorithm that multiplies the last array size by 3. This can lead to a lot of 
extra space, along with the large records already stored within the buffer and 
the file writer, will cause memory issues.
   
   This PR introduces a few components/ideas to manage memory:
   1. Have the converter also estimate the size of each record since it needs 
to traverse through the record in order to perform the conversion.
   2. The internal buffer of the `GobblinBaseOrcWriter` should account for the 
memory available in the JVM (which is available through the Java runtime APIs) 
minus the size of the records that can be stored in the underlying file writer 
and the size of the Avro to ORC converter due to resizes. It should then divide 
this number by the average size of a record
   3. There was a conscious decision to not re-initialize the underlying 
ORCWriter every time a tune is performed because it would have to create a new 
file which can lead to a large number of files in the end. Since there is a 
compression done when rows are added to this writer, it should generally 
perform well enough* if it is tuned during each writer initialization at the 
beginning, in Fast Ingest this occurs every 5 minutes
   4. Average record size and the size allocated to the converter is stored in 
the Gobblin state and every writer initialization will use the previous run's 
calculations instead of slowly tuning up.
   
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       4. Subject is limited to 50 characters
       5. Subject does not end with a period
       6. Subject uses the imperative mood ("add", not "adding")
       7. Body wraps at 72 characters
       8. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [gobblin] Will-Lo opened a new pull request, #3751: [DRAFT] Create selftuning buffered ORC writer

Reply via email to