[GitHub] [hudi] honeyaya opened a new pull request, #7255: [HUDI-5250] use the estimate record size when estimation threshold is l…

GitBox Sun, 20 Nov 2022 19:06:49 -0800


honeyaya opened a new pull request, #7255:
URL: https://github.com/apache/hudi/pull/7255


   Using the default value of estimate record size at the 
averageBytesPerRecord() when estimation threshold is less than 0
   
   ### Change Logs
   
   Currently, hudi obtains the average record size based on records written 
during previous commits. Used for estimating how many records pack into one 
file, and the code is about UpsertPartitioner.averageBytesPerRecord().
   
   But we found that the single data file could become 600~700M and most other 
files are less than 200M.
   
   -  Reason
   
   1. the result of totalBytesWritten/totalRecordsWritten is very small when 
the last commit, but the next commit record is very large, then the data files 
will become very large. 
   
   - Solve plan
   
   1. Plan1: calculate avgSize of the past several commit not just only one, 
but the getCommitMetadata costs a lot of time, then this function might be 
slow, so we did not choose this.
   
   1. Plan2: Use the estimated record size considering our data size is fixed 
in some sense, more the hudi community did not encourage adding a more boolean 
variable to control whether to use the last commit avgSize, then we use the 
estimation threshold, when it is less than 0, we use the default estimate 
record size.
   
   
   ### Impact
   
   UpsertPartitioner.averageBytesPerRecord(), small
   
   ### Risk level (write none, low medium or high below)
   
   low, this feature works only when estimation threshold is less than 0
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated_
   
   > We use the previous commits' metadata to calculate the estimated record 
size and use it "
             + " to bin pack records into partitions. If the previous commit is 
too small to make an accurate estimation, "
             + " Hudi will search commits in the reverse order, until we find a 
commit that has totalBytesWritten "
             + " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold). 
Will use hoodie.copyonwrite.record.size.estimate value when this value is less 
than 0.");
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] honeyaya opened a new pull request, #7255: [HUDI-5250] use the estimate record size when estimation threshold is l…

Reply via email to