honeyaya opened a new pull request, #7255: URL: https://github.com/apache/hudi/pull/7255
Using the default value of estimate record size at the averageBytesPerRecord() when estimation threshold is less than 0 ### Change Logs Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about UpsertPartitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M. - Reason 1. the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. - Solve plan 1. Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. 1. Plan2: Use the estimated record size considering our data size is fixed in some sense, more the hudi community did not encourage adding a more boolean variable to control whether to use the last commit avgSize, then we use the estimation threshold, when it is less than 0, we use the default estimate record size. ### Impact UpsertPartitioner.averageBytesPerRecord(), small ### Risk level (write none, low medium or high below) low, this feature works only when estimation threshold is less than 0 ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated_ > We use the previous commits' metadata to calculate the estimated record size and use it " + " to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, " + " Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten " + " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold). Will use hoodie.copyonwrite.record.size.estimate value when this value is less than 0."); ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org