[ https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar closed HUDI-1398. -------------------------------- Resolution: Fixed > Align insert file size for reducing IO > -------------------------------------- > > Key: HUDI-1398 > URL: https://issues.apache.org/jira/browse/HUDI-1398 > Project: Apache Hudi > Issue Type: Improvement > Reporter: steven zhang > Priority: Minor > Labels: pull-request-available > Fix For: 0.7.0 > > > currently we insert totalUnassignedInserts into new file if we have anything > more records > and set number of new bucket records as follow: > recordsPerBucket.add(totalUnassignedInserts / insertBuckets); > ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java] > L 188) > it just compute the avg records. and it may create new small file > for example: > totalUnassignedInserts = 250 > insertRecordsPerBucket = 120 > so insertBuckets = 3 (eg. file_a,file_b,file_c) > then file_a = file_b = file_c = 83 > the small files will include above three file when next delta process > and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 -- This message was sent by Atlassian Jira (v8.3.4#803005)