[jira] [Updated] (HUDI-1398) Align insert file size for reducing IO
[ https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1398: - Status: Open (was: New) > Align insert file size for reducing IO > -- > > Key: HUDI-1398 > URL: https://issues.apache.org/jira/browse/HUDI-1398 > Project: Apache Hudi > Issue Type: Improvement >Reporter: steven zhang >Priority: Minor > Labels: pull-request-available > Fix For: 0.7.0 > > > currently we insert totalUnassignedInserts into new file if we have anything > more records > and set number of new bucket records as follow: > recordsPerBucket.add(totalUnassignedInserts / insertBuckets); > ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java] > L 188) > it just compute the avg records. and it may create new small file > for example: > totalUnassignedInserts = 250 > insertRecordsPerBucket = 120 > so insertBuckets = 3 (eg. file_a,file_b,file_c) > then file_a = file_b = file_c = 83 > the small files will include above three file when next delta process > and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1398) Align insert file size for reducing IO
[ https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1398: - Fix Version/s: 0.7.0 > Align insert file size for reducing IO > -- > > Key: HUDI-1398 > URL: https://issues.apache.org/jira/browse/HUDI-1398 > Project: Apache Hudi > Issue Type: Improvement >Reporter: steven zhang >Priority: Minor > Labels: pull-request-available > Fix For: 0.7.0 > > > currently we insert totalUnassignedInserts into new file if we have anything > more records > and set number of new bucket records as follow: > recordsPerBucket.add(totalUnassignedInserts / insertBuckets); > ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java] > L 188) > it just compute the avg records. and it may create new small file > for example: > totalUnassignedInserts = 250 > insertRecordsPerBucket = 120 > so insertBuckets = 3 (eg. file_a,file_b,file_c) > then file_a = file_b = file_c = 83 > the small files will include above three file when next delta process > and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1398) Align insert file size for reducing IO
[ https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1398: - Labels: pull-request-available (was: ) > Align insert file size for reducing IO > -- > > Key: HUDI-1398 > URL: https://issues.apache.org/jira/browse/HUDI-1398 > Project: Apache Hudi > Issue Type: Improvement >Reporter: steven zhang >Priority: Minor > Labels: pull-request-available > > currently we insert totalUnassignedInserts into new file if we have anything > more records > and set number of new bucket records as follow: > recordsPerBucket.add(totalUnassignedInserts / insertBuckets); > ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java] > L 188) > it just compute the avg records. and it may create new small file > for example: > totalUnassignedInserts = 250 > insertRecordsPerBucket = 120 > so insertBuckets = 3 (eg. file_a,file_b,file_c) > then file_a = file_b = file_c = 83 > the small files will include above three file when next delta process > and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 -- This message was sent by Atlassian Jira (v8.3.4#803005)