[ https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-993: -------------------------------- Fix Version/s: 0.7.0 > Use hoodie.delete.shuffle.parallelism for Delete API > ---------------------------------------------------- > > Key: HUDI-993 > URL: https://issues.apache.org/jira/browse/HUDI-993 > Project: Apache Hudi > Issue Type: Improvement > Components: Performance > Reporter: Dongwook Kwon > Priority: Minor > Labels: pull-request-available > Fix For: 0.7.0 > > > While HUDI-328 introduced Delete API, I noticed > [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57] > method doesn't allow any parallelism for RDD operation while > [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104] > for upsert uses parallelism on RDD. > {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}} > > I found certain cases, like input RDD has few parallelism but target table > has large files, certain Spark job's performance is suffered from low > parallelism. so in this case, upsert performance with > "EmptyHoodieRecordPayload" is faster than delete API. > Also this is due to the fact that "hoodie.combine.before.upsert" is true by > default, when it's not enabled, the issue would be the same. > So I wonder input RDD should be repartition as > "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is > false for better performance regardless of "hoodie.combine.before.delete" > > -- This message was sent by Atlassian Jira (v8.3.4#803005)