[ 
https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-993:
--------------------------------
    Fix Version/s:     (was: 0.8.0)
                   0.7.0

> Use hoodie.delete.shuffle.parallelism for Delete API
> ----------------------------------------------------
>
>                 Key: HUDI-993
>                 URL: https://issues.apache.org/jira/browse/HUDI-993
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Performance
>            Reporter: Dongwook Kwon
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>
> While HUDI-328 introduced Delete API, I noticed 
> [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
>  method doesn't allow any parallelism for RDD operation while 
> [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
>  for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>  
> I found certain cases, like input RDD has few parallelism but target table 
> has large files, certain Spark job's performance is suffered from low 
> parallelism. so in this case,  upsert performance with 
> "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by 
> default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as 
> "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is 
> false for better performance regardless of "hoodie.combine.before.delete"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to