[ 
https://issues.apache.org/jira/browse/HUDI-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6213:
---------------------------------
    Labels: pull-request-available  (was: )

> Parallelize deletion of files during rollback.
> ----------------------------------------------
>
>                 Key: HUDI-6213
>                 URL: https://issues.apache.org/jira/browse/HUDI-6213
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>
> Assume we are rolling back a commit with large number of files (1k+) in a 
> partition
> *Current strategy:*
> For each partition, create a rollback request which contains the list of all 
> the files to be deleted from that partition. Since each rollback request is 
> executed on an executor, in this model an executor would be deleting the 1K+ 
> files sequentially. This is slow and does not take advantage of the rollback 
> parallelism or presence of multiple executors.
> *Changed strategy:*
> Each rollback request should only contain a single file to be deleted from a 
> partition. Since each rollback request is executed on an executor, in this 
> model 1k+ tasks will be executed in parallel on the available executors. This 
> will speed up the deletion part of the rollback.
>  
> We have several datasets where the number of files inserted are in 90K+ per 
> commit. So for rolling back failed commits it takes hours. With this change 
> it takes minutes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to