[ https://issues.apache.org/jira/browse/HUDI-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-6213: --------------------------------- Labels: pull-request-available (was: ) > Parallelize deletion of files during rollback. > ---------------------------------------------- > > Key: HUDI-6213 > URL: https://issues.apache.org/jira/browse/HUDI-6213 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > Labels: pull-request-available > > Assume we are rolling back a commit with large number of files (1k+) in a > partition > *Current strategy:* > For each partition, create a rollback request which contains the list of all > the files to be deleted from that partition. Since each rollback request is > executed on an executor, in this model an executor would be deleting the 1K+ > files sequentially. This is slow and does not take advantage of the rollback > parallelism or presence of multiple executors. > *Changed strategy:* > Each rollback request should only contain a single file to be deleted from a > partition. Since each rollback request is executed on an executor, in this > model 1k+ tasks will be executed in parallel on the available executors. This > will speed up the deletion part of the rollback. > > We have several datasets where the number of files inserted are in 90K+ per > commit. So for rolling back failed commits it takes hours. With this change > it takes minutes. -- This message was sent by Atlassian Jira (v8.20.10#820010)