[ https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
nicolas paris reassigned HUDI-4792: ----------------------------------- Assignee: nicolas paris > Speed up cleaning with metadata table enabled > ---------------------------------------------- > > Key: HUDI-4792 > URL: https://issues.apache.org/jira/browse/HUDI-4792 > Project: Apache Hudi > Issue Type: Improvement > Reporter: nicolas paris > Assignee: nicolas paris > Priority: Major > Labels: pull-request-available > > Currently fetching file group to be deleted is parallelized over each > partition. As a result, in case of many partition, many calls are made on the > metadata. While this is ok for file system view, this is highly inefficient > with the metadata table view (MDT){*}.{*} Likely each call makes the MoR > happens on the MDT and in the case of thousand of partitions the process is > incredibly slow. > I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on > a 40k partitionned hudi table : > * w/ MDT: 5 hours > * w/o MDT: 5 minutes > This slowness makes the use of MDT not reasonable in the case of many > partitions, because cleaning is a must-have. -- This message was sent by Atlassian Jira (v8.20.10#820010)