[ https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654303#comment-17654303 ]
voon commented on HUDI-5496: ---------------------------- Duplicate of HUDI-5173. > Prevent Hudi from generating clustering plans with filegroups consisting of > only 1 fileSlice > -------------------------------------------------------------------------------------------- > > Key: HUDI-5496 > URL: https://issues.apache.org/jira/browse/HUDI-5496 > Project: Apache Hudi > Issue Type: Bug > Reporter: voon > Assignee: voon > Priority: Major > Labels: pull-request-available > > Suppose a partition is no longer being written/updated, i.e. there will be no > changes to the partition, therefore, size of parquet files will always be the > same. > > If the parquet files in the partition (even after prior clustering) is > smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the > fileSlice will always be returned as a candidate for > {_}getFileSlicesEligibleForClustering(){_}. > > This may cause inputGroups with only 1 fileSlice to be selected as candidates > for clustering. An of a clusteringPlan demonstrating such a case in JSON > format is seen below. > > > {code:java} > { > "inputGroups": [ > { > "slices": [ > { > "dataFilePath": > "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet", > "deltaFilePaths": [], > "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0", > "partitionPath": "dt=2023-01-03", > "bootstrapFilePath": "", > "version": 1 > } > ], > "metrics": { > "TOTAL_LOG_FILES": 0.0, > "TOTAL_IO_MB": 260.0, > "TOTAL_IO_READ_MB": 130.0, > "TOTAL_LOG_FILES_SIZE": 0.0, > "TOTAL_IO_WRITE_MB": 130.0 > }, > "numOutputFileGroups": 1, > "extraMetadata": null, > "version": 1 > }, > { > "slices": [ > { > "dataFilePath": > "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet", > "deltaFilePaths": [], > "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0", > "partitionPath": "dt=2023-01-04", > "bootstrapFilePath": "", > "version": 1 > }, > { > "dataFilePath": > "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet", > "deltaFilePaths": [], > "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0", > "partitionPath": "dt=2023-01-04", > "bootstrapFilePath": "", > "version": 1 > } > ], > "metrics": { > "TOTAL_LOG_FILES": 0.0, > "TOTAL_IO_MB": 418.0, > "TOTAL_IO_READ_MB": 209.0, > "TOTAL_LOG_FILES_SIZE": 0.0, > "TOTAL_IO_WRITE_MB": 209.0 > }, > "numOutputFileGroups": 1, > "extraMetadata": null, > "version": 1 > } > ], > "strategy": { > "strategyClassName": > "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy", > "strategyParams": {}, > "version": 1 > }, > "extraMetadata": {}, > "version": 1, > "preserveHoodieMetadata": true > }{code} > > Such a case will cause performance issues as a parquet file is re-written > unnecessarily (write amplification). > > The fix is to only select inputGroups with more than 1 fileSlice as > candidates for clustering. > -- This message was sent by Atlassian Jira (v8.20.10#820010)