[ https://issues.apache.org/jira/browse/HUDI-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo closed HUDI-5326. --------------------------- Resolution: Fixed > Incorrect cluster grouping in SparkSizeBasedClusteringPlanStrategy > ------------------------------------------------------------------- > > Key: HUDI-5326 > URL: https://issues.apache.org/jira/browse/HUDI-5326 > Project: Apache Hudi > Issue Type: Bug > Components: clustering > Reporter: zouxxyy > Assignee: zouxxyy > Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Currently, the size of the clusting group building in > `SparkSizeBasedClusteringPlanStrategy` will be greater than > `hoodie.clustering.plan.strategy.max.bytes.per.group`, which will cause the > merged file size to be inconsistent > with`hoodie.clustering.plan.strategy.target.file.max.bytes` > E.g: > we set max.bytes.per.group=20, target.file.max.bytes=10 > if we have small files, which sizes are 4, 4, 4, 4, 3, 3, 3 > 4, 4, 4, 4, 3, 3 will be divided into one clusting groups, they will be > merged into 22/10 = 3 files, which is inconsistent with > `max.bytes.per.group`/`target.file.max.bytes`=2 > and their average size will be 7, which has a big gap with 10. -- This message was sent by Atlassian Jira (v8.20.10#820010)