[ https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Danny Chen resolved HUDI-4766. ------------------------------ > Fix HoodieFlinkClusteringJob > ---------------------------- > > Key: HUDI-4766 > URL: https://issues.apache.org/jira/browse/HUDI-4766 > Project: Apache Hudi > Issue Type: Bug > Reporter: voon > Assignee: voon > Priority: Major > Labels: pull-request-available > > h1. Flink Hudi Clustering Issues > > # Integer type used for byte-size configuration parameters instead of long > ** Maximum size range of 2^31-1 bytes ~2 gigabytes > # Unable to choose a particular instant to execute > # Unable to select filter mode as the method that controls this is > overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_ > # No cleaning > ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is > only enabled if _clean.async.enabled = false._ > # Schedule configuration is not consistent with HoodieFlinkCompactor > defining the flag = false, which is opposite of HoodieFlinkCompactor > # No ability to allow props to be passed in using _--props/–hoodie-conf_ > ** Required for passing in configurations like: > *** _hoodie.parquet.compression.ratio_ > *** Partition filter configurations depending on strategy > # Clustering group will spit out files of _hoodie.parquet.max.file.size_ > (120MB by default) > # Multiple clustering jobs can execute, but no fine-grain control over > restarting jobs that have failed. Current implementation will only filter for > REQUESTED clustering jobs; rollbacks will never be performed. > # Removed unused _getNumberOfOutputFileGroups()_ function. > ** _hoodie.clustering.plan.strategy.small.file.limit_ > ** _hoodie.clustering.plan.strategy.max.bytes.per.group_ > ** _hoodie.clustering.plan.strategy.target.file.max.bytes_ > ** Will create N file groups (1 task will be writing to each file group, > increasing parallelism) -- This message was sent by Atlassian Jira (v8.20.10#820010)