[ https://issues.apache.org/jira/browse/HUDI-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yue Zhang updated HUDI-2194: ---------------------------- Description: As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering strategy to create ClusteringPlan. And it is useful when Hudi table is partitioned by time. For now, users can set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control the number of partitions to list from the latest partition to create ClusteringPlan. For example, we have 6 partitions based on date, and users set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)| |<----- choose to cluster ---->| Sometimes users also what to skip x partitions from latest when make clustering plan because latest partitions contains lots of update data or some reasons else. This patch will add a new config named ` hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions ` to set the number of partitions to skip from latest when choosing partitions to create ClusteringPlan for example users set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and ` hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions ` 2 |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)| |<----- choose ----->| was: As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering strategy to create ClusteringPlan. And it is useful when Hudi table is partitioned by time. For now, users can set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control the number of partitions to list from the latest partition to create ClusteringPlan. For example, we have 6 partitions based on date, and users set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 |20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) | |<----- choose to cluster ---->| Sometimes users also what to skip x partitions from latest when make clustering plan because latest partitions contains lots of update data or some reasons else. This patch will add a new config named ` hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions ` to set the number of partitions to skip from latest when choosing partitions to create ClusteringPlan for example users set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and ` hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions ` 2 |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)| |<----- choose ----->| > Skip the latest N partitions when creating ClusteringPlan > --------------------------------------------------------- > > Key: HUDI-2194 > URL: https://issues.apache.org/jira/browse/HUDI-2194 > Project: Apache Hudi > Issue Type: Task > Reporter: Yue Zhang > Priority: Major > > As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering > strategy to create ClusteringPlan. And it is useful when Hudi table is > partitioned by time. > > For now, users can set > `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control > the number of partitions to list from the latest partition to create > ClusteringPlan. > For example, we have 6 partitions based on date, and users set > `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 > |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)| > > |<----- choose to cluster ---->| > Sometimes users also what to skip x partitions from latest when make > clustering plan because latest partitions contains lots of update data or > some reasons else. > > This patch will add a new config named ` > hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions > ` to set the > number of partitions to skip from latest when choosing partitions to create > ClusteringPlan > > for example users set > `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and > ` > hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions > ` 2 > |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)| > |<----- choose ----->| > -- This message was sent by Atlassian Jira (v8.3.4#803005)