[ 
https://issues.apache.org/jira/browse/HUDI-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2194:
----------------------------
    Description: 
As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering 
strategy to create ClusteringPlan. And it is useful when Hudi table is 
partitioned by time.

 

For now, users can set 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` to  control the 
number of partitions to list from the latest partition to create ClusteringPlan.

For example, we have 6 partitions based on date, and users set 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2
|20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|

                                                                                
|<----- choose to cluster ---->|

Sometimes users also what to skip x partitions from latest when make clustering 
plan because latest partitions contains lots of update data or some reasons 
else.

 

This patch will add a new config named `

hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions

` to set the 

number of partitions to skip from latest when choosing partitions to create 
ClusteringPlan

 

for example users set 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and 

`

hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions

` 2
|20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|

                                        |<-----  choose  ----->|

 

  was:
As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering 
strategy to create ClusteringPlan. And it is useful when Hudi table is 
partitioned by time.

 

For now, users can set 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` to  control the 
number of partitions to list from the latest partition to create ClusteringPlan.

For example, we have 6 partitions based on date, and users set 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2

|20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |

                                                                               
|<----- choose to cluster ---->|

Sometimes users also what to skip x partitions from latest when make clustering 
plan because latest partitions contains lots of update data or some reasons 
else.

 

This patch will add a new config named `

hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions

` to set the 

number of partitions to skip from latest when choosing partitions to create 
ClusteringPlan

 

for example users set 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and 

`

hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions

` 2
|20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|

                                        |<-----  choose  ----->|

 


> Skip the latest N partitions when creating ClusteringPlan
> ---------------------------------------------------------
>
>                 Key: HUDI-2194
>                 URL: https://issues.apache.org/jira/browse/HUDI-2194
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Yue Zhang
>            Priority: Major
>
> As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering 
> strategy to create ClusteringPlan. And it is useful when Hudi table is 
> partitioned by time.
>  
> For now, users can set 
> `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to  control 
> the number of partitions to list from the latest partition to create 
> ClusteringPlan.
> For example, we have 6 partitions based on date, and users set 
> `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2
> |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|
>                                                                               
>   |<----- choose to cluster ---->|
> Sometimes users also what to skip x partitions from latest when make 
> clustering plan because latest partitions contains lots of update data or 
> some reasons else.
>  
> This patch will add a new config named `
> hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
> ` to set the 
> number of partitions to skip from latest when choosing partitions to create 
> ClusteringPlan
>  
> for example users set 
> `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and 
> `
> hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
> ` 2
> |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|
>                                         |<-----  choose  ----->|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to