[ 
https://issues.apache.org/jira/browse/HUDI-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4773:
----------------------------
    Component/s: table-service

> Adding filter mode to Clustering to filter for recent files
> -----------------------------------------------------------
>
>                 Key: HUDI-4773
>                 URL: https://issues.apache.org/jira/browse/HUDI-4773
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: clustering, table-service
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available
>
> We have partition aware clustering strategy and recent partitions based 
> strategy as well for clustering. This plays out well if partitioning is based 
> on dates. but what incase partitioning is based on some other random field. 
>  
> So, we might need another clustering filtering strategy to consider only 
> those file groups which got touched in the last N commits. 
> For eg, if a user configures clustering to run every 5 commits, every time 
> clustering runs, it will consider only the file groups touched in the last 5 
> commits. This will avoid triggering repeated clustering for already clustered 
> file groups as well and clustering will be very fast only delta file groups 
> are considered. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to