[ https://issues.apache.org/jira/browse/HUDI-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo updated HUDI-4773: ---------------------------- Component/s: table-service > Adding filter mode to Clustering to filter for recent files > ----------------------------------------------------------- > > Key: HUDI-4773 > URL: https://issues.apache.org/jira/browse/HUDI-4773 > Project: Apache Hudi > Issue Type: Improvement > Components: clustering, table-service > Reporter: sivabalan narayanan > Assignee: sivabalan narayanan > Priority: Major > Labels: pull-request-available > > We have partition aware clustering strategy and recent partitions based > strategy as well for clustering. This plays out well if partitioning is based > on dates. but what incase partitioning is based on some other random field. > > So, we might need another clustering filtering strategy to consider only > those file groups which got touched in the last N commits. > For eg, if a user configures clustering to run every 5 commits, every time > clustering runs, it will consider only the file groups touched in the last 5 > commits. This will avoid triggering repeated clustering for already clustered > file groups as well and clustering will be very fast only delta file groups > are considered. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)