[ https://issues.apache.org/jira/browse/HUDI-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028627#comment-17028627 ]
leesf commented on HUDI-80: --------------------------- Fixed via master: 8ff06ddb0fdc8325382dbca4bd9dd4884b4e1110 > Incrementalize cleaning based on timeline metadata > -------------------------------------------------- > > Key: HUDI-80 > URL: https://issues.apache.org/jira/browse/HUDI-80 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core > Reporter: Vinoth Chandar > Assignee: Balaji Varadarajan > Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, cleaning lists all partitions once and then picks the file groups > to clean from DFS. This is partly due to support for retaining last x > versions of a file group as well (in additon to the default mode of retaining > last x commits). This could be expensive in some cases. See > [https://github.com/apache/incubator-hudi/issues/613] for a issue reported. > > This task tracks work to > * Determine if we can get rid of last X version cleaning mode > * Implement cleaning based on file metadata in hudi timeline itself > * Resulting rpc calls to DFS would be O(number of filegroups > cleaned)/O(number of partitions touched in last X commits) > > HUDI-1 implements a timeline service for writing, that promotes caching of > file system metadata. This can be implemented on top of that. -- This message was sent by Atlassian Jira (v8.3.4#803005)