[ 
https://issues.apache.org/jira/browse/HUDI-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967698#comment-16967698
 ] 

Balaji Varadarajan commented on HUDI-80:
----------------------------------------

The proposed solution is to

 

(a) Retain clean by versions but have incremental clean be enabled only for 
clean by commits

(b) Incremental Cleaning removes listing all partitions for looking for files 
to clean. Instead it looks at next set of partitions for deletion by looking at 
newer commits in an incremental fashion

(c) We rely on embedded timeline-server still to reduce RPC calls. In the case 
of deltastreamer running in continuous mode, we can leverage this benefit.

 

> Incrementalize cleaning based on timeline metadata
> --------------------------------------------------
>
>                 Key: HUDI-80
>                 URL: https://issues.apache.org/jira/browse/HUDI-80
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Write Client
>            Reporter: Vinoth Chandar
>            Assignee: Balaji Varadarajan
>            Priority: Major
>             Fix For: 0.5.1
>
>
> Currently, cleaning lists all partitions once and then picks the file groups 
> to clean from DFS. This is partly due to support for retaining last x 
> versions of a file group as well (in additon to the default mode of retaining 
> last x commits). This could be expensive in some cases. See 
> [https://github.com/apache/incubator-hudi/issues/613] for a issue reported. 
>  
> This task tracks work to 
>  * Determine if we can get rid of last X version cleaning mode 
>  * Implement cleaning based on file metadata in hudi timeline itself
>  * Resulting rpc calls to DFS would be O(number of filegroups 
> cleaned)/O(number of partitions touched in last X commits)
>  
> HUDI-1 implements a timeline service for writing, that promotes caching of 
> file system metadata. This can be implemented on top of that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to