[DISSCUSS][NEW FEATURE] Hudi Lake Manager

Yue Zhang Sun, 17 Apr 2022 23:15:29 -0700


Hi all, 
    I would like to discuss and contribute a new feature named Hudi Lake 
Manager.



    As more and more users from different companies and different businesses 
begin to use the hudi pipeline to write data, data governance has gradually 
become one of the most pain points for users. In order to get better query 
performance or better timeliness, users need to carefully configure clustering, 
compaction, cleaner and archive for each ingestion pipeline, which will 
undoubtedly bring higher learning costs and maintenance costs. Imagine that if 
a business has hundreds or thousands of ingestion piplines, then users even 
need to maintain hundreds or thousands of sets of configurations and keep 
tuning them maybe.


    This new Feature Hudi Lake Manager is to decouple hudi ingestion and hudi 
table service, including cleaner, archival, clustering, comapction and any 
table services in the feature.


    Users only need to care about their own ingest pipline and leave all the 
table services to the manager to automatically discover and manage the hudi 
table, thereby greatly reducing the pressure of operation and maintenance and 
the cost of on board.


    This lake manager is  the role of a hudi table master/coordinator, which 
can discover hudi tables and unify and automatically call out services such as 
cleaner/clustering/compaction/archive(multi-writer and async) based on certain 
conditions.


    A common and interesting example is that in our production environment, we 
basically use date as the partition key and have specific data retention 
requests. To do this we need to write a script for each pipline to delete the 
data and the corresponding hive metadata. With this lake manager, we can expand 
the scope of the cleaner, implement a mechanism for data retention based on 
date partition.


    I found there is a very valuable RFC-36 on going now 
https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore server, 
which will store the metadata of the hudi table, maybe we could expand this 
RFC's scope to design and develop lake manager or we could raise a new RFC and 
take this RFC-36 as information inputs.


    I hope we can discuss the feasibility of this idea, it would be greatly 
appreciated.
    I also volunteer my part if it is possible.
| |
Yue Zhang
|
|
[email protected]
|

[DISSCUSS][NEW FEATURE] Hudi Lake Manager

Reply via email to