Hi all,
I would like to discuss and contribute a new feature named Hudi Lake
Manager.
As more and more users from different companies and different businesses
begin to use the hudi pipeline to write data, data governance has gradually
become one of the most pain points for users. In order to get better query
performance or better timeliness, users need to carefully configure clustering,
compaction, cleaner and archive for each ingestion pipeline, which will
undoubtedly bring higher learning costs and maintenance costs. Imagine that if
a business has hundreds or thousands of ingestion piplines, then users even
need to maintain hundreds or thousands of sets of configurations and keep
tuning them maybe.
This new Feature Hudi Lake Manager is to decouple hudi ingestion and hudi
table service, including cleaner, archival, clustering, comapction and any
table services in the feature.
Users only need to care about their own ingest pipline and leave all the
table services to the manager to automatically discover and manage the hudi
table, thereby greatly reducing the pressure of operation and maintenance and
the cost of on board.
This lake manager is the role of a hudi table master/coordinator, which
can discover hudi tables and unify and automatically call out services such as
cleaner/clustering/compaction/archive(multi-writer and async) based on certain
conditions.
A common and interesting example is that in our production environment, we
basically use date as the partition key and have specific data retention
requests. To do this we need to write a script for each pipline to delete the
data and the corresponding hive metadata. With this lake manager, we can expand
the scope of the cleaner, implement a mechanism for data retention based on
date partition.
I found there is a very valuable RFC-36 on going now
https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore server,
which will store the metadata of the hudi table, maybe we could expand this
RFC's scope to design and develop lake manager or we could raise a new RFC and
take this RFC-36 as information inputs.
I hope we can discuss the feasibility of this idea, it would be greatly
appreciated.
I also volunteer my part if it is possible.
| |
Yue Zhang
|
|
[email protected]
|