Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Shiyan Xu Mon, 18 Apr 2022 02:11:18 -0700

Great idea, Zhang Yue! I see more potential collaborations in the work for
the table management service in this RFC 43
https://github.com/apache/hudi/pull/4309


On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <[email protected]> wrote:

>
>
> Hi all,
>     I would like to discuss and contribute a new feature named Hudi Lake
> Manager.
>
>
>     As more and more users from different companies and different
> businesses begin to use the hudi pipeline to write data, data governance
> has gradually become one of the most pain points for users. In order to get
> better query performance or better timeliness, users need to carefully
> configure clustering, compaction, cleaner and archive for each ingestion
> pipeline, which will undoubtedly bring higher learning costs and
> maintenance costs. Imagine that if a business has hundreds or thousands of
> ingestion piplines, then users even need to maintain hundreds or thousands
> of sets of configurations and keep tuning them maybe.
>
>
>     This new Feature Hudi Lake Manager is to decouple hudi ingestion and
> hudi table service, including cleaner, archival, clustering, comapction and
> any table services in the feature.
>
>
>     Users only need to care about their own ingest pipline and leave all
> the table services to the manager to automatically discover and manage the
> hudi table, thereby greatly reducing the pressure of operation and
> maintenance and the cost of on board.
>
>
>     This lake manager is  the role of a hudi table master/coordinator,
> which can discover hudi tables and unify and automatically call out
> services such as cleaner/clustering/compaction/archive(multi-writer and
> async) based on certain conditions.
>
>
>     A common and interesting example is that in our production
> environment, we basically use date as the partition key and have specific
> data retention requests. To do this we need to write a script for each
> pipline to delete the data and the corresponding hive metadata. With this
> lake manager, we can expand the scope of the cleaner, implement a mechanism
> for data retention based on date partition.
>
>
>     I found there is a very valuable RFC-36 on going now
> https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> server, which will store the metadata of the hudi table, maybe we could
> expand this RFC's scope to design and develop lake manager or we could
> raise a new RFC and take this RFC-36 as information inputs.
>
>
>     I hope we can discuss the feasibility of this idea, it would be
> greatly appreciated.
>     I also volunteer my part if it is possible.
> | |
> Yue Zhang
> |
> |
> [email protected]
> |
>
> --
Best,
Shiyan

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Reply via email to