Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Y Ethan Guo Mon, 18 Apr 2022 10:38:09 -0700

+1 This is a great idea! The proposed lake manager and centralized
management layer are essential to ease the burden of carrying out data
governance and optimizing the storage layout, making them independent of
ingestion and streaming.  I see that this provides a better abstraction for
any potential centralized maintenance and optimization beyond existing
table services.


It would be good to have this centralized Lake Manager component in the
metastore server proposed by RFC-36.  RFC-43 can also somehow be part of
it.  The Lake Manager implementation can be self-contained in some way.

On Mon, Apr 18, 2022 at 2:11 AM Shiyan Xu <xu.shiyan.raym...@gmail.com>
wrote:

> Great idea, Zhang Yue! I see more potential collaborations in the work for
> the table management service in this RFC 43
> https://github.com/apache/hudi/pull/4309
>
> On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zhangyue921...@163.com> wrote:
>
> >
> >
> > Hi all,
> >     I would like to discuss and contribute a new feature named Hudi Lake
> > Manager.
> >
> >
> >     As more and more users from different companies and different
> > businesses begin to use the hudi pipeline to write data, data governance
> > has gradually become one of the most pain points for users. In order to
> get
> > better query performance or better timeliness, users need to carefully
> > configure clustering, compaction, cleaner and archive for each ingestion
> > pipeline, which will undoubtedly bring higher learning costs and
> > maintenance costs. Imagine that if a business has hundreds or thousands
> of
> > ingestion piplines, then users even need to maintain hundreds or
> thousands
> > of sets of configurations and keep tuning them maybe.
> >
> >
> >     This new Feature Hudi Lake Manager is to decouple hudi ingestion and
> > hudi table service, including cleaner, archival, clustering, comapction
> and
> > any table services in the feature.
> >
> >
> >     Users only need to care about their own ingest pipline and leave all
> > the table services to the manager to automatically discover and manage
> the
> > hudi table, thereby greatly reducing the pressure of operation and
> > maintenance and the cost of on board.
> >
> >
> >     This lake manager is  the role of a hudi table master/coordinator,
> > which can discover hudi tables and unify and automatically call out
> > services such as cleaner/clustering/compaction/archive(multi-writer and
> > async) based on certain conditions.
> >
> >
> >     A common and interesting example is that in our production
> > environment, we basically use date as the partition key and have specific
> > data retention requests. To do this we need to write a script for each
> > pipline to delete the data and the corresponding hive metadata. With this
> > lake manager, we can expand the scope of the cleaner, implement a
> mechanism
> > for data retention based on date partition.
> >
> >
> >     I found there is a very valuable RFC-36 on going now
> > https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> > server, which will store the metadata of the hudi table, maybe we could
> > expand this RFC's scope to design and develop lake manager or we could
> > raise a new RFC and take this RFC-36 as information inputs.
> >
> >
> >     I hope we can discuss the feasibility of this idea, it would be
> > greatly appreciated.
> >     I also volunteer my part if it is possible.
> > | |
> > Yue Zhang
> > |
> > |
> > zhangyue921...@163.com
> > |
> >
> > --
> Best,
> Shiyan
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Reply via email to