In my point of view, this Lake Manager should be more like a centralized
management layer on top of Hudi tables to schedule different table services
and do data governance.  The scheduling / managing part should be
lightweight.  The execution should still be in cluster.  It should not be a
single node executing all services to create bottlenecks.  And I agree that
there should be fallbacks to achieve high availability, e.g., if the main
manager is down, there should be back up, or each table falls back to
execute independent table services.  How to achieve this can be discussed
later in the detailed design.

IMO, we should still keep the mode of running independent table services
and let users decide whether they want to use Lake Manager or not to manage
table services (providing one more option here), not making it a compulsory
move as you said.


On Mon, Apr 18, 2022 at 8:01 PM Danny Chan <danny0...@apache.org> wrote:

> I have different concerns here, the Lake Manager seems like a single
> node service here, and there is a risk that it becomes a bottleneck
> for handling too many table services. And for every single node
> service we should consider how to achieve high availability.
>
> What is the final state of the Hudi service here ? Should we drop the
> advantage of the server-less/light-weight architecture and moves
> forward to a service mode ?
> I mean will Hudi be more and more like a database on the cloud ?
>
> Best,
> Danny
>
> Y Ethan Guo <ethan.guoyi...@gmail.com> 于2022年4月19日周二 01:38写道:
> >
> > +1 This is a great idea! The proposed lake manager and centralized
> > management layer are essential to ease the burden of carrying out data
> > governance and optimizing the storage layout, making them independent of
> > ingestion and streaming.  I see that this provides a better abstraction
> for
> > any potential centralized maintenance and optimization beyond existing
> > table services.
> >
> > It would be good to have this centralized Lake Manager component in the
> > metastore server proposed by RFC-36.  RFC-43 can also somehow be part of
> > it.  The Lake Manager implementation can be self-contained in some way.
> >
> > On Mon, Apr 18, 2022 at 2:11 AM Shiyan Xu <xu.shiyan.raym...@gmail.com>
> > wrote:
> >
> > > Great idea, Zhang Yue! I see more potential collaborations in the work
> for
> > > the table management service in this RFC 43
> > > https://github.com/apache/hudi/pull/4309
> > >
> > > On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zhangyue921...@163.com>
> wrote:
> > >
> > > >
> > > >
> > > > Hi all,
> > > >     I would like to discuss and contribute a new feature named Hudi
> Lake
> > > > Manager.
> > > >
> > > >
> > > >     As more and more users from different companies and different
> > > > businesses begin to use the hudi pipeline to write data, data
> governance
> > > > has gradually become one of the most pain points for users. In order
> to
> > > get
> > > > better query performance or better timeliness, users need to
> carefully
> > > > configure clustering, compaction, cleaner and archive for each
> ingestion
> > > > pipeline, which will undoubtedly bring higher learning costs and
> > > > maintenance costs. Imagine that if a business has hundreds or
> thousands
> > > of
> > > > ingestion piplines, then users even need to maintain hundreds or
> > > thousands
> > > > of sets of configurations and keep tuning them maybe.
> > > >
> > > >
> > > >     This new Feature Hudi Lake Manager is to decouple hudi ingestion
> and
> > > > hudi table service, including cleaner, archival, clustering,
> comapction
> > > and
> > > > any table services in the feature.
> > > >
> > > >
> > > >     Users only need to care about their own ingest pipline and leave
> all
> > > > the table services to the manager to automatically discover and
> manage
> > > the
> > > > hudi table, thereby greatly reducing the pressure of operation and
> > > > maintenance and the cost of on board.
> > > >
> > > >
> > > >     This lake manager is  the role of a hudi table
> master/coordinator,
> > > > which can discover hudi tables and unify and automatically call out
> > > > services such as cleaner/clustering/compaction/archive(multi-writer
> and
> > > > async) based on certain conditions.
> > > >
> > > >
> > > >     A common and interesting example is that in our production
> > > > environment, we basically use date as the partition key and have
> specific
> > > > data retention requests. To do this we need to write a script for
> each
> > > > pipline to delete the data and the corresponding hive metadata. With
> this
> > > > lake manager, we can expand the scope of the cleaner, implement a
> > > mechanism
> > > > for data retention based on date partition.
> > > >
> > > >
> > > >     I found there is a very valuable RFC-36 on going now
> > > > https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> > > > server, which will store the metadata of the hudi table, maybe we
> could
> > > > expand this RFC's scope to design and develop lake manager or we
> could
> > > > raise a new RFC and take this RFC-36 as information inputs.
> > > >
> > > >
> > > >     I hope we can discuss the feasibility of this idea, it would be
> > > > greatly appreciated.
> > > >     I also volunteer my part if it is possible.
> > > > | |
> > > > Yue Zhang
> > > > |
> > > > |
> > > > zhangyue921...@163.com
> > > > |
> > > >
> > > > --
> > > Best,
> > > Shiyan
> > >
>

Reply via email to