Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Alexey Kudinkin Thu, 21 Apr 2022 15:31:45 -0700

Hey, folks!

I feel there's quite a bit of confusion in this thread, so let's try to
clear it: my understanding (please correct me if I'm wrong) is that
Lake Manager was referred to as a service in a similar interpretation of
how we call compaction, clustering and cleaning a* table services.*

So, i'd suggest for us to be extra careful in operating familiar terms to
avoid stirring up the confusion: for all things related to *RPC services *
(like Metastore Server) we can call them "servers"*, *and for compaction,
clustering and the rest we stick w/ "table services".

If my understanding of the proposal is correct, then I think the proposal
is to consolidate knobs and levers for Data Governance, Data Management, etc
w/in the layer called *Lake Manager, *which will be orchestrating already
existing table services through a nicely abstracted high-level API.

Regarding adding any new *server* components: given Hudi's *stateless*
architecture where we rely on standalone execution engines (like Spark or
Flink) to operate, i don't really see us introducing a server component
directly into Hudi's core. Metastore Server on the other hand will be a
*standalone* component, that Hudi (as well as other processes) could be
relying on to access the metadata.

On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang <zhangyue19921...@apache.org>
wrote:

> Thanks for all your attention.
> Sure, we do need to take care of high availability in design.
>
> Also in my opinion this lake manager wouldn't drive hudi into a database
> on the cloud. It is just an official option. Something like
> HoodieDeltaStreamer and help users to reduce maintenance and hudi data
> governance efforts.
>
> As for resource and performance concerns, this lake manager should be
> designed as a planner/master, for example, lake manager will call out
> cleaner apis to launch a (spark/flink) execution to delete files under
> certain conditions based on table metadata information, rather than doing
> works itself. So that the workload and resources requirement is much less.
> But in general, I agree that we have to consider failure recovery and high
> availability, etc.
>
> On 2022/04/19 04:30:22 Simon Su wrote:
> > >
> > > I agree with Danny said. IMO, there are two points that should be
> > > considered
> >
> > 1. If Lake Manager is designed as a service, so we should consider its
> High
> > Availability, Dynamic Expanding/Shrinking, and state consistency.
> > 2. How many resources will Lake Manager used to execute those actions of
> > HUDI such as compaction, clustering, etc..
> >
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Reply via email to