Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Vinoth Chandar Wed, 27 Apr 2022 09:21:49 -0700

I left my thoughts on the RFC https://github.com/apache/hudi/pull/4309


I just see this as a another deployment model where a centralized set of
microservices take up scheduling, execution of Hudi's table services.

+1 on thinking about sharding,locking and HA upfront.

Thanks
Vinoth

On Thu, Apr 21, 2022 at 3:31 PM Alexey Kudinkin <ale...@onehouse.ai> wrote:

> Hey, folks!
>
> I feel there's quite a bit of confusion in this thread, so let's try to
> clear it: my understanding (please correct me if I'm wrong) is that
> Lake Manager was referred to as a service in a similar interpretation of
> how we call compaction, clustering and cleaning a* table services.*
>
> So, i'd suggest for us to be extra careful in operating familiar terms to
> avoid stirring up the confusion: for all things related to *RPC services *
> (like Metastore Server) we can call them "servers"*, *and for compaction,
> clustering and the rest we stick w/ "table services".
>
> If my understanding of the proposal is correct, then I think the proposal
> is to consolidate knobs and levers for Data Governance, Data Management,
> etc
> w/in the layer called *Lake Manager, *which will be orchestrating already
> existing table services through a nicely abstracted high-level API.
>
> Regarding adding any new *server* components: given Hudi's *stateless*
> architecture where we rely on standalone execution engines (like Spark or
> Flink) to operate, i don't really see us introducing a server component
> directly into Hudi's core. Metastore Server on the other hand will be a
> *standalone* component, that Hudi (as well as other processes) could be
> relying on to access the metadata.
>
> On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang <zhangyue19921...@apache.org>
> wrote:
>
> > Thanks for all your attention.
> > Sure, we do need to take care of high availability in design.
> >
> > Also in my opinion this lake manager wouldn't drive hudi into a database
> > on the cloud. It is just an official option. Something like
> > HoodieDeltaStreamer and help users to reduce maintenance and hudi data
> > governance efforts.
> >
> > As for resource and performance concerns, this lake manager should be
> > designed as a planner/master, for example, lake manager will call out
> > cleaner apis to launch a (spark/flink) execution to delete files under
> > certain conditions based on table metadata information, rather than doing
> > works itself. So that the workload and resources requirement is much
> less.
> > But in general, I agree that we have to consider failure recovery and
> high
> > availability, etc.
> >
> > On 2022/04/19 04:30:22 Simon Su wrote:
> > > >
> > > > I agree with Danny said. IMO, there are two points that should be
> > > > considered
> > >
> > > 1. If Lake Manager is designed as a service, so we should consider its
> > High
> > > Availability, Dynamic Expanding/Shrinking, and state consistency.
> > > 2. How many resources will Lake Manager used to execute those actions
> of
> > > HUDI such as compaction, clustering, etc..
> > >
> >
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Reply via email to