I left my thoughts on the RFC https://github.com/apache/hudi/pull/4309
I just see this as a another deployment model where a centralized set of microservices take up scheduling, execution of Hudi's table services. +1 on thinking about sharding,locking and HA upfront. Thanks Vinoth On Thu, Apr 21, 2022 at 3:31 PM Alexey Kudinkin <ale...@onehouse.ai> wrote: > Hey, folks! > > I feel there's quite a bit of confusion in this thread, so let's try to > clear it: my understanding (please correct me if I'm wrong) is that > Lake Manager was referred to as a service in a similar interpretation of > how we call compaction, clustering and cleaning a* table services.* > > So, i'd suggest for us to be extra careful in operating familiar terms to > avoid stirring up the confusion: for all things related to *RPC services * > (like Metastore Server) we can call them "servers"*, *and for compaction, > clustering and the rest we stick w/ "table services". > > If my understanding of the proposal is correct, then I think the proposal > is to consolidate knobs and levers for Data Governance, Data Management, > etc > w/in the layer called *Lake Manager, *which will be orchestrating already > existing table services through a nicely abstracted high-level API. > > Regarding adding any new *server* components: given Hudi's *stateless* > architecture where we rely on standalone execution engines (like Spark or > Flink) to operate, i don't really see us introducing a server component > directly into Hudi's core. Metastore Server on the other hand will be a > *standalone* component, that Hudi (as well as other processes) could be > relying on to access the metadata. > > On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang <zhangyue19921...@apache.org> > wrote: > > > Thanks for all your attention. > > Sure, we do need to take care of high availability in design. > > > > Also in my opinion this lake manager wouldn't drive hudi into a database > > on the cloud. It is just an official option. Something like > > HoodieDeltaStreamer and help users to reduce maintenance and hudi data > > governance efforts. > > > > As for resource and performance concerns, this lake manager should be > > designed as a planner/master, for example, lake manager will call out > > cleaner apis to launch a (spark/flink) execution to delete files under > > certain conditions based on table metadata information, rather than doing > > works itself. So that the workload and resources requirement is much > less. > > But in general, I agree that we have to consider failure recovery and > high > > availability, etc. > > > > On 2022/04/19 04:30:22 Simon Su wrote: > > > > > > > > I agree with Danny said. IMO, there are two points that should be > > > > considered > > > > > > 1. If Lake Manager is designed as a service, so we should consider its > > High > > > Availability, Dynamic Expanding/Shrinking, and state consistency. > > > 2. How many resources will Lake Manager used to execute those actions > of > > > HUDI such as compaction, clustering, etc.. > > > > > >