Great idea, Zhang Yue! I see more potential collaborations in the work for the table management service in this RFC 43 https://github.com/apache/hudi/pull/4309
On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zhangyue921...@163.com> wrote: > > > Hi all, > I would like to discuss and contribute a new feature named Hudi Lake > Manager. > > > As more and more users from different companies and different > businesses begin to use the hudi pipeline to write data, data governance > has gradually become one of the most pain points for users. In order to get > better query performance or better timeliness, users need to carefully > configure clustering, compaction, cleaner and archive for each ingestion > pipeline, which will undoubtedly bring higher learning costs and > maintenance costs. Imagine that if a business has hundreds or thousands of > ingestion piplines, then users even need to maintain hundreds or thousands > of sets of configurations and keep tuning them maybe. > > > This new Feature Hudi Lake Manager is to decouple hudi ingestion and > hudi table service, including cleaner, archival, clustering, comapction and > any table services in the feature. > > > Users only need to care about their own ingest pipline and leave all > the table services to the manager to automatically discover and manage the > hudi table, thereby greatly reducing the pressure of operation and > maintenance and the cost of on board. > > > This lake manager is the role of a hudi table master/coordinator, > which can discover hudi tables and unify and automatically call out > services such as cleaner/clustering/compaction/archive(multi-writer and > async) based on certain conditions. > > > A common and interesting example is that in our production > environment, we basically use date as the partition key and have specific > data retention requests. To do this we need to write a script for each > pipline to delete the data and the corresponding hive metadata. With this > lake manager, we can expand the scope of the cleaner, implement a mechanism > for data retention based on date partition. > > > I found there is a very valuable RFC-36 on going now > https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore > server, which will store the metadata of the hudi table, maybe we could > expand this RFC's scope to design and develop lake manager or we could > raise a new RFC and take this RFC-36 as information inputs. > > > I hope we can discuss the feasibility of this idea, it would be > greatly appreciated. > I also volunteer my part if it is possible. > | | > Yue Zhang > | > | > zhangyue921...@163.com > | > > -- Best, Shiyan