Hi all, I would like to discuss and contribute a new feature named Hudi Lake Manager.
As more and more users from different companies and different businesses begin to use the hudi pipeline to write data, data governance has gradually become one of the most pain points for users. In order to get better query performance or better timeliness, users need to carefully configure clustering, compaction, cleaner and archive for each ingestion pipeline, which will undoubtedly bring higher learning costs and maintenance costs. Imagine that if a business has hundreds or thousands of ingestion piplines, then users even need to maintain hundreds or thousands of sets of configurations and keep tuning them maybe. This new Feature Hudi Lake Manager is to decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, comapction and any table services in the feature. Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table, thereby greatly reducing the pressure of operation and maintenance and the cost of on board. This lake manager is the role of a hudi table master/coordinator, which can discover hudi tables and unify and automatically call out services such as cleaner/clustering/compaction/archive(multi-writer and async) based on certain conditions. A common and interesting example is that in our production environment, we basically use date as the partition key and have specific data retention requests. To do this we need to write a script for each pipline to delete the data and the corresponding hive metadata. With this lake manager, we can expand the scope of the cleaner, implement a mechanism for data retention based on date partition. I found there is a very valuable RFC-36 on going now https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore server, which will store the metadata of the hudi table, maybe we could expand this RFC's scope to design and develop lake manager or we could raise a new RFC and take this RFC-36 as information inputs. I hope we can discuss the feasibility of this idea, it would be greatly appreciated. I also volunteer my part if it is possible. | | Yue Zhang | | zhangyue921...@163.com |