Hi Sagar, Thanks for your attention. Okay, I'll draft an RFC named "Partition Level Bucket Index", focusing on offline resizing performance and the complexity of bucket-sizing management (for example, recording the number of partition buckets in .hoodie_partition_metadata). If you are also interested, welcome to discuss and review!
Best regards, zhangyue19921010 At 2025-01-25 20:10:32, "Sagar Sumit" <cod...@apache.org> wrote: >Hi Yue, > >Thanks for your proposal. I think it simplifies the bucket index design while >offering flexibility for varied partition sizes. I can see this being >particularly useful for workloads with predictable partition growth or where >operational simplicity is a priority, such as batch ingestion pipelines or >scenarios with heterogeneous partition sizes. > >That said, some challenges to consider include the operational overhead of >stopping writes during offline resizing and the potential complexity in >defining and managing bucket-sizing rules. For real-time workloads or highly >dynamic partition growth, RFC-42’s dynamic resizing might still be preferable. >Balancing these trade-offs will be key to making this feature successful. > >Overall, I believe there is room for both strategies, and we should let users >choose the one that best suits their needs. Let's explore this further and >discuss how we can refine the proposal in an RFC. > >Regards, >Sagar > >On 2025/01/24 09:58:49 Yue Zhang wrote: >> Hi Hudis: >> >> As we known, Hudi proposed and introduced Bucket Index in RFC-29. >> Bucket Index can well unify the indexes of Flink and Spark, that is, Spark >> and Flink could upsert the same Hudi table using bucket index. >> >> However, Bucket Index Table has a limit of fixed number of buckets. In >> order to solve this problem, RFC-42 proposed the ability of consistent >> hashing achieving bucket resizing by splitting or merging several local >> buckets dynamically. >> >> But from PRD experience, sometimes a Partition-Level Bucket Index and a >> offline way to do bucket rescale is good enough without introducing >> additional efforts (multiple writes, clustering, automatic resizing,etc.). >> Because the more complex the Architecture, the more error-prone it is and >> the greater operation and maintenance pressure. >> >> In this regard, We could upgrade the traditional Bucket Index to >> implement a Partition-Level Bucket Index, so that users can set a specific >> number of buckets for different partitions through a rule engine (such as >> regular expression matching). On the other hand, for a certain existing >> partitions, an offline command is provided to reorganized the data using >> insert overwrite(need to stop the data writing of the current partition). >> More importantly, the existing Bucket Index table can be upgraded to >> Partition-Level Bucket Index smoothly without re-building the whole table. >> Some thoughts on this feature? Any feedback would be greatly appreciated >> ! >> Best regards, >> zhangyue19921010