Very practical feature! I agree that supporting async/offline ways of re-balancing some partitions with a different number of buckets will benefit a lot in practice, and avoid the complexity of dynamic scaling.
I'd suggest renaming it like "Support adjusting bucket numbers for simple bucket index", because the bucket index itself is a non-global (partition-level) index, so this is to not confuse with the index scope concept. On Sun, Jan 26, 2025 at 8:44 PM Yue Zhang <zhangyue921...@163.com> wrote: > Hi Sagar, > > Thanks for your attention. Okay, I'll draft an RFC named "Partition Level > Bucket Index", focusing on offline resizing performance and the complexity > of bucket-sizing management (for example, recording the number of partition > buckets in .hoodie_partition_metadata). If you are also interested, welcome > to discuss and review! > > Best regards, > zhangyue19921010 > > > > > > > > > At 2025-01-25 20:10:32, "Sagar Sumit" <cod...@apache.org> wrote: > >Hi Yue, > > > >Thanks for your proposal. I think it simplifies the bucket index design > while offering flexibility for varied partition sizes. I can see this being > particularly useful for workloads with predictable partition growth or > where operational simplicity is a priority, such as batch ingestion > pipelines or scenarios with heterogeneous partition sizes. > > > >That said, some challenges to consider include the operational overhead > of stopping writes during offline resizing and the potential complexity in > defining and managing bucket-sizing rules. For real-time workloads or > highly dynamic partition growth, RFC-42’s dynamic resizing might still be > preferable. Balancing these trade-offs will be key to making this feature > successful. > > > >Overall, I believe there is room for both strategies, and we should let > users choose the one that best suits their needs. Let's explore this > further and discuss how we can refine the proposal in an RFC. > > > >Regards, > >Sagar > > > >On 2025/01/24 09:58:49 Yue Zhang wrote: > >> Hi Hudis: > >> > >> As we known, Hudi proposed and introduced Bucket Index in RFC-29. > Bucket Index can well unify the indexes of Flink and Spark, that is, Spark > and Flink could upsert the same Hudi table using bucket index. > >> > >> However, Bucket Index Table has a limit of fixed number of > buckets. In order to solve this problem, RFC-42 proposed the ability of > consistent hashing achieving bucket resizing by splitting or merging > several local buckets dynamically. > >> > >> But from PRD experience, sometimes a Partition-Level Bucket Index > and a offline way to do bucket rescale is good enough without introducing > additional efforts (multiple writes, clustering, automatic resizing,etc.). > Because the more complex the Architecture, the more error-prone it is and > the greater operation and maintenance pressure. > >> > >> In this regard, We could upgrade the traditional Bucket Index to > implement a Partition-Level Bucket Index, so that users can set a specific > number of buckets for different partitions through a rule engine (such as > regular expression matching). On the other hand, for a certain existing > partitions, an offline command is provided to reorganized the data using > insert overwrite(need to stop the data writing of the current partition). > >> More importantly, the existing Bucket Index table can be upgraded > to Partition-Level Bucket Index smoothly without re-building the whole > table. > >> Some thoughts on this feature? Any feedback would be greatly > appreciated ! > >> Best regards, > >> zhangyue19921010 >