Very practical feature! I agree that supporting async/offline ways of
re-balancing some partitions with a different number of buckets will
benefit a lot in practice, and avoid the complexity of dynamic scaling.

I'd suggest renaming it like "Support adjusting bucket numbers for simple
bucket index", because the bucket index itself is a non-global
(partition-level) index, so this is to not confuse with the index scope
concept.


On Sun, Jan 26, 2025 at 8:44 PM Yue Zhang <zhangyue921...@163.com> wrote:

> Hi Sagar,
>
> Thanks for your attention. Okay, I'll draft an RFC named "Partition Level
> Bucket Index", focusing on offline resizing performance and the complexity
> of bucket-sizing management (for example, recording the number of partition
> buckets in .hoodie_partition_metadata). If you are also interested, welcome
> to discuss and review!
>
> Best regards,
> zhangyue19921010
>
>
>
>
>
>
>
>
> At 2025-01-25 20:10:32, "Sagar Sumit" <cod...@apache.org> wrote:
> >Hi Yue,
> >
> >Thanks for your proposal. I think it simplifies the bucket index design
> while offering flexibility for varied partition sizes. I can see this being
> particularly useful for workloads with predictable partition growth or
> where operational simplicity is a priority, such as batch ingestion
> pipelines or scenarios with heterogeneous partition sizes.
> >
> >That said, some challenges to consider include the operational overhead
> of stopping writes during offline resizing and the potential complexity in
> defining and managing bucket-sizing rules. For real-time workloads or
> highly dynamic partition growth, RFC-42’s dynamic resizing might still be
> preferable. Balancing these trade-offs will be key to making this feature
> successful.
> >
> >Overall, I believe there is room for both strategies, and we should let
> users choose the one that best suits their needs. Let's explore this
> further and discuss how we can refine the proposal in an RFC.
> >
> >Regards,
> >Sagar
> >
> >On 2025/01/24 09:58:49 Yue Zhang wrote:
> >> Hi Hudis:
> >>
> >>      As we known, Hudi proposed and introduced Bucket Index in RFC-29.
> Bucket Index can well unify the indexes of Flink and Spark, that is, Spark
> and Flink could upsert the same Hudi table using bucket index.
> >>
> >>      However, Bucket Index Table has a limit of fixed number of
> buckets. In order to solve this problem, RFC-42 proposed the ability of
> consistent hashing achieving bucket resizing by splitting or merging
> several local buckets dynamically.
> >>
> >>     But from PRD experience, sometimes a Partition-Level Bucket Index
> and a offline way to do bucket rescale is good enough without introducing
> additional efforts (multiple writes, clustering, automatic resizing,etc.).
> Because the more complex the Architecture, the more error-prone it is and
> the greater operation and maintenance pressure.
> >>
> >>     In this regard, We could upgrade the traditional Bucket Index to
> implement a Partition-Level Bucket Index, so that users can set a specific
> number of buckets for different partitions through a rule engine (such as
> regular expression matching). On the other hand, for a certain existing
> partitions, an offline command is provided to reorganized the data using
> insert overwrite(need to stop the data writing of the current partition).
> >>     More importantly, the existing Bucket Index table can be upgraded
> to Partition-Level Bucket Index smoothly without re-building the whole
> table.
> >>     Some thoughts on this feature? Any feedback would be greatly
> appreciated !
> >> Best regards,
> >> zhangyue19921010
>

Reply via email to