Re: Partition Level Bucket Index

Sagar Sumit Sat, 25 Jan 2025 04:10:51 -0800

Hi Yue,

Thanks for your proposal. I think it simplifies the bucket index design while 
offering flexibility for varied partition sizes. I can see this being 
particularly useful for workloads with predictable partition growth or where 
operational simplicity is a priority, such as batch ingestion pipelines or 
scenarios with heterogeneous partition sizes.


That said, some challenges to consider include the operational overhead of 
stopping writes during offline resizing and the potential complexity in 
defining and managing bucket-sizing rules. For real-time workloads or highly 
dynamic partition growth, RFC-42’s dynamic resizing might still be preferable. 
Balancing these trade-offs will be key to making this feature successful.

Overall, I believe there is room for both strategies, and we should let users 
choose the one that best suits their needs. Let's explore this further and 
discuss how we can refine the proposal in an RFC.

Regards,
Sagar

On 2025/01/24 09:58:49 Yue Zhang wrote:
> Hi Hudis:
> 
>      As we known, Hudi proposed and introduced Bucket Index in RFC-29. Bucket 
> Index can well unify the indexes of Flink and Spark, that is, Spark and Flink 
> could upsert the same Hudi table using bucket index.
> 
>      However, Bucket Index Table has a limit of fixed number of buckets. In 
> order to solve this problem, RFC-42 proposed the ability of consistent 
> hashing achieving bucket resizing by splitting or merging several local 
> buckets dynamically.
> 
>     But from PRD experience, sometimes a Partition-Level Bucket Index and a 
> offline way to do bucket rescale is good enough without introducing 
> additional efforts (multiple writes, clustering, automatic resizing,etc.). 
> Because the more complex the Architecture, the more error-prone it is and the 
> greater operation and maintenance pressure.
> 
>     In this regard, We could upgrade the traditional Bucket Index to 
> implement a Partition-Level Bucket Index, so that users can set a specific 
> number of buckets for different partitions through a rule engine (such as 
> regular expression matching). On the other hand, for a certain existing 
> partitions, an offline command is provided to reorganized the data using 
> insert overwrite(need to stop the data writing of the current partition).
>     More importantly, the existing Bucket Index table can be upgraded to 
> Partition-Level Bucket Index smoothly without re-building the whole table.
>     Some thoughts on this feature? Any feedback would be greatly appreciated !
> Best regards,
> zhangyue19921010

Re: Partition Level Bucket Index

Reply via email to