Re:Re: Partition Level Bucket Index

Yue Zhang Sun, 26 Jan 2025 18:45:58 -0800

Hi Sagar,

Thanks for your attention. Okay, I'll draft an RFC named "Partition Level 
Bucket Index", focusing on offline resizing performance and the complexity of 
bucket-sizing management (for example, recording the number of partition 
buckets in .hoodie_partition_metadata). If you are also interested, welcome to 
discuss and review!


Best regards,
zhangyue19921010








At 2025-01-25 20:10:32, "Sagar Sumit" <cod...@apache.org> wrote:
>Hi Yue,
>
>Thanks for your proposal. I think it simplifies the bucket index design while 
>offering flexibility for varied partition sizes. I can see this being 
>particularly useful for workloads with predictable partition growth or where 
>operational simplicity is a priority, such as batch ingestion pipelines or 
>scenarios with heterogeneous partition sizes.
>
>That said, some challenges to consider include the operational overhead of 
>stopping writes during offline resizing and the potential complexity in 
>defining and managing bucket-sizing rules. For real-time workloads or highly 
>dynamic partition growth, RFC-42’s dynamic resizing might still be preferable. 
>Balancing these trade-offs will be key to making this feature successful.
>
>Overall, I believe there is room for both strategies, and we should let users 
>choose the one that best suits their needs. Let's explore this further and 
>discuss how we can refine the proposal in an RFC.
>
>Regards,
>Sagar
>
>On 2025/01/24 09:58:49 Yue Zhang wrote:
>> Hi Hudis:
>> 
>>      As we known, Hudi proposed and introduced Bucket Index in RFC-29. 
>> Bucket Index can well unify the indexes of Flink and Spark, that is, Spark 
>> and Flink could upsert the same Hudi table using bucket index.
>> 
>>      However, Bucket Index Table has a limit of fixed number of buckets. In 
>> order to solve this problem, RFC-42 proposed the ability of consistent 
>> hashing achieving bucket resizing by splitting or merging several local 
>> buckets dynamically.
>> 
>>     But from PRD experience, sometimes a Partition-Level Bucket Index and a 
>> offline way to do bucket rescale is good enough without introducing 
>> additional efforts (multiple writes, clustering, automatic resizing,etc.). 
>> Because the more complex the Architecture, the more error-prone it is and 
>> the greater operation and maintenance pressure.
>> 
>>     In this regard, We could upgrade the traditional Bucket Index to 
>> implement a Partition-Level Bucket Index, so that users can set a specific 
>> number of buckets for different partitions through a rule engine (such as 
>> regular expression matching). On the other hand, for a certain existing 
>> partitions, an offline command is provided to reorganized the data using 
>> insert overwrite(need to stop the data writing of the current partition).
>>     More importantly, the existing Bucket Index table can be upgraded to 
>> Partition-Level Bucket Index smoothly without re-building the whole table.
>>     Some thoughts on this feature? Any feedback would be greatly appreciated 
>> !
>> Best regards,
>> zhangyue19921010

Re:Re: Partition Level Bucket Index

Reply via email to