Hi Hudis:

     As we known, Hudi proposed and introduced Bucket Index in RFC-29. Bucket 
Index can well unify the indexes of Flink and Spark, that is, Spark and Flink 
could upsert the same Hudi table using bucket index.

     However, Bucket Index Table has a limit of fixed number of buckets. In 
order to solve this problem, RFC-42 proposed the ability of consistent hashing 
achieving bucket resizing by splitting or merging several local buckets 
dynamically.

    But from PRD experience, sometimes a Partition-Level Bucket Index and a 
offline way to do bucket rescale is good enough without introducing additional 
efforts (multiple writes, clustering, automatic resizing,etc.). Because the 
more complex the Architecture, the more error-prone it is and the greater 
operation and maintenance pressure.

    In this regard, We could upgrade the traditional Bucket Index to implement 
a Partition-Level Bucket Index, so that users can set a specific number of 
buckets for different partitions through a rule engine (such as regular 
expression matching). On the other hand, for a certain existing partitions, an 
offline command is provided to reorganized the data using insert overwrite(need 
to stop the data writing of the current partition).
    More importantly, the existing Bucket Index table can be upgraded to 
Partition-Level Bucket Index smoothly without re-building the whole table.
    Some thoughts on this feature? Any feedback would be greatly appreciated !
Best regards,
zhangyue19921010

Reply via email to