+1 for the improvement to make bucket index more comprehensive and looking forward to the RFC for more details.
Yuwei Xiao <ywxiaoz...@gmail.com> 于2021年12月10日周五 16:22写道: > Dear Hudi Community, > > I would like to propose Consistent Hashing Indexing to enable dynamic > bucket number, saving hyper-parameter tuning for Hudi users. > > Currently, we have Bucket Index on landing [1]. It is an effective index > approach to address the performance issue during Upsert. I observed ~3x > throughput improvement for Upsert in my local setup compared to the Bloom > Filter approach. However, it requires pre-configure a bucket number when > creating the table. As described in [1], this imposes two limitations: > > - Due to the one-one mapping between buckets and file groups, the size of a > single file group may grow infinitely. Services like compaction will take > longer because of the larger read/write amplification. > > - There may exist data skew because of imbalance data distribution, > resulting in long-tail read/write. > > Based on the above observation, supporting dynamic bucket number is > necessary, especially for rapidly changing hudi tables. Looking at the > market, Consistent Hashing has been adopted in DB systems[2][3]. The main > idea of it is to turn the "key->bucket" mapping into > "key->hash_value->(range mapping)->bucket", constraining the re-hashing > process to touch only several local buckets (e.g., only large file groups) > rather than shuffling the whole hash table. > > In order to introduce Consistent Hashing to Hudi, we need to consider the > following issues: > > - Storing hashing metadata, such as range mapping infos. Metadata size and > concurrent updates to metadata should also be considered. > > - Splitting & Merging criteria. We need to design a (or several) policies > to manage 'when and how to split & merge bucket'. A simple policy would be > splitting in the middle when the file group reaches the size threshold. > > - Supporting concurrent write & read. The splitting or merging must not > block concurrent writer & reader, and the whole process should be fast > enough (e.g., one bucket at a time) to minimize the impact on other > operations. > > - Integrating splitting & merging process into existing hudi table service > pipelines. > > I have sketched a prototype design to address the above problems: > > - Maintain hashing metadata for each partition (persisted as files), and > use instant to manage multi-version and concurrent updates of it. > > - A flexible framework will be implemented for different pluggable > policies. The splitting plan, specifying which and how the bucket to split > (merge), will be generated during the scheduling (just like how compaction > does). > > - Dual-write will be activated once the writer observes the splitting(or > merging) process, upserting records as log files into both old and new > buckets (file groups). Readers can see records once the writer completes, > regardless of the splitting process. > > - The splitting & merging could be integrated as a sub-task into the > Clustering service, because we could view them as a special case of the > Clustering's goal (i.e., managing file groups based on file size). Though > we need to modify Clustering to handle log files, the bucket index enhances > Clustering by allowing concurrent updates. > > > Would love to hear your thoughts and any feedback about the proposal. I can > draft an RFC with a detailed design once we reach an agreement. > > [1] > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index > > [2] YugabyteDB > > https://docs.yugabyte.com/latest/architecture/docdb-sharding/sharding/#example > > [3] PolarDB-X > https://help.aliyun.com/document_detail/316603.html#title-y5n-2i1-5ws > > > > Best, > > Yuwei Xiao >