[DISCUSS] Hash Index for HUDI

耿筱喻 Wed, 02 Jun 2021 07:42:37 -0700

Hi, 
Currently, Hudi index implementation is pluggable and provides two options: 
bloom filter and hbase. When a Hudi table becomes large, the performance of 
bloom filter degrade drastically due to the increase in false positive 
probability.


Hash index is an efficient light-weight approach to address the performance 
issue. It is used in Hive called Bucket, which clusters the records whose key 
have the same hash value under a unique hash function. This pre-distribution 
can accelerate the sql query in some scenarios. Besides, Bucket in Hive offers 
the efficient sampling. 

I make a RFC for this 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.

Feel free to discuss under this thread and suggestions are welcomed.

Regards,
Shawy

[DISCUSS] Hash Index for HUDI

Reply via email to