Hi, Currently, Hudi index implementation is pluggable and provides two options: bloom filter and hbase. When a Hudi table becomes large, the performance of bloom filter degrade drastically due to the increase in false positive probability.
Hash index is an efficient light-weight approach to address the performance issue. It is used in Hive called Bucket, which clusters the records whose key have the same hash value under a unique hash function. This pre-distribution can accelerate the sql query in some scenarios. Besides, Bucket in Hive offers the efficient sampling. I make a RFC for this https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index. Feel free to discuss under this thread and suggestions are welcomed. Regards, Shawy