Combining bucket index and bloom filter is a great idea. There is no conflict between the two in implementation, and the bloom filter info can be still stored in the file to position faster.
Best, Shawy > 2021年6月9日 16:23,Thiru Malai <thiru.dr...@gmail.com> 写道: > > Hi, > > This feature seems promising. If we are planning to assign the filegroupID as > the hash mod value, then we can leverage this change in Bloom Index as well > by pruning the files based on hash mod value before mix max record_key > pruning. So that the exploded RDD will be comparatively smaller which will > eventually optimise the shuffle size in "Compute all comparisons needed > between records and files" stages. > > Can we add this hash based indexing approach to Bloom Filter Based approach > also > > On 2021/06/07 03:26:34, Danny Chan <danny0...@apache.org> wrote: >>> number of buckets expanded by multiple is recommended >> The condition is too harsh and the bucket number would be with >> exponential growth. >> >>> with hash index can be solved by using mutiple file groups per bucket as >> mentioned in the RFC >> The relation of file groups and bucket would be too complicated, we should >> avoid that. It also requires that the query engine be aware of the >> bucketing rules, not that transparent and is not a common query >> optimization. >> >> Best, >> Danny Chan >> >> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月4日周五 下午6:06写道: >> >>> Thank you for your questions. >>> >>> For the first question, the number of buckets expanded by mutiple is >>> recommended. Combine rehashing and clustering to re-distribute the data >>> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st >>> bucket and rehashing data in it to two small buckets: 1st and 3st bucket. >>> Details have been supplied to the RFC. >>> >>> For the second one, data skew when writing to hudi with hash index can be >>> solved by using mutiple file groups per bucket as mentioned in the RFC. To >>> data process engine like Spark, data skew when table joining can be solved >>> by splitting the skew partition to some smaller units and distributing them >>> to different tasks to execute, and it works in some scenarios which has >>> fixed sql pattern. Besides, data skew solution needs more effort to be >>> compatible with bucket join rule. However, the read and write long tail >>> caused by data skew in sql query is hard to be solved. >>> >>> Regards, >>> Shawy >>> >>>> 2021年6月3日 10:47,Danny Chan <danny0...@apache.org> 写道: >>>> >>>> Thanks for the new feature, very promising ~ >>>> >>>> Some confusion about the *Scalability* and *Data Skew* part: >>>> >>>> How do we expanded the number of existing buckets, say if we have 100 >>>> buckets before, but 120 buckets now, what is the algorithm ? >>>> >>>> About the data skew, did you mean there is no good solution to solve this >>>> problem now ? >>>> >>>> Best, >>>> Danny Chan >>>> >>>> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道: >>>> >>>>> Hi, >>>>> Currently, Hudi index implementation is pluggable and provides two >>>>> options: bloom filter and hbase. When a Hudi table becomes large, the >>>>> performance of bloom filter degrade drastically due to the increase in >>>>> false positive probability. >>>>> >>>>> Hash index is an efficient light-weight approach to address the >>>>> performance issue. It is used in Hive called Bucket, which clusters the >>>>> records whose key have the same hash value under a unique hash function. >>>>> This pre-distribution can accelerate the sql query in some scenarios. >>>>> Besides, Bucket in Hive offers the efficient sampling. >>>>> >>>>> I make a RFC for this >>>>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index >>> . >>>>> >>>>> Feel free to discuss under this thread and suggestions are welcomed. >>>>> >>>>> Regards, >>>>> Shawy >>> >>> >>