Combining bucket index and bloom filter is a great idea. There is no conflict 
between the two in implementation, and the bloom filter info can be still 
stored in the file to position faster.

Best,
Shawy

> 2021年6月9日 16:23,Thiru Malai <thiru.dr...@gmail.com> 写道:
> 
> Hi,
> 
> This feature seems promising. If we are planning to assign the filegroupID as 
> the hash mod value, then we can leverage this change in Bloom Index as well 
> by pruning the files based on hash mod value before mix max record_key 
> pruning. So that the exploded RDD will be comparatively smaller which will 
> eventually optimise the shuffle size in "Compute all comparisons needed 
> between records and files" stages.
> 
> Can we add this hash based indexing approach to Bloom Filter Based approach 
> also
> 
> On 2021/06/07 03:26:34, Danny Chan <danny0...@apache.org> wrote: 
>>> number of buckets expanded by multiple is recommended
>> The condition is too harsh and the bucket number would be with
>> exponential growth.
>> 
>>> with hash index can be solved by using mutiple file groups per bucket as
>> mentioned in the RFC
>> The relation of file groups and bucket would be too complicated, we should
>> avoid that. It also requires that the query engine be aware of the
>> bucketing rules, not that transparent and is not a common query
>> optimization.
>> 
>> Best,
>> Danny Chan
>> 
>> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月4日周五 下午6:06写道:
>> 
>>> Thank you for your questions.
>>> 
>>> For the first question, the number of buckets expanded by mutiple is
>>> recommended. Combine rehashing and clustering to re-distribute the data
>>> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
>>> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
>>> Details have been supplied to the RFC.
>>> 
>>> For the second one, data skew when writing to hudi with hash index can be
>>> solved by using mutiple file groups per bucket as mentioned in the RFC. To
>>> data process engine like Spark, data skew when table joining can be solved
>>> by splitting the skew partition to some smaller units and distributing them
>>> to different tasks to execute, and it works in some scenarios which has
>>> fixed sql pattern. Besides, data skew solution needs more effort to be
>>> compatible with bucket join rule. However, the read and write long tail
>>> caused by data skew in sql query is hard to be solved.
>>> 
>>> Regards,
>>> Shawy
>>> 
>>>> 2021年6月3日 10:47,Danny Chan <danny0...@apache.org> 写道:
>>>> 
>>>> Thanks for the new feature, very promising ~
>>>> 
>>>> Some confusion about the *Scalability* and *Data Skew* part:
>>>> 
>>>> How do we expanded the number of existing buckets, say if we have 100
>>>> buckets before, but 120 buckets now, what is the algorithm ?
>>>> 
>>>> About the data skew, did you mean there is no good solution to solve this
>>>> problem now ?
>>>> 
>>>> Best,
>>>> Danny Chan
>>>> 
>>>> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道:
>>>> 
>>>>> Hi,
>>>>> Currently, Hudi index implementation is pluggable and provides two
>>>>> options: bloom filter and hbase. When a Hudi table becomes large, the
>>>>> performance of bloom filter degrade drastically due to the increase in
>>>>> false positive probability.
>>>>> 
>>>>> Hash index is an efficient light-weight approach to address the
>>>>> performance issue. It is used in Hive called Bucket, which clusters the
>>>>> records whose key have the same hash value under a unique hash function.
>>>>> This pre-distribution can accelerate the sql query in some scenarios.
>>>>> Besides, Bucket in Hive offers the efficient sampling.
>>>>> 
>>>>> I make a RFC for this
>>>>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
>>> .
>>>>> 
>>>>> Feel free to discuss under this thread and suggestions are welcomed.
>>>>> 
>>>>> Regards,
>>>>> Shawy
>>> 
>>> 
>> 

Reply via email to