Re: [DISCUSS] Hash Index for HUDI

Shawy Geng Wed, 16 Jun 2021 08:54:55 -0700

Thank you for your questions and advice.

Differently from RFC-08, this one doesn’t introduce the HFile to store the 
mapping of record and its location. One bucket having a file group is one of 
the options. For one file group per bucket, assigning bucket id to file group 
id is a great idea. This part of the RFC has been modified.


Regards,
Shawy


> 2021年6月4日 21:24，Vinoth Chandar <vin...@apache.org> 写道：
> 
> Thanks for opening the RFC! At first glance, it seemed similar to RFC-08,
> but the proposal seems to be adding a bucket id to each file group ID?
> If I may suggest, we should call this BucketedIndex?
> 
> Instead of changing the existing file name, can we simply assign the
> filegroupID as the hash mod value?  i.e just make the fileGroupIDs 0 -
> numBuckets-1 (with some hash value of the partition path also for
> uniqueness across table)?
> This way this is a localized change, not a major change is how we name
> files/objects?
> 
> I will review the RFC more carefully, early next week.
> 
> Thanks
> Vinoth
> 
> 
> 
> 
> 
> 
> 
> On Fri, Jun 4, 2021 at 3:05 AM 耿筱喻 <gengxiaoyu1...@gmail.com> wrote:
> 
>> Thank you for your questions.
>> 
>> For the first question, the number of buckets expanded by mutiple is
>> recommended. Combine rehashing and clustering to re-distribute the data
>> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
>> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
>> Details have been supplied to the RFC.
>> 
>> For the second one, data skew when writing to hudi with hash index can be
>> solved by using mutiple file groups per bucket as mentioned in the RFC. To
>> data process engine like Spark, data skew when table joining can be solved
>> by splitting the skew partition to some smaller units and distributing them
>> to different tasks to execute, and it works in some scenarios which has
>> fixed sql pattern. Besides, data skew solution needs more effort to be
>> compatible with bucket join rule. However, the read and write long tail
>> caused by data skew in sql query is hard to be solved.
>> 
>> Regards,
>> Shawy
>> 
>>> 2021年6月3日 10:47，Danny Chan <danny0...@apache.org> 写道：
>>> 
>>> Thanks for the new feature, very promising ~
>>> 
>>> Some confusion about the *Scalability* and *Data Skew* part:
>>> 
>>> How do we expanded the number of existing buckets, say if we have 100
>>> buckets before, but 120 buckets now, what is the algorithm ？
>>> 
>>> About the data skew, did you mean there is no good solution to solve this
>>> problem now ?
>>> 
>>> Best,
>>> Danny Chan
>>> 
>>> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道：
>>> 
>>>> Hi,
>>>> Currently, Hudi index implementation is pluggable and provides two
>>>> options: bloom filter and hbase. When a Hudi table becomes large, the
>>>> performance of bloom filter degrade drastically due to the increase in
>>>> false positive probability.
>>>> 
>>>> Hash index is an efficient light-weight approach to address the
>>>> performance issue. It is used in Hive called Bucket, which clusters the
>>>> records whose key have the same hash value under a unique hash function.
>>>> This pre-distribution can accelerate the sql query in some scenarios.
>>>> Besides, Bucket in Hive offers the efficient sampling.
>>>> 
>>>> I make a RFC for this
>>>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
>> .
>>>> 
>>>> Feel free to discuss under this thread and suggestions are welcomed.
>>>> 
>>>> Regards,
>>>> Shawy
>> 
>>

Re: [DISCUSS] Hash Index for HUDI

Reply via email to