Re: [DISCUSS] Hash Index for HUDI

Vinoth Chandar Fri, 04 Jun 2021 06:24:22 -0700

Thanks for opening the RFC! At first glance, it seemed similar to RFC-08,
but the proposal seems to be adding a bucket id to each file group ID?
If I may suggest, we should call this BucketedIndex?


Instead of changing the existing file name, can we simply assign the
filegroupID as the hash mod value?  i.e just make the fileGroupIDs 0 -
numBuckets-1 (with some hash value of the partition path also for
uniqueness across table)?
This way this is a localized change, not a major change is how we name
files/objects?

I will review the RFC more carefully, early next week.

Thanks
Vinoth







On Fri, Jun 4, 2021 at 3:05 AM 耿筱喻 <gengxiaoyu1...@gmail.com> wrote:

> Thank you for your questions.
>
> For the first question, the number of buckets expanded by mutiple is
> recommended. Combine rehashing and clustering to re-distribute the data
> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
> Details have been supplied to the RFC.
>
> For the second one, data skew when writing to hudi with hash index can be
> solved by using mutiple file groups per bucket as mentioned in the RFC. To
> data process engine like Spark, data skew when table joining can be solved
> by splitting the skew partition to some smaller units and distributing them
> to different tasks to execute, and it works in some scenarios which has
> fixed sql pattern. Besides, data skew solution needs more effort to be
> compatible with bucket join rule. However, the read and write long tail
> caused by data skew in sql query is hard to be solved.
>
> Regards,
> Shawy
>
> > 2021年6月3日 10:47，Danny Chan <danny0...@apache.org> 写道：
> >
> > Thanks for the new feature, very promising ~
> >
> > Some confusion about the *Scalability* and *Data Skew* part:
> >
> > How do we expanded the number of existing buckets, say if we have 100
> > buckets before, but 120 buckets now, what is the algorithm ？
> >
> > About the data skew, did you mean there is no good solution to solve this
> > problem now ?
> >
> > Best,
> > Danny Chan
> >
> > 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道：
> >
> >> Hi,
> >> Currently, Hudi index implementation is pluggable and provides two
> >> options: bloom filter and hbase. When a Hudi table becomes large, the
> >> performance of bloom filter degrade drastically due to the increase in
> >> false positive probability.
> >>
> >> Hash index is an efficient light-weight approach to address the
> >> performance issue. It is used in Hive called Bucket, which clusters the
> >> records whose key have the same hash value under a unique hash function.
> >> This pre-distribution can accelerate the sql query in some scenarios.
> >> Besides, Bucket in Hive offers the efficient sampling.
> >>
> >> I make a RFC for this
> >> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
> .
> >>
> >> Feel free to discuss under this thread and suggestions are welcomed.
> >>
> >> Regards,
> >> Shawy
>
>

Re: [DISCUSS] Hash Index for HUDI

Reply via email to