Thanks for opening the RFC! At first glance, it seemed similar to RFC-08, but the proposal seems to be adding a bucket id to each file group ID? If I may suggest, we should call this BucketedIndex?
Instead of changing the existing file name, can we simply assign the filegroupID as the hash mod value? i.e just make the fileGroupIDs 0 - numBuckets-1 (with some hash value of the partition path also for uniqueness across table)? This way this is a localized change, not a major change is how we name files/objects? I will review the RFC more carefully, early next week. Thanks Vinoth On Fri, Jun 4, 2021 at 3:05 AM 耿筱喻 <gengxiaoyu1...@gmail.com> wrote: > Thank you for your questions. > > For the first question, the number of buckets expanded by mutiple is > recommended. Combine rehashing and clustering to re-distribute the data > without shuffling. For example, 2 buckets expands to 4 by splitting the 1st > bucket and rehashing data in it to two small buckets: 1st and 3st bucket. > Details have been supplied to the RFC. > > For the second one, data skew when writing to hudi with hash index can be > solved by using mutiple file groups per bucket as mentioned in the RFC. To > data process engine like Spark, data skew when table joining can be solved > by splitting the skew partition to some smaller units and distributing them > to different tasks to execute, and it works in some scenarios which has > fixed sql pattern. Besides, data skew solution needs more effort to be > compatible with bucket join rule. However, the read and write long tail > caused by data skew in sql query is hard to be solved. > > Regards, > Shawy > > > 2021年6月3日 10:47,Danny Chan <danny0...@apache.org> 写道: > > > > Thanks for the new feature, very promising ~ > > > > Some confusion about the *Scalability* and *Data Skew* part: > > > > How do we expanded the number of existing buckets, say if we have 100 > > buckets before, but 120 buckets now, what is the algorithm ? > > > > About the data skew, did you mean there is no good solution to solve this > > problem now ? > > > > Best, > > Danny Chan > > > > 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道: > > > >> Hi, > >> Currently, Hudi index implementation is pluggable and provides two > >> options: bloom filter and hbase. When a Hudi table becomes large, the > >> performance of bloom filter degrade drastically due to the increase in > >> false positive probability. > >> > >> Hash index is an efficient light-weight approach to address the > >> performance issue. It is used in Hive called Bucket, which clusters the > >> records whose key have the same hash value under a unique hash function. > >> This pre-distribution can accelerate the sql query in some scenarios. > >> Besides, Bucket in Hive offers the efficient sampling. > >> > >> I make a RFC for this > >> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index > . > >> > >> Feel free to discuss under this thread and suggestions are welcomed. > >> > >> Regards, > >> Shawy > >