Thank you for your questions and advice. Differently from RFC-08, this one doesn’t introduce the HFile to store the mapping of record and its location. One bucket having a file group is one of the options. For one file group per bucket, assigning bucket id to file group id is a great idea. This part of the RFC has been modified.
Regards, Shawy > 2021年6月4日 21:24,Vinoth Chandar <vin...@apache.org> 写道: > > Thanks for opening the RFC! At first glance, it seemed similar to RFC-08, > but the proposal seems to be adding a bucket id to each file group ID? > If I may suggest, we should call this BucketedIndex? > > Instead of changing the existing file name, can we simply assign the > filegroupID as the hash mod value? i.e just make the fileGroupIDs 0 - > numBuckets-1 (with some hash value of the partition path also for > uniqueness across table)? > This way this is a localized change, not a major change is how we name > files/objects? > > I will review the RFC more carefully, early next week. > > Thanks > Vinoth > > > > > > > > On Fri, Jun 4, 2021 at 3:05 AM 耿筱喻 <gengxiaoyu1...@gmail.com> wrote: > >> Thank you for your questions. >> >> For the first question, the number of buckets expanded by mutiple is >> recommended. Combine rehashing and clustering to re-distribute the data >> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st >> bucket and rehashing data in it to two small buckets: 1st and 3st bucket. >> Details have been supplied to the RFC. >> >> For the second one, data skew when writing to hudi with hash index can be >> solved by using mutiple file groups per bucket as mentioned in the RFC. To >> data process engine like Spark, data skew when table joining can be solved >> by splitting the skew partition to some smaller units and distributing them >> to different tasks to execute, and it works in some scenarios which has >> fixed sql pattern. Besides, data skew solution needs more effort to be >> compatible with bucket join rule. However, the read and write long tail >> caused by data skew in sql query is hard to be solved. >> >> Regards, >> Shawy >> >>> 2021年6月3日 10:47,Danny Chan <danny0...@apache.org> 写道: >>> >>> Thanks for the new feature, very promising ~ >>> >>> Some confusion about the *Scalability* and *Data Skew* part: >>> >>> How do we expanded the number of existing buckets, say if we have 100 >>> buckets before, but 120 buckets now, what is the algorithm ? >>> >>> About the data skew, did you mean there is no good solution to solve this >>> problem now ? >>> >>> Best, >>> Danny Chan >>> >>> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道: >>> >>>> Hi, >>>> Currently, Hudi index implementation is pluggable and provides two >>>> options: bloom filter and hbase. When a Hudi table becomes large, the >>>> performance of bloom filter degrade drastically due to the increase in >>>> false positive probability. >>>> >>>> Hash index is an efficient light-weight approach to address the >>>> performance issue. It is used in Hive called Bucket, which clusters the >>>> records whose key have the same hash value under a unique hash function. >>>> This pre-distribution can accelerate the sql query in some scenarios. >>>> Besides, Bucket in Hive offers the efficient sampling. >>>> >>>> I make a RFC for this >>>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index >> . >>>> >>>> Feel free to discuss under this thread and suggestions are welcomed. >>>> >>>> Regards, >>>> Shawy >> >>