Yeah all the rate limiting code in HBaseIndex is working around for these large bulk writes.
On Tue, Oct 5, 2021 at 11:16 AM Jian Feng <[email protected]> wrote: > actually I met this problem when bootstrap a huge table,after changed > region key split strategy,problem solved. > Im glad to hear that hfile solution will work in the future,since > bloomindex cannot index mor log file,hence new insert data still write into > parquet ,that why I choose hbase index ,get better performance. > > Vinoth Chandar <[email protected]>于2021年10月5日 周二下午7:29写道: > > > +1 on that answer. It's pretty spot on. > > > > Even as random prefix helps with HBase balancing, the issue then becomes > > that you lose all the key ordering inside the Hudi table, which > > can be a nice thing if you even want range pruning/indexing to be > > effective. > > > > To paint a picture of all the work being done around this area. This > work, > > driven by uber engineers https://github.com/apache/hudi/pull/3508 could > > technically solve the issue by directly reading HFiles > > for the indexing, avoiding going to HBase servers. But obviously, it > could > > be less performant for small upsert batches than HBase (given the region > > servers will cache etc). > > If your backing storage is a cloud/object storage, which again throttles > by > > prefixes etc, then we could run into the same hotspotting problem again. > > Otherwise, for larger batches, this would be far more scalable. > > > > > > On Mon, Oct 4, 2021 at 7:06 PM 管梓越 <[email protected]> wrote: > > > > > Hi jianfeng > > > As far as I know, there may not be a solution in hudi side yet. > > > However, I have met this problem before so hope my experience could > help. > > > Just like other usages of hbase, adding a random prefix to rowkey may > be > > > the most universal solution to this problem. > > > We may change the primary key for hudi by adding such prefix before the > > > data is ingested into hudi. A new column could be added to save > original > > > primary key for query and hide the pk of hudi. > > > Also, we may have a small modification to hbase index. Copy the code of > > > hbase index, add the prefix on the aspect of query and update hbase. By > > > this way, the pk in hbase will be different with the one in hudi but > such > > > logic will be transparent to business logic. I have adopted this method > > in > > > prod environment. Using withIndexClass config in IndexConfig could > > specify > > > custom index which allows the change of index without re compilation of > > the > > > whole hudi project. > > > > > > On Mon, Oct 4, 2021, 11:29 PM <[email protected]> wrote: > > > when I bootstrape a huge hbase index table, I found all keys have a > > prefix > > > 'itemid:', then it caused data skew, there are 100 region servers in > > hbase > > > but only one was handle datas Is there any way to avoid this issue on > the > > > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure > > > > > > -- > Full jian > <Department> | <Function> > Mobile <Mobile> > Address <Office's Address> >
