[Delta Streamer] file name mismatch with meta when compaction running

2021-10-05 Thread Jian Feng
when I run delta streamer(version 0.9) to ingest data from kafka to a Hbase
indexed mor table ,  after few commits, met this error when compaction
running
[image: image.png]

 In hdfs there is a file has same fileId and commit instant but different
in the middle:
hdfs://tl5/projects/data_vite/mysql_ingestion/rti_vite/shopee_item_v4_db__item_v4_tab_newHbase/BR/2021-10/813800cd-1aaf-43ea-829f-4feef4a51cb3-0_19-2672-4427765_
*20211006051032*.parquet

below is 20211006051032.commit's content,


[image: image.png]


What does 2672-4427765 and 2657-4368242 mean? and how can I fix this error?

I tried recreate table , it happens again


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure


Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-05 Thread Jian Feng
actually I met this problem when bootstrap a huge table,after changed
region key split strategy,problem solved.
Im glad to hear that hfile solution will work in the future,since
bloomindex cannot index mor log file,hence new insert data still write into
parquet ,that why I choose hbase index ,get better performance.

Vinoth Chandar 于2021年10月5日 周二下午7:29写道:

> +1 on that answer. It's pretty spot on.
>
> Even as random prefix helps with HBase balancing, the issue then becomes
> that you lose all the key ordering inside the Hudi table, which
> can be a nice thing if you even want range pruning/indexing to be
> effective.
>
> To paint a picture of all the work being done around this area. This work,
> driven by uber engineers https://github.com/apache/hudi/pull/3508 could
> technically solve the issue by directly reading HFiles
> for the indexing, avoiding going to HBase servers. But obviously, it could
> be less performant for small upsert batches than HBase (given the region
> servers will cache etc).
> If your backing storage is a cloud/object storage, which again throttles by
> prefixes etc, then we could run into the same hotspotting problem again.
> Otherwise, for larger batches, this would be far more scalable.
>
>
> On Mon, Oct 4, 2021 at 7:06 PM 管梓越  wrote:
>
> > Hi jianfeng
> >   As far as I know, there may not be a solution in hudi side yet.
> > However, I have met this problem before so hope my experience could help.
> > Just like other usages of hbase, adding a random prefix to rowkey may be
> > the most universal solution to this problem.
> > We may change the primary key for hudi by adding such prefix before the
> > data is ingested into hudi. A new column could be added to save original
> > primary key for query and hide the pk of hudi.
> > Also, we may have a small modification to hbase index. Copy the code of
> > hbase index, add the prefix on the aspect of query and update hbase. By
> > this way, the pk in hbase will be different with the one in hudi but such
> > logic will be transparent to business logic. I have adopted this method
> in
> > prod environment. Using withIndexClass config in IndexConfig could
> specify
> > custom index which allows the change of index without re compilation of
> the
> > whole hudi project.
> >
> > On Mon, Oct 4, 2021, 11:29 PM  wrote:
> > when I bootstrape a huge hbase index table, I found all keys have a
> prefix
> > 'itemid:', then it caused data skew, there are 100 region servers in
> hbase
> > but only one was handle datas Is there any way to avoid this issue on the
> > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
> >
>
-- 
Full jian
 | 
Mobile 
Address 


Code Walkthough Session on 8th October

2021-10-05 Thread joyan sil
Hello Everyone,

Please join for a code walkthrough session on Friday,* 8th October, 8 AM
PST* by @vc.
Please register using the link:
https://meeting.zoho.in/meeting/register?sessionId=1356203477

Let me know if you face any issues registering or have any suggestions.

Regards
Joyan


Re: Monthly or Bi-Monthly Dev meeting?

2021-10-05 Thread Pratyaksh Sharma
Works for me in India :)

On Tue, Oct 5, 2021 at 9:41 AM Vinoth Chandar  wrote:

> Looks like there is enough interest here.
>
> Moving onto timing. Does 8AM PST, on the second thursday of every
> month work for everyone?
> This is the time I find, works best for most time zones.
>
> On Thu, Sep 23, 2021 at 1:15 PM Y Ethan Guo 
> wrote:
>
> > +1 on monthly community sync.
> >
> > On Thu, Sep 23, 2021 at 12:32 PM Udit Mehrotra 
> wrote:
> >
> > > +1 for the monthly meeting. It would be great to start syncing up
> > > again. Thanks Vinoth for bringing it up !
> > >
> > > On Thu, Sep 23, 2021 at 12:14 PM Sivabalan  wrote:
> > > >
> > > > +1 on monthly meet up.
> > > >
> > > > On Thu, Sep 23, 2021 at 11:01 AM vino yang 
> > > wrote:
> > > >
> > > > > +1 for monthly
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > Pratyaksh Sharma  于2021年9月23日周四 下午9:36写道:
> > > > >
> > > > > > Monthly should be good. Been a long time since we connected in
> > these
> > > > > > meetings. :)
> > > > > >
> > > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
> > > > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > > >
> > > > > > > 1 hour monthly is what I was proposing to be specific.
> > > > > > >
> > > > > > > On Thu, Sep 23, 2021 at 6:30 AM Gary Li 
> > wrote:
> > > > > > >
> > > > > > > > +1 for monthly.
> > > > > > > >
> > > > > > > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar <
> > > vin...@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Once upon a time, we used to have a weekly community sync.
> > > > > Wondering
> > > > > > if
> > > > > > > > > there is interest in having a monthly or bi-monthly dev
> > > meeting?
> > > > > > > > >
> > > > > > > > > Agenda could be
> > > > > > > > > - Update/Summary of all dev work tracks
> > > > > > > > > - Show and tell, where people can present their ongoing
> work
> > > > > > > > > - Open floor discussions, bring up new issues.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Vinoth
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > >
> >
>


Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-05 Thread Vinoth Chandar
+1 on that answer. It's pretty spot on.

Even as random prefix helps with HBase balancing, the issue then becomes
that you lose all the key ordering inside the Hudi table, which
can be a nice thing if you even want range pruning/indexing to be
effective.

To paint a picture of all the work being done around this area. This work,
driven by uber engineers https://github.com/apache/hudi/pull/3508 could
technically solve the issue by directly reading HFiles
for the indexing, avoiding going to HBase servers. But obviously, it could
be less performant for small upsert batches than HBase (given the region
servers will cache etc).
If your backing storage is a cloud/object storage, which again throttles by
prefixes etc, then we could run into the same hotspotting problem again.
Otherwise, for larger batches, this would be far more scalable.


On Mon, Oct 4, 2021 at 7:06 PM 管梓越  wrote:

> Hi jianfeng
>   As far as I know, there may not be a solution in hudi side yet.
> However, I have met this problem before so hope my experience could help.
> Just like other usages of hbase, adding a random prefix to rowkey may be
> the most universal solution to this problem.
> We may change the primary key for hudi by adding such prefix before the
> data is ingested into hudi. A new column could be added to save original
> primary key for query and hide the pk of hudi.
> Also, we may have a small modification to hbase index. Copy the code of
> hbase index, add the prefix on the aspect of query and update hbase. By
> this way, the pk in hbase will be different with the one in hudi but such
> logic will be transparent to business logic. I have adopted this method in
> prod environment. Using withIndexClass config in IndexConfig could specify
> custom index which allows the change of index without re compilation of the
> whole hudi project.
>
> On Mon, Oct 4, 2021, 11:29 PM  wrote:
> when I bootstrape a huge hbase index table, I found all keys have a prefix
> 'itemid:', then it caused data skew, there are 100 region servers in hbase
> but only one was handle datas Is there any way to avoid this issue on the
> Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
>