[ANNOUNCE] New Apache Hudi Committer - Zhaojing Yu

2022-03-24 Thread Danny Chan
Hi everyone,

On behalf of the PMC, I'm very happy to announce Zhaojing Yu as a new
Hudi committer.

Zhaojing is very active in Flink Hudi contributions, many cool
features such as the flink streaming bootstrap, compaction service and
all kinds of writing modes are contributed by him. He also fixed many
critical bugs from the Flink side.

Besides that, Zhaojing is also active in use case publicity of Hudi in
China, he is very active in answering user questions in our Dingtalk
group. Now he is working in Bytedance for pushing forward the Volcanic
cloud service Hudi products !

Please join me in congratulating Zhaojing for becoming a Hudi committer!

Cheers,
Danny


Re: [DISCUSS] New RFC to support Lock-free concurrency control on Merge-on-read tables

2022-03-24 Thread Jian Feng
sure, I'm working on it, will add you as a co-author when create a pr

On Fri, Mar 25, 2022 at 1:17 AM Vinoth Chandar  wrote:

> +1. Love to be a co-author on the RFC, if you are open to it.
>
> On Mon, Mar 21, 2022 at 12:31 PM 冯健  wrote:
>
> > Hi team,
> >
> > The situation is Optimistic concurrency control(OCC) has some limitation
> >
> >-
> >
> >When conflicts do occur, they may waste massive resources during every
> >attempt (lakehouse-concurrency-control-are-we-too-optimistic
> ><
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_blog_2021_12_16_lakehouse-2Dconcurrency-2Dcontrol-2Dare-2Dwe-2Dtoo-2Doptimistic=DwIFaQ=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw=bXAq09cDo2vOJ-2Uz9h3CslJmeCj9JMbo5X-gCHPF24=rz6Mo5568KcwmokXd967obpw0RNDcDJepfrUmf9KUxgfK14-uOfJSLb4l7xpCxqp=GFRt00qSBTRTWbGjUo-UBInLiU88zE_YbvHP0UO_geE=
> > >
> >).
> >-
> >
> >multiple writers may cause data duplicates when records with same new
> >record-key arrives.multi-writer-guarantees
> ><
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_concurrency-5Fcontrol-23multi-2Dwriter-2Dguarantees=DwIFaQ=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw=bXAq09cDo2vOJ-2Uz9h3CslJmeCj9JMbo5X-gCHPF24=rz6Mo5568KcwmokXd967obpw0RNDcDJepfrUmf9KUxgfK14-uOfJSLb4l7xpCxqp=H7a3yrvObNIz8WpuChSWN9X8fKpMslfTeiRJ29U3Tkg=
> >
> >
> > There is some background information, with OCC, we assume Multiple
> writers
> > won't write data to same FileID in most of time, if there is a FileId
> level
> > conflict, the commit will be rollbacked. and FileID level conflict can't
> > guarantee no duplicate if two records with same new record-key arrives in
> > multiple writers, since the mapping of key-bucket is not consistent with
> > bloom index.
> >
> > What I plan to do is support Lock-free concurrency control with a
> > non-duplicates guarantee in hudi(only for Merge-On-Read tables).
> >
> >-
> >
> >With canIndexLogfiles index , multiple writers ingesting data into
> >Merge-on-read tables can only append data to delta logs. This is a
> >lock-free process if we can make sure they don’t write data to the
> same
> > log
> >file (plan to create multiple marker files to achieve this). And with
> > log
> >merge API(preCombine logic in Payload class), data in log files can be
> > read
> >properly
> >-
> >
> >Since hudi already has an index type like Bucket index which can map
> >key-bucket in a consistent way.  Data duplicates can be eliminated
> >
> >
> > Thanks,
> > Jian Feng
> >
>


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure


Re: [PSA] CI failures, PR merges halted

2022-03-24 Thread Y Ethan Guo
Hi all,

The CI issues have been resolved.  CI is green on master.  Please rebase
your PRs on the latest master to avoid noises in CI runs.

Best,
- Ethan

On Wed, Mar 23, 2022 at 8:26 AM sagar sumit  wrote:

> Hi all,
>
> We have noticed consistent failure in the CI. These failures are mainly
> due to reasons mentioned in
> https://issues.apache.org/jira/browse/HUDI-3689
>
> Please bear with us while we are working to make the CI green again. This
> will halt merges for some time. We will keep you posted on this thread.
> Thanks for your patience.
>
> Regards,
> Sagar
>


Re: [DISCUSS] New RFC to support Lock-free concurrency control on Merge-on-read tables

2022-03-24 Thread Vinoth Chandar
+1. Love to be a co-author on the RFC, if you are open to it.

On Mon, Mar 21, 2022 at 12:31 PM 冯健  wrote:

> Hi team,
>
> The situation is Optimistic concurrency control(OCC) has some limitation
>
>-
>
>When conflicts do occur, they may waste massive resources during every
>attempt (lakehouse-concurrency-control-are-we-too-optimistic
><
> https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic
> >
>).
>-
>
>multiple writers may cause data duplicates when records with same new
>record-key arrives.multi-writer-guarantees
><
> https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees>
>
> There is some background information, with OCC, we assume Multiple writers
> won't write data to same FileID in most of time, if there is a FileId level
> conflict, the commit will be rollbacked. and FileID level conflict can't
> guarantee no duplicate if two records with same new record-key arrives in
> multiple writers, since the mapping of key-bucket is not consistent with
> bloom index.
>
> What I plan to do is support Lock-free concurrency control with a
> non-duplicates guarantee in hudi(only for Merge-On-Read tables).
>
>-
>
>With canIndexLogfiles index , multiple writers ingesting data into
>Merge-on-read tables can only append data to delta logs. This is a
>lock-free process if we can make sure they don’t write data to the same
> log
>file (plan to create multiple marker files to achieve this). And with
> log
>merge API(preCombine logic in Payload class), data in log files can be
> read
>properly
>-
>
>Since hudi already has an index type like Bucket index which can map
>key-bucket in a consistent way.  Data duplicates can be eliminated
>
>
> Thanks,
> Jian Feng
>