+1. Love to be a co-author on the RFC, if you are open to it.

On Mon, Mar 21, 2022 at 12:31 PM 冯健 <fengjian...@gmail.com> wrote:

> Hi team,
>
> The situation is Optimistic concurrency control(OCC) has some limitation
>
>    -
>
>    When conflicts do occur, they may waste massive resources during every
>    attempt (lakehouse-concurrency-control-are-we-too-optimistic
>    <
> https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic
> >
>    ).
>    -
>
>    multiple writers may cause data duplicates when records with same new
>    record-key arrives.multi-writer-guarantees
>    <
> https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees>
>
> There is some background information, with OCC, we assume Multiple writers
> won't write data to same FileID in most of time, if there is a FileId level
> conflict, the commit will be rollbacked. and FileID level conflict can't
> guarantee no duplicate if two records with same new record-key arrives in
> multiple writers, since the mapping of key-bucket is not consistent with
> bloom index.
>
> What I plan to do is support Lock-free concurrency control with a
> non-duplicates guarantee in hudi(only for Merge-On-Read tables).
>
>    -
>
>    With canIndexLogfiles index , multiple writers ingesting data into
>    Merge-on-read tables can only append data to delta logs. This is a
>    lock-free process if we can make sure they don’t write data to the same
> log
>    file (plan to create multiple marker files to achieve this). And with
> log
>    merge API(preCombine logic in Payload class), data in log files can be
> read
>    properly
>    -
>
>    Since hudi already has an index type like Bucket index which can map
>    key-bucket in a consistent way.  Data duplicates can be eliminated
>
>
> Thanks,
> Jian Feng
>

Reply via email to