+1. Love to be a co-author on the RFC, if you are open to it. On Mon, Mar 21, 2022 at 12:31 PM 冯健 <[email protected]> wrote:
> Hi team, > > The situation is Optimistic concurrency control(OCC) has some limitation > > - > > When conflicts do occur, they may waste massive resources during every > attempt (lakehouse-concurrency-control-are-we-too-optimistic > < > https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic > > > ). > - > > multiple writers may cause data duplicates when records with same new > record-key arrives.multi-writer-guarantees > < > https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees> > > There is some background information, with OCC, we assume Multiple writers > won't write data to same FileID in most of time, if there is a FileId level > conflict, the commit will be rollbacked. and FileID level conflict can't > guarantee no duplicate if two records with same new record-key arrives in > multiple writers, since the mapping of key-bucket is not consistent with > bloom index. > > What I plan to do is support Lock-free concurrency control with a > non-duplicates guarantee in hudi(only for Merge-On-Read tables). > > - > > With canIndexLogfiles index , multiple writers ingesting data into > Merge-on-read tables can only append data to delta logs. This is a > lock-free process if we can make sure they don’t write data to the same > log > file (plan to create multiple marker files to achieve this). And with > log > merge API(preCombine logic in Payload class), data in log files can be > read > properly > - > > Since hudi already has an index type like Bucket index which can map > key-bucket in a consistent way. Data duplicates can be eliminated > > > Thanks, > Jian Feng >
