sure, I'm working on it, will add you as a co-author when create a pr On Fri, Mar 25, 2022 at 1:17 AM Vinoth Chandar <vin...@apache.org> wrote:
> +1. Love to be a co-author on the RFC, if you are open to it. > > On Mon, Mar 21, 2022 at 12:31 PM 冯健 <fengjian...@gmail.com> wrote: > > > Hi team, > > > > The situation is Optimistic concurrency control(OCC) has some limitation > > > > - > > > > When conflicts do occur, they may waste massive resources during every > > attempt (lakehouse-concurrency-control-are-we-too-optimistic > > < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_blog_2021_12_16_lakehouse-2Dconcurrency-2Dcontrol-2Dare-2Dwe-2Dtoo-2Doptimistic&d=DwIFaQ&c=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw&r=bXAq09cDo2vOJ-2Uz9h3CslJmeCj9JMbo5X-gCHPF24&m=rz6Mo5568KcwmokXd967obpw0RNDcDJepfrUmf9KUxgfK14-uOfJSLb4l7xpCxqp&s=GFRt00qSBTRTWbGjUo-UBInLiU88zE_YbvHP0UO_geE&e= > > > > > ). > > - > > > > multiple writers may cause data duplicates when records with same new > > record-key arrives.multi-writer-guarantees > > < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_concurrency-5Fcontrol-23multi-2Dwriter-2Dguarantees&d=DwIFaQ&c=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw&r=bXAq09cDo2vOJ-2Uz9h3CslJmeCj9JMbo5X-gCHPF24&m=rz6Mo5568KcwmokXd967obpw0RNDcDJepfrUmf9KUxgfK14-uOfJSLb4l7xpCxqp&s=H7a3yrvObNIz8WpuChSWN9X8fKpMslfTeiRJ29U3Tkg&e= > > > > > > There is some background information, with OCC, we assume Multiple > writers > > won't write data to same FileID in most of time, if there is a FileId > level > > conflict, the commit will be rollbacked. and FileID level conflict can't > > guarantee no duplicate if two records with same new record-key arrives in > > multiple writers, since the mapping of key-bucket is not consistent with > > bloom index. > > > > What I plan to do is support Lock-free concurrency control with a > > non-duplicates guarantee in hudi(only for Merge-On-Read tables). > > > > - > > > > With canIndexLogfiles index , multiple writers ingesting data into > > Merge-on-read tables can only append data to delta logs. This is a > > lock-free process if we can make sure they don’t write data to the > same > > log > > file (plan to create multiple marker files to achieve this). And with > > log > > merge API(preCombine logic in Payload class), data in log files can be > > read > > properly > > - > > > > Since hudi already has an index type like Bucket index which can map > > key-bucket in a consistent way. Data duplicates can be eliminated > > > > > > Thanks, > > Jian Feng > > > -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure