Hi Sivabalan, Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to be looked up in different partitions. This is indeed helpful in the situation where the same record key gets updated on its partition path.
Now I'm thinking when we "tagLocationBacktoRecords <https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112>", we could potentially create a delete operation for the record in the old partition while keeping the incoming insert operation for it in the new partition. This is crucial for avoiding duplicate records (with the same record keys) in the Hudi dataset. Is this some functionality already implemented? I might have missed some part of the logic from the codebase. Please kindly point out if I got any misunderstanding. Thank you. Best, Raymond On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> wrote: > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I know > which one are you talking about? > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <[email protected]> > wrote: > > > Hi Hudi devs, > > > > Upon upsert operations, does Hudi detect record's partition path change? > As > > for the same record, the partition path field may get updated while the > > record key (the primary id) stays the same, then the insert would result > in > > duplicate record (based on record key) in the dataset. Is there any > > relevant logic of this kind of detection and/or clean-up in the codebase? > > > > Best, > > Raymond > > > > > -- > Regards, > -Sivabalan >
