Hi Sivabalan,

Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to be
looked up in different partitions. This is indeed helpful in the situation
where the same record key gets updated on its partition path.

Now I'm thinking when we "tagLocationBacktoRecords
<https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112>",
we could potentially create a delete operation for the record in the old
partition while keeping the incoming insert operation for it in the new
partition. This is crucial for avoiding duplicate records (with the same
record keys) in the Hudi dataset. Is this some functionality already
implemented? I might have missed some part of the logic from the codebase.
Please kindly point out if I got any misunderstanding.

Thank you.

Best,
Raymond

On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> wrote:

> Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I know
> which one are you talking about?
>
>
> On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <[email protected]>
> wrote:
>
> > Hi Hudi devs,
> >
> > Upon upsert operations, does Hudi detect record's partition path change?
> As
> > for the same record, the partition path field may get updated while the
> > record key (the primary id) stays the same, then the insert would result
> in
> > duplicate record (based on record key) in the dataset. Is there any
> > relevant logic of this kind of detection and/or clean-up in the codebase?
> >
> > Best,
> > Raymond
> >
>
>
> --
> Regards,
> -Sivabalan
>

Reply via email to