Re: [QUESTION] Handle record partition change

Shiyan Xu Wed, 18 Dec 2019 19:40:23 -0800

Sure. I can create a JIRA and note down the discussion points there.

On Wed, Dec 18, 2019 at 7:14 PM Vinoth Chandar <[email protected]> wrote:


> Interesting discussion. We can file a JIRA for option 2? It seems to also
> make the semantics  simpler.
>
> On Wed, Dec 18, 2019 at 11:21 AM Shiyan Xu <[email protected]>
> wrote:
>
> > Thanks Sivabalan. Exactly, that's what I meant.
> > I can think of a usecase for option 2: a Hudi dataset manages people info
> > and partitioned by birthday. In most cases, where people info are
> updated,
> > birthdays are not to be changed (that's why we choose it as partition
> > field). But in some edge cases where birthday info are input wrongly and
> we
> > want to manually fix it or allow user to updated it occasionally. In this
> > case, option 2 would be helpful in keeping records in the expected
> > partition, so that a query like "show me people who were born after 2000"
> > would work.
> >
> > I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help
> > achieve both options.
> >
> > On Wed, Dec 18, 2019 at 10:32 AM Sivabalan <[email protected]> wrote:
> >
> > > Raymond,
> > >      The patch <https://github.com/apache/incubator-hudi/pull/1091>
> > which
> > > I
> > > have put up works differently. If initial record is in Partition1, and
> > > updates are sent to Partition2, we silently update the record in
> > > Partition1. Guess you are asking for opposite, i.e. insert in
> Partition2
> > > and delete record in Partition1. I am not sure about the usability of
> > this
> > > in general. Let's ask our experts in our group.
> > >
> > > @vinoth, balaji and others:
> > > Do we support both functionality or just one. If we plan to support
> both,
> > > then it might incur api changes. or we could tackle with a config as
> > well.
> > >
> > > Here is the use-case.
> > > - Insert record1 to partition1 with global bloom.
> > > - Update record1 with partition set to partition2(different partition
> > > compared to where the record is present as of now).
> > >
> > > Option1:
> > > Update record1 to Partition1 and do nothing in Partition2.
> > >    - Since with global bloom, the primary key is just the record key
> and
> > > hence partition is ignored.
> > >
> > > Option2:
> > > Insert a new record, record1 to Partition2. and Delete record1 from
> > > Partition1.
> > >
> > > I have already put up a patch for Option1. but looks like Raymond is
> > > looking for Option2.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Sivabalan,
> > > >
> > > > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records
> to
> > > be
> > > > looked up in different partitions. This is indeed helpful in the
> > > situation
> > > > where the same record key gets updated on its partition path.
> > > >
> > > > Now I'm thinking when we "tagLocationBacktoRecords
> > > > <
> > > >
> > >
> >
> https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112
> > > > >",
> > > > we could potentially create a delete operation for the record in the
> > old
> > > > partition while keeping the incoming insert operation for it in the
> new
> > > > partition. This is crucial for avoiding duplicate records (with the
> > same
> > > > record keys) in the Hudi dataset. Is this some functionality already
> > > > implemented? I might have missed some part of the logic from the
> > > codebase.
> > > > Please kindly point out if I got any misunderstanding.
> > > >
> > > > Thank you.
> > > >
> > > > Best,
> > > > Raymond
> > > >
> > > > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]>
> wrote:
> > > >
> > > > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM.
> May I
> > > > know
> > > > > which one are you talking about?
> > > > >
> > > > >
> > > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Hudi devs,
> > > > > >
> > > > > > Upon upsert operations, does Hudi detect record's partition path
> > > > change?
> > > > > As
> > > > > > for the same record, the partition path field may get updated
> while
> > > the
> > > > > > record key (the primary id) stays the same, then the insert would
> > > > result
> > > > > in
> > > > > > duplicate record (based on record key) in the dataset. Is there
> any
> > > > > > relevant logic of this kind of detection and/or clean-up in the
> > > > codebase?
> > > > > >
> > > > > > Best,
> > > > > > Raymond
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: [QUESTION] Handle record partition change

Reply via email to