Re: [QUESTION] Handle record partition change

Vinoth Chandar Wed, 18 Dec 2019 19:15:11 -0800

Interesting discussion. We can file a JIRA for option 2? It seems to also
make the semantics  simpler.


On Wed, Dec 18, 2019 at 11:21 AM Shiyan Xu <[email protected]>
wrote:

> Thanks Sivabalan. Exactly, that's what I meant.
> I can think of a usecase for option 2: a Hudi dataset manages people info
> and partitioned by birthday. In most cases, where people info are updated,
> birthdays are not to be changed (that's why we choose it as partition
> field). But in some edge cases where birthday info are input wrongly and we
> want to manually fix it or allow user to updated it occasionally. In this
> case, option 2 would be helpful in keeping records in the expected
> partition, so that a query like "show me people who were born after 2000"
> would work.
>
> I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help
> achieve both options.
>
> On Wed, Dec 18, 2019 at 10:32 AM Sivabalan <[email protected]> wrote:
>
> > Raymond,
> >      The patch <https://github.com/apache/incubator-hudi/pull/1091>
> which
> > I
> > have put up works differently. If initial record is in Partition1, and
> > updates are sent to Partition2, we silently update the record in
> > Partition1. Guess you are asking for opposite, i.e. insert in Partition2
> > and delete record in Partition1. I am not sure about the usability of
> this
> > in general. Let's ask our experts in our group.
> >
> > @vinoth, balaji and others:
> > Do we support both functionality or just one. If we plan to support both,
> > then it might incur api changes. or we could tackle with a config as
> well.
> >
> > Here is the use-case.
> > - Insert record1 to partition1 with global bloom.
> > - Update record1 with partition set to partition2(different partition
> > compared to where the record is present as of now).
> >
> > Option1:
> > Update record1 to Partition1 and do nothing in Partition2.
> >    - Since with global bloom, the primary key is just the record key and
> > hence partition is ignored.
> >
> > Option2:
> > Insert a new record, record1 to Partition2. and Delete record1 from
> > Partition1.
> >
> > I have already put up a patch for Option1. but looks like Raymond is
> > looking for Option2.
> >
> >
> >
> >
> >
> > On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu <[email protected]>
> > wrote:
> >
> > > Hi Sivabalan,
> > >
> > > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to
> > be
> > > looked up in different partitions. This is indeed helpful in the
> > situation
> > > where the same record key gets updated on its partition path.
> > >
> > > Now I'm thinking when we "tagLocationBacktoRecords
> > > <
> > >
> >
> https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112
> > > >",
> > > we could potentially create a delete operation for the record in the
> old
> > > partition while keeping the incoming insert operation for it in the new
> > > partition. This is crucial for avoiding duplicate records (with the
> same
> > > record keys) in the Hudi dataset. Is this some functionality already
> > > implemented? I might have missed some part of the logic from the
> > codebase.
> > > Please kindly point out if I got any misunderstanding.
> > >
> > > Thank you.
> > >
> > > Best,
> > > Raymond
> > >
> > > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> wrote:
> > >
> > > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I
> > > know
> > > > which one are you talking about?
> > > >
> > > >
> > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hi Hudi devs,
> > > > >
> > > > > Upon upsert operations, does Hudi detect record's partition path
> > > change?
> > > > As
> > > > > for the same record, the partition path field may get updated while
> > the
> > > > > record key (the primary id) stays the same, then the insert would
> > > result
> > > > in
> > > > > duplicate record (based on record key) in the dataset. Is there any
> > > > > relevant logic of this kind of detection and/or clean-up in the
> > > codebase?
> > > > >
> > > > > Best,
> > > > > Raymond
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [QUESTION] Handle record partition change

Reply via email to