Sure. I can create a JIRA and note down the discussion points there. On Wed, Dec 18, 2019 at 7:14 PM Vinoth Chandar <[email protected]> wrote:
> Interesting discussion. We can file a JIRA for option 2? It seems to also > make the semantics simpler. > > On Wed, Dec 18, 2019 at 11:21 AM Shiyan Xu <[email protected]> > wrote: > > > Thanks Sivabalan. Exactly, that's what I meant. > > I can think of a usecase for option 2: a Hudi dataset manages people info > > and partitioned by birthday. In most cases, where people info are > updated, > > birthdays are not to be changed (that's why we choose it as partition > > field). But in some edge cases where birthday info are input wrongly and > we > > want to manually fix it or allow user to updated it occasionally. In this > > case, option 2 would be helpful in keeping records in the expected > > partition, so that a query like "show me people who were born after 2000" > > would work. > > > > I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help > > achieve both options. > > > > On Wed, Dec 18, 2019 at 10:32 AM Sivabalan <[email protected]> wrote: > > > > > Raymond, > > > The patch <https://github.com/apache/incubator-hudi/pull/1091> > > which > > > I > > > have put up works differently. If initial record is in Partition1, and > > > updates are sent to Partition2, we silently update the record in > > > Partition1. Guess you are asking for opposite, i.e. insert in > Partition2 > > > and delete record in Partition1. I am not sure about the usability of > > this > > > in general. Let's ask our experts in our group. > > > > > > @vinoth, balaji and others: > > > Do we support both functionality or just one. If we plan to support > both, > > > then it might incur api changes. or we could tackle with a config as > > well. > > > > > > Here is the use-case. > > > - Insert record1 to partition1 with global bloom. > > > - Update record1 with partition set to partition2(different partition > > > compared to where the record is present as of now). > > > > > > Option1: > > > Update record1 to Partition1 and do nothing in Partition2. > > > - Since with global bloom, the primary key is just the record key > and > > > hence partition is ignored. > > > > > > Option2: > > > Insert a new record, record1 to Partition2. and Delete record1 from > > > Partition1. > > > > > > I have already put up a patch for Option1. but looks like Raymond is > > > looking for Option2. > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu <[email protected] > > > > > wrote: > > > > > > > Hi Sivabalan, > > > > > > > > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records > to > > > be > > > > looked up in different partitions. This is indeed helpful in the > > > situation > > > > where the same record key gets updated on its partition path. > > > > > > > > Now I'm thinking when we "tagLocationBacktoRecords > > > > < > > > > > > > > > > https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112 > > > > >", > > > > we could potentially create a delete operation for the record in the > > old > > > > partition while keeping the incoming insert operation for it in the > new > > > > partition. This is crucial for avoiding duplicate records (with the > > same > > > > record keys) in the Hudi dataset. Is this some functionality already > > > > implemented? I might have missed some part of the logic from the > > > codebase. > > > > Please kindly point out if I got any misunderstanding. > > > > > > > > Thank you. > > > > > > > > Best, > > > > Raymond > > > > > > > > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> > wrote: > > > > > > > > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. > May I > > > > know > > > > > which one are you talking about? > > > > > > > > > > > > > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > Hi Hudi devs, > > > > > > > > > > > > Upon upsert operations, does Hudi detect record's partition path > > > > change? > > > > > As > > > > > > for the same record, the partition path field may get updated > while > > > the > > > > > > record key (the primary id) stays the same, then the insert would > > > > result > > > > > in > > > > > > duplicate record (based on record key) in the dataset. Is there > any > > > > > > relevant logic of this kind of detection and/or clean-up in the > > > > codebase? > > > > > > > > > > > > Best, > > > > > > Raymond > > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > -Sivabalan > > > > > > > > > > > > > > > > > > -- > > > Regards, > > > -Sivabalan > > > > > >
