Interesting discussion. We can file a JIRA for option 2? It seems to also make the semantics simpler.
On Wed, Dec 18, 2019 at 11:21 AM Shiyan Xu <[email protected]> wrote: > Thanks Sivabalan. Exactly, that's what I meant. > I can think of a usecase for option 2: a Hudi dataset manages people info > and partitioned by birthday. In most cases, where people info are updated, > birthdays are not to be changed (that's why we choose it as partition > field). But in some edge cases where birthday info are input wrongly and we > want to manually fix it or allow user to updated it occasionally. In this > case, option 2 would be helpful in keeping records in the expected > partition, so that a query like "show me people who were born after 2000" > would work. > > I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help > achieve both options. > > On Wed, Dec 18, 2019 at 10:32 AM Sivabalan <[email protected]> wrote: > > > Raymond, > > The patch <https://github.com/apache/incubator-hudi/pull/1091> > which > > I > > have put up works differently. If initial record is in Partition1, and > > updates are sent to Partition2, we silently update the record in > > Partition1. Guess you are asking for opposite, i.e. insert in Partition2 > > and delete record in Partition1. I am not sure about the usability of > this > > in general. Let's ask our experts in our group. > > > > @vinoth, balaji and others: > > Do we support both functionality or just one. If we plan to support both, > > then it might incur api changes. or we could tackle with a config as > well. > > > > Here is the use-case. > > - Insert record1 to partition1 with global bloom. > > - Update record1 with partition set to partition2(different partition > > compared to where the record is present as of now). > > > > Option1: > > Update record1 to Partition1 and do nothing in Partition2. > > - Since with global bloom, the primary key is just the record key and > > hence partition is ignored. > > > > Option2: > > Insert a new record, record1 to Partition2. and Delete record1 from > > Partition1. > > > > I have already put up a patch for Option1. but looks like Raymond is > > looking for Option2. > > > > > > > > > > > > On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu <[email protected]> > > wrote: > > > > > Hi Sivabalan, > > > > > > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to > > be > > > looked up in different partitions. This is indeed helpful in the > > situation > > > where the same record key gets updated on its partition path. > > > > > > Now I'm thinking when we "tagLocationBacktoRecords > > > < > > > > > > https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112 > > > >", > > > we could potentially create a delete operation for the record in the > old > > > partition while keeping the incoming insert operation for it in the new > > > partition. This is crucial for avoiding duplicate records (with the > same > > > record keys) in the Hudi dataset. Is this some functionality already > > > implemented? I might have missed some part of the logic from the > > codebase. > > > Please kindly point out if I got any misunderstanding. > > > > > > Thank you. > > > > > > Best, > > > Raymond > > > > > > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan <[email protected]> wrote: > > > > > > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I > > > know > > > > which one are you talking about? > > > > > > > > > > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu < > [email protected] > > > > > > > wrote: > > > > > > > > > Hi Hudi devs, > > > > > > > > > > Upon upsert operations, does Hudi detect record's partition path > > > change? > > > > As > > > > > for the same record, the partition path field may get updated while > > the > > > > > record key (the primary id) stays the same, then the insert would > > > result > > > > in > > > > > duplicate record (based on record key) in the dataset. Is there any > > > > > relevant logic of this kind of detection and/or clean-up in the > > > codebase? > > > > > > > > > > Best, > > > > > Raymond > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > -Sivabalan > > > > > > > > > > > > > -- > > Regards, > > -Sivabalan > > >
