Good discussion.. Sorry, to jump in late.. (been having a downtime last week)
insert/bulk_insert operations will in fact introduce duplicates if your input has duplicates. I would also like to understand what feature of Hudi is useful to you in general, since you seem to want duplicates. Only two things I can think of, which could filter our duplicate records and both apply to duplicates within the same batch only (i.e you load both json files that contain duplicates in the same run) - Either you pass the -filter-dupes option to DeltaStreamer tool - You have precombining on for inserts http://hudi.apache.org/configurations.html#combineInput . Do any of these apply to you.. Thanks Vinoth On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <[email protected]> wrote: > Hi Rahul, > > I am sorry, I didnt understand the use case properly. Can you please > explain with an example? Let me put my version of understanding based on > your email. > > In json file, every time i will pass a fixed value for a key field. > Are you saying that you will always have only one value for every key? > Example: Rahul -> "Some Value" > > > Currently if i load data like this only 1 entry per file only load. > What do you mean by this line? Do you mean currently you are loading data > like this and only 1 entry per file is loading. Isnt that what you are > trying to achieve in the line above? > > > I don't want same key's values to be skipped while inserting. > All you are saying that you want to have same values also repeated in your > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you > would want that to appear 5 times in your store? > > In summary: it appears that what you want is if someone enters 5 values > even if they are same. So you need something as below: > > | primary_key | Values | > > | Rahul | "Some Value", "Some Value", ..... | > > Let me know if my understanding is correct. > Thanks > Kabeer. > > > Dear Omar/Kabeer > > In one of my usecasetthink like i don't want update at all. In json > file, every time i will pass a fixed value for a key field. Currently if i > load data like this only 1 entry per file only load. I don't want same > key's values to be skipped while inserting. > > Thanks & Regards > > Rahul > > On Apr 5 2019, at 9:11 am, Unknown wrote: > > > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <[email protected]> wrote: > > > Omkar - there might be various reasons to have duplicates eg: handle > trades in a given day from a single client, track visitor click data to the > website etc. > > > > > > Rahul - If you can give more details about your requirements, then we > can come up with a solution. > > > I have never used INSERT & BULK_INSERT at all and I am not sure if > these options (insert and bulk_insert) do allow user to specify the logic > that you are seeking. Without knowing your exact requirement, I can still > give a suggestion to look into the option of implementing your own > combineAndGetUpdateValue() logic. > > > Lets say all your values for a particular key are strings. You could > append the string values to existing values and store them as: > > > > > > key | Value > > > Rahul | Nice > > > // when there is another entry append the existing one with value with > a comma separator per say. > > > > > > key | Value > > > Rahul | Nice, Person > > > When you retrieve the key values you could then decide to ship back to > user as you want - which is something you would know based on your > requirement - since your json is anyways having multiple ways to insert > values for a key. > > > > > > Feel free to reach out if you need help and I will help you as much as > I can. > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <[email protected]> wrote: > > > > Hi Rahul, > > > > > > > > Thanks for trying out Hudi!! > > > > Any reason why you need to have duplicates in HUDI dataset? Will you > ever > > > > be updating it later? > > > > > > > > Thanks, > > > > Omkar > > > > > > > > On Thu, Apr 4, 2019 at 1:33 AM [email protected] < > > > > [email protected]> wrote: > > > > > > > > > Dear All > > > > > I am using cow table with INSERT/BULK_INSERT. > > > > > I am loading the data from json files. > > > > > > > > > > If existing key in hudi dataset is loading again, then only new > data with > > > > > that key only showing. Can i able to show both data? (In INSERT) > > > > > > > > > > If same key is there in multiple times in a source json file, then > only > > > > > one key is getting loaded. Can i able to load duplicates keys from > same > > > > > file. (both insert/bulk_insert) > > > > > > > > > > > > > > > Thanks & Regards > > > > > Rahul > > > > > > > > > > > > > > > > > > > > Dear Omar/Kabeer > > In one of my usecasetthink like i don't want update at all. In json > file, every time i will pass a fixed value for a key field. Currently if i > load data like this only 1 entry per file only load. I don't want same > key's values to be skipped while inserting. > > Thanks & Regards > > Rahul > > > >
