On 2019/04/08 01:41:16, Vinoth Chandar <[email protected]> wrote: 
> Good discussion.. Sorry, to jump in late.. (been having a downtime last
> week)
> 
> insert/bulk_insert operations will in fact introduce duplicates if your
> input has duplicates. I would also like to understand what feature of Hudi
> is useful to you in general, since you seem to want duplicates.
> 
> Only two things I can think of, which could filter our duplicate records
> and both apply to duplicates within the same batch only (i.e you load both
> json files that contain duplicates in the same run)
> 
>  - Either you pass the -filter-dupes option to DeltaStreamer tool
>  - You have precombining on for inserts
> http://hudi.apache.org/configurations.html#combineInput .
> 
> Do any of these apply to you..
> 
> Thanks
> Vinoth
> 
> 
> 
> On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <[email protected]> wrote:
> 
> > Hi Rahul,
> >
> > I am sorry, I didnt understand the use case properly. Can you please
> > explain with an example? Let me put my version of understanding based on
> > your email.
> > > In json file, every time i will pass a fixed value for a key field.
> > Are you saying that you will always have only one value for every key?
> > Example: Rahul -> "Some Value"
> >
> > > Currently if i load data like this only 1 entry per file only load.
> > What do you mean by this line? Do you mean currently you are loading data
> > like this and only 1 entry per file is loading. Isnt that what you are
> > trying to achieve in the line above?
> >
> > > I don't want same key's values to be skipped while inserting.
> > All you are saying that you want to have same values also repeated in your
> > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> > would want that to appear 5 times in your store?
> >
> > In summary: it appears that what you want is if someone enters 5 values
> > even if they are same. So you need something as below:
> > > | primary_key | Values |
> > > | Rahul | "Some Value", "Some Value", ..... |
> >
> > Let me know if my understanding is correct.
> > Thanks
> > Kabeer.
> >
> > > Dear Omar/Kabeer
> > > In one of my usecasetthink like i don't want update at all. In json
> > file, every time i will pass a fixed value for a key field. Currently if i
> > load data like this only 1 entry per file only load. I don't want same
> > key's values to be skipped while inserting.
> > > Thanks & Regards
> > > Rahul
> >
> > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > >
> > >
> > > On 2019/04/04 19:48:39, Kabeer Ahmed <[email protected]> wrote:
> > > > Omkar - there might be various reasons to have duplicates eg: handle
> > trades in a given day from a single client, track visitor click data to the
> > website etc.
> > > >
> > > > Rahul - If you can give more details about your requirements, then we
> > can come up with a solution.
> > > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> > these options (insert and bulk_insert) do allow user to specify the logic
> > that you are seeking. Without knowing your exact requirement, I can still
> > give a suggestion to look into the option of implementing your own
> > combineAndGetUpdateValue() logic.
> > > > Lets say all your values for a particular key are strings. You could
> > append the string values to existing values and store them as:
> > > >
> > > > key | Value
> > > > Rahul | Nice
> > > > // when there is another entry append the existing one with value with
> > a comma separator per say.
> > > >
> > > > key | Value
> > > > Rahul | Nice, Person
> > > > When you retrieve the key values you could then decide to ship back to
> > user as you want - which is something you would know based on your
> > requirement - since your json is anyways having multiple ways to insert
> > values for a key.
> > > >
> > > > Feel free to reach out if you need help and I will help you as much as
> > I can.
> > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <[email protected]> wrote:
> > > > > Hi Rahul,
> > > > >
> > > > > Thanks for trying out Hudi!!
> > > > > Any reason why you need to have duplicates in HUDI dataset? Will you
> > ever
> > > > > be updating it later?
> > > > >
> > > > > Thanks,
> > > > > Omkar
> > > > >
> > > > > On Thu, Apr 4, 2019 at 1:33 AM [email protected] <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Dear All
> > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > I am loading the data from json files.
> > > > > >
> > > > > > If existing key in hudi dataset is loading again, then only new
> > data with
> > > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > > >
> > > > > > If same key is there in multiple times in a source json file, then
> > only
> > > > > > one key is getting loaded. Can i able to load duplicates keys from
> > same
> > > > > > file. (both insert/bulk_insert)
> > > > > >
> > > > > >
> > > > > > Thanks & Regards
> > > > > > Rahul
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > Dear Omar/Kabeer
> > > In one of my usecasetthink like i don't want update at all. In json
> > file, every time i will pass a fixed value for a key field. Currently if i
> > load data like this only 1 entry per file only load. I don't want same
> > key's values to be skipped while inserting.
> > > Thanks & Regards
> > > Rahul
> > >
> >
> >
> 

Dear Kabeer/Vinod


For exaple I  have a file which contains UserName,Tranasaction_Amount,ID fileds.
In this json file i am putting every time same value for it & i mapped this as 
the hudi dataset key filed.
(Currently all records which will come are new records & i  don't have auto 
increment ID in the files which i am getting).

suppose if i have 4 entries in a json file
eg : 
rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0

currently if i load it normall,y only 1 record will be there in hudi dataset as 
all the key is 0 (while selecting from hive table).

I want to have all 4 entries to be loaded

@vinod For this use case I don't want key based update, and i just want to 
control small files in hadoop using hudi. I want to use only hudi's small file 
size control feature, incremental pull.


Thanks & Regards
Rahul



Reply via email to