Re: Hudi with duplicate key

rahuledavalath Tue, 09 Apr 2019 00:09:41 -0700

On 2019/04/09 06:31:33, rahuledaval...@gmail.com <rahuledaval...@gmail.com> 
wrote: 
> 
> 
> On 2019/04/09 06:22:14, rahuledaval...@gmail.com <rahuledaval...@gmail.com> 
> wrote: 
> > 
> > 
> > On 2019/04/08 01:41:16, Vinoth Chandar <vin...@apache.org> wrote: 
> > > Good discussion.. Sorry, to jump in late.. (been having a downtime last
> > > week)
> > > 
> > > insert/bulk_insert operations will in fact introduce duplicates if your
> > > input has duplicates. I would also like to understand what feature of Hudi
> > > is useful to you in general, since you seem to want duplicates.
> > > 
> > > Only two things I can think of, which could filter our duplicate records
> > > and both apply to duplicates within the same batch only (i.e you load both
> > > json files that contain duplicates in the same run)
> > > 
> > >  - Either you pass the -filter-dupes option to DeltaStreamer tool
> > >  - You have precombining on for inserts
> > > http://hudi.apache.org/configurations.html#combineInput .
> > > 
> > > Do any of these apply to you..
> > > 
> > > Thanks
> > > Vinoth
> > > 
> > > 
> > > 
> > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <kab...@linuxmail.org> wrote:
> > > 
> > > > Hi Rahul,
> > > >
> > > > I am sorry, I didnt understand the use case properly. Can you please
> > > > explain with an example? Let me put my version of understanding based on
> > > > your email.
> > > > > In json file, every time i will pass a fixed value for a key field.
> > > > Are you saying that you will always have only one value for every key?
> > > > Example: Rahul -> "Some Value"
> > > >
> > > > > Currently if i load data like this only 1 entry per file only load.
> > > > What do you mean by this line? Do you mean currently you are loading 
> > > > data
> > > > like this and only 1 entry per file is loading. Isnt that what you are
> > > > trying to achieve in the line above?
> > > >
> > > > > I don't want same key's values to be skipped while inserting.
> > > > All you are saying that you want to have same values also repeated in 
> > > > your
> > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> > > > would want that to appear 5 times in your store?
> > > >
> > > > In summary: it appears that what you want is if someone enters 5 values
> > > > even if they are same. So you need something as below:
> > > > > | primary_key | Values |
> > > > > | Rahul | "Some Value", "Some Value", ..... |
> > > >
> > > > Let me know if my understanding is correct.
> > > > Thanks
> > > > Kabeer.
> > > >
> > > > > Dear Omar/Kabeer
> > > > > In one of my usecasetthink like i don't want update at all. In json
> > > > file, every time i will pass a fixed value for a key field. Currently 
> > > > if i
> > > > load data like this only 1 entry per file only load. I don't want same
> > > > key's values to be skipped while inserting.
> > > > > Thanks & Regards
> > > > > Rahul
> > > >
> > > > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > > > >
> > > > >
> > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <kab...@linuxmail.org> wrote:
> > > > > > Omkar - there might be various reasons to have duplicates eg: handle
> > > > trades in a given day from a single client, track visitor click data to 
> > > > the
> > > > website etc.
> > > > > >
> > > > > > Rahul - If you can give more details about your requirements, then 
> > > > > > we
> > > > can come up with a solution.
> > > > > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> > > > these options (insert and bulk_insert) do allow user to specify the 
> > > > logic
> > > > that you are seeking. Without knowing your exact requirement, I can 
> > > > still
> > > > give a suggestion to look into the option of implementing your own
> > > > combineAndGetUpdateValue() logic.
> > > > > > Lets say all your values for a particular key are strings. You could
> > > > append the string values to existing values and store them as:
> > > > > >
> > > > > > key | Value
> > > > > > Rahul | Nice
> > > > > > // when there is another entry append the existing one with value 
> > > > > > with
> > > > a comma separator per say.
> > > > > >
> > > > > > key | Value
> > > > > > Rahul | Nice, Person
> > > > > > When you retrieve the key values you could then decide to ship back 
> > > > > > to
> > > > user as you want - which is something you would know based on your
> > > > requirement - since your json is anyways having multiple ways to insert
> > > > values for a key.
> > > > > >
> > > > > > Feel free to reach out if you need help and I will help you as much 
> > > > > > as
> > > > I can.
> > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> 
> > > > > > wrote:
> > > > > > > Hi Rahul,
> > > > > > >
> > > > > > > Thanks for trying out Hudi!!
> > > > > > > Any reason why you need to have duplicates in HUDI dataset? Will 
> > > > > > > you
> > > > ever
> > > > > > > be updating it later?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Omkar
> > > > > > >
> > > > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledaval...@gmail.com <
> > > > > > > rahuledaval...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Dear All
> > > > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > > > I am loading the data from json files.
> > > > > > > >
> > > > > > > > If existing key in hudi dataset is loading again, then only new
> > > > data with
> > > > > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > > > > >
> > > > > > > > If same key is there in multiple times in a source json file, 
> > > > > > > > then
> > > > only
> > > > > > > > one key is getting loaded. Can i able to load duplicates keys 
> > > > > > > > from
> > > > same
> > > > > > > > file. (both insert/bulk_insert)
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks & Regards
> > > > > > > > Rahul
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > Dear Omar/Kabeer
> > > > > In one of my usecasetthink like i don't want update at all. In json
> > > > file, every time i will pass a fixed value for a key field. Currently 
> > > > if i
> > > > load data like this only 1 entry per file only load. I don't want same
> > > > key's values to be skipped while inserting.
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > >
> > > >
> > > 
> > 
> > Dear Kabeer/Vinod
> > 
> > 
> > For exaple I  have a file which contains UserName,Tranasaction_Amount,ID 
> > fileds.
> > In this json file i am putting every time same value for it & i mapped this 
> > as the hudi dataset key filed.
> > (Currently all records which will come are new records & i  don't have auto 
> > increment ID in the files which i am getting).
> > 
> > suppose if i have 4 entries in a json file
> > eg : 
> > rahul,15,0
> > kabeer,17,0
> > vinod,18,0
> > nishith,16,0
> > 
> > currently if i load it normall,y only 1 record will be there in hudi 
> > dataset as all the key is 0 (while selecting from hive table).
> > 
> > I want to have all 4 entries to be loaded
> > 
> > @vinod For this use case I don't want key based update, and i just want to 
> > control small files in hadoop using hudi. I want to use only hudi's small 
> > file size control feature, incremental pull.
> > 
> > 
> > Thanks & Regards
> > Rahul
> > 
> > 
> > 
> > 
> Dear Vinod
> 
> As per your suggestion i checked  hoodie.combine.before.upsert property.
> 
> combineInput(on_insert = false, on_update=true)
> Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
> Flag which first combines the input RDD and merges multiple partial records 
> into a single record before inserting or updating in DFS
> 
> But it's mentioned  default it is false so first i thought not to try this.  
> Anyway tried with the false now it's inserting duplicate records.
> 
> After this while searching  i found already raised issue for this. 
> HoodieWriteConfig writeConfig = 
> HoodieWriteConfig.newBuilder().combineInput(true, true)
>     .withPath(basePath).withAutoCommit(false)
> 
> in that it's telling about default values as true,true need to be changed.
> 
> I can see still in the latest code it's not yet updated.  Please check this. 
> 
> 
> Thanks & Regards
> Rahul P
> 
> 
> 
> 
Dear Vinod/Kabeer

With  hoodie.combine.before.upsert property=true i am able to insert duplicates 
records in the hudi dataset. But if i am using same key in next loading only 
the new data with that key is showing.

eg: after 1st insert 

rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0

I loaded same file again( INSERT I am using for this)

The records are showing like below


rahul,15,0
rahul,15,0
rahul,15,0
rahul,15,0

How can i avoid the second case. I need output as 
rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0
rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0

Please assist on this

note: hoodie.parquet.small.file.limit feature also i am using. so as my data is 
less in same parquet file it will append data.

Thanks & Regards
Rahul
Re: Hudi with duplicate key

Reply via email to