Re: Hudi with duplicate key

Vinoth Chandar Tue, 09 Apr 2019 11:01:20 -0700

https://github.com/apache/incubator-hudi/pull/634


Rahul. I do see what you mean. The defaults documented are for the write
client and they are correct..  I think it makes sense to change the
defaults for DataSource and DeltaStreamer inserts.
We can discuss pros/cons on the PR?

On Tue, Apr 9, 2019 at 10:11 AM Vinoth Chandar <
[email protected]> wrote:

> Hi Rahul,
>
> +1 to kabeer's suggestion.. you can just generate a UUID even as a new key
> & issue upserts. It will help you identify duplicates also eventually.
>
> >>@vinod For this use case I don't want key based update, and i just want
> to control small files in hadoop using hudi. I want to use only hudi's
> small file size control feature, incremental pull.
> For your use case, I think you should just use insert operation (which
> totally bypasses the index lookup/update), not upsert. And set
> combine.on.insert to false (please see docs for exact prop name).
>
> On changing defaults, I still think the defaults make sense..
>   private static final String COMBINE_BEFORE_INSERT_PROP =
> "hoodie.combine.before.insert";
>   private static final String DEFAULT_COMBINE_BEFORE_INSERT = "false";
>
> We turn off combining by default for insert operation. Please raise a JIRA
> if thats not working for you out of box
>
> Thanks
> Vinoth
>
> On Tue, Apr 9, 2019 at 8:17 AM Kabeer Ahmed <[email protected]> wrote:
>
>> Hi Rahul,
>>
>> Thank you specifying the example. Isnt your example easy if we really
>> switch the primary_key to say the names of keys like kabeer/rahul etc.? I
>> am sure your example must be a much simplified version of what you might
>> actually have a task at hand.
>> If that is really not possible, then I would consider something as below:
>> name, amount, ID, IDwithName. (compound key)
>> rahul,15,0, 0rahul
>> kabeer,17,0, 0kabeer
>> vinod,18,0,....
>> nishith,16,0,....
>>
>> This will help you with all the further inserts and updates. So a further
>> update would be based on 0rahul, 0kabeer etc and you will have provided
>> HUDI with unique keys and you get the desired results. This is a common way
>> we achieve these results should a need arise similar to that of yours.
>> Thanks
>> Kabeer.
>>
>> On Apr 9 2019, at 7:31 am, Unknown wrote:
>> >
>> >
>> > On 2019/04/09 06:22:14, [email protected] <
>> [email protected]> wrote:
>> > >
>> > >
>> > > On 2019/04/08 01:41:16, Vinoth Chandar <[email protected]> wrote:
>> > > > Good discussion.. Sorry, to jump in late.. (been having a downtime
>> last
>> > > > week)
>> > > >
>> > > > insert/bulk_insert operations will in fact introduce duplicates if
>> your
>> > > > input has duplicates. I would also like to understand what feature
>> of Hudi
>> > > > is useful to you in general, since you seem to want duplicates.
>> > > >
>> > > > Only two things I can think of, which could filter our duplicate
>> records
>> > > > and both apply to duplicates within the same batch only (i.e you
>> load both
>> > > > json files that contain duplicates in the same run)
>> > > >
>> > > > - Either you pass the -filter-dupes option to DeltaStreamer tool
>> > > > - You have precombining on for inserts
>> > > > http://hudi.apache.org/configurations.html#combineInput .
>> > > >
>> > > > Do any of these apply to you..
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <[email protected]>
>> wrote:
>> > > > > Hi Rahul,
>> > > > > I am sorry, I didnt understand the use case properly. Can you
>> please
>> > > > > explain with an example? Let me put my version of understanding
>> based on
>> > > > > your email.
>> > > > > > In json file, every time i will pass a fixed value for a key
>> field.
>> > > > >
>> > > > > Are you saying that you will always have only one value for every
>> key?
>> > > > > Example: Rahul -> "Some Value"
>> > > > >
>> > > > > > Currently if i load data like this only 1 entry per file only
>> load.
>> > > > > What do you mean by this line? Do you mean currently you are
>> loading data
>> > > > > like this and only 1 entry per file is loading. Isnt that what
>> you are
>> > > > > trying to achieve in the line above?
>> > > > >
>> > > > > > I don't want same key's values to be skipped while inserting.
>> > > > > All you are saying that you want to have same values also
>> repeated in your
>> > > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times,
>> then you
>> > > > > would want that to appear 5 times in your store?
>> > > > >
>> > > > > In summary: it appears that what you want is if someone enters 5
>> values
>> > > > > even if they are same. So you need something as below:
>> > > > > > | primary_key | Values |
>> > > > > > | Rahul | "Some Value", "Some Value", ..... |
>> > > > >
>> > > > >
>> > > > > Let me know if my understanding is correct.
>> > > > > Thanks
>> > > > > Kabeer.
>> > > > >
>> > > > > > Dear Omar/Kabeer
>> > > > > > In one of my usecasetthink like i don't want update at all. In
>> json
>> > > > >
>> > > > > file, every time i will pass a fixed value for a key field.
>> Currently if i
>> > > > > load data like this only 1 entry per file only load. I don't want
>> same
>> > > > > key's values to be skipped while inserting.
>> > > > > > Thanks & Regards
>> > > > > > Rahul
>> > > > >
>> > > > >
>> > > > > On Apr 5 2019, at 9:11 am, Unknown wrote:
>> > > > > >
>> > > > > >
>> > > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <[email protected]>
>> wrote:
>> > > > > > > Omkar - there might be various reasons to have duplicates eg:
>> handle
>> > > > > >
>> > > > >
>> > > > > trades in a given day from a single client, track visitor click
>> data to the
>> > > > > website etc.
>> > > > > > >
>> > > > > > > Rahul - If you can give more details about your requirements,
>> then we
>> > > > > can come up with a solution.
>> > > > > > > I have never used INSERT & BULK_INSERT at all and I am not
>> sure if
>> > > > > >
>> > > > >
>> > > > > these options (insert and bulk_insert) do allow user to specify
>> the logic
>> > > > > that you are seeking. Without knowing your exact requirement, I
>> can still
>> > > > > give a suggestion to look into the option of implementing your own
>> > > > > combineAndGetUpdateValue() logic.
>> > > > > > > Lets say all your values for a particular key are strings.
>> You could
>> > > > > >
>> > > > >
>> > > > > append the string values to existing values and store them as:
>> > > > > > >
>> > > > > > > key | Value
>> > > > > > > Rahul | Nice
>> > > > > > > // when there is another entry append the existing one with
>> value with
>> > > > > >
>> > > > >
>> > > > > a comma separator per say.
>> > > > > > >
>> > > > > > > key | Value
>> > > > > > > Rahul | Nice, Person
>> > > > > > > When you retrieve the key values you could then decide to
>> ship back to
>> > > > > >
>> > > > >
>> > > > > user as you want - which is something you would know based on your
>> > > > > requirement - since your json is anyways having multiple ways to
>> insert
>> > > > > values for a key.
>> > > > > > >
>> > > > > > > Feel free to reach out if you need help and I will help you
>> as much as
>> > > > > I can.
>> > > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <[email protected]>
>> wrote:
>> > > > > > > > Hi Rahul,
>> > > > > > > >
>> > > > > > > > Thanks for trying out Hudi!!
>> > > > > > > > Any reason why you need to have duplicates in HUDI dataset?
>> Will you
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > > ever
>> > > > > > > > be updating it later?
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Omkar
>> > > > > > > >
>> > > > > > > > On Thu, Apr 4, 2019 at 1:33 AM [email protected] <
>> > > > > > > > [email protected]> wrote:
>> > > > > > > >
>> > > > > > > > > Dear All
>> > > > > > > > > I am using cow table with INSERT/BULK_INSERT.
>> > > > > > > > > I am loading the data from json files.
>> > > > > > > > >
>> > > > > > > > > If existing key in hudi dataset is loading again, then
>> only new
>> > > > > data with
>> > > > > > > > > that key only showing. Can i able to show both data? (In
>> INSERT)
>> > > > > > > > >
>> > > > > > > > > If same key is there in multiple times in a source json
>> file, then
>> > > > > only
>> > > > > > > > > one key is getting loaded. Can i able to load duplicates
>> keys from
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > > same
>> > > > > > > > > file. (both insert/bulk_insert)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Thanks & Regards
>> > > > > > > > > Rahul
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > Dear Omar/Kabeer
>> > > > > > In one of my usecasetthink like i don't want update at all. In
>> json
>> > > > >
>> > > > > file, every time i will pass a fixed value for a key field.
>> Currently if i
>> > > > > load data like this only 1 entry per file only load. I don't want
>> same
>> > > > > key's values to be skipped while inserting.
>> > > > > > Thanks & Regards
>> > > > > > Rahul
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> > > Dear Kabeer/Vinod
>> > >
>> > > For exaple I have a file which contains
>> UserName,Tranasaction_Amount,ID fileds.
>> > > In this json file i am putting every time same value for it & i
>> mapped this as the hudi dataset key filed.
>> > > (Currently all records which will come are new records & i don't have
>> auto increment ID in the files which i am getting).
>> > >
>> > > suppose if i have 4 entries in a json file
>> > > eg :
>> > > rahul,15,0
>> > > kabeer,17,0
>> > > vinod,18,0
>> > > nishith,16,0
>> > >
>> > > currently if i load it normall,y only 1 record will be there in hudi
>> dataset as all the key is 0 (while selecting from hive table).
>> > > I want to have all 4 entries to be loaded
>> > > @vinod For this use case I don't want key based update, and i just
>> want to control small files in hadoop using hudi. I want to use only hudi's
>> small file size control feature, incremental pull.
>> > >
>> > > Thanks & Regards
>> > > Rahul
>> > >
>> > >
>> > >
>> > >
>> > Dear Vinod
>> > As per your suggestion i checked hoodie.combine.before.upsert property.
>> > combineInput(on_insert = false, on_update=true)
>> > Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
>> > Flag which first combines the input RDD and merges multiple partial
>> records into a single record before inserting or updating in DFS
>> >
>> > But it's mentioned default it is false so first i thought not to try
>> this. Anyway tried with the false now it's inserting duplicate records.
>> > After this while searching i found already raised issue for this.
>> > HoodieWriteConfig writeConfig =
>> HoodieWriteConfig.newBuilder().combineInput(true, true)
>> > .withPath(basePath).withAutoCommit(false)
>> >
>> > in that it's telling about default values as true,true need to be
>> changed.
>> > I can see still in the latest code it's not yet updated. Please check
>> this.
>> >
>> > Thanks & Regards
>> > Rahul P
>> >
>>
>>

Re: Hudi with duplicate key

Reply via email to