https://github.com/apache/incubator-hudi/pull/634
Rahul. I do see what you mean. The defaults documented are for the write client and they are correct.. I think it makes sense to change the defaults for DataSource and DeltaStreamer inserts. We can discuss pros/cons on the PR? On Tue, Apr 9, 2019 at 10:11 AM Vinoth Chandar < [email protected]> wrote: > Hi Rahul, > > +1 to kabeer's suggestion.. you can just generate a UUID even as a new key > & issue upserts. It will help you identify duplicates also eventually. > > >>@vinod For this use case I don't want key based update, and i just want > to control small files in hadoop using hudi. I want to use only hudi's > small file size control feature, incremental pull. > For your use case, I think you should just use insert operation (which > totally bypasses the index lookup/update), not upsert. And set > combine.on.insert to false (please see docs for exact prop name). > > On changing defaults, I still think the defaults make sense.. > private static final String COMBINE_BEFORE_INSERT_PROP = > "hoodie.combine.before.insert"; > private static final String DEFAULT_COMBINE_BEFORE_INSERT = "false"; > > We turn off combining by default for insert operation. Please raise a JIRA > if thats not working for you out of box > > Thanks > Vinoth > > On Tue, Apr 9, 2019 at 8:17 AM Kabeer Ahmed <[email protected]> wrote: > >> Hi Rahul, >> >> Thank you specifying the example. Isnt your example easy if we really >> switch the primary_key to say the names of keys like kabeer/rahul etc.? I >> am sure your example must be a much simplified version of what you might >> actually have a task at hand. >> If that is really not possible, then I would consider something as below: >> name, amount, ID, IDwithName. (compound key) >> rahul,15,0, 0rahul >> kabeer,17,0, 0kabeer >> vinod,18,0,.... >> nishith,16,0,.... >> >> This will help you with all the further inserts and updates. So a further >> update would be based on 0rahul, 0kabeer etc and you will have provided >> HUDI with unique keys and you get the desired results. This is a common way >> we achieve these results should a need arise similar to that of yours. >> Thanks >> Kabeer. >> >> On Apr 9 2019, at 7:31 am, Unknown wrote: >> > >> > >> > On 2019/04/09 06:22:14, [email protected] < >> [email protected]> wrote: >> > > >> > > >> > > On 2019/04/08 01:41:16, Vinoth Chandar <[email protected]> wrote: >> > > > Good discussion.. Sorry, to jump in late.. (been having a downtime >> last >> > > > week) >> > > > >> > > > insert/bulk_insert operations will in fact introduce duplicates if >> your >> > > > input has duplicates. I would also like to understand what feature >> of Hudi >> > > > is useful to you in general, since you seem to want duplicates. >> > > > >> > > > Only two things I can think of, which could filter our duplicate >> records >> > > > and both apply to duplicates within the same batch only (i.e you >> load both >> > > > json files that contain duplicates in the same run) >> > > > >> > > > - Either you pass the -filter-dupes option to DeltaStreamer tool >> > > > - You have precombining on for inserts >> > > > http://hudi.apache.org/configurations.html#combineInput . >> > > > >> > > > Do any of these apply to you.. >> > > > Thanks >> > > > Vinoth >> > > > >> > > > >> > > > >> > > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <[email protected]> >> wrote: >> > > > > Hi Rahul, >> > > > > I am sorry, I didnt understand the use case properly. Can you >> please >> > > > > explain with an example? Let me put my version of understanding >> based on >> > > > > your email. >> > > > > > In json file, every time i will pass a fixed value for a key >> field. >> > > > > >> > > > > Are you saying that you will always have only one value for every >> key? >> > > > > Example: Rahul -> "Some Value" >> > > > > >> > > > > > Currently if i load data like this only 1 entry per file only >> load. >> > > > > What do you mean by this line? Do you mean currently you are >> loading data >> > > > > like this and only 1 entry per file is loading. Isnt that what >> you are >> > > > > trying to achieve in the line above? >> > > > > >> > > > > > I don't want same key's values to be skipped while inserting. >> > > > > All you are saying that you want to have same values also >> repeated in your >> > > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times, >> then you >> > > > > would want that to appear 5 times in your store? >> > > > > >> > > > > In summary: it appears that what you want is if someone enters 5 >> values >> > > > > even if they are same. So you need something as below: >> > > > > > | primary_key | Values | >> > > > > > | Rahul | "Some Value", "Some Value", ..... | >> > > > > >> > > > > >> > > > > Let me know if my understanding is correct. >> > > > > Thanks >> > > > > Kabeer. >> > > > > >> > > > > > Dear Omar/Kabeer >> > > > > > In one of my usecasetthink like i don't want update at all. In >> json >> > > > > >> > > > > file, every time i will pass a fixed value for a key field. >> Currently if i >> > > > > load data like this only 1 entry per file only load. I don't want >> same >> > > > > key's values to be skipped while inserting. >> > > > > > Thanks & Regards >> > > > > > Rahul >> > > > > >> > > > > >> > > > > On Apr 5 2019, at 9:11 am, Unknown wrote: >> > > > > > >> > > > > > >> > > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <[email protected]> >> wrote: >> > > > > > > Omkar - there might be various reasons to have duplicates eg: >> handle >> > > > > > >> > > > > >> > > > > trades in a given day from a single client, track visitor click >> data to the >> > > > > website etc. >> > > > > > > >> > > > > > > Rahul - If you can give more details about your requirements, >> then we >> > > > > can come up with a solution. >> > > > > > > I have never used INSERT & BULK_INSERT at all and I am not >> sure if >> > > > > > >> > > > > >> > > > > these options (insert and bulk_insert) do allow user to specify >> the logic >> > > > > that you are seeking. Without knowing your exact requirement, I >> can still >> > > > > give a suggestion to look into the option of implementing your own >> > > > > combineAndGetUpdateValue() logic. >> > > > > > > Lets say all your values for a particular key are strings. >> You could >> > > > > > >> > > > > >> > > > > append the string values to existing values and store them as: >> > > > > > > >> > > > > > > key | Value >> > > > > > > Rahul | Nice >> > > > > > > // when there is another entry append the existing one with >> value with >> > > > > > >> > > > > >> > > > > a comma separator per say. >> > > > > > > >> > > > > > > key | Value >> > > > > > > Rahul | Nice, Person >> > > > > > > When you retrieve the key values you could then decide to >> ship back to >> > > > > > >> > > > > >> > > > > user as you want - which is something you would know based on your >> > > > > requirement - since your json is anyways having multiple ways to >> insert >> > > > > values for a key. >> > > > > > > >> > > > > > > Feel free to reach out if you need help and I will help you >> as much as >> > > > > I can. >> > > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <[email protected]> >> wrote: >> > > > > > > > Hi Rahul, >> > > > > > > > >> > > > > > > > Thanks for trying out Hudi!! >> > > > > > > > Any reason why you need to have duplicates in HUDI dataset? >> Will you >> > > > > > > >> > > > > > >> > > > > >> > > > > ever >> > > > > > > > be updating it later? >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Omkar >> > > > > > > > >> > > > > > > > On Thu, Apr 4, 2019 at 1:33 AM [email protected] < >> > > > > > > > [email protected]> wrote: >> > > > > > > > >> > > > > > > > > Dear All >> > > > > > > > > I am using cow table with INSERT/BULK_INSERT. >> > > > > > > > > I am loading the data from json files. >> > > > > > > > > >> > > > > > > > > If existing key in hudi dataset is loading again, then >> only new >> > > > > data with >> > > > > > > > > that key only showing. Can i able to show both data? (In >> INSERT) >> > > > > > > > > >> > > > > > > > > If same key is there in multiple times in a source json >> file, then >> > > > > only >> > > > > > > > > one key is getting loaded. Can i able to load duplicates >> keys from >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > same >> > > > > > > > > file. (both insert/bulk_insert) >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Thanks & Regards >> > > > > > > > > Rahul >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > Dear Omar/Kabeer >> > > > > > In one of my usecasetthink like i don't want update at all. In >> json >> > > > > >> > > > > file, every time i will pass a fixed value for a key field. >> Currently if i >> > > > > load data like this only 1 entry per file only load. I don't want >> same >> > > > > key's values to be skipped while inserting. >> > > > > > Thanks & Regards >> > > > > > Rahul >> > > > > > >> > > > > >> > > > > >> > > > >> > > >> > > Dear Kabeer/Vinod >> > > >> > > For exaple I have a file which contains >> UserName,Tranasaction_Amount,ID fileds. >> > > In this json file i am putting every time same value for it & i >> mapped this as the hudi dataset key filed. >> > > (Currently all records which will come are new records & i don't have >> auto increment ID in the files which i am getting). >> > > >> > > suppose if i have 4 entries in a json file >> > > eg : >> > > rahul,15,0 >> > > kabeer,17,0 >> > > vinod,18,0 >> > > nishith,16,0 >> > > >> > > currently if i load it normall,y only 1 record will be there in hudi >> dataset as all the key is 0 (while selecting from hive table). >> > > I want to have all 4 entries to be loaded >> > > @vinod For this use case I don't want key based update, and i just >> want to control small files in hadoop using hudi. I want to use only hudi's >> small file size control feature, incremental pull. >> > > >> > > Thanks & Regards >> > > Rahul >> > > >> > > >> > > >> > > >> > Dear Vinod >> > As per your suggestion i checked hoodie.combine.before.upsert property. >> > combineInput(on_insert = false, on_update=true) >> > Property: hoodie.combine.before.insert, hoodie.combine.before.upsert >> > Flag which first combines the input RDD and merges multiple partial >> records into a single record before inserting or updating in DFS >> > >> > But it's mentioned default it is false so first i thought not to try >> this. Anyway tried with the false now it's inserting duplicate records. >> > After this while searching i found already raised issue for this. >> > HoodieWriteConfig writeConfig = >> HoodieWriteConfig.newBuilder().combineInput(true, true) >> > .withPath(basePath).withAutoCommit(false) >> > >> > in that it's telling about default values as true,true need to be >> changed. >> > I can see still in the latest code it's not yet updated. Please check >> this. >> > >> > Thanks & Regards >> > Rahul P >> > >> >>
