Re: Handling delta

2020-07-19 Thread Vinoth Chandar
Thanks everyone for helping out prakash! On Thu, Jul 16, 2020 at 10:24 AM Sivaprakash wrote: > Great !! > > Got it working !! > > 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.ComplexKeyGenerator', > > Thank you.

Re: Handling delta

2020-07-16 Thread Sivaprakash
Great !! Got it working !! 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', Thank you. On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman wrote: > Hi Sivaprakash, > To be able to specify multiple keys

Re: Handling delta

2020-07-16 Thread Adam Feldman
Hi Sivaprakash, To be able to specify multiple keys, in a comma separated notation, you must also set the KEYGENERATOR_CLASS_OPT_KEY to classOf[ComplexKeyGenerator].getName. Please see description here: https://hudi.apache.org/docs/writing_data.html#datasource-writer. Note: RECORDKEY_FIELD_OPT_KEY

Re: Handling delta

2020-07-16 Thread Sivaprakash
Looks like this property does the trick Property: hoodie.datasource.write.recordkey.field, Default: uuid Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the do

Re: Handling delta

2020-07-16 Thread Sivaprakash
Hello Balaji Thank you for your info !! I tried those options but what I find is (I'm trying to understand how hudi internally manages its files) First Write 1. ('NR001', 'YXXXTRE', 'YXXXTRE_445343') ('NR002', 'TRE', 'TRE_445343') ('NR003', 'YZZZTRE', 'YZZZTRE_445343') Commit time for

Re: Handling delta

2020-07-16 Thread Balaji Varadarajan
Hi Sivaprakash, Uniqueness of records is determined by the record key you specify to hudi. Hudi supports filtering out existing records (by record key). By default, it would upsert all incoming records.  Please look at  https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandledu

Re: Handling delta

2020-07-16 Thread Sivaprakash
Yes I'm 10 records that I mentioned from Step - 1. But, I re-write whole dataset the second time also. I see that commit_time is getting updated for all 50 records (which I feel normal) But I'm not sure how to see/prove to myself that the data is not growing (to 100 records; actually it should be o

Re: Handling delta

2020-07-16 Thread Adam Feldman
Hi Sivaprakash, Not an expert here either, but for your second question. Yes, I believe when writing delta to the table you must identify the actual delta yourself and only write the new/changed/removed records. I guess we could put a request in for hudi to take care of this, but two possible issue

Re: Handling delta

2020-07-16 Thread Allen Underwood
Hi Sivaprakash, So I'm by no means an expert on this, but I think you might find what you're looking for here: https://hudi.apache.org/docs/concepts.html I'm not sure I fully understand Step 2 you mentioned - I'm writing 50 records out of which only 10 records have been changed - does that mean t

Handling delta

2020-07-16 Thread Sivaprakash
This might be a basic question - I'm experimenting with Hudi (Pyspark). I have used Insert/Upsert options to write delta into my data lake. However, one is not clear to me Step 1:- I write 50 records Step 2:- Im writing 50 records out of which only *10 records have been changed* (I'm using upsert