subject:"Handling delta"

Re: Handling delta

2020-07-19 Thread Vinoth Chandar

Thanks everyone for helping out prakash! On Thu, Jul 16, 2020 at 10:24 AM Sivaprakash wrote: > Great !! > > Got it working !! > > 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.ComplexKeyGenerator', > > Thank you.

Re: Handling delta

2020-07-16 Thread Sivaprakash

Great !! Got it working !! 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', Thank you. On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman wrote: > Hi Sivaprakash, > To be able to specify multiple keys

Re: Handling delta

2020-07-16 Thread Adam Feldman

Hi Sivaprakash, To be able to specify multiple keys, in a comma separated notation, you must also set the KEYGENERATOR_CLASS_OPT_KEY to classOf[ComplexKeyGenerator].getName. Please see description here: https://hudi.apache.org/docs/writing_data.html#datasource-writer. Note: RECORDKEY_FIELD_OPT_KEY

Re: Handling delta

2020-07-16 Thread Sivaprakash

Looks like this property does the trick Property: hoodie.datasource.write.recordkey.field, Default: uuid Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the do

Re: Handling delta

2020-07-16 Thread Sivaprakash

Hello Balaji Thank you for your info !! I tried those options but what I find is (I'm trying to understand how hudi internally manages its files) First Write 1. ('NR001', 'YXXXTRE', 'YXXXTRE_445343') ('NR002', 'TRE', 'TRE_445343') ('NR003', 'YZZZTRE', 'YZZZTRE_445343') Commit time for

Re: Handling delta

2020-07-16 Thread Balaji Varadarajan

Hi Sivaprakash, Uniqueness of records is determined by the record key you specify to hudi. Hudi supports filtering out existing records (by record key). By default, it would upsert all incoming records. Please look at https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandledu

Re: Handling delta

2020-07-16 Thread Sivaprakash

Yes I'm 10 records that I mentioned from Step - 1. But, I re-write whole dataset the second time also. I see that commit_time is getting updated for all 50 records (which I feel normal) But I'm not sure how to see/prove to myself that the data is not growing (to 100 records; actually it should be o

Re: Handling delta

2020-07-16 Thread Adam Feldman

Hi Sivaprakash, Not an expert here either, but for your second question. Yes, I believe when writing delta to the table you must identify the actual delta yourself and only write the new/changed/removed records. I guess we could put a request in for hudi to take care of this, but two possible issue

Re: Handling delta

2020-07-16 Thread Allen Underwood

Hi Sivaprakash, So I'm by no means an expert on this, but I think you might find what you're looking for here: https://hudi.apache.org/docs/concepts.html I'm not sure I fully understand Step 2 you mentioned - I'm writing 50 records out of which only 10 records have been changed - does that mean t

Handling delta

2020-07-16 Thread Sivaprakash

This might be a basic question - I'm experimenting with Hudi (Pyspark). I have used Insert/Upsert options to write delta into my data lake. However, one is not clear to me Step 1:- I write 50 records Step 2:- Im writing 50 records out of which only *10 records have been changed* (I'm using upsert

Re: Handling delta

Re: Handling delta

Re: Handling delta

Re: Handling delta

Re: Handling delta

Re: Handling delta

Re: Handling delta

Re: Handling delta

Re: Handling delta

Handling delta

10 matches

Site Navigation

Mail list logo

Footer information