Hi Sivaprakash, To be able to specify multiple keys, in a comma separated notation, you must also set the KEYGENERATOR_CLASS_OPT_KEY to classOf[ComplexKeyGenerator].getName. Please see description here: https://hudi.apache.org/docs/writing_data.html#datasource-writer.
Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the hoodie.datasource.write.recordkey.field configuration. Thanks, Adam Feldman On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash <[email protected]> wrote: > Looks like this property does the trick > > Property: hoodie.datasource.write.recordkey.field, Default: uuid > Record key field. Value to be used as the recordKey component of HoodieKey. > Actual value will be obtained by invoking .toString() on the field value. > Nested fields can be specified using the dot notation eg: a.b.c > > However I couldn't provide more than one column like this... COL1.COL2 > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' > > Anything wrong with the syntax? (tried with comma as well) > > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash < > [email protected]> > wrote: > > > Hello Balaji > > > > Thank you for your info !! > > > > I tried those options but what I find is (I'm trying to understand how > > hudi internally manages its files) > > > > First Write > > > > 1. > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > ('NR002', 'YYYYTRE', 'YYYYTRE_445343') > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > Commit time for all the records 20200716212533 > > > > 2. > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343') > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > (There is only one record change in my new dataset other two records are > > same as 1 but after snapshot/incremental read I see that commit time is > > updated for all 3 records) > > > > Commit time for all the records 20200716214544 > > > > > > - Does it mean that Hudi re-creates 3 records again? I thought it > > would create only the 2nd record > > - Trying to understand the storage volume efficiency here > > - Some configuration has to be enabled to fix this? > > > > configuration that I use > > > > - COPY_ON_WRITE, Append, Upsert > > - First Column (NR001) is configured as > > *hoodie.datasource.write.recordkey.field* > > > > > > > > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan > > <[email protected]> wrote: > > > >> Hi Sivaprakash, > >> Uniqueness of records is determined by the record key you specify to > >> hudi. Hudi supports filtering out existing records (by record key). By > >> default, it would upsert all incoming records. > >> Please look at > >> > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput > for > >> information on how to dedupe records based on record key. > >> > >> Balaji.V > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash < > >> [email protected]> wrote: > >> > >> This might be a basic question - I'm experimenting with Hudi > (Pyspark). I > >> have used Insert/Upsert options to write delta into my data lake. > However, > >> one is not clear to me > >> > >> Step 1:- I write 50 records > >> Step 2:- Im writing 50 records out of which only *10 records have been > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also > >> COPY_ON_WRITE) > >> Step 3: I was expecting only 10 records will be written but it writes > >> whole > >> 50 records is this a normal behaviour? Which means do I need to > determine > >> the delta myself and write them alone? > >> > >> Am I missing something? > >> > > > > > > > > -- > > - Prakash. > > > > > -- > - Prakash. >
