Thanks everyone for helping out prakash! On Thu, Jul 16, 2020 at 10:24 AM Sivaprakash <[email protected]> wrote:
> Great !! > > Got it working !! > > 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.ComplexKeyGenerator', > > Thank you. > > On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman <[email protected]> wrote: > > > Hi Sivaprakash, > > To be able to specify multiple keys, in a comma separated notation, you > > must also set the KEYGENERATOR_CLASS_OPT_KEY to > > classOf[ComplexKeyGenerator].getName. Please see description here: > > https://hudi.apache.org/docs/writing_data.html#datasource-writer. > > > > Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the > > hoodie.datasource.write.recordkey.field configuration. > > > > Thanks, > > > > Adam Feldman > > > > On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash < > > [email protected]> > > wrote: > > > > > Looks like this property does the trick > > > > > > Property: hoodie.datasource.write.recordkey.field, Default: uuid > > > Record key field. Value to be used as the recordKey component of > > HoodieKey. > > > Actual value will be obtained by invoking .toString() on the field > value. > > > Nested fields can be specified using the dot notation eg: a.b.c > > > > > > However I couldn't provide more than one column like this... COL1.COL2 > > > > > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' > > > > > > Anything wrong with the syntax? (tried with comma as well) > > > > > > > > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash < > > > [email protected]> > > > wrote: > > > > > > > Hello Balaji > > > > > > > > Thank you for your info !! > > > > > > > > I tried those options but what I find is (I'm trying to understand > how > > > > hudi internally manages its files) > > > > > > > > First Write > > > > > > > > 1. > > > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > > ('NR002', 'YYYYTRE', 'YYYYTRE_445343') > > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > > > Commit time for all the records 20200716212533 > > > > > > > > 2. > > > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343') > > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > > > (There is only one record change in my new dataset other two records > > are > > > > same as 1 but after snapshot/incremental read I see that commit time > is > > > > updated for all 3 records) > > > > > > > > Commit time for all the records 20200716214544 > > > > > > > > > > > > - Does it mean that Hudi re-creates 3 records again? I thought it > > > > would create only the 2nd record > > > > - Trying to understand the storage volume efficiency here > > > > - Some configuration has to be enabled to fix this? > > > > > > > > configuration that I use > > > > > > > > - COPY_ON_WRITE, Append, Upsert > > > > - First Column (NR001) is configured as > > > > *hoodie.datasource.write.recordkey.field* > > > > > > > > > > > > > > > > > > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan > > > > <[email protected]> wrote: > > > > > > > >> Hi Sivaprakash, > > > >> Uniqueness of records is determined by the record key you specify to > > > >> hudi. Hudi supports filtering out existing records (by record key). > By > > > >> default, it would upsert all incoming records. > > > >> Please look at > > > >> > > > > > > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput > > > for > > > >> information on how to dedupe records based on record key. > > > >> > > > >> Balaji.V > > > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash < > > > >> [email protected]> wrote: > > > >> > > > >> This might be a basic question - I'm experimenting with Hudi > > > (Pyspark). I > > > >> have used Insert/Upsert options to write delta into my data lake. > > > However, > > > >> one is not clear to me > > > >> > > > >> Step 1:- I write 50 records > > > >> Step 2:- Im writing 50 records out of which only *10 records have > been > > > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also > > > >> COPY_ON_WRITE) > > > >> Step 3: I was expecting only 10 records will be written but it > writes > > > >> whole > > > >> 50 records is this a normal behaviour? Which means do I need to > > > determine > > > >> the delta myself and write them alone? > > > >> > > > >> Am I missing something? > > > >> > > > > > > > > > > > > > > > > -- > > > > - Prakash. > > > > > > > > > > > > > -- > > > - Prakash. > > > > > > > > -- > - Prakash. >
