Great !!

Got it working !!

'hoodie.datasource.write.recordkey.field': 'COL1,COL2',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',

Thank you.

On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman <[email protected]> wrote:

> Hi Sivaprakash,
> To be able to specify multiple keys, in a comma separated notation, you
> must also set the KEYGENERATOR_CLASS_OPT_KEY to
> classOf[ComplexKeyGenerator].getName. Please see description here:
> https://hudi.apache.org/docs/writing_data.html#datasource-writer.
>
> Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the
> hoodie.datasource.write.recordkey.field configuration.
>
> Thanks,
>
> Adam Feldman
>
> On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash <
> [email protected]>
> wrote:
>
> > Looks like this property does the trick
> >
> > Property: hoodie.datasource.write.recordkey.field, Default: uuid
> > Record key field. Value to be used as the recordKey component of
> HoodieKey.
> > Actual value will be obtained by invoking .toString() on the field value.
> > Nested fields can be specified using the dot notation eg: a.b.c
> >
> > However I couldn't provide more than one column like this... COL1.COL2
> >
> > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
> >
> > Anything wrong with the syntax? (tried with comma as well)
> >
> >
> > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <
> > [email protected]>
> > wrote:
> >
> > > Hello Balaji
> > >
> > > Thank you for your info !!
> > >
> > > I tried those options but what I find is (I'm trying to understand how
> > > hudi internally manages its files)
> > >
> > > First Write
> > >
> > > 1.
> > >
> > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > ('NR002', 'YYYYTRE', 'YYYYTRE_445343')
> > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > >
> > > Commit time for all the records 20200716212533
> > >
> > > 2.
> > >
> > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > >
> > > (There is only one record change in my new dataset other two records
> are
> > > same as 1 but after snapshot/incremental read I see that commit time is
> > > updated for all 3 records)
> > >
> > > Commit time for all the records 20200716214544
> > >
> > >
> > >    - Does it mean that Hudi re-creates 3 records again? I thought it
> > >    would create only the 2nd record
> > >    - Trying to understand the storage volume efficiency here
> > >    - Some configuration has to be enabled to fix this?
> > >
> > > configuration that I use
> > >
> > >    - COPY_ON_WRITE, Append, Upsert
> > >    - First Column (NR001) is configured as
> > >    *hoodie.datasource.write.recordkey.field*
> > >
> > >
> > >
> > >
> > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> > > <[email protected]> wrote:
> > >
> > >>  Hi Sivaprakash,
> > >> Uniqueness of records is determined by the record key you specify to
> > >> hudi. Hudi supports filtering out existing records (by record key). By
> > >> default, it would upsert all incoming records.
> > >> Please look at
> > >>
> >
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> > for
> > >> information on how to dedupe records based on record key.
> > >>
> > >> Balaji.V
> > >>     On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> > >> [email protected]> wrote:
> > >>
> > >>  This might be a basic question - I'm experimenting with Hudi
> > (Pyspark). I
> > >> have used Insert/Upsert options to write delta into my data lake.
> > However,
> > >> one is not clear to me
> > >>
> > >> Step 1:- I write 50 records
> > >> Step 2:- Im writing 50 records out of which only *10 records have been
> > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> > >> COPY_ON_WRITE)
> > >> Step 3: I was expecting only 10 records will be written but it writes
> > >> whole
> > >> 50 records is this a normal behaviour? Which means do I need to
> > determine
> > >> the delta myself and write them alone?
> > >>
> > >> Am I missing something?
> > >>
> > >
> > >
> > >
> > > --
> > > - Prakash.
> > >
> >
> >
> > --
> > - Prakash.
> >
>


-- 
- Prakash.

Reply via email to