Re: Handling delta

Adam Feldman Thu, 16 Jul 2020 10:10:29 -0700

Hi Sivaprakash,
To be able to specify multiple keys, in a comma separated notation, you
must also set the KEYGENERATOR_CLASS_OPT_KEY to
classOf[ComplexKeyGenerator].getName. Please see description here:
https://hudi.apache.org/docs/writing_data.html#datasource-writer.


Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the
hoodie.datasource.write.recordkey.field configuration.

Thanks,

Adam Feldman

On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash <[email protected]>
wrote:

> Looks like this property does the trick
>
> Property: hoodie.datasource.write.recordkey.field, Default: uuid
> Record key field. Value to be used as the recordKey component of HoodieKey.
> Actual value will be obtained by invoking .toString() on the field value.
> Nested fields can be specified using the dot notation eg: a.b.c
>
> However I couldn't provide more than one column like this... COL1.COL2
>
> 'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
>
> Anything wrong with the syntax? (tried with comma as well)
>
>
> On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <
> [email protected]>
> wrote:
>
> > Hello Balaji
> >
> > Thank you for your info !!
> >
> > I tried those options but what I find is (I'm trying to understand how
> > hudi internally manages its files)
> >
> > First Write
> >
> > 1.
> >
> > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > ('NR002', 'YYYYTRE', 'YYYYTRE_445343')
> > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> >
> > Commit time for all the records 20200716212533
> >
> > 2.
> >
> > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> >
> > (There is only one record change in my new dataset other two records are
> > same as 1 but after snapshot/incremental read I see that commit time is
> > updated for all 3 records)
> >
> > Commit time for all the records 20200716214544
> >
> >
> >    - Does it mean that Hudi re-creates 3 records again? I thought it
> >    would create only the 2nd record
> >    - Trying to understand the storage volume efficiency here
> >    - Some configuration has to be enabled to fix this?
> >
> > configuration that I use
> >
> >    - COPY_ON_WRITE, Append, Upsert
> >    - First Column (NR001) is configured as
> >    *hoodie.datasource.write.recordkey.field*
> >
> >
> >
> >
> > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> > <[email protected]> wrote:
> >
> >>  Hi Sivaprakash,
> >> Uniqueness of records is determined by the record key you specify to
> >> hudi. Hudi supports filtering out existing records (by record key). By
> >> default, it would upsert all incoming records.
> >> Please look at
> >>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> for
> >> information on how to dedupe records based on record key.
> >>
> >> Balaji.V
> >>     On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> >> [email protected]> wrote:
> >>
> >>  This might be a basic question - I'm experimenting with Hudi
> (Pyspark). I
> >> have used Insert/Upsert options to write delta into my data lake.
> However,
> >> one is not clear to me
> >>
> >> Step 1:- I write 50 records
> >> Step 2:- Im writing 50 records out of which only *10 records have been
> >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> >> COPY_ON_WRITE)
> >> Step 3: I was expecting only 10 records will be written but it writes
> >> whole
> >> 50 records is this a normal behaviour? Which means do I need to
> determine
> >> the delta myself and write them alone?
> >>
> >> Am I missing something?
> >>
> >
> >
> >
> > --
> > - Prakash.
> >
>
>
> --
> - Prakash.
>

Re: Handling delta

Reply via email to