Re: Handling delta

Vinoth Chandar Sun, 19 Jul 2020 15:41:03 -0700

Thanks everyone for helping out prakash!

On Thu, Jul 16, 2020 at 10:24 AM Sivaprakash <[email protected]>
wrote:


> Great !!
>
> Got it working !!
>
> 'hoodie.datasource.write.recordkey.field': 'COL1,COL2',
> 'hoodie.datasource.write.keygenerator.class':
> 'org.apache.hudi.keygen.ComplexKeyGenerator',
>
> Thank you.
>
> On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman <[email protected]> wrote:
>
> > Hi Sivaprakash,
> > To be able to specify multiple keys, in a comma separated notation, you
> > must also set the KEYGENERATOR_CLASS_OPT_KEY to
> > classOf[ComplexKeyGenerator].getName. Please see description here:
> > https://hudi.apache.org/docs/writing_data.html#datasource-writer.
> >
> > Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the
> > hoodie.datasource.write.recordkey.field configuration.
> >
> > Thanks,
> >
> > Adam Feldman
> >
> > On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash <
> > [email protected]>
> > wrote:
> >
> > > Looks like this property does the trick
> > >
> > > Property: hoodie.datasource.write.recordkey.field, Default: uuid
> > > Record key field. Value to be used as the recordKey component of
> > HoodieKey.
> > > Actual value will be obtained by invoking .toString() on the field
> value.
> > > Nested fields can be specified using the dot notation eg: a.b.c
> > >
> > > However I couldn't provide more than one column like this... COL1.COL2
> > >
> > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
> > >
> > > Anything wrong with the syntax? (tried with comma as well)
> > >
> > >
> > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <
> > > [email protected]>
> > > wrote:
> > >
> > > > Hello Balaji
> > > >
> > > > Thank you for your info !!
> > > >
> > > > I tried those options but what I find is (I'm trying to understand
> how
> > > > hudi internally manages its files)
> > > >
> > > > First Write
> > > >
> > > > 1.
> > > >
> > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > > ('NR002', 'YYYYTRE', 'YYYYTRE_445343')
> > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > > >
> > > > Commit time for all the records 20200716212533
> > > >
> > > > 2.
> > > >
> > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > > >
> > > > (There is only one record change in my new dataset other two records
> > are
> > > > same as 1 but after snapshot/incremental read I see that commit time
> is
> > > > updated for all 3 records)
> > > >
> > > > Commit time for all the records 20200716214544
> > > >
> > > >
> > > >    - Does it mean that Hudi re-creates 3 records again? I thought it
> > > >    would create only the 2nd record
> > > >    - Trying to understand the storage volume efficiency here
> > > >    - Some configuration has to be enabled to fix this?
> > > >
> > > > configuration that I use
> > > >
> > > >    - COPY_ON_WRITE, Append, Upsert
> > > >    - First Column (NR001) is configured as
> > > >    *hoodie.datasource.write.recordkey.field*
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> > > > <[email protected]> wrote:
> > > >
> > > >>  Hi Sivaprakash,
> > > >> Uniqueness of records is determined by the record key you specify to
> > > >> hudi. Hudi supports filtering out existing records (by record key).
> By
> > > >> default, it would upsert all incoming records.
> > > >> Please look at
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> > > for
> > > >> information on how to dedupe records based on record key.
> > > >>
> > > >> Balaji.V
> > > >>     On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> > > >> [email protected]> wrote:
> > > >>
> > > >>  This might be a basic question - I'm experimenting with Hudi
> > > (Pyspark). I
> > > >> have used Insert/Upsert options to write delta into my data lake.
> > > However,
> > > >> one is not clear to me
> > > >>
> > > >> Step 1:- I write 50 records
> > > >> Step 2:- Im writing 50 records out of which only *10 records have
> been
> > > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> > > >> COPY_ON_WRITE)
> > > >> Step 3: I was expecting only 10 records will be written but it
> writes
> > > >> whole
> > > >> 50 records is this a normal behaviour? Which means do I need to
> > > determine
> > > >> the delta myself and write them alone?
> > > >>
> > > >> Am I missing something?
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > - Prakash.
> > > >
> > >
> > >
> > > --
> > > - Prakash.
> > >
> >
>
>
> --
> - Prakash.
>

Re: Handling delta

Reply via email to