Re: Handling delta

Sivaprakash Thu, 16 Jul 2020 10:01:24 -0700

Looks like this property does the trick

Property: hoodie.datasource.write.recordkey.field, Default: uuid
Record key field. Value to be used as the recordKey component of HoodieKey.
Actual value will be obtained by invoking .toString() on the field value.
Nested fields can be specified using the dot notation eg: a.b.c


However I couldn't provide more than one column like this... COL1.COL2

'hoodie.datasource.write.recordkey.field: 'COL1.COL2'

Anything wrong with the syntax? (tried with comma as well)


On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <[email protected]>
wrote:

> Hello Balaji
>
> Thank you for your info !!
>
> I tried those options but what I find is (I'm trying to understand how
> hudi internally manages its files)
>
> First Write
>
> 1.
>
> ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> ('NR002', 'YYYYTRE', 'YYYYTRE_445343')
> ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
>
> Commit time for all the records 20200716212533
>
> 2.
>
> ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
>
> (There is only one record change in my new dataset other two records are
> same as 1 but after snapshot/incremental read I see that commit time is
> updated for all 3 records)
>
> Commit time for all the records 20200716214544
>
>
>    - Does it mean that Hudi re-creates 3 records again? I thought it
>    would create only the 2nd record
>    - Trying to understand the storage volume efficiency here
>    - Some configuration has to be enabled to fix this?
>
> configuration that I use
>
>    - COPY_ON_WRITE, Append, Upsert
>    - First Column (NR001) is configured as
>    *hoodie.datasource.write.recordkey.field*
>
>
>
>
> On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> <[email protected]> wrote:
>
>>  Hi Sivaprakash,
>> Uniqueness of records is determined by the record key you specify to
>> hudi. Hudi supports filtering out existing records (by record key). By
>> default, it would upsert all incoming records.
>> Please look at
>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
>>  for
>> information on how to dedupe records based on record key.
>>
>> Balaji.V
>>     On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
>> [email protected]> wrote:
>>
>>  This might be a basic question - I'm experimenting with Hudi (Pyspark). I
>> have used Insert/Upsert options to write delta into my data lake. However,
>> one is not clear to me
>>
>> Step 1:- I write 50 records
>> Step 2:- Im writing 50 records out of which only *10 records have been
>> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
>> COPY_ON_WRITE)
>> Step 3: I was expecting only 10 records will be written but it writes
>> whole
>> 50 records is this a normal behaviour? Which means do I need to determine
>> the delta myself and write them alone?
>>
>> Am I missing something?
>>
>
>
>
> --
> - Prakash.
>


-- 
- Prakash.

Re: Handling delta

Reply via email to