Hello Balaji

Thank you for your info !!

I tried those options but what I find is (I'm trying to understand how hudi
internally manages its files)

First Write

1.

('NR001', 'YXXXTRE', 'YXXXTRE_445343')
('NR002', 'YYYYTRE', 'YYYYTRE_445343')
('NR003', 'YZZZTRE', 'YZZZTRE_445343')

Commit time for all the records 20200716212533

2.

('NR001', 'YXXXTRE', 'YXXXTRE_445343')
('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
('NR003', 'YZZZTRE', 'YZZZTRE_445343')

(There is only one record change in my new dataset other two records are
same as 1 but after snapshot/incremental read I see that commit time is
updated for all 3 records)

Commit time for all the records 20200716214544


   - Does it mean that Hudi re-creates 3 records again? I thought it would
   create only the 2nd record
   - Trying to understand the storage volume efficiency here
   - Some configuration has to be enabled to fix this?

configuration that I use

   - COPY_ON_WRITE, Append, Upsert
   - First Column (NR001) is configured as
   *hoodie.datasource.write.recordkey.field*




On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
<[email protected]> wrote:

>  Hi Sivaprakash,
> Uniqueness of records is determined by the record key you specify to hudi.
> Hudi supports filtering out existing records (by record key). By default,
> it would upsert all incoming records.
> Please look at
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
>  for
> information on how to dedupe records based on record key.
>
> Balaji.V
>     On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> [email protected]> wrote:
>
>  This might be a basic question - I'm experimenting with Hudi (Pyspark). I
> have used Insert/Upsert options to write delta into my data lake. However,
> one is not clear to me
>
> Step 1:- I write 50 records
> Step 2:- Im writing 50 records out of which only *10 records have been
> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> COPY_ON_WRITE)
> Step 3: I was expecting only 10 records will be written but it writes whole
> 50 records is this a normal behaviour? Which means do I need to determine
> the delta myself and write them alone?
>
> Am I missing something?
>



-- 
- Prakash.

Reply via email to