Re: Handling delta

Balaji Varadarajan Thu, 16 Jul 2020 09:11:27 -0700

 Hi Sivaprakash,
Uniqueness of records is determined by the record key you specify to hudi. Hudi 
supports filtering out existing records (by record key). By default, it would 
upsert all incoming records. 
Please look at 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
 for information on how to dedupe records based on record key.


Balaji.V
    On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash 
<[email protected]> wrote:  
 
 This might be a basic question - I'm experimenting with Hudi (Pyspark). I
have used Insert/Upsert options to write delta into my data lake. However,
one is not clear to me

Step 1:- I write 50 records
Step 2:- Im writing 50 records out of which only *10 records have been
changed* (I'm using upsert mode & tried with MERGE_ON_READ also
COPY_ON_WRITE)
Step 3: I was expecting only 10 records will be written but it writes whole
50 records is this a normal behaviour? Which means do I need to determine
the delta myself and write them alone?

Am I missing something?

Re: Handling delta

Reply via email to