Hi Sivaprakash,
Uniqueness of records is determined by the record key you specify to hudi. Hudi
supports filtering out existing records (by record key). By default, it would
upsert all incoming records.
Please look at
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
for information on how to dedupe records based on record key.
Balaji.V
On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash
<[email protected]> wrote:
This might be a basic question - I'm experimenting with Hudi (Pyspark). I
have used Insert/Upsert options to write delta into my data lake. However,
one is not clear to me
Step 1:- I write 50 records
Step 2:- Im writing 50 records out of which only *10 records have been
changed* (I'm using upsert mode & tried with MERGE_ON_READ also
COPY_ON_WRITE)
Step 3: I was expecting only 10 records will be written but it writes whole
50 records is this a normal behaviour? Which means do I need to determine
the delta myself and write them alone?
Am I missing something?