Yes I'm 10 records that I mentioned from Step - 1. But, I re-write whole dataset the second time also. I see that commit_time is getting updated for all 50 records (which I feel normal) But I'm not sure how to see/prove to myself that the data is not growing (to 100 records; actually it should be only 50 records).
On Thu, Jul 16, 2020 at 4:01 PM Allen Underwood <[email protected]> wrote: > Hi Sivaprakash, > > So I'm by no means an expert on this, but I think you might find what > you're looking for here: > https://hudi.apache.org/docs/concepts.html > > I'm not sure I fully understand Step 2 you mentioned - I'm writing 50 > records out of which only 10 records have been changed - does that mean > that you updated 10 records from step 1? Or you're updating some of the > other 40 records from step 2? > > Either way I guess, the key is all deltas will be written...it's after > those records are written to disk that they are consolidated during the > COMPACTION phase. I *BELIEVE* this is how it works. > Take a look at COMPACTION under the timeline section here: > https://hudi.apache.org/docs/concepts.html#timeline > > Hope that helps a bit. > > Allen > > On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash < > [email protected]> wrote: > >> This might be a basic question - I'm experimenting with Hudi (Pyspark). I >> have used Insert/Upsert options to write delta into my data lake. However, >> one is not clear to me >> >> Step 1:- I write 50 records >> Step 2:- Im writing 50 records out of which only *10 records have been >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also >> COPY_ON_WRITE) >> Step 3: I was expecting only 10 records will be written but it writes >> whole >> 50 records is this a normal behaviour? Which means do I need to determine >> the delta myself and write them alone? >> >> Am I missing something? >> > > > -- > *Allen Underwood* > Principal Software Engineer > Broadcom | Symantec Enterprise Division > *Mobile*: 404.808.5926 > -- - Prakash.
