Yes I'm 10 records that I mentioned from Step - 1. But, I re-write whole
dataset the second time also. I see that commit_time is getting updated for
all 50 records (which I feel normal) But I'm not sure how to see/prove to
myself that the data is not growing (to 100 records; actually it should be
only 50 records).

On Thu, Jul 16, 2020 at 4:01 PM Allen Underwood
<[email protected]> wrote:

> Hi Sivaprakash,
>
> So I'm by no means an expert on this, but I think you might find what
> you're looking for here:
> https://hudi.apache.org/docs/concepts.html
>
> I'm not sure I fully understand Step 2 you mentioned - I'm writing 50
> records out of which only 10 records have been changed - does that mean
> that you updated 10 records from step 1?  Or you're updating some of the
> other 40 records from step 2?
>
> Either way I guess, the key is all deltas will be written...it's after
> those records are written to disk that they are consolidated during the
> COMPACTION phase.  I *BELIEVE* this is how it works.
> Take a look at COMPACTION under the timeline section here:
> https://hudi.apache.org/docs/concepts.html#timeline
>
> Hope that helps a bit.
>
> Allen
>
> On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash <
> [email protected]> wrote:
>
>> This might be a basic question - I'm experimenting with Hudi (Pyspark). I
>> have used Insert/Upsert options to write delta into my data lake. However,
>> one is not clear to me
>>
>> Step 1:- I write 50 records
>> Step 2:- Im writing 50 records out of which only *10 records have been
>> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
>> COPY_ON_WRITE)
>> Step 3: I was expecting only 10 records will be written but it writes
>> whole
>> 50 records is this a normal behaviour? Which means do I need to determine
>> the delta myself and write them alone?
>>
>> Am I missing something?
>>
>
>
> --
> *Allen Underwood*
> Principal Software Engineer
> Broadcom | Symantec Enterprise Division
> *Mobile*: 404.808.5926
>


-- 
- Prakash.

Reply via email to