Re: clarification on changelog behavior for equality deletes

Steven Wu Thu, 22 Aug 2024 08:50:53 -0700

>  It should emit changes for each snapshot in the requested range.

Wing Yew has a good point here. +1


On Thu, Aug 22, 2024 at 8:46 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> First, thank you all for your responses to my question.
>
> For Peter's question, I believe that (b) is the correct behavior. It is
> also the current behavior when using copy-on-write (deletes and updates are
> still supported but not using delete files). A changelog scan is an
> incremental scan over multiple snapshots. It should emit changes for each
> snapshot in the requested range. Spark provides additional functionality on
> top of the changelog scan, to produce net changes for the requested range.
> See
> https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view.
> Basically the create_changelog_view procedure uses a changelog scan (read
> the changelog table, i.e., <table>.changes) to get a DataFrame which is
> saved to a temporary Spark view which can then be queried; if net_changes
> is true, only the net changes are produced for this temporary view. This
> functionality uses ChangelogIterator.removeNetCarryovers (which is in
> Spark).
>
>
> On Thu, Aug 22, 2024 at 7:51 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> Peter, good question. In this case, (b) is the complete change history.
>> (a) is the squashed version.
>>
>> I would probably check how other changelog systems deal with this
>> scenario.
>>
>> On Thu, Aug 22, 2024 at 3:49 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Technically different, but somewhat similar question:
>>>
>>> What is the expected behaviour when the `IncrementalScan` is created for
>>> not a single snapshot, but for multiple snapshots?
>>> S1 added PK1-V1
>>> S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b)
>>> S3 updated PK1-V1b to PK1-V1c (removed PK1-V1b and added PK1-V1c)
>>>
>>> Let's say we have
>>> *IncrementalScan.fromSnapshotInclusive(S1).toSnapshot(S3)*.
>>> Or we need to return:
>>> (a)
>>> - PK1,V1c,INSERTED
>>>
>>> Or is it ok, to return:
>>> (b)
>>> - PK1,V1,INSERTED
>>> - PK1,V1,DELETED
>>> - PK1,V1b,INSERTED
>>> - PK1,V1b,DELETED
>>> - PK1,V1c,INSERTED
>>>
>>> I think the (a) is the correct behaviour.
>>>
>>> Thanks,
>>> Peter
>>>
>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2024. aug. 21.,
>>> Sze, 22:27):
>>>
>>>> Agree with everyone that option (a) is the correct behavior.
>>>>
>>>> On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang
>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>
>>>>> I agree that option (a) is what user expects for row level changes.
>>>>>
>>>>> I feel the added deletes in given snapshots provides a PK of DELETED
>>>>> entry, existing deletes are used to read together with data files to find
>>>>> DELETED value (V1b) and result of columns.
>>>>>
>>>>> Thanks,
>>>>> Steve Zhang
>>>>>
>>>>>
>>>>>
>>>>> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon <wyp...@cloudera.com.INVALID>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a PR open to add changelog support for the case where delete
>>>>> files are present (https://github.com/apache/iceberg/pull/10935). I
>>>>> have a question about what the changelog should emit in the following
>>>>> scenario:
>>>>>
>>>>> The table has a schema with a primary key/identifier column PK and
>>>>> additional column V.
>>>>> In snapshot 1, we write a data file DF1 with rows
>>>>> PK1, V1
>>>>> PK2, V2
>>>>> etc.
>>>>> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and
>>>>> new data file DF2 with rows
>>>>> PK1, V1b
>>>>> (possibly other rows)
>>>>> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and
>>>>> new data file DF3 with rows
>>>>> PK1, V1c
>>>>> (possibly other rows)
>>>>>
>>>>> Thus, in snapshot 2 and snapshot 3, we update the row identified by
>>>>> PK1 with new values by using an equality delete and writing new data for
>>>>> the row.
>>>>> These are the files present in snapshot 3:
>>>>> DF1 (sequence number 1)
>>>>> DF2 (sequence number 2)
>>>>> DF3 (sequence number 3)
>>>>> ED1 (sequence number 2)
>>>>> ED2 (sequence number 3)
>>>>>
>>>>> The question I have is what should the changelog emit for snapshot 3?
>>>>> For snapshot 1, the changelog should emit a row for each row in DF1 as
>>>>> INSERTED.
>>>>> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row
>>>>> for PK1, V1b as INSERTED.
>>>>> For snapshot 3, I see two possibilities:
>>>>> (a)
>>>>> PK1,V1b,DELETED
>>>>> PK1,V1c,INSERTED
>>>>>
>>>>> (b)
>>>>> PK1,V1,DELETED
>>>>> PK1,V1b,DELETED
>>>>> PK1,V1c,INSERTED
>>>>>
>>>>> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with
>>>>> ED1 being an existing delete file and ED2 being an added delete file for
>>>>> it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
>>>>> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.
>>>>>
>>>>> The interpretation for (a) is that ED1 is an existing delete file for
>>>>> DF1 and in snapshot 3, the row PK1,V1 already does not exist before the
>>>>> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is
>>>>> already applied to DF1, and we only consider any additional rows that get
>>>>> deleted when ED2 is applied.)
>>>>>
>>>>> I lean towards (a), as I think it is more reflective of net changes.
>>>>> I am interested to hear what folks think.
>>>>>
>>>>> Thank you,
>>>>> Wing Yew
>>>>>
>>>>>
>>>>>
>>>>>

Re: clarification on changelog behavior for equality deletes

Reply via email to