Re: Append-only table scans in the presence of OVERWRITE snapshots

Péter Váry Thu, 26 Jun 2025 08:46:12 -0700

> Consequently, we must throw on DELETE snapshots, even if users opt-in to
reading appends of OVERWRITE snapshots.


OVERWRITE snapshots themselves could still contain deletes. So in this
regard, I don't see a difference between the DELETE and the OVERWRITE
snapshots.

Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 26., Cs,
11:03):

> Thank you for your feedback Steven and Gyula!
>
> @Steven
>
>> the Flink streaming read only consumes `append` only commits. This is a
>> snapshot commit  `DataOperation` type. You were talking about row-level
>> appends, delete etc.
>
>
> Yes, there is no doubt the fact that Flink streaming reads only process
> append snapshots. But the issue here is that it jumps over "overwrite"
> snapshots, which contains both new data files and delete files. I don't
> think users are generally aware of this. Even worse, reading that way leads
> to an inconsistent view on the table state, which is a serious data
> integrity issue.
>
> >>2. Add an option to read appended data of overwrite snapshots to allow
> >>users to de-duplicate downstream (opt-in config)
> >For updates it is true. But what about actual row deletion?
>
> Good point! I had only considered upserts, but deletes can definitely not
> be reconstructed like it is the case for upserts via downstream
> deduplication. Consequently, we must throw on DELETE snapshots, even if
> users opt-in to reading appends of OVERWRITE snapshots.
>
> It would be interesting to hear what people think about this. In some
>> scenarios, this is probably better.
>
> In the scenario of primarily streaming append (and occasionally batch
>> overwrite), is this always a preferred behavior?
>
>
> I think so, but I'm also curious about what others think.
>
> @Gyula
>
>> There is a slight risk with 3 that it may break pipelines where the
>> current behaviour is well understood and expected but that's probably a
>> very small minority of the users.
>
>
> Agreed, that is a possible risk, but if we consider the hidden risk of
> effectively "losing" data, it looks like a small risk to take.
>
> -Max
>
>
> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hi Max!
>>
>> I like this proposal especially that proper streaming reads of deletes
>> seem to be quite a bit of work based on recent efforts.
>>
>> Giving an option to include the append parts of OVERWRITE snapshots (2)
>> is a great quick improvement that will unblock use-cases where the iceberg
>> table is used to store insert/upsert only data which is pretty common when
>> storing metadata/enrichment information for example. This can then later be
>> simply removed once proper streaming read is available.
>>
>> As for (3) as a default , I think it's reasonable given that this is a
>> source of a lot of user confusion when they see that the source simply
>> doesn't read *anything* even if new snapshots are produced to the table.
>> There is a slight risk with 3 that it may break pipelines where the
>> current behaviour is well understood and expected but that's probably a
>> very small minority of the users.
>>
>> Cheers
>> Gyula
>>
>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> the Flink streaming read only consumes `append` only commits. This is a
>>> snapshot commit  `DataOperation` type. You were talking about row-level
>>> appends, delete etc.
>>>
>>> > 2. Add an option to read appended data of overwrite snapshots to allow
>>> users to de-duplicate downstream (opt-in config)
>>>
>>> For updates it is true. But what about actual row deletion?
>>>
>>> > 3. Throw an error if overwrite snapshots are discovered in an
>>> append-only scan and the option in (2) is not activated
>>>
>>> It would be interesting to hear what people think about this. In some
>>> scenarios, this is probably better.
>>>
>>> In the scenario of primarily streaming append (and occasionally batch
>>> overwrite), is this always a preferred behavior?
>>>
>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <m...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> It is well known that Flink and other Iceberg engines do not support
>>>> "merge-on-read" in streaming/incremental read mode. There are plans to
>>>> change that, see the "Improving Merge-On-Read Query Performance"
>>>> thread, but this is not what this message is about.
>>>>
>>>> I used to think that when we incrementally read from an Iceberg table
>>>> with Flink, we would simply not apply delete files, but still see
>>>> appends. That is not the case. Instead, we run an append-only table
>>>> scan [1], which will silently skip over any "overwrite" [2] snapshots.
>>>> That means that the consumer will skip appends and deletes if the
>>>> producer runs in "merge-on-read" mode.
>>>>
>>>> Now, it is debatable whether it is a good idea to skip applying
>>>> deletes, but I'm not sure ignoring "overwrite" snapshots entirely is a
>>>> great idea either. Not every upsert effectively results in a delete +
>>>> append. Many upserts are merely appends. Reading appends at least
>>>> gives consumers the chance to de-duplicate later on, but skipping the
>>>> entire snapshot definitely means data loss downstream.
>>>>
>>>> I think we have three options:
>>>>
>>>> 1. Implement "merge-on-read" efficiently for incremental / streaming
>>>> reads (currently a work in progress)
>>>> 2. Add an option to read appended data of overwrite snapshots to allow
>>>> users to de-duplicate downstream (opt-in config)
>>>> 3. Throw an error if overwrite snapshots are discovered in an
>>>> append-only scan and the option in (2) is not activated
>>>>
>>>> (2) and (3) are stop gaps until we have (1).
>>>>
>>>> What do you think?
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>> [1]
>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>>>> [2]
>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>>>
>>>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to