Re: Append-only table scans in the presence of OVERWRITE snapshots

Gyula Fóra Thu, 26 Jun 2025 08:38:13 -0700

I agree with Peter, it would be weird to get an error on a DELETE snapshot
if you already explicitly opted in for reading the appends of OVERWRITE
snapshots.


Users may not be able to control the type of snapshot to be created so this
would otherwise render this feature useless.

Gyula

On Thu, Jun 26, 2025 at 5:34 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> > Consequently, we must throw on DELETE snapshots, even if users opt-in to
> reading appends of OVERWRITE snapshots.
>
> OVERWRITE snapshots themselves could still contain deletes. So in this
> regard, I don't see a difference between the DELETE and the OVERWRITE
> snapshots.
>
> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 26.,
> Cs, 11:03):
>
>> Thank you for your feedback Steven and Gyula!
>>
>> @Steven
>>
>>> the Flink streaming read only consumes `append` only commits. This is a
>>> snapshot commit  `DataOperation` type. You were talking about row-level
>>> appends, delete etc.
>>
>>
>> Yes, there is no doubt the fact that Flink streaming reads only process
>> append snapshots. But the issue here is that it jumps over "overwrite"
>> snapshots, which contains both new data files and delete files. I don't
>> think users are generally aware of this. Even worse, reading that way leads
>> to an inconsistent view on the table state, which is a serious data
>> integrity issue.
>>
>> >>2. Add an option to read appended data of overwrite snapshots to allow
>> >>users to de-duplicate downstream (opt-in config)
>> >For updates it is true. But what about actual row deletion?
>>
>> Good point! I had only considered upserts, but deletes can definitely not
>> be reconstructed like it is the case for upserts via downstream
>> deduplication. Consequently, we must throw on DELETE snapshots, even if
>> users opt-in to reading appends of OVERWRITE snapshots.
>>
>> It would be interesting to hear what people think about this. In some
>>> scenarios, this is probably better.
>>
>> In the scenario of primarily streaming append (and occasionally batch
>>> overwrite), is this always a preferred behavior?
>>
>>
>> I think so, but I'm also curious about what others think.
>>
>> @Gyula
>>
>>> There is a slight risk with 3 that it may break pipelines where the
>>> current behaviour is well understood and expected but that's probably a
>>> very small minority of the users.
>>
>>
>> Agreed, that is a possible risk, but if we consider the hidden risk of
>> effectively "losing" data, it looks like a small risk to take.
>>
>> -Max
>>
>>
>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>
>>> Hi Max!
>>>
>>> I like this proposal especially that proper streaming reads of deletes
>>> seem to be quite a bit of work based on recent efforts.
>>>
>>> Giving an option to include the append parts of OVERWRITE snapshots (2)
>>> is a great quick improvement that will unblock use-cases where the iceberg
>>> table is used to store insert/upsert only data which is pretty common when
>>> storing metadata/enrichment information for example. This can then later be
>>> simply removed once proper streaming read is available.
>>>
>>> As for (3) as a default , I think it's reasonable given that this is a
>>> source of a lot of user confusion when they see that the source simply
>>> doesn't read *anything* even if new snapshots are produced to the table.
>>> There is a slight risk with 3 that it may break pipelines where the
>>> current behaviour is well understood and expected but that's probably a
>>> very small minority of the users.
>>>
>>> Cheers
>>> Gyula
>>>
>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> the Flink streaming read only consumes `append` only commits. This is a
>>>> snapshot commit  `DataOperation` type. You were talking about row-level
>>>> appends, delete etc.
>>>>
>>>> > 2. Add an option to read appended data of overwrite snapshots to allow
>>>> users to de-duplicate downstream (opt-in config)
>>>>
>>>> For updates it is true. But what about actual row deletion?
>>>>
>>>> > 3. Throw an error if overwrite snapshots are discovered in an
>>>> append-only scan and the option in (2) is not activated
>>>>
>>>> It would be interesting to hear what people think about this. In some
>>>> scenarios, this is probably better.
>>>>
>>>> In the scenario of primarily streaming append (and occasionally batch
>>>> overwrite), is this always a preferred behavior?
>>>>
>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <m...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It is well known that Flink and other Iceberg engines do not support
>>>>> "merge-on-read" in streaming/incremental read mode. There are plans to
>>>>> change that, see the "Improving Merge-On-Read Query Performance"
>>>>> thread, but this is not what this message is about.
>>>>>
>>>>> I used to think that when we incrementally read from an Iceberg table
>>>>> with Flink, we would simply not apply delete files, but still see
>>>>> appends. That is not the case. Instead, we run an append-only table
>>>>> scan [1], which will silently skip over any "overwrite" [2] snapshots.
>>>>> That means that the consumer will skip appends and deletes if the
>>>>> producer runs in "merge-on-read" mode.
>>>>>
>>>>> Now, it is debatable whether it is a good idea to skip applying
>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots entirely is a
>>>>> great idea either. Not every upsert effectively results in a delete +
>>>>> append. Many upserts are merely appends. Reading appends at least
>>>>> gives consumers the chance to de-duplicate later on, but skipping the
>>>>> entire snapshot definitely means data loss downstream.
>>>>>
>>>>> I think we have three options:
>>>>>
>>>>> 1. Implement "merge-on-read" efficiently for incremental / streaming
>>>>> reads (currently a work in progress)
>>>>> 2. Add an option to read appended data of overwrite snapshots to allow
>>>>> users to de-duplicate downstream (opt-in config)
>>>>> 3. Throw an error if overwrite snapshots are discovered in an
>>>>> append-only scan and the option in (2) is not activated
>>>>>
>>>>> (2) and (3) are stop gaps until we have (1).
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>>>>> [2]
>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>>>>
>>>>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to