Re: Append-only table scans in the presence of OVERWRITE snapshots

Gyula Fóra Thu, 26 Jun 2025 00:18:30 -0700

Hi Max!

I like this proposal especially that proper streaming reads of deletes seem
to be quite a bit of work based on recent efforts.


Giving an option to include the append parts of OVERWRITE snapshots (2) is
a great quick improvement that will unblock use-cases where the iceberg
table is used to store insert/upsert only data which is pretty common when
storing metadata/enrichment information for example. This can then later be
simply removed once proper streaming read is available.

As for (3) as a default , I think it's reasonable given that this is a
source of a lot of user confusion when they see that the source simply
doesn't read *anything* even if new snapshots are produced to the table.
There is a slight risk with 3 that it may break pipelines where the current
behaviour is well understood and expected but that's probably a very small
minority of the users.

Cheers
Gyula

On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com> wrote:

> the Flink streaming read only consumes `append` only commits. This is a
> snapshot commit  `DataOperation` type. You were talking about row-level
> appends, delete etc.
>
> > 2. Add an option to read appended data of overwrite snapshots to allow
> users to de-duplicate downstream (opt-in config)
>
> For updates it is true. But what about actual row deletion?
>
> > 3. Throw an error if overwrite snapshots are discovered in an
> append-only scan and the option in (2) is not activated
>
> It would be interesting to hear what people think about this. In some
> scenarios, this is probably better.
>
> In the scenario of primarily streaming append (and occasionally batch
> overwrite), is this always a preferred behavior?
>
> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <m...@apache.org> wrote:
>
>> Hi,
>>
>> It is well known that Flink and other Iceberg engines do not support
>> "merge-on-read" in streaming/incremental read mode. There are plans to
>> change that, see the "Improving Merge-On-Read Query Performance"
>> thread, but this is not what this message is about.
>>
>> I used to think that when we incrementally read from an Iceberg table
>> with Flink, we would simply not apply delete files, but still see
>> appends. That is not the case. Instead, we run an append-only table
>> scan [1], which will silently skip over any "overwrite" [2] snapshots.
>> That means that the consumer will skip appends and deletes if the
>> producer runs in "merge-on-read" mode.
>>
>> Now, it is debatable whether it is a good idea to skip applying
>> deletes, but I'm not sure ignoring "overwrite" snapshots entirely is a
>> great idea either. Not every upsert effectively results in a delete +
>> append. Many upserts are merely appends. Reading appends at least
>> gives consumers the chance to de-duplicate later on, but skipping the
>> entire snapshot definitely means data loss downstream.
>>
>> I think we have three options:
>>
>> 1. Implement "merge-on-read" efficiently for incremental / streaming
>> reads (currently a work in progress)
>> 2. Add an option to read appended data of overwrite snapshots to allow
>> users to de-duplicate downstream (opt-in config)
>> 3. Throw an error if overwrite snapshots are discovered in an
>> append-only scan and the option in (2) is not activated
>>
>> (2) and (3) are stop gaps until we have (1).
>>
>> What do you think?
>>
>> Cheers,
>> Max
>>
>> [1]
>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>> [2]
>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>
>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to