Re: Append-only table scans in the presence of OVERWRITE snapshots

gyula . fora Fri, 27 Jun 2025 06:28:37 -0700

Sounds good to me!

Gyula

Sent from my iPhone

On 27 Jun 2025, at 13:48, Maximilian Michels <[email protected]> wrote:

In my understanding, overwrite snapshots are specifically there to overwrite data, i.e. there is always an append for a delete. That is also the case for the Flink Iceberg sink. There may be other writers, which use overwrite snapshots differently. But point taken, we may need to also opt-in to skipping over delete snapshots, in addition to overwrite snapshots.

I've taken a look at how Spark handles this topic. Turns out, they came up with a similar approach:

[Spark] Iceberg only supports reading data from append snapshots. Overwrite snapshots cannot be processed and will cause an exception by default. Overwrites may be ignored by setting streaming-skip-overwrite-snapshots=true. Similarly, delete snapshots will cause an exception by default, and deletes may be ignored by setting streaming-skip-delete-snapshots=true.
https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads

Spark fails by default and has options to ignore overwrite and delete snapshots entirely. For Flink, perhaps we can make the following changes to align with Spark and also provide some flexibility for users:

For Flink incremental / streaming reads:
A. By default, throw on non-append snapshots (like Spark, important for maintaining an accurate table view)
B. Add options to skip overwrite / delete snapshots (like Spark)
C. Add an option to read append data of overwrite snapshots (allow down the line deduplication of upserts)

What do you think?

Thanks,
Max

On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <[email protected]> wrote:
I agree with Peter, it would be weird to get an error on a DELETE snapshot if you already explicitly opted in for reading the appends of OVERWRITE snapshots.

Users may not be able to control the type of snapshot to be created so this would otherwise render this feature useless.

Gyula

On Thu, Jun 26, 2025 at 5:34 PM Péter Váry <[email protected]> wrote:
> Consequently, we must throw on DELETE snapshots, even if users opt-in to reading appends of OVERWRITE snapshots.

OVERWRITE snapshots themselves could still contain deletes. So in this regard, I don't see a difference between the DELETE and the OVERWRITE snapshots.

Maximilian Michels <[email protected]> ezt írta (időpont: 2025. jún. 26., Cs, 11:03):
Thank you for your feedback Steven and Gyula!

@Steven
the Flink streaming read only consumes `append` only commits. This is a snapshot commit `DataOperation` type. You were talking about row-level appends, delete etc.

Yes, there is no doubt the fact that Flink streaming reads only process append snapshots. But the issue here is that it jumps over "overwrite" snapshots, which contains both new data files and delete files. I don't think users are generally aware of this. Even worse, reading that way leads to an inconsistent view on the table state, which is a serious data integrity issue.

>>2. Add an option to read appended data of overwrite snapshots to allow
>>users to de-duplicate downstream (opt-in config)
>For updates it is true. But what about actual row deletion?

Good point! I had only considered upserts, but deletes can definitely not be reconstructed like it is the case for upserts via downstream deduplication. Consequently, we must throw on DELETE snapshots, even if users opt-in to reading appends of OVERWRITE snapshots.

It would be interesting to hear what people think about this. In some scenarios, this is probably better.
In the scenario of primarily streaming append (and occasionally batch overwrite), is this always a preferred behavior?

I think so, but I'm also curious about what others think.

@Gyula
There is a slight risk with 3 that it may break pipelines where the current behaviour is well understood and expected but that's probably a very small minority of the users.

Agreed, that is a possible risk, but if we consider the hidden risk of effectively "losing" data, it looks like a small risk to take.

-Max

On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <[email protected]> wrote:
Hi Max!

I like this proposal especially that proper streaming reads of deletes seem to be quite a bit of work based on recent efforts.

Giving an option to include the append parts of OVERWRITE snapshots (2) is a great quick improvement that will unblock use-cases where the iceberg table is used to store insert/upsert only data which is pretty common when storing metadata/enrichment information for example. This can then later be simply removed once proper streaming read is available.

As for (3) as a default , I think it's reasonable given that this is a source of a lot of user confusion when they see that the source simply doesn't read *anything* even if new snapshots are produced to the table.
There is a slight risk with 3 that it may break pipelines where the current behaviour is well understood and expected but that's probably a very small minority of the users.

Cheers
Gyula

On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <[email protected]> wrote:
the Flink streaming read only consumes `append` only commits. This is a snapshot commit `DataOperation` type. You were talking about row-level appends, delete etc.

> 2. Add an option to read appended data of overwrite snapshots to allow
users to de-duplicate downstream (opt-in config)

For updates it is true. But what about actual row deletion?

> 3. Throw an error if overwrite snapshots are discovered in an
append-only scan and the option in (2) is not activated

It would be interesting to hear what people think about this. In some scenarios, this is probably better.

In the scenario of primarily streaming append (and occasionally batch overwrite), is this always a preferred behavior?

On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <[email protected]> wrote:
Hi,

It is well known that Flink and other Iceberg engines do not support
"merge-on-read" in streaming/incremental read mode. There are plans to
change that, see the "Improving Merge-On-Read Query Performance"
thread, but this is not what this message is about.

I used to think that when we incrementally read from an Iceberg table
with Flink, we would simply not apply delete files, but still see
appends. That is not the case. Instead, we run an append-only table
scan [1], which will silently skip over any "overwrite" [2] snapshots.
That means that the consumer will skip appends and deletes if the
producer runs in "merge-on-read" mode.

Now, it is debatable whether it is a good idea to skip applying
deletes, but I'm not sure ignoring "overwrite" snapshots entirely is a
great idea either. Not every upsert effectively results in a delete +
append. Many upserts are merely appends. Reading appends at least
gives consumers the chance to de-duplicate later on, but skipping the
entire snapshot definitely means data loss downstream.

I think we have three options:

1. Implement "merge-on-read" efficiently for incremental / streaming
reads (currently a work in progress)
2. Add an option to read appended data of overwrite snapshots to allow
users to de-duplicate downstream (opt-in config)
3. Throw an error if overwrite snapshots are discovered in an
append-only scan and the option in (2) is not activated

(2) and (3) are stop gaps until we have (1).

What do you think?

Cheers,
Max

[1] https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
[2] https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to