Append-only table scans in the presence of OVERWRITE snapshots

Maximilian Michels Wed, 25 Jun 2025 08:24:01 -0700

Hi,

It is well known that Flink and other Iceberg engines do not support
"merge-on-read" in streaming/incremental read mode. There are plans to
change that, see the "Improving Merge-On-Read Query Performance"
thread, but this is not what this message is about.

I used to think that when we incrementally read from an Iceberg table
with Flink, we would simply not apply delete files, but still see
appends. That is not the case. Instead, we run an append-only table
scan [1], which will silently skip over any "overwrite" [2] snapshots.
That means that the consumer will skip appends and deletes if the
producer runs in "merge-on-read" mode.

Now, it is debatable whether it is a good idea to skip applying
deletes, but I'm not sure ignoring "overwrite" snapshots entirely is a
great idea either. Not every upsert effectively results in a delete +
append. Many upserts are merely appends. Reading appends at least
gives consumers the chance to de-duplicate later on, but skipping the
entire snapshot definitely means data loss downstream.

I think we have three options:

1. Implement "merge-on-read" efficiently for incremental / streaming
reads (currently a work in progress)
2. Add an option to read appended data of overwrite snapshots to allow
users to de-duplicate downstream (opt-in config)
3. Throw an error if overwrite snapshots are discovered in an
append-only scan and the option in (2) is not activated

(2) and (3) are stop gaps until we have (1).

What do you think?

Cheers,
Max

[1]
https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
[2]
https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32

Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to