Re: Append-only table scans in the presence of OVERWRITE snapshots

Maximilian Michels Mon, 30 Jun 2025 02:50:25 -0700

That's essentially what I propose, minus the WARN message which is a great
addition. The flag to skip overwrite snapshots should probably be there
forever, similarly to Spark's flags to skip overwrite / delete snapshots.
Here's again the proposal:


For Flink incremental / streaming reads:
A. By default, throw on non-append snapshots (like Spark, important for
maintaining an accurate table view)
B. Add options to skip overwrite / delete snapshots (like Spark)
C. Add an option to read append data of overwrite snapshots (allow down the
line deduplication of upserts)

We could change (A) to just log a warning. Delaying throwing an exception
for (A) until 2.0 means we are ok with randomly skipping over some data, as
opposed to informing users and allowing them to choose what to do with the
data. Warnings are easily overlooked.


On Mon, Jun 30, 2025 at 11:38 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Minimally LOG.warn message about deprecation.
> Maybe a "hidden" flag which could turn back to skip overwrite snapshots.
> This flag could be deprecated immediately and removed in the next release.
> Maybe wait until 2.0, where we can introduce breaking changes?
>
> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., H,
> 11:19):
>
>> How would such a grace period look like? Even if we defer this X amount
>> of releases, some users will probably be surprised by this. When users
>> upgrade their Iceberg version, they would generally expect slightly
>> different (improved) behavior. Right now, some users are discovering that
>> they are not reading all the data, which I believe is a much bigger
>> surprise to them.
>>
>> On Fri, Jun 27, 2025 at 5:47 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> I would try to avoid breaking the current behaviour.
>>> Maybe after some grace period it could be ok, but not "suddenly"
>>>
>>> <gyula.f...@gmail.com> ezt írta (időpont: 2025. jún. 27., P, 15:28):
>>>
>>>> Sounds good to me!
>>>>
>>>> Gyula
>>>> Sent from my iPhone
>>>>
>>>> On 27 Jun 2025, at 13:48, Maximilian Michels <m...@apache.org> wrote:
>>>>
>>>> 
>>>> In my understanding, overwrite snapshots are specifically there to
>>>> overwrite data, i.e. there is always an append for a delete. That is also
>>>> the case for the Flink Iceberg sink. There may be other writers, which use
>>>> overwrite snapshots differently. But point taken, we may need to also
>>>> opt-in to skipping over delete snapshots, in addition to overwrite
>>>> snapshots.
>>>>
>>>> I've taken a look at how Spark handles this topic. Turns out, they came
>>>> up with a similar approach:
>>>>
>>>> [Spark] Iceberg only supports reading data from append snapshots.
>>>>> Overwrite snapshots cannot be processed and will cause an exception by
>>>>> default. Overwrites may be ignored by setting
>>>>> streaming-skip-overwrite-snapshots=true. Similarly, delete snapshots
>>>>> will cause an exception by default, and deletes may be ignored by setting
>>>>> streaming-skip-delete-snapshots=true.
>>>>
>>>>
>>>> https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads
>>>>
>>>> Spark fails by default and has options to ignore overwrite and delete
>>>> snapshots entirely. For Flink, perhaps we can make the following changes to
>>>> align with Spark and also provide some flexibility for users:
>>>>
>>>> For Flink incremental / streaming reads:
>>>> A. By default, throw on non-append snapshots (like Spark, important for
>>>> maintaining an accurate table view)
>>>> B. Add options to skip overwrite / delete snapshots (like Spark)
>>>> C. Add an option to read append data of overwrite snapshots (allow down
>>>> the line deduplication of upserts)
>>>>
>>>> What do you think?
>>>>
>>>> Thanks,
>>>> Max
>>>>
>>>> On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <gyula.f...@gmail.com>
>>>> wrote:
>>>>
>>>>> I agree with Peter, it would be weird to get an error on a DELETE
>>>>> snapshot if you already explicitly opted in for reading the appends of
>>>>> OVERWRITE snapshots.
>>>>>
>>>>> Users may not be able to control the type of snapshot to be created so
>>>>> this would otherwise render this feature useless.
>>>>>
>>>>> Gyula
>>>>>
>>>>> On Thu, Jun 26, 2025 at 5:34 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> > Consequently, we must throw on DELETE snapshots, even if users
>>>>>> opt-in to reading appends of OVERWRITE snapshots.
>>>>>>
>>>>>> OVERWRITE snapshots themselves could still contain deletes. So in
>>>>>> this regard, I don't see a difference between the DELETE and the 
>>>>>> OVERWRITE
>>>>>> snapshots.
>>>>>>
>>>>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún.
>>>>>> 26., Cs, 11:03):
>>>>>>
>>>>>>> Thank you for your feedback Steven and Gyula!
>>>>>>>
>>>>>>> @Steven
>>>>>>>
>>>>>>>> the Flink streaming read only consumes `append` only commits. This
>>>>>>>> is a snapshot commit  `DataOperation` type. You were talking about
>>>>>>>> row-level appends, delete etc.
>>>>>>>
>>>>>>>
>>>>>>> Yes, there is no doubt the fact that Flink streaming reads only
>>>>>>> process append snapshots. But the issue here is that it jumps over
>>>>>>> "overwrite" snapshots, which contains both new data files and delete 
>>>>>>> files.
>>>>>>> I don't think users are generally aware of this. Even worse, reading 
>>>>>>> that
>>>>>>> way leads to an inconsistent view on the table state, which is a serious
>>>>>>> data integrity issue.
>>>>>>>
>>>>>>> >>2. Add an option to read appended data of overwrite snapshots to
>>>>>>> allow
>>>>>>> >>users to de-duplicate downstream (opt-in config)
>>>>>>> >For updates it is true. But what about actual row deletion?
>>>>>>>
>>>>>>> Good point! I had only considered upserts, but deletes can
>>>>>>> definitely not be reconstructed like it is the case for upserts via
>>>>>>> downstream deduplication. Consequently, we must throw on DELETE 
>>>>>>> snapshots,
>>>>>>> even if users opt-in to reading appends of OVERWRITE snapshots.
>>>>>>>
>>>>>>> It would be interesting to hear what people think about this. In
>>>>>>>> some scenarios, this is probably better.
>>>>>>>
>>>>>>> In the scenario of primarily streaming append (and occasionally
>>>>>>>> batch overwrite), is this always a preferred behavior?
>>>>>>>
>>>>>>>
>>>>>>> I think so, but I'm also curious about what others think.
>>>>>>>
>>>>>>> @Gyula
>>>>>>>
>>>>>>>> There is a slight risk with 3 that it may break pipelines where the
>>>>>>>> current behaviour is well understood and expected but that's probably a
>>>>>>>> very small minority of the users.
>>>>>>>
>>>>>>>
>>>>>>> Agreed, that is a possible risk, but if we consider the hidden risk
>>>>>>> of effectively "losing" data, it looks like a small risk to take.
>>>>>>>
>>>>>>> -Max
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Max!
>>>>>>>>
>>>>>>>> I like this proposal especially that proper streaming reads of
>>>>>>>> deletes seem to be quite a bit of work based on recent efforts.
>>>>>>>>
>>>>>>>> Giving an option to include the append parts of OVERWRITE snapshots
>>>>>>>> (2) is a great quick improvement that will unblock use-cases where the
>>>>>>>> iceberg table is used to store insert/upsert only data which is pretty
>>>>>>>> common when storing metadata/enrichment information for example. This 
>>>>>>>> can
>>>>>>>> then later be simply removed once proper streaming read is available.
>>>>>>>>
>>>>>>>> As for (3) as a default , I think it's reasonable given that this
>>>>>>>> is a source of a lot of user confusion when they see that the source 
>>>>>>>> simply
>>>>>>>> doesn't read *anything* even if new snapshots are produced to the 
>>>>>>>> table.
>>>>>>>> There is a slight risk with 3 that it may break pipelines where the
>>>>>>>> current behaviour is well understood and expected but that's probably a
>>>>>>>> very small minority of the users.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Gyula
>>>>>>>>
>>>>>>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> the Flink streaming read only consumes `append` only commits. This
>>>>>>>>> is a snapshot commit  `DataOperation` type. You were talking about
>>>>>>>>> row-level appends, delete etc.
>>>>>>>>>
>>>>>>>>> > 2. Add an option to read appended data of overwrite snapshots to
>>>>>>>>> allow
>>>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>>>>
>>>>>>>>> For updates it is true. But what about actual row deletion?
>>>>>>>>>
>>>>>>>>> > 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>>>
>>>>>>>>> It would be interesting to hear what people think about this. In
>>>>>>>>> some scenarios, this is probably better.
>>>>>>>>>
>>>>>>>>> In the scenario of primarily streaming append (and
>>>>>>>>> occasionally batch overwrite), is this always a preferred behavior?
>>>>>>>>>
>>>>>>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <m...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> It is well known that Flink and other Iceberg engines do not
>>>>>>>>>> support
>>>>>>>>>> "merge-on-read" in streaming/incremental read mode. There are
>>>>>>>>>> plans to
>>>>>>>>>> change that, see the "Improving Merge-On-Read Query Performance"
>>>>>>>>>> thread, but this is not what this message is about.
>>>>>>>>>>
>>>>>>>>>> I used to think that when we incrementally read from an Iceberg
>>>>>>>>>> table
>>>>>>>>>> with Flink, we would simply not apply delete files, but still see
>>>>>>>>>> appends. That is not the case. Instead, we run an append-only
>>>>>>>>>> table
>>>>>>>>>> scan [1], which will silently skip over any "overwrite" [2]
>>>>>>>>>> snapshots.
>>>>>>>>>> That means that the consumer will skip appends and deletes if the
>>>>>>>>>> producer runs in "merge-on-read" mode.
>>>>>>>>>>
>>>>>>>>>> Now, it is debatable whether it is a good idea to skip applying
>>>>>>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots entirely
>>>>>>>>>> is a
>>>>>>>>>> great idea either. Not every upsert effectively results in a
>>>>>>>>>> delete +
>>>>>>>>>> append. Many upserts are merely appends. Reading appends at least
>>>>>>>>>> gives consumers the chance to de-duplicate later on, but skipping
>>>>>>>>>> the
>>>>>>>>>> entire snapshot definitely means data loss downstream.
>>>>>>>>>>
>>>>>>>>>> I think we have three options:
>>>>>>>>>>
>>>>>>>>>> 1. Implement "merge-on-read" efficiently for incremental /
>>>>>>>>>> streaming
>>>>>>>>>> reads (currently a work in progress)
>>>>>>>>>> 2. Add an option to read appended data of overwrite snapshots to
>>>>>>>>>> allow
>>>>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>>>>> 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>>>>
>>>>>>>>>> (2) and (3) are stop gaps until we have (1).
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Max
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>>>>>>>>>
>>>>>>>>>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to