Re: Append-only table scans in the presence of OVERWRITE snapshots

Gyula Fóra Mon, 30 Jun 2025 07:48:50 -0700

Hi Peter!

We have quite a few use-cases where option C would be required that
currently have no workarounds.
So this is something that we will have to do in either case, we feel that
it would also be a good addition for other users.


Cheers,
Gyula

On Mon, Jun 30, 2025 at 12:39 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> I like the first 2 points of your proposal A (with warning) + B and
> changing the default with 2.0, but I would suggest avoiding C. I see very
> limited case for deduplication (only if there are no deletes the table,
> just updates), and it would cause more confusion and configuration clutter.
>
> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., H,
> 11:50):
>
>> That's essentially what I propose, minus the WARN message which is a
>> great addition. The flag to skip overwrite snapshots should probably be
>> there forever, similarly to Spark's flags to skip overwrite / delete
>> snapshots. Here's again the proposal:
>>
>> For Flink incremental / streaming reads:
>> A. By default, throw on non-append snapshots (like Spark, important for
>> maintaining an accurate table view)
>> B. Add options to skip overwrite / delete snapshots (like Spark)
>> C. Add an option to read append data of overwrite snapshots (allow down
>> the line deduplication of upserts)
>>
>> We could change (A) to just log a warning. Delaying throwing an exception
>> for (A) until 2.0 means we are ok with randomly skipping over some data, as
>> opposed to informing users and allowing them to choose what to do with the
>> data. Warnings are easily overlooked.
>>
>>
>> On Mon, Jun 30, 2025 at 11:38 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Minimally LOG.warn message about deprecation.
>>> Maybe a "hidden" flag which could turn back to skip overwrite snapshots.
>>> This flag could be deprecated immediately and removed in the next release.
>>> Maybe wait until 2.0, where we can introduce breaking changes?
>>>
>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30.,
>>> H, 11:19):
>>>
>>>> How would such a grace period look like? Even if we defer this X amount
>>>> of releases, some users will probably be surprised by this. When users
>>>> upgrade their Iceberg version, they would generally expect slightly
>>>> different (improved) behavior. Right now, some users are discovering that
>>>> they are not reading all the data, which I believe is a much bigger
>>>> surprise to them.
>>>>
>>>> On Fri, Jun 27, 2025 at 5:47 PM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> I would try to avoid breaking the current behaviour.
>>>>> Maybe after some grace period it could be ok, but not "suddenly"
>>>>>
>>>>> <gyula.f...@gmail.com> ezt írta (időpont: 2025. jún. 27., P, 15:28):
>>>>>
>>>>>> Sounds good to me!
>>>>>>
>>>>>> Gyula
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On 27 Jun 2025, at 13:48, Maximilian Michels <m...@apache.org> wrote:
>>>>>>
>>>>>> 
>>>>>> In my understanding, overwrite snapshots are specifically there to
>>>>>> overwrite data, i.e. there is always an append for a delete. That is also
>>>>>> the case for the Flink Iceberg sink. There may be other writers, which 
>>>>>> use
>>>>>> overwrite snapshots differently. But point taken, we may need to also
>>>>>> opt-in to skipping over delete snapshots, in addition to overwrite
>>>>>> snapshots.
>>>>>>
>>>>>> I've taken a look at how Spark handles this topic. Turns out, they
>>>>>> came up with a similar approach:
>>>>>>
>>>>>> [Spark] Iceberg only supports reading data from append snapshots.
>>>>>>> Overwrite snapshots cannot be processed and will cause an exception by
>>>>>>> default. Overwrites may be ignored by setting
>>>>>>> streaming-skip-overwrite-snapshots=true. Similarly, delete
>>>>>>> snapshots will cause an exception by default, and deletes may be 
>>>>>>> ignored by
>>>>>>> setting streaming-skip-delete-snapshots=true.
>>>>>>
>>>>>>
>>>>>> https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads
>>>>>>
>>>>>> Spark fails by default and has options to ignore overwrite and delete
>>>>>> snapshots entirely. For Flink, perhaps we can make the following changes 
>>>>>> to
>>>>>> align with Spark and also provide some flexibility for users:
>>>>>>
>>>>>> For Flink incremental / streaming reads:
>>>>>> A. By default, throw on non-append snapshots (like Spark, important
>>>>>> for maintaining an accurate table view)
>>>>>> B. Add options to skip overwrite / delete snapshots (like Spark)
>>>>>> C. Add an option to read append data of overwrite snapshots (allow
>>>>>> down the line deduplication of upserts)
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Thanks,
>>>>>> Max
>>>>>>
>>>>>> On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <gyula.f...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with Peter, it would be weird to get an error on a DELETE
>>>>>>> snapshot if you already explicitly opted in for reading the appends of
>>>>>>> OVERWRITE snapshots.
>>>>>>>
>>>>>>> Users may not be able to control the type of snapshot to be created
>>>>>>> so this would otherwise render this feature useless.
>>>>>>>
>>>>>>> Gyula
>>>>>>>
>>>>>>> On Thu, Jun 26, 2025 at 5:34 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> > Consequently, we must throw on DELETE snapshots, even if users
>>>>>>>> opt-in to reading appends of OVERWRITE snapshots.
>>>>>>>>
>>>>>>>> OVERWRITE snapshots themselves could still contain deletes. So in
>>>>>>>> this regard, I don't see a difference between the DELETE and the 
>>>>>>>> OVERWRITE
>>>>>>>> snapshots.
>>>>>>>>
>>>>>>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún.
>>>>>>>> 26., Cs, 11:03):
>>>>>>>>
>>>>>>>>> Thank you for your feedback Steven and Gyula!
>>>>>>>>>
>>>>>>>>> @Steven
>>>>>>>>>
>>>>>>>>>> the Flink streaming read only consumes `append` only commits.
>>>>>>>>>> This is a snapshot commit  `DataOperation` type. You were talking 
>>>>>>>>>> about
>>>>>>>>>> row-level appends, delete etc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, there is no doubt the fact that Flink streaming reads only
>>>>>>>>> process append snapshots. But the issue here is that it jumps over
>>>>>>>>> "overwrite" snapshots, which contains both new data files and delete 
>>>>>>>>> files.
>>>>>>>>> I don't think users are generally aware of this. Even worse, reading 
>>>>>>>>> that
>>>>>>>>> way leads to an inconsistent view on the table state, which is a 
>>>>>>>>> serious
>>>>>>>>> data integrity issue.
>>>>>>>>>
>>>>>>>>> >>2. Add an option to read appended data of overwrite snapshots to
>>>>>>>>> allow
>>>>>>>>> >>users to de-duplicate downstream (opt-in config)
>>>>>>>>> >For updates it is true. But what about actual row deletion?
>>>>>>>>>
>>>>>>>>> Good point! I had only considered upserts, but deletes can
>>>>>>>>> definitely not be reconstructed like it is the case for upserts via
>>>>>>>>> downstream deduplication. Consequently, we must throw on DELETE 
>>>>>>>>> snapshots,
>>>>>>>>> even if users opt-in to reading appends of OVERWRITE snapshots.
>>>>>>>>>
>>>>>>>>> It would be interesting to hear what people think about this. In
>>>>>>>>>> some scenarios, this is probably better.
>>>>>>>>>
>>>>>>>>> In the scenario of primarily streaming append (and occasionally
>>>>>>>>>> batch overwrite), is this always a preferred behavior?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think so, but I'm also curious about what others think.
>>>>>>>>>
>>>>>>>>> @Gyula
>>>>>>>>>
>>>>>>>>>> There is a slight risk with 3 that it may break pipelines where
>>>>>>>>>> the current behaviour is well understood and expected but that's 
>>>>>>>>>> probably a
>>>>>>>>>> very small minority of the users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agreed, that is a possible risk, but if we consider the hidden
>>>>>>>>> risk of effectively "losing" data, it looks like a small risk to take.
>>>>>>>>>
>>>>>>>>> -Max
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Max!
>>>>>>>>>>
>>>>>>>>>> I like this proposal especially that proper streaming reads of
>>>>>>>>>> deletes seem to be quite a bit of work based on recent efforts.
>>>>>>>>>>
>>>>>>>>>> Giving an option to include the append parts of OVERWRITE
>>>>>>>>>> snapshots (2) is a great quick improvement that will unblock 
>>>>>>>>>> use-cases
>>>>>>>>>> where the iceberg table is used to store insert/upsert only data 
>>>>>>>>>> which is
>>>>>>>>>> pretty common when storing metadata/enrichment information for 
>>>>>>>>>> example.
>>>>>>>>>> This can then later be simply removed once proper streaming read is
>>>>>>>>>> available.
>>>>>>>>>>
>>>>>>>>>> As for (3) as a default , I think it's reasonable given that this
>>>>>>>>>> is a source of a lot of user confusion when they see that the source 
>>>>>>>>>> simply
>>>>>>>>>> doesn't read *anything* even if new snapshots are produced to the 
>>>>>>>>>> table.
>>>>>>>>>> There is a slight risk with 3 that it may break pipelines where
>>>>>>>>>> the current behaviour is well understood and expected but that's 
>>>>>>>>>> probably a
>>>>>>>>>> very small minority of the users.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Gyula
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> the Flink streaming read only consumes `append` only commits.
>>>>>>>>>>> This is a snapshot commit  `DataOperation` type. You were talking 
>>>>>>>>>>> about
>>>>>>>>>>> row-level appends, delete etc.
>>>>>>>>>>>
>>>>>>>>>>> > 2. Add an option to read appended data of overwrite snapshots
>>>>>>>>>>> to allow
>>>>>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>>>>>>
>>>>>>>>>>> For updates it is true. But what about actual row deletion?
>>>>>>>>>>>
>>>>>>>>>>> > 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>>>>>
>>>>>>>>>>> It would be interesting to hear what people think about this. In
>>>>>>>>>>> some scenarios, this is probably better.
>>>>>>>>>>>
>>>>>>>>>>> In the scenario of primarily streaming append (and
>>>>>>>>>>> occasionally batch overwrite), is this always a preferred behavior?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <
>>>>>>>>>>> m...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> It is well known that Flink and other Iceberg engines do not
>>>>>>>>>>>> support
>>>>>>>>>>>> "merge-on-read" in streaming/incremental read mode. There are
>>>>>>>>>>>> plans to
>>>>>>>>>>>> change that, see the "Improving Merge-On-Read Query Performance"
>>>>>>>>>>>> thread, but this is not what this message is about.
>>>>>>>>>>>>
>>>>>>>>>>>> I used to think that when we incrementally read from an Iceberg
>>>>>>>>>>>> table
>>>>>>>>>>>> with Flink, we would simply not apply delete files, but still
>>>>>>>>>>>> see
>>>>>>>>>>>> appends. That is not the case. Instead, we run an append-only
>>>>>>>>>>>> table
>>>>>>>>>>>> scan [1], which will silently skip over any "overwrite" [2]
>>>>>>>>>>>> snapshots.
>>>>>>>>>>>> That means that the consumer will skip appends and deletes if
>>>>>>>>>>>> the
>>>>>>>>>>>> producer runs in "merge-on-read" mode.
>>>>>>>>>>>>
>>>>>>>>>>>> Now, it is debatable whether it is a good idea to skip applying
>>>>>>>>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots
>>>>>>>>>>>> entirely is a
>>>>>>>>>>>> great idea either. Not every upsert effectively results in a
>>>>>>>>>>>> delete +
>>>>>>>>>>>> append. Many upserts are merely appends. Reading appends at
>>>>>>>>>>>> least
>>>>>>>>>>>> gives consumers the chance to de-duplicate later on, but
>>>>>>>>>>>> skipping the
>>>>>>>>>>>> entire snapshot definitely means data loss downstream.
>>>>>>>>>>>>
>>>>>>>>>>>> I think we have three options:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Implement "merge-on-read" efficiently for incremental /
>>>>>>>>>>>> streaming
>>>>>>>>>>>> reads (currently a work in progress)
>>>>>>>>>>>> 2. Add an option to read appended data of overwrite snapshots
>>>>>>>>>>>> to allow
>>>>>>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>>>>>>> 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>>>>>>
>>>>>>>>>>>> (2) and (3) are stop gaps until we have (1).
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Max
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to