Re: Append-only table scans in the presence of OVERWRITE snapshots

Péter Váry Mon, 30 Jun 2025 03:40:03 -0700

I like the first 2 points of your proposal A (with warning) + B and
changing the default with 2.0, but I would suggest avoiding C. I see very
limited case for deduplication (only if there are no deletes the table,
just updates), and it would cause more confusion and configuration clutter.


Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., H,
11:50):

> That's essentially what I propose, minus the WARN message which is a great
> addition. The flag to skip overwrite snapshots should probably be there
> forever, similarly to Spark's flags to skip overwrite / delete snapshots.
> Here's again the proposal:
>
> For Flink incremental / streaming reads:
> A. By default, throw on non-append snapshots (like Spark, important for
> maintaining an accurate table view)
> B. Add options to skip overwrite / delete snapshots (like Spark)
> C. Add an option to read append data of overwrite snapshots (allow down
> the line deduplication of upserts)
>
> We could change (A) to just log a warning. Delaying throwing an exception
> for (A) until 2.0 means we are ok with randomly skipping over some data, as
> opposed to informing users and allowing them to choose what to do with the
> data. Warnings are easily overlooked.
>
>
> On Mon, Jun 30, 2025 at 11:38 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Minimally LOG.warn message about deprecation.
>> Maybe a "hidden" flag which could turn back to skip overwrite snapshots.
>> This flag could be deprecated immediately and removed in the next release.
>> Maybe wait until 2.0, where we can introduce breaking changes?
>>
>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30.,
>> H, 11:19):
>>
>>> How would such a grace period look like? Even if we defer this X amount
>>> of releases, some users will probably be surprised by this. When users
>>> upgrade their Iceberg version, they would generally expect slightly
>>> different (improved) behavior. Right now, some users are discovering that
>>> they are not reading all the data, which I believe is a much bigger
>>> surprise to them.
>>>
>>> On Fri, Jun 27, 2025 at 5:47 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> I would try to avoid breaking the current behaviour.
>>>> Maybe after some grace period it could be ok, but not "suddenly"
>>>>
>>>> <gyula.f...@gmail.com> ezt írta (időpont: 2025. jún. 27., P, 15:28):
>>>>
>>>>> Sounds good to me!
>>>>>
>>>>> Gyula
>>>>> Sent from my iPhone
>>>>>
>>>>> On 27 Jun 2025, at 13:48, Maximilian Michels <m...@apache.org> wrote:
>>>>>
>>>>> 
>>>>> In my understanding, overwrite snapshots are specifically there to
>>>>> overwrite data, i.e. there is always an append for a delete. That is also
>>>>> the case for the Flink Iceberg sink. There may be other writers, which use
>>>>> overwrite snapshots differently. But point taken, we may need to also
>>>>> opt-in to skipping over delete snapshots, in addition to overwrite
>>>>> snapshots.
>>>>>
>>>>> I've taken a look at how Spark handles this topic. Turns out, they
>>>>> came up with a similar approach:
>>>>>
>>>>> [Spark] Iceberg only supports reading data from append snapshots.
>>>>>> Overwrite snapshots cannot be processed and will cause an exception by
>>>>>> default. Overwrites may be ignored by setting
>>>>>> streaming-skip-overwrite-snapshots=true. Similarly, delete snapshots
>>>>>> will cause an exception by default, and deletes may be ignored by setting
>>>>>> streaming-skip-delete-snapshots=true.
>>>>>
>>>>>
>>>>> https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads
>>>>>
>>>>> Spark fails by default and has options to ignore overwrite and delete
>>>>> snapshots entirely. For Flink, perhaps we can make the following changes 
>>>>> to
>>>>> align with Spark and also provide some flexibility for users:
>>>>>
>>>>> For Flink incremental / streaming reads:
>>>>> A. By default, throw on non-append snapshots (like Spark, important
>>>>> for maintaining an accurate table view)
>>>>> B. Add options to skip overwrite / delete snapshots (like Spark)
>>>>> C. Add an option to read append data of overwrite snapshots (allow
>>>>> down the line deduplication of upserts)
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Thanks,
>>>>> Max
>>>>>
>>>>> On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <gyula.f...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree with Peter, it would be weird to get an error on a DELETE
>>>>>> snapshot if you already explicitly opted in for reading the appends of
>>>>>> OVERWRITE snapshots.
>>>>>>
>>>>>> Users may not be able to control the type of snapshot to be created
>>>>>> so this would otherwise render this feature useless.
>>>>>>
>>>>>> Gyula
>>>>>>
>>>>>> On Thu, Jun 26, 2025 at 5:34 PM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> > Consequently, we must throw on DELETE snapshots, even if users
>>>>>>> opt-in to reading appends of OVERWRITE snapshots.
>>>>>>>
>>>>>>> OVERWRITE snapshots themselves could still contain deletes. So in
>>>>>>> this regard, I don't see a difference between the DELETE and the 
>>>>>>> OVERWRITE
>>>>>>> snapshots.
>>>>>>>
>>>>>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún.
>>>>>>> 26., Cs, 11:03):
>>>>>>>
>>>>>>>> Thank you for your feedback Steven and Gyula!
>>>>>>>>
>>>>>>>> @Steven
>>>>>>>>
>>>>>>>>> the Flink streaming read only consumes `append` only commits. This
>>>>>>>>> is a snapshot commit  `DataOperation` type. You were talking about
>>>>>>>>> row-level appends, delete etc.
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, there is no doubt the fact that Flink streaming reads only
>>>>>>>> process append snapshots. But the issue here is that it jumps over
>>>>>>>> "overwrite" snapshots, which contains both new data files and delete 
>>>>>>>> files.
>>>>>>>> I don't think users are generally aware of this. Even worse, reading 
>>>>>>>> that
>>>>>>>> way leads to an inconsistent view on the table state, which is a 
>>>>>>>> serious
>>>>>>>> data integrity issue.
>>>>>>>>
>>>>>>>> >>2. Add an option to read appended data of overwrite snapshots to
>>>>>>>> allow
>>>>>>>> >>users to de-duplicate downstream (opt-in config)
>>>>>>>> >For updates it is true. But what about actual row deletion?
>>>>>>>>
>>>>>>>> Good point! I had only considered upserts, but deletes can
>>>>>>>> definitely not be reconstructed like it is the case for upserts via
>>>>>>>> downstream deduplication. Consequently, we must throw on DELETE 
>>>>>>>> snapshots,
>>>>>>>> even if users opt-in to reading appends of OVERWRITE snapshots.
>>>>>>>>
>>>>>>>> It would be interesting to hear what people think about this. In
>>>>>>>>> some scenarios, this is probably better.
>>>>>>>>
>>>>>>>> In the scenario of primarily streaming append (and occasionally
>>>>>>>>> batch overwrite), is this always a preferred behavior?
>>>>>>>>
>>>>>>>>
>>>>>>>> I think so, but I'm also curious about what others think.
>>>>>>>>
>>>>>>>> @Gyula
>>>>>>>>
>>>>>>>>> There is a slight risk with 3 that it may break pipelines where
>>>>>>>>> the current behaviour is well understood and expected but that's 
>>>>>>>>> probably a
>>>>>>>>> very small minority of the users.
>>>>>>>>
>>>>>>>>
>>>>>>>> Agreed, that is a possible risk, but if we consider the hidden risk
>>>>>>>> of effectively "losing" data, it looks like a small risk to take.
>>>>>>>>
>>>>>>>> -Max
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Max!
>>>>>>>>>
>>>>>>>>> I like this proposal especially that proper streaming reads of
>>>>>>>>> deletes seem to be quite a bit of work based on recent efforts.
>>>>>>>>>
>>>>>>>>> Giving an option to include the append parts of OVERWRITE
>>>>>>>>> snapshots (2) is a great quick improvement that will unblock use-cases
>>>>>>>>> where the iceberg table is used to store insert/upsert only data 
>>>>>>>>> which is
>>>>>>>>> pretty common when storing metadata/enrichment information for 
>>>>>>>>> example.
>>>>>>>>> This can then later be simply removed once proper streaming read is
>>>>>>>>> available.
>>>>>>>>>
>>>>>>>>> As for (3) as a default , I think it's reasonable given that this
>>>>>>>>> is a source of a lot of user confusion when they see that the source 
>>>>>>>>> simply
>>>>>>>>> doesn't read *anything* even if new snapshots are produced to the 
>>>>>>>>> table.
>>>>>>>>> There is a slight risk with 3 that it may break pipelines where
>>>>>>>>> the current behaviour is well understood and expected but that's 
>>>>>>>>> probably a
>>>>>>>>> very small minority of the users.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Gyula
>>>>>>>>>
>>>>>>>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> the Flink streaming read only consumes `append` only commits.
>>>>>>>>>> This is a snapshot commit  `DataOperation` type. You were talking 
>>>>>>>>>> about
>>>>>>>>>> row-level appends, delete etc.
>>>>>>>>>>
>>>>>>>>>> > 2. Add an option to read appended data of overwrite snapshots
>>>>>>>>>> to allow
>>>>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>>>>>
>>>>>>>>>> For updates it is true. But what about actual row deletion?
>>>>>>>>>>
>>>>>>>>>> > 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>>>>
>>>>>>>>>> It would be interesting to hear what people think about this. In
>>>>>>>>>> some scenarios, this is probably better.
>>>>>>>>>>
>>>>>>>>>> In the scenario of primarily streaming append (and
>>>>>>>>>> occasionally batch overwrite), is this always a preferred behavior?
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <
>>>>>>>>>> m...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> It is well known that Flink and other Iceberg engines do not
>>>>>>>>>>> support
>>>>>>>>>>> "merge-on-read" in streaming/incremental read mode. There are
>>>>>>>>>>> plans to
>>>>>>>>>>> change that, see the "Improving Merge-On-Read Query Performance"
>>>>>>>>>>> thread, but this is not what this message is about.
>>>>>>>>>>>
>>>>>>>>>>> I used to think that when we incrementally read from an Iceberg
>>>>>>>>>>> table
>>>>>>>>>>> with Flink, we would simply not apply delete files, but still see
>>>>>>>>>>> appends. That is not the case. Instead, we run an append-only
>>>>>>>>>>> table
>>>>>>>>>>> scan [1], which will silently skip over any "overwrite" [2]
>>>>>>>>>>> snapshots.
>>>>>>>>>>> That means that the consumer will skip appends and deletes if the
>>>>>>>>>>> producer runs in "merge-on-read" mode.
>>>>>>>>>>>
>>>>>>>>>>> Now, it is debatable whether it is a good idea to skip applying
>>>>>>>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots
>>>>>>>>>>> entirely is a
>>>>>>>>>>> great idea either. Not every upsert effectively results in a
>>>>>>>>>>> delete +
>>>>>>>>>>> append. Many upserts are merely appends. Reading appends at least
>>>>>>>>>>> gives consumers the chance to de-duplicate later on, but
>>>>>>>>>>> skipping the
>>>>>>>>>>> entire snapshot definitely means data loss downstream.
>>>>>>>>>>>
>>>>>>>>>>> I think we have three options:
>>>>>>>>>>>
>>>>>>>>>>> 1. Implement "merge-on-read" efficiently for incremental /
>>>>>>>>>>> streaming
>>>>>>>>>>> reads (currently a work in progress)
>>>>>>>>>>> 2. Add an option to read appended data of overwrite snapshots to
>>>>>>>>>>> allow
>>>>>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>>>>>> 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>>>>>
>>>>>>>>>>> (2) and (3) are stop gaps until we have (1).
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Max
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>>>>>>>>>>
>>>>>>>>>>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to