Re: Append-only table scans in the presence of OVERWRITE snapshots

Péter Váry Fri, 27 Jun 2025 08:47:40 -0700

I would try to avoid breaking the current behaviour.
Maybe after some grace period it could be ok, but not "suddenly"


<gyula.f...@gmail.com> ezt írta (időpont: 2025. jún. 27., P, 15:28):

> Sounds good to me!
>
> Gyula
> Sent from my iPhone
>
> On 27 Jun 2025, at 13:48, Maximilian Michels <m...@apache.org> wrote:
>
> 
> In my understanding, overwrite snapshots are specifically there to
> overwrite data, i.e. there is always an append for a delete. That is also
> the case for the Flink Iceberg sink. There may be other writers, which use
> overwrite snapshots differently. But point taken, we may need to also
> opt-in to skipping over delete snapshots, in addition to overwrite
> snapshots.
>
> I've taken a look at how Spark handles this topic. Turns out, they came up
> with a similar approach:
>
> [Spark] Iceberg only supports reading data from append snapshots.
>> Overwrite snapshots cannot be processed and will cause an exception by
>> default. Overwrites may be ignored by setting
>> streaming-skip-overwrite-snapshots=true. Similarly, delete snapshots
>> will cause an exception by default, and deletes may be ignored by setting
>> streaming-skip-delete-snapshots=true.
>
>
> https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads
>
> Spark fails by default and has options to ignore overwrite and delete
> snapshots entirely. For Flink, perhaps we can make the following changes to
> align with Spark and also provide some flexibility for users:
>
> For Flink incremental / streaming reads:
> A. By default, throw on non-append snapshots (like Spark, important for
> maintaining an accurate table view)
> B. Add options to skip overwrite / delete snapshots (like Spark)
> C. Add an option to read append data of overwrite snapshots (allow down
> the line deduplication of upserts)
>
> What do you think?
>
> Thanks,
> Max
>
> On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> I agree with Peter, it would be weird to get an error on a DELETE
>> snapshot if you already explicitly opted in for reading the appends of
>> OVERWRITE snapshots.
>>
>> Users may not be able to control the type of snapshot to be created so
>> this would otherwise render this feature useless.
>>
>> Gyula
>>
>> On Thu, Jun 26, 2025 at 5:34 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> > Consequently, we must throw on DELETE snapshots, even if users opt-in
>>> to reading appends of OVERWRITE snapshots.
>>>
>>> OVERWRITE snapshots themselves could still contain deletes. So in this
>>> regard, I don't see a difference between the DELETE and the OVERWRITE
>>> snapshots.
>>>
>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 26.,
>>> Cs, 11:03):
>>>
>>>> Thank you for your feedback Steven and Gyula!
>>>>
>>>> @Steven
>>>>
>>>>> the Flink streaming read only consumes `append` only commits. This is
>>>>> a snapshot commit  `DataOperation` type. You were talking about row-level
>>>>> appends, delete etc.
>>>>
>>>>
>>>> Yes, there is no doubt the fact that Flink streaming reads only process
>>>> append snapshots. But the issue here is that it jumps over "overwrite"
>>>> snapshots, which contains both new data files and delete files. I don't
>>>> think users are generally aware of this. Even worse, reading that way leads
>>>> to an inconsistent view on the table state, which is a serious data
>>>> integrity issue.
>>>>
>>>> >>2. Add an option to read appended data of overwrite snapshots to allow
>>>> >>users to de-duplicate downstream (opt-in config)
>>>> >For updates it is true. But what about actual row deletion?
>>>>
>>>> Good point! I had only considered upserts, but deletes can definitely
>>>> not be reconstructed like it is the case for upserts via downstream
>>>> deduplication. Consequently, we must throw on DELETE snapshots, even if
>>>> users opt-in to reading appends of OVERWRITE snapshots.
>>>>
>>>> It would be interesting to hear what people think about this. In some
>>>>> scenarios, this is probably better.
>>>>
>>>> In the scenario of primarily streaming append (and occasionally batch
>>>>> overwrite), is this always a preferred behavior?
>>>>
>>>>
>>>> I think so, but I'm also curious about what others think.
>>>>
>>>> @Gyula
>>>>
>>>>> There is a slight risk with 3 that it may break pipelines where the
>>>>> current behaviour is well understood and expected but that's probably a
>>>>> very small minority of the users.
>>>>
>>>>
>>>> Agreed, that is a possible risk, but if we consider the hidden risk of
>>>> effectively "losing" data, it looks like a small risk to take.
>>>>
>>>> -Max
>>>>
>>>>
>>>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Max!
>>>>>
>>>>> I like this proposal especially that proper streaming reads of deletes
>>>>> seem to be quite a bit of work based on recent efforts.
>>>>>
>>>>> Giving an option to include the append parts of OVERWRITE snapshots
>>>>> (2) is a great quick improvement that will unblock use-cases where the
>>>>> iceberg table is used to store insert/upsert only data which is pretty
>>>>> common when storing metadata/enrichment information for example. This can
>>>>> then later be simply removed once proper streaming read is available.
>>>>>
>>>>> As for (3) as a default , I think it's reasonable given that this is a
>>>>> source of a lot of user confusion when they see that the source simply
>>>>> doesn't read *anything* even if new snapshots are produced to the table.
>>>>> There is a slight risk with 3 that it may break pipelines where the
>>>>> current behaviour is well understood and expected but that's probably a
>>>>> very small minority of the users.
>>>>>
>>>>> Cheers
>>>>> Gyula
>>>>>
>>>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> the Flink streaming read only consumes `append` only commits. This is
>>>>>> a snapshot commit  `DataOperation` type. You were talking about row-level
>>>>>> appends, delete etc.
>>>>>>
>>>>>> > 2. Add an option to read appended data of overwrite snapshots to
>>>>>> allow
>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>
>>>>>> For updates it is true. But what about actual row deletion?
>>>>>>
>>>>>> > 3. Throw an error if overwrite snapshots are discovered in an
>>>>>> append-only scan and the option in (2) is not activated
>>>>>>
>>>>>> It would be interesting to hear what people think about this. In some
>>>>>> scenarios, this is probably better.
>>>>>>
>>>>>> In the scenario of primarily streaming append (and occasionally batch
>>>>>> overwrite), is this always a preferred behavior?
>>>>>>
>>>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels <m...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> It is well known that Flink and other Iceberg engines do not support
>>>>>>> "merge-on-read" in streaming/incremental read mode. There are plans
>>>>>>> to
>>>>>>> change that, see the "Improving Merge-On-Read Query Performance"
>>>>>>> thread, but this is not what this message is about.
>>>>>>>
>>>>>>> I used to think that when we incrementally read from an Iceberg table
>>>>>>> with Flink, we would simply not apply delete files, but still see
>>>>>>> appends. That is not the case. Instead, we run an append-only table
>>>>>>> scan [1], which will silently skip over any "overwrite" [2]
>>>>>>> snapshots.
>>>>>>> That means that the consumer will skip appends and deletes if the
>>>>>>> producer runs in "merge-on-read" mode.
>>>>>>>
>>>>>>> Now, it is debatable whether it is a good idea to skip applying
>>>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots entirely is
>>>>>>> a
>>>>>>> great idea either. Not every upsert effectively results in a delete +
>>>>>>> append. Many upserts are merely appends. Reading appends at least
>>>>>>> gives consumers the chance to de-duplicate later on, but skipping the
>>>>>>> entire snapshot definitely means data loss downstream.
>>>>>>>
>>>>>>> I think we have three options:
>>>>>>>
>>>>>>> 1. Implement "merge-on-read" efficiently for incremental / streaming
>>>>>>> reads (currently a work in progress)
>>>>>>> 2. Add an option to read appended data of overwrite snapshots to
>>>>>>> allow
>>>>>>> users to de-duplicate downstream (opt-in config)
>>>>>>> 3. Throw an error if overwrite snapshots are discovered in an
>>>>>>> append-only scan and the option in (2) is not activated
>>>>>>>
>>>>>>> (2) and (3) are stop gaps until we have (1).
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Max
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89
>>>>>>> [2]
>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32
>>>>>>>
>>>>>>

Re: Append-only table scans in the presence of OVERWRITE snapshots

Reply via email to