Hi Peter! We have quite a few use-cases where option C would be required that currently have no workarounds. So this is something that we will have to do in either case, we feel that it would also be a good addition for other users.
Cheers, Gyula On Mon, Jun 30, 2025 at 12:39 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > I like the first 2 points of your proposal A (with warning) + B and > changing the default with 2.0, but I would suggest avoiding C. I see very > limited case for deduplication (only if there are no deletes the table, > just updates), and it would cause more confusion and configuration clutter. > > Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., H, > 11:50): > >> That's essentially what I propose, minus the WARN message which is a >> great addition. The flag to skip overwrite snapshots should probably be >> there forever, similarly to Spark's flags to skip overwrite / delete >> snapshots. Here's again the proposal: >> >> For Flink incremental / streaming reads: >> A. By default, throw on non-append snapshots (like Spark, important for >> maintaining an accurate table view) >> B. Add options to skip overwrite / delete snapshots (like Spark) >> C. Add an option to read append data of overwrite snapshots (allow down >> the line deduplication of upserts) >> >> We could change (A) to just log a warning. Delaying throwing an exception >> for (A) until 2.0 means we are ok with randomly skipping over some data, as >> opposed to informing users and allowing them to choose what to do with the >> data. Warnings are easily overlooked. >> >> >> On Mon, Jun 30, 2025 at 11:38 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Minimally LOG.warn message about deprecation. >>> Maybe a "hidden" flag which could turn back to skip overwrite snapshots. >>> This flag could be deprecated immediately and removed in the next release. >>> Maybe wait until 2.0, where we can introduce breaking changes? >>> >>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., >>> H, 11:19): >>> >>>> How would such a grace period look like? Even if we defer this X amount >>>> of releases, some users will probably be surprised by this. When users >>>> upgrade their Iceberg version, they would generally expect slightly >>>> different (improved) behavior. Right now, some users are discovering that >>>> they are not reading all the data, which I believe is a much bigger >>>> surprise to them. >>>> >>>> On Fri, Jun 27, 2025 at 5:47 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> I would try to avoid breaking the current behaviour. >>>>> Maybe after some grace period it could be ok, but not "suddenly" >>>>> >>>>> <gyula.f...@gmail.com> ezt írta (időpont: 2025. jún. 27., P, 15:28): >>>>> >>>>>> Sounds good to me! >>>>>> >>>>>> Gyula >>>>>> Sent from my iPhone >>>>>> >>>>>> On 27 Jun 2025, at 13:48, Maximilian Michels <m...@apache.org> wrote: >>>>>> >>>>>> >>>>>> In my understanding, overwrite snapshots are specifically there to >>>>>> overwrite data, i.e. there is always an append for a delete. That is also >>>>>> the case for the Flink Iceberg sink. There may be other writers, which >>>>>> use >>>>>> overwrite snapshots differently. But point taken, we may need to also >>>>>> opt-in to skipping over delete snapshots, in addition to overwrite >>>>>> snapshots. >>>>>> >>>>>> I've taken a look at how Spark handles this topic. Turns out, they >>>>>> came up with a similar approach: >>>>>> >>>>>> [Spark] Iceberg only supports reading data from append snapshots. >>>>>>> Overwrite snapshots cannot be processed and will cause an exception by >>>>>>> default. Overwrites may be ignored by setting >>>>>>> streaming-skip-overwrite-snapshots=true. Similarly, delete >>>>>>> snapshots will cause an exception by default, and deletes may be >>>>>>> ignored by >>>>>>> setting streaming-skip-delete-snapshots=true. >>>>>> >>>>>> >>>>>> https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads >>>>>> >>>>>> Spark fails by default and has options to ignore overwrite and delete >>>>>> snapshots entirely. For Flink, perhaps we can make the following changes >>>>>> to >>>>>> align with Spark and also provide some flexibility for users: >>>>>> >>>>>> For Flink incremental / streaming reads: >>>>>> A. By default, throw on non-append snapshots (like Spark, important >>>>>> for maintaining an accurate table view) >>>>>> B. Add options to skip overwrite / delete snapshots (like Spark) >>>>>> C. Add an option to read append data of overwrite snapshots (allow >>>>>> down the line deduplication of upserts) >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Thanks, >>>>>> Max >>>>>> >>>>>> On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <gyula.f...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I agree with Peter, it would be weird to get an error on a DELETE >>>>>>> snapshot if you already explicitly opted in for reading the appends of >>>>>>> OVERWRITE snapshots. >>>>>>> >>>>>>> Users may not be able to control the type of snapshot to be created >>>>>>> so this would otherwise render this feature useless. >>>>>>> >>>>>>> Gyula >>>>>>> >>>>>>> On Thu, Jun 26, 2025 at 5:34 PM Péter Váry < >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>> >>>>>>>> > Consequently, we must throw on DELETE snapshots, even if users >>>>>>>> opt-in to reading appends of OVERWRITE snapshots. >>>>>>>> >>>>>>>> OVERWRITE snapshots themselves could still contain deletes. So in >>>>>>>> this regard, I don't see a difference between the DELETE and the >>>>>>>> OVERWRITE >>>>>>>> snapshots. >>>>>>>> >>>>>>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. >>>>>>>> 26., Cs, 11:03): >>>>>>>> >>>>>>>>> Thank you for your feedback Steven and Gyula! >>>>>>>>> >>>>>>>>> @Steven >>>>>>>>> >>>>>>>>>> the Flink streaming read only consumes `append` only commits. >>>>>>>>>> This is a snapshot commit `DataOperation` type. You were talking >>>>>>>>>> about >>>>>>>>>> row-level appends, delete etc. >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, there is no doubt the fact that Flink streaming reads only >>>>>>>>> process append snapshots. But the issue here is that it jumps over >>>>>>>>> "overwrite" snapshots, which contains both new data files and delete >>>>>>>>> files. >>>>>>>>> I don't think users are generally aware of this. Even worse, reading >>>>>>>>> that >>>>>>>>> way leads to an inconsistent view on the table state, which is a >>>>>>>>> serious >>>>>>>>> data integrity issue. >>>>>>>>> >>>>>>>>> >>2. Add an option to read appended data of overwrite snapshots to >>>>>>>>> allow >>>>>>>>> >>users to de-duplicate downstream (opt-in config) >>>>>>>>> >For updates it is true. But what about actual row deletion? >>>>>>>>> >>>>>>>>> Good point! I had only considered upserts, but deletes can >>>>>>>>> definitely not be reconstructed like it is the case for upserts via >>>>>>>>> downstream deduplication. Consequently, we must throw on DELETE >>>>>>>>> snapshots, >>>>>>>>> even if users opt-in to reading appends of OVERWRITE snapshots. >>>>>>>>> >>>>>>>>> It would be interesting to hear what people think about this. In >>>>>>>>>> some scenarios, this is probably better. >>>>>>>>> >>>>>>>>> In the scenario of primarily streaming append (and occasionally >>>>>>>>>> batch overwrite), is this always a preferred behavior? >>>>>>>>> >>>>>>>>> >>>>>>>>> I think so, but I'm also curious about what others think. >>>>>>>>> >>>>>>>>> @Gyula >>>>>>>>> >>>>>>>>>> There is a slight risk with 3 that it may break pipelines where >>>>>>>>>> the current behaviour is well understood and expected but that's >>>>>>>>>> probably a >>>>>>>>>> very small minority of the users. >>>>>>>>> >>>>>>>>> >>>>>>>>> Agreed, that is a possible risk, but if we consider the hidden >>>>>>>>> risk of effectively "losing" data, it looks like a small risk to take. >>>>>>>>> >>>>>>>>> -Max >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Max! >>>>>>>>>> >>>>>>>>>> I like this proposal especially that proper streaming reads of >>>>>>>>>> deletes seem to be quite a bit of work based on recent efforts. >>>>>>>>>> >>>>>>>>>> Giving an option to include the append parts of OVERWRITE >>>>>>>>>> snapshots (2) is a great quick improvement that will unblock >>>>>>>>>> use-cases >>>>>>>>>> where the iceberg table is used to store insert/upsert only data >>>>>>>>>> which is >>>>>>>>>> pretty common when storing metadata/enrichment information for >>>>>>>>>> example. >>>>>>>>>> This can then later be simply removed once proper streaming read is >>>>>>>>>> available. >>>>>>>>>> >>>>>>>>>> As for (3) as a default , I think it's reasonable given that this >>>>>>>>>> is a source of a lot of user confusion when they see that the source >>>>>>>>>> simply >>>>>>>>>> doesn't read *anything* even if new snapshots are produced to the >>>>>>>>>> table. >>>>>>>>>> There is a slight risk with 3 that it may break pipelines where >>>>>>>>>> the current behaviour is well understood and expected but that's >>>>>>>>>> probably a >>>>>>>>>> very small minority of the users. >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> Gyula >>>>>>>>>> >>>>>>>>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> the Flink streaming read only consumes `append` only commits. >>>>>>>>>>> This is a snapshot commit `DataOperation` type. You were talking >>>>>>>>>>> about >>>>>>>>>>> row-level appends, delete etc. >>>>>>>>>>> >>>>>>>>>>> > 2. Add an option to read appended data of overwrite snapshots >>>>>>>>>>> to allow >>>>>>>>>>> users to de-duplicate downstream (opt-in config) >>>>>>>>>>> >>>>>>>>>>> For updates it is true. But what about actual row deletion? >>>>>>>>>>> >>>>>>>>>>> > 3. Throw an error if overwrite snapshots are discovered in an >>>>>>>>>>> append-only scan and the option in (2) is not activated >>>>>>>>>>> >>>>>>>>>>> It would be interesting to hear what people think about this. In >>>>>>>>>>> some scenarios, this is probably better. >>>>>>>>>>> >>>>>>>>>>> In the scenario of primarily streaming append (and >>>>>>>>>>> occasionally batch overwrite), is this always a preferred behavior? >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels < >>>>>>>>>>> m...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> It is well known that Flink and other Iceberg engines do not >>>>>>>>>>>> support >>>>>>>>>>>> "merge-on-read" in streaming/incremental read mode. There are >>>>>>>>>>>> plans to >>>>>>>>>>>> change that, see the "Improving Merge-On-Read Query Performance" >>>>>>>>>>>> thread, but this is not what this message is about. >>>>>>>>>>>> >>>>>>>>>>>> I used to think that when we incrementally read from an Iceberg >>>>>>>>>>>> table >>>>>>>>>>>> with Flink, we would simply not apply delete files, but still >>>>>>>>>>>> see >>>>>>>>>>>> appends. That is not the case. Instead, we run an append-only >>>>>>>>>>>> table >>>>>>>>>>>> scan [1], which will silently skip over any "overwrite" [2] >>>>>>>>>>>> snapshots. >>>>>>>>>>>> That means that the consumer will skip appends and deletes if >>>>>>>>>>>> the >>>>>>>>>>>> producer runs in "merge-on-read" mode. >>>>>>>>>>>> >>>>>>>>>>>> Now, it is debatable whether it is a good idea to skip applying >>>>>>>>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots >>>>>>>>>>>> entirely is a >>>>>>>>>>>> great idea either. Not every upsert effectively results in a >>>>>>>>>>>> delete + >>>>>>>>>>>> append. Many upserts are merely appends. Reading appends at >>>>>>>>>>>> least >>>>>>>>>>>> gives consumers the chance to de-duplicate later on, but >>>>>>>>>>>> skipping the >>>>>>>>>>>> entire snapshot definitely means data loss downstream. >>>>>>>>>>>> >>>>>>>>>>>> I think we have three options: >>>>>>>>>>>> >>>>>>>>>>>> 1. Implement "merge-on-read" efficiently for incremental / >>>>>>>>>>>> streaming >>>>>>>>>>>> reads (currently a work in progress) >>>>>>>>>>>> 2. Add an option to read appended data of overwrite snapshots >>>>>>>>>>>> to allow >>>>>>>>>>>> users to de-duplicate downstream (opt-in config) >>>>>>>>>>>> 3. Throw an error if overwrite snapshots are discovered in an >>>>>>>>>>>> append-only scan and the option in (2) is not activated >>>>>>>>>>>> >>>>>>>>>>>> (2) and (3) are stop gaps until we have (1). >>>>>>>>>>>> >>>>>>>>>>>> What do you think? >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Max >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89 >>>>>>>>>>>> [2] >>>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32 >>>>>>>>>>>> >>>>>>>>>>>