I like the first 2 points of your proposal A (with warning) + B and changing the default with 2.0, but I would suggest avoiding C. I see very limited case for deduplication (only if there are no deletes the table, just updates), and it would cause more confusion and configuration clutter.
Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., H, 11:50): > That's essentially what I propose, minus the WARN message which is a great > addition. The flag to skip overwrite snapshots should probably be there > forever, similarly to Spark's flags to skip overwrite / delete snapshots. > Here's again the proposal: > > For Flink incremental / streaming reads: > A. By default, throw on non-append snapshots (like Spark, important for > maintaining an accurate table view) > B. Add options to skip overwrite / delete snapshots (like Spark) > C. Add an option to read append data of overwrite snapshots (allow down > the line deduplication of upserts) > > We could change (A) to just log a warning. Delaying throwing an exception > for (A) until 2.0 means we are ok with randomly skipping over some data, as > opposed to informing users and allowing them to choose what to do with the > data. Warnings are easily overlooked. > > > On Mon, Jun 30, 2025 at 11:38 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Minimally LOG.warn message about deprecation. >> Maybe a "hidden" flag which could turn back to skip overwrite snapshots. >> This flag could be deprecated immediately and removed in the next release. >> Maybe wait until 2.0, where we can introduce breaking changes? >> >> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. 30., >> H, 11:19): >> >>> How would such a grace period look like? Even if we defer this X amount >>> of releases, some users will probably be surprised by this. When users >>> upgrade their Iceberg version, they would generally expect slightly >>> different (improved) behavior. Right now, some users are discovering that >>> they are not reading all the data, which I believe is a much bigger >>> surprise to them. >>> >>> On Fri, Jun 27, 2025 at 5:47 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> I would try to avoid breaking the current behaviour. >>>> Maybe after some grace period it could be ok, but not "suddenly" >>>> >>>> <gyula.f...@gmail.com> ezt írta (időpont: 2025. jún. 27., P, 15:28): >>>> >>>>> Sounds good to me! >>>>> >>>>> Gyula >>>>> Sent from my iPhone >>>>> >>>>> On 27 Jun 2025, at 13:48, Maximilian Michels <m...@apache.org> wrote: >>>>> >>>>> >>>>> In my understanding, overwrite snapshots are specifically there to >>>>> overwrite data, i.e. there is always an append for a delete. That is also >>>>> the case for the Flink Iceberg sink. There may be other writers, which use >>>>> overwrite snapshots differently. But point taken, we may need to also >>>>> opt-in to skipping over delete snapshots, in addition to overwrite >>>>> snapshots. >>>>> >>>>> I've taken a look at how Spark handles this topic. Turns out, they >>>>> came up with a similar approach: >>>>> >>>>> [Spark] Iceberg only supports reading data from append snapshots. >>>>>> Overwrite snapshots cannot be processed and will cause an exception by >>>>>> default. Overwrites may be ignored by setting >>>>>> streaming-skip-overwrite-snapshots=true. Similarly, delete snapshots >>>>>> will cause an exception by default, and deletes may be ignored by setting >>>>>> streaming-skip-delete-snapshots=true. >>>>> >>>>> >>>>> https://iceberg.apache.org/docs/1.8.1/spark-structured-streaming/#streaming-reads >>>>> >>>>> Spark fails by default and has options to ignore overwrite and delete >>>>> snapshots entirely. For Flink, perhaps we can make the following changes >>>>> to >>>>> align with Spark and also provide some flexibility for users: >>>>> >>>>> For Flink incremental / streaming reads: >>>>> A. By default, throw on non-append snapshots (like Spark, important >>>>> for maintaining an accurate table view) >>>>> B. Add options to skip overwrite / delete snapshots (like Spark) >>>>> C. Add an option to read append data of overwrite snapshots (allow >>>>> down the line deduplication of upserts) >>>>> >>>>> What do you think? >>>>> >>>>> Thanks, >>>>> Max >>>>> >>>>> On Thu, Jun 26, 2025 at 5:38 PM Gyula Fóra <gyula.f...@gmail.com> >>>>> wrote: >>>>> >>>>>> I agree with Peter, it would be weird to get an error on a DELETE >>>>>> snapshot if you already explicitly opted in for reading the appends of >>>>>> OVERWRITE snapshots. >>>>>> >>>>>> Users may not be able to control the type of snapshot to be created >>>>>> so this would otherwise render this feature useless. >>>>>> >>>>>> Gyula >>>>>> >>>>>> On Thu, Jun 26, 2025 at 5:34 PM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> > Consequently, we must throw on DELETE snapshots, even if users >>>>>>> opt-in to reading appends of OVERWRITE snapshots. >>>>>>> >>>>>>> OVERWRITE snapshots themselves could still contain deletes. So in >>>>>>> this regard, I don't see a difference between the DELETE and the >>>>>>> OVERWRITE >>>>>>> snapshots. >>>>>>> >>>>>>> Maximilian Michels <m...@apache.org> ezt írta (időpont: 2025. jún. >>>>>>> 26., Cs, 11:03): >>>>>>> >>>>>>>> Thank you for your feedback Steven and Gyula! >>>>>>>> >>>>>>>> @Steven >>>>>>>> >>>>>>>>> the Flink streaming read only consumes `append` only commits. This >>>>>>>>> is a snapshot commit `DataOperation` type. You were talking about >>>>>>>>> row-level appends, delete etc. >>>>>>>> >>>>>>>> >>>>>>>> Yes, there is no doubt the fact that Flink streaming reads only >>>>>>>> process append snapshots. But the issue here is that it jumps over >>>>>>>> "overwrite" snapshots, which contains both new data files and delete >>>>>>>> files. >>>>>>>> I don't think users are generally aware of this. Even worse, reading >>>>>>>> that >>>>>>>> way leads to an inconsistent view on the table state, which is a >>>>>>>> serious >>>>>>>> data integrity issue. >>>>>>>> >>>>>>>> >>2. Add an option to read appended data of overwrite snapshots to >>>>>>>> allow >>>>>>>> >>users to de-duplicate downstream (opt-in config) >>>>>>>> >For updates it is true. But what about actual row deletion? >>>>>>>> >>>>>>>> Good point! I had only considered upserts, but deletes can >>>>>>>> definitely not be reconstructed like it is the case for upserts via >>>>>>>> downstream deduplication. Consequently, we must throw on DELETE >>>>>>>> snapshots, >>>>>>>> even if users opt-in to reading appends of OVERWRITE snapshots. >>>>>>>> >>>>>>>> It would be interesting to hear what people think about this. In >>>>>>>>> some scenarios, this is probably better. >>>>>>>> >>>>>>>> In the scenario of primarily streaming append (and occasionally >>>>>>>>> batch overwrite), is this always a preferred behavior? >>>>>>>> >>>>>>>> >>>>>>>> I think so, but I'm also curious about what others think. >>>>>>>> >>>>>>>> @Gyula >>>>>>>> >>>>>>>>> There is a slight risk with 3 that it may break pipelines where >>>>>>>>> the current behaviour is well understood and expected but that's >>>>>>>>> probably a >>>>>>>>> very small minority of the users. >>>>>>>> >>>>>>>> >>>>>>>> Agreed, that is a possible risk, but if we consider the hidden risk >>>>>>>> of effectively "losing" data, it looks like a small risk to take. >>>>>>>> >>>>>>>> -Max >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 26, 2025 at 9:18 AM Gyula Fóra <gyula.f...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Max! >>>>>>>>> >>>>>>>>> I like this proposal especially that proper streaming reads of >>>>>>>>> deletes seem to be quite a bit of work based on recent efforts. >>>>>>>>> >>>>>>>>> Giving an option to include the append parts of OVERWRITE >>>>>>>>> snapshots (2) is a great quick improvement that will unblock use-cases >>>>>>>>> where the iceberg table is used to store insert/upsert only data >>>>>>>>> which is >>>>>>>>> pretty common when storing metadata/enrichment information for >>>>>>>>> example. >>>>>>>>> This can then later be simply removed once proper streaming read is >>>>>>>>> available. >>>>>>>>> >>>>>>>>> As for (3) as a default , I think it's reasonable given that this >>>>>>>>> is a source of a lot of user confusion when they see that the source >>>>>>>>> simply >>>>>>>>> doesn't read *anything* even if new snapshots are produced to the >>>>>>>>> table. >>>>>>>>> There is a slight risk with 3 that it may break pipelines where >>>>>>>>> the current behaviour is well understood and expected but that's >>>>>>>>> probably a >>>>>>>>> very small minority of the users. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Gyula >>>>>>>>> >>>>>>>>> On Wed, Jun 25, 2025 at 9:27 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> the Flink streaming read only consumes `append` only commits. >>>>>>>>>> This is a snapshot commit `DataOperation` type. You were talking >>>>>>>>>> about >>>>>>>>>> row-level appends, delete etc. >>>>>>>>>> >>>>>>>>>> > 2. Add an option to read appended data of overwrite snapshots >>>>>>>>>> to allow >>>>>>>>>> users to de-duplicate downstream (opt-in config) >>>>>>>>>> >>>>>>>>>> For updates it is true. But what about actual row deletion? >>>>>>>>>> >>>>>>>>>> > 3. Throw an error if overwrite snapshots are discovered in an >>>>>>>>>> append-only scan and the option in (2) is not activated >>>>>>>>>> >>>>>>>>>> It would be interesting to hear what people think about this. In >>>>>>>>>> some scenarios, this is probably better. >>>>>>>>>> >>>>>>>>>> In the scenario of primarily streaming append (and >>>>>>>>>> occasionally batch overwrite), is this always a preferred behavior? >>>>>>>>>> >>>>>>>>>> On Wed, Jun 25, 2025 at 8:24 AM Maximilian Michels < >>>>>>>>>> m...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> It is well known that Flink and other Iceberg engines do not >>>>>>>>>>> support >>>>>>>>>>> "merge-on-read" in streaming/incremental read mode. There are >>>>>>>>>>> plans to >>>>>>>>>>> change that, see the "Improving Merge-On-Read Query Performance" >>>>>>>>>>> thread, but this is not what this message is about. >>>>>>>>>>> >>>>>>>>>>> I used to think that when we incrementally read from an Iceberg >>>>>>>>>>> table >>>>>>>>>>> with Flink, we would simply not apply delete files, but still see >>>>>>>>>>> appends. That is not the case. Instead, we run an append-only >>>>>>>>>>> table >>>>>>>>>>> scan [1], which will silently skip over any "overwrite" [2] >>>>>>>>>>> snapshots. >>>>>>>>>>> That means that the consumer will skip appends and deletes if the >>>>>>>>>>> producer runs in "merge-on-read" mode. >>>>>>>>>>> >>>>>>>>>>> Now, it is debatable whether it is a good idea to skip applying >>>>>>>>>>> deletes, but I'm not sure ignoring "overwrite" snapshots >>>>>>>>>>> entirely is a >>>>>>>>>>> great idea either. Not every upsert effectively results in a >>>>>>>>>>> delete + >>>>>>>>>>> append. Many upserts are merely appends. Reading appends at least >>>>>>>>>>> gives consumers the chance to de-duplicate later on, but >>>>>>>>>>> skipping the >>>>>>>>>>> entire snapshot definitely means data loss downstream. >>>>>>>>>>> >>>>>>>>>>> I think we have three options: >>>>>>>>>>> >>>>>>>>>>> 1. Implement "merge-on-read" efficiently for incremental / >>>>>>>>>>> streaming >>>>>>>>>>> reads (currently a work in progress) >>>>>>>>>>> 2. Add an option to read appended data of overwrite snapshots to >>>>>>>>>>> allow >>>>>>>>>>> users to de-duplicate downstream (opt-in config) >>>>>>>>>>> 3. Throw an error if overwrite snapshots are discovered in an >>>>>>>>>>> append-only scan and the option in (2) is not activated >>>>>>>>>>> >>>>>>>>>>> (2) and (3) are stop gaps until we have (1). >>>>>>>>>>> >>>>>>>>>>> What do you think? >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Max >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L89 >>>>>>>>>>> [2] >>>>>>>>>>> https://github.com/apache/iceberg/blob/5b50afe8f2b4cf16af6c015625385021b44aca14/api/src/main/java/org/apache/iceberg/DataOperations.java#L32 >>>>>>>>>>> >>>>>>>>>>