I went through the proposal and left comments as well. Thanks for working on it, Russell!
I don't see a good solution to how row lineage can work with equality deletes. If so, I would be in favor of not allowing equality deletes at all if row lineage is enabled as opposed to treating all added data records as new. I will spend more time thinking if we can make it work. - Anton ср, 28 серп. 2024 р. о 12:41 Ryan Blue <b...@databricks.com.invalid> пише: > Sounds good to me. Thanks for pushing this forward, Russell! > > On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I think folks have had a lot of good comments and since there haven't >> been a lot of strong opinions I'm going to try to take what I think are the >> least interesting options and move them into the "discarded section". >> Please continue to comment and let's please make sure any things that folk >> think are blockers for a Spec PR are eliminated. If we have general >> consensus at a high level I think we can move to discussing the actual spec >> changes on a spec change PR. >> >> I'm going to be keeping the proposals for : >> >> Global Identifier as the Identifier >> and >> Last Updated Sequence number as the Version >> >> >> >> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> The situation in which you would use equality deletes is when you do not >>> want to read the existing table data. That seems at odds with a feature >>> like row-level tracking where you want to keep track. To me, it would be a >>> reasonable solution to just say that equality deletes can't be used in >>> tables where row-level tracking is enabled. >>> >>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> As far as I know Flink is actually the only engine we have at the >>>> moment that can produce Equally deletes and only Equality deletes have this >>>> specific problem. Since an equality delete can be written without actually >>>> knowing whether rows are being updated or not, it is always ambiguous as to >>>> whether a new row is an updated row, a newly added row, or a row which was >>>> deleted but then a newly added row was also appended. >>>> >>>> I think in this case we need to ignore row_versioning and just give >>>> every new row a brand new identifier. For a reader this means all updates >>>> look like a "delete" and "add" and no "updates". For other processes (COW >>>> and Position Deletes) we only mark records as being deleted or updated >>>> after finding them first, this makes it easy to take the lineage identifier >>>> from the source record and change it. For Spark, we just kept working on >>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make >>>> that scan and join faster but we probably still require a bit slower >>>> latency. >>>> >>>> I think we could theoretically resolve equality deletes into updates at >>>> compaction time again but only if the user first defines accurate "row >>>> identity" columns because otherwise we have no way of determining whether >>>> rows were updated or not. This is basically the issue we have now in the >>>> CDC procedures. Ideally, I think we need to find a way to have flink locate >>>> updated rows at runtime using some better indexing structure or something >>>> like that as you suggested. >>>> >>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> Hi Russell, >>>>> >>>>> As discussed offline, this would be very hard to implement with the >>>>> current Flink CDC write strategies. I think this is true for every >>>>> streaming writers. >>>>> >>>>> For tracking the previous version of the row, the streaming writer >>>>> would need to scan the table. It needs to be done for every record to find >>>>> the previous version. This could be possible if the data would be stored >>>>> in >>>>> a way which supports fast queries on the primary key, like LSM Tree (see: >>>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible >>>>> for >>>>> higher loads. So adding a new storage strategy could be one solution. >>>>> >>>>> Alternatively we might find a way for the compaction to update the >>>>> lineage fields. We could provide a way to link the equality deletes to the >>>>> new rows which updated them during write, then on compaction we could >>>>> update the lineage fields based on this info. >>>>> >>>>> Is there any better ideas with Spark streaming which we can adopt? >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> [1] - https://paimon.apache.org/docs/0.8/ >>>>> >>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <russell.spit...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Y'all, >>>>>> >>>>>> We've been working on a new proposal to add Row Lineage to Iceberg in >>>>>> the V3 Spec. The general idea is to give every row a unique identifier as >>>>>> well as a marker of what version of the row it is. This should let us >>>>>> build >>>>>> a variety of features related to CDC, Incremental Processing and Audit >>>>>> Logging. If you are interested please check out the linked proposal >>>>>> below. >>>>>> This will require compliance from all engines to be really useful so It's >>>>>> important we come to consensus on whether or not this is possible. >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing >>>>>> >>>>>> >>>>>> Thank you for your consideration, >>>>>> Russ >>>>>> >>>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >> > > -- > Ryan Blue > Databricks >