Re: [DISCUSS] Row Lineage Proposal

Russell Spitzer Mon, 16 Sep 2024 08:23:40 -0700

One for each Table Version? Maybe worth thinking about going forwards. We a
little discussion about this at the community sync up last weds and the
general consensus is we just keep doing things the way we are doing them
until it becomes too unwieldy, then figure out a new solution. Feel free to
start up another thread though, it's worth thinking about.


On Sat, Sep 14, 2024 at 12:43 AM Manu Zhang <[email protected]> wrote:

> Thanks Russel. Not a question on the proposal itself, I find it a bit hard
> to follow and maintain all the three specs in one place. We are also
> publishing a unfinalized spec to the website. Would it be better to
> maintain the spec in a "copy-on-write" style, i.e. each spec having its own
> format file?
>
> Sorry to go off topic, I can start a separate thread if you think this
> concern is valid.
>
>
> On Sat, Sep 14, 2024 at 6:33 AM Russell Spitzer <[email protected]>
> wrote:
>
>> Pull Request Available, please focus any remaining comments there and we
>> can wrap this one up
>>
>> https://github.com/apache/iceberg/pull/11130
>>
>> On Thu, Aug 29, 2024 at 11:20 AM [email protected] <[email protected]>
>> wrote:
>>
>>> +1 for making row lineage and equality deletes mutually exclusive.
>>>
>>> The idea behind equality deletes is to avoid needing to read existing
>>> data in order to delete records. That doesn't fit with row lineage because
>>> the purpose of lineage is to be able to identify when a row changes by
>>> maintaining an identifier that would have to be read.
>>>
>>> On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> I went through the proposal and left comments as well. Thanks for
>>>> working on it, Russell!
>>>>
>>>> I don't see a good solution to how row lineage can work with equality
>>>> deletes. If so, I would be in favor of not allowing equality deletes at all
>>>> if row lineage is enabled as opposed to treating all added data records as
>>>> new. I will spend more time thinking if we can make it work.
>>>>
>>>> - Anton
>>>>
>>>> ср, 28 серп. 2024 р. о 12:41 Ryan Blue <[email protected]>
>>>> пише:
>>>>
>>>>> Sounds good to me. Thanks for pushing this forward, Russell!
>>>>>
>>>>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think folks have had a lot of good comments and since there haven't
>>>>>> been a lot of strong opinions I'm going to try to take what I think are 
>>>>>> the
>>>>>> least interesting options and move them into the "discarded section".
>>>>>> Please continue to comment and let's please make sure any things that 
>>>>>> folk
>>>>>> think are blockers for a Spec PR are eliminated. If we have general
>>>>>> consensus at a high level I think we can move to discussing the actual 
>>>>>> spec
>>>>>> changes on a spec change PR.
>>>>>>
>>>>>> I'm going to be keeping the proposals for :
>>>>>>
>>>>>> Global Identifier as the Identifier
>>>>>> and
>>>>>> Last Updated Sequence number as the Version
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> The situation in which you would use equality deletes is when you do
>>>>>>> not want to read the existing table data. That seems at odds with a 
>>>>>>> feature
>>>>>>> like row-level tracking where you want to keep track. To me, it would 
>>>>>>> be a
>>>>>>> reasonable solution to just say that equality deletes can't be used in
>>>>>>> tables where row-level tracking is enabled.
>>>>>>>
>>>>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> As far as I know Flink is actually the only engine we have at the
>>>>>>>> moment that can produce Equally deletes and only Equality deletes have 
>>>>>>>> this
>>>>>>>> specific problem. Since an equality delete can be written without 
>>>>>>>> actually
>>>>>>>> knowing whether rows are being updated or not, it is always ambiguous 
>>>>>>>> as to
>>>>>>>> whether a new row is an updated row, a newly added row, or a row which 
>>>>>>>> was
>>>>>>>> deleted but then a newly added row was also appended.
>>>>>>>>
>>>>>>>> I think in this case we need to ignore row_versioning and just give
>>>>>>>> every new row a brand new identifier. For a reader this means all 
>>>>>>>> updates
>>>>>>>> look like a "delete" and "add" and no "updates". For other processes 
>>>>>>>> (COW
>>>>>>>> and Position Deletes) we only mark records as being deleted or updated
>>>>>>>> after finding them first, this makes it easy to take the lineage 
>>>>>>>> identifier
>>>>>>>> from the source record and change it. For Spark, we just kept working 
>>>>>>>> on
>>>>>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to 
>>>>>>>> make
>>>>>>>> that scan and join faster but we probably still require a bit slower
>>>>>>>> latency.
>>>>>>>>
>>>>>>>> I think we could theoretically resolve equality deletes into
>>>>>>>> updates at compaction time again but only if the user first defines
>>>>>>>> accurate "row identity" columns because otherwise we have no way of
>>>>>>>> determining whether rows were updated or not. This is basically the 
>>>>>>>> issue
>>>>>>>> we have now in the CDC procedures. Ideally, I think we need to find a 
>>>>>>>> way
>>>>>>>> to have flink locate updated rows at runtime using some better indexing
>>>>>>>> structure or something like that as you suggested.
>>>>>>>>
>>>>>>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Russell,
>>>>>>>>>
>>>>>>>>> As discussed offline, this would be very hard to implement with
>>>>>>>>> the current Flink CDC write strategies. I think this is true for every
>>>>>>>>> streaming writers.
>>>>>>>>>
>>>>>>>>> For tracking the previous version of the row, the streaming writer
>>>>>>>>> would need to scan the table. It needs to be done for every record to 
>>>>>>>>> find
>>>>>>>>> the previous version. This could be possible if the data would be 
>>>>>>>>> stored in
>>>>>>>>> a way which supports fast queries on the primary key, like LSM Tree 
>>>>>>>>> (see:
>>>>>>>>> Paimon [1]), otherwise it would be prohibitively costly, and 
>>>>>>>>> unfeasible for
>>>>>>>>> higher loads. So adding a new storage strategy could be one solution.
>>>>>>>>>
>>>>>>>>> Alternatively we might find a way for the compaction to update the
>>>>>>>>> lineage fields. We could provide a way to link the equality deletes 
>>>>>>>>> to the
>>>>>>>>> new rows which updated them during write, then on compaction we could
>>>>>>>>> update the lineage fields based on this info.
>>>>>>>>>
>>>>>>>>> Is there any better ideas with Spark streaming which we can adopt?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> [1] - https://paimon.apache.org/docs/0.8/
>>>>>>>>>
>>>>>>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Y'all,
>>>>>>>>>>
>>>>>>>>>> We've been working on a new proposal to add Row Lineage to
>>>>>>>>>> Iceberg in the V3 Spec. The general idea is to give every row a 
>>>>>>>>>> unique
>>>>>>>>>> identifier as well as a marker of what version of the row it is. This
>>>>>>>>>> should let us build a variety of features related to CDC, Incremental
>>>>>>>>>> Processing and Audit Logging. If you are interested please check out 
>>>>>>>>>> the
>>>>>>>>>> linked proposal below. This will require compliance from all engines 
>>>>>>>>>> to be
>>>>>>>>>> really useful so It's important we come to consensus on whether or 
>>>>>>>>>> not this
>>>>>>>>>> is possible.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you for your consideration,
>>>>>>>>>> Russ
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Databricks
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Databricks
>>>>>
>>>>

Re: [DISCUSS] Row Lineage Proposal

Reply via email to