I went through the proposal and left comments as well. Thanks for working
on it, Russell!

I don't see a good solution to how row lineage can work with equality
deletes. If so, I would be in favor of not allowing equality deletes at all
if row lineage is enabled as opposed to treating all added data records as
new. I will spend more time thinking if we can make it work.

- Anton

ср, 28 серп. 2024 р. о 12:41 Ryan Blue <b...@databricks.com.invalid> пише:

> Sounds good to me. Thanks for pushing this forward, Russell!
>
> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I think folks have had a lot of good comments and since there haven't
>> been a lot of strong opinions I'm going to try to take what I think are the
>> least interesting options and move them into the "discarded section".
>> Please continue to comment and let's please make sure any things that folk
>> think are blockers for a Spec PR are eliminated. If we have general
>> consensus at a high level I think we can move to discussing the actual spec
>> changes on a spec change PR.
>>
>> I'm going to be keeping the proposals for :
>>
>> Global Identifier as the Identifier
>> and
>> Last Updated Sequence number as the Version
>>
>>
>>
>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> The situation in which you would use equality deletes is when you do not
>>> want to read the existing table data. That seems at odds with a feature
>>> like row-level tracking where you want to keep track. To me, it would be a
>>> reasonable solution to just say that equality deletes can't be used in
>>> tables where row-level tracking is enabled.
>>>
>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> As far as I know Flink is actually the only engine we have at the
>>>> moment that can produce Equally deletes and only Equality deletes have this
>>>> specific problem. Since an equality delete can be written without actually
>>>> knowing whether rows are being updated or not, it is always ambiguous as to
>>>> whether a new row is an updated row, a newly added row, or a row which was
>>>> deleted but then a newly added row was also appended.
>>>>
>>>> I think in this case we need to ignore row_versioning and just give
>>>> every new row a brand new identifier. For a reader this means all updates
>>>> look like a "delete" and "add" and no "updates". For other processes (COW
>>>> and Position Deletes) we only mark records as being deleted or updated
>>>> after finding them first, this makes it easy to take the lineage identifier
>>>> from the source record and change it. For Spark, we just kept working on
>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make
>>>> that scan and join faster but we probably still require a bit slower
>>>> latency.
>>>>
>>>> I think we could theoretically resolve equality deletes into updates at
>>>> compaction time again but only if the user first defines accurate "row
>>>> identity" columns because otherwise we have no way of determining whether
>>>> rows were updated or not. This is basically the issue we have now in the
>>>> CDC procedures. Ideally, I think we need to find a way to have flink locate
>>>> updated rows at runtime using some better indexing structure or something
>>>> like that as you suggested.
>>>>
>>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Russell,
>>>>>
>>>>> As discussed offline, this would be very hard to implement with the
>>>>> current Flink CDC write strategies. I think this is true for every
>>>>> streaming writers.
>>>>>
>>>>> For tracking the previous version of the row, the streaming writer
>>>>> would need to scan the table. It needs to be done for every record to find
>>>>> the previous version. This could be possible if the data would be stored 
>>>>> in
>>>>> a way which supports fast queries on the primary key, like LSM Tree (see:
>>>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible 
>>>>> for
>>>>> higher loads. So adding a new storage strategy could be one solution.
>>>>>
>>>>> Alternatively we might find a way for the compaction to update the
>>>>> lineage fields. We could provide a way to link the equality deletes to the
>>>>> new rows which updated them during write, then on compaction we could
>>>>> update the lineage fields based on this info.
>>>>>
>>>>> Is there any better ideas with Spark streaming which we can adopt?
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> [1] - https://paimon.apache.org/docs/0.8/
>>>>>
>>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <russell.spit...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Y'all,
>>>>>>
>>>>>> We've been working on a new proposal to add Row Lineage to Iceberg in
>>>>>> the V3 Spec. The general idea is to give every row a unique identifier as
>>>>>> well as a marker of what version of the row it is. This should let us 
>>>>>> build
>>>>>> a variety of features related to CDC, Incremental Processing and Audit
>>>>>> Logging. If you are interested please check out the linked proposal 
>>>>>> below.
>>>>>> This will require compliance from all engines to be really useful so It's
>>>>>> important we come to consensus on whether or not this is possible.
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>>>>>
>>>>>>
>>>>>> Thank you for your consideration,
>>>>>> Russ
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Reply via email to