Re: [DISCUSS] V3 spec: add monotonic requirement to data DV

Péter Váry Sat, 22 Nov 2025 07:02:10 -0800

Apologies if I was unclear. As Steven also mentioned, I wanted to confirm
whether we agree on the clarification regarding the `row-id` and
`first-row-id`.


Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., Szo,
15:28):

> Just to clarify, I was asking a question.
>
> Is it valid to add a new data file with a row?
>
>    - whose persisted row-id value is lower than the snapshot's
>    first-row-id
>    - whose last-updated-seq-number is not set and inherit from the
>    snapshot sequence number
>
> Thanks,
> Steven
>
> On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]>
> wrote:
>
>> +1 for this proposal
>>
>> Slightly related, but we can move this to a separate thread if it needs
>> independent discussion: We should clarify the relationship between `row-id`
>> and `first-row-id`. This has come up several times in our discussions about
>> the equality delete removal proposal, where we considered generating
>> `row-ids` manually instead of relying on the auto-assignment feature.
>>
>> As discussed with Steven:
>>
>>> It is valid to add a new data file with a row:
>>>
>>>    - whose persisted row-id value is lower than the snapshot's
>>>    first-row-id
>>>    - whose last-updated-seq-number is not set and inherit from the
>>>    snapshot sequence number
>>>
>>>
>> Prashant Singh <[email protected]> ezt írta (időpont: 2025. nov.
>> 22., Szo, 5:29):
>>
>>> +1 for making it explicit that an *undelete *of a row can't be done by
>>> unsetting the corresponding bit in DV
>>>
>>> *Rows should only be added via new data files*, sounds reasonable to me
>>> !
>>>
>>> apart from row-lineage it also complicates the operation type inference
>>> like here [1] as we would now
>>> inspect the contents of these DV to see if it's an insert ?
>>>
>>> [1] https://github.com/apache/iceberg/pull/14581#discussion_r2533057189
>>>
>>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]>
>>> wrote:
>>>
>>>> It makes sense to me, it sounds like a minor clarification.  For v2
>>>> position deletes, code like rewrite_position_deletes may have made some
>>>> assumptions like this and would not work well if violated, maybe other code
>>>> as well.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> Similar weird behavior can also happen for V2 position delete files
>>>>> with `undelete`.
>>>>>
>>>>> In V2, there could be multiple position delete files (say pd1, pd2)
>>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5
>>>>> and 10 and pd2 deletes row 15.
>>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), and
>>>>> pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5)
>>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING)
>>>>>
>>>>> In either case, essentially some rows are added (back) to the table
>>>>> with lower sequence number than the new snapshot's sequence number.
>>>>>
>>>>>
>>>>>
>>>>> Just to recap the question: should the spec (v2 and v3) spell out that
>>>>> `undelete row` is not allowed? Rows should only be added via new data 
>>>>> files.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> >Are we specifically stating somewhere that all row-ids should be
>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>> that don't have a specific row-id assigned.
>>>>>>
>>>>>> I meant an ADDED row should have `row-id` higher than or equal to the
>>>>>> snapshot's `first-row-id`. EXISTING or UPDATED row can have lower row id.
>>>>>>
>>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> > Can we create a validator to prevent this from happening?
>>>>>>>
>>>>>>> We don't have this problem with the Java implementation.
>>>>>>> `BaseDVFileWriter` merges the  previous DV with the new delta DV. So 
>>>>>>> there
>>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow
>>>>>>> "undelete". So we probably don't need to add any validation code in the
>>>>>>> Java impl.
>>>>>>>
>>>>>>> Just thought it is good to spell it out in the spec so that
>>>>>>> clients/engines can be clear about the expected behavior.
>>>>>>>
>>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Are we specifically stating somewhere that all row-ids should be
>>>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>>>> that don't have a specific row-id assigned.
>>>>>>>>
>>>>>>>> Noneless, I agree that the `row-id` and the
>>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can say 
>>>>>>>> that
>>>>>>>> undeleting a row is not allowed because of this.
>>>>>>>>
>>>>>>>> Can we create a validator to prevent this from happening?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov.
>>>>>>>> 21., P, 21:11):
>>>>>>>>
>>>>>>>>> The undeleted row would have invalid `row-id` and
>>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it should 
>>>>>>>>> have
>>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` 
>>>>>>>>> and the
>>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's 
>>>>>>>>> sequence
>>>>>>>>> number.
>>>>>>>>>
>>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*"
>>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially 
>>>>>>>>>> adds a row
>>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would 
>>>>>>>>>> violate the
>>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be
>>>>>>>>>> monotonically increasing.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] V3 spec: add monotonic requirement to data DV

Reply via email to