+1 for this proposal

Slightly related, but we can move this to a separate thread if it needs
independent discussion: We should clarify the relationship between `row-id`
and `first-row-id`. This has come up several times in our discussions about
the equality delete removal proposal, where we considered generating
`row-ids` manually instead of relying on the auto-assignment feature.

As discussed with Steven:

> It is valid to add a new data file with a row:
>
>    - whose persisted row-id value is lower than the snapshot's
>    first-row-id
>    - whose last-updated-seq-number is not set and inherit from the
>    snapshot sequence number
>
>
Prashant Singh <[email protected]> ezt írta (időpont: 2025. nov.
22., Szo, 5:29):

> +1 for making it explicit that an *undelete *of a row can't be done by
> unsetting the corresponding bit in DV
>
> *Rows should only be added via new data files*, sounds reasonable to me !
>
> apart from row-lineage it also complicates the operation type inference
> like here [1] as we would now
> inspect the contents of these DV to see if it's an insert ?
>
> [1] https://github.com/apache/iceberg/pull/14581#discussion_r2533057189
>
> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]> wrote:
>
>> It makes sense to me, it sounds like a minor clarification.  For v2
>> position deletes, code like rewrite_position_deletes may have made some
>> assumptions like this and would not work well if violated, maybe other code
>> as well.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> wrote:
>>
>>> Similar weird behavior can also happen for V2 position delete files with
>>> `undelete`.
>>>
>>> In V2, there could be multiple position delete files (say pd1, pd2)
>>> associated with the same data file (say f1). Let's say pd1 deletes row 5
>>> and 10 and pd2 deletes row 15.
>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), and
>>> pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5)
>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING)
>>>
>>> In either case, essentially some rows are added (back) to the table with
>>> lower sequence number than the new snapshot's sequence number.
>>>
>>>
>>>
>>> Just to recap the question: should the spec (v2 and v3) spell out that
>>> `undelete row` is not allowed? Rows should only be added via new data files.
>>>
>>>
>>>
>>>
>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]> wrote:
>>>
>>>> >Are we specifically stating somewhere that all row-ids should be
>>>> higher than or equal to the snapshot's `first-row-id`?
>>>> In my mental model the `first-row-id` is only applicable for rows that
>>>> don't have a specific row-id assigned.
>>>>
>>>> I meant an ADDED row should have `row-id` higher than or equal to the
>>>> snapshot's `first-row-id`. EXISTING or UPDATED row can have lower row id.
>>>>
>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> > Can we create a validator to prevent this from happening?
>>>>>
>>>>> We don't have this problem with the Java implementation.
>>>>> `BaseDVFileWriter` merges the  previous DV with the new delta DV. So there
>>>>> is no `undelete` behavior. I am not aware of any Java API to allow
>>>>> "undelete". So we probably don't need to add any validation code in the
>>>>> Java impl.
>>>>>
>>>>> Just thought it is good to spell it out in the spec so that
>>>>> clients/engines can be clear about the expected behavior.
>>>>>
>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Are we specifically stating somewhere that all row-ids should be
>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>> that don't have a specific row-id assigned.
>>>>>>
>>>>>> Noneless, I agree that the `row-id` and the
>>>>>> `last-updated-seq-num` should have changed to a new one, so we can say 
>>>>>> that
>>>>>> undeleting a row is not allowed because of this.
>>>>>>
>>>>>> Can we create a validator to prevent this from happening?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 21.,
>>>>>> P, 21:11):
>>>>>>
>>>>>>> The undeleted row would have invalid `row-id` and
>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it should 
>>>>>>> have
>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` and 
>>>>>>> the
>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's 
>>>>>>> sequence
>>>>>>> number.
>>>>>>>
>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*" of
>>>>>>>> a row by unsetting the DV bit? Unsetting a DV bit essentially adds a 
>>>>>>>> row
>>>>>>>> with lower row-id than the snapshot's first-row-id, which would 
>>>>>>>> violate the
>>>>>>>> row lineage spec. With the restriction, DV cardinality should be
>>>>>>>> monotonically increasing.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Steven
>>>>>>>>
>>>>>>>

Reply via email to