Here is the spec PR to clarify undelete is not allowed. Will start a vote
thread for that.
https://github.com/apache/iceberg/pull/14731

Let me start a new discussion thread for the first-row-id and row-id
question for row lineage to get more attention and input.

On Sat, Nov 22, 2025 at 7:02 AM Péter Váry <[email protected]>
wrote:

> Apologies if I was unclear. As Steven also mentioned, I wanted to confirm
> whether we agree on the clarification regarding the `row-id` and
> `first-row-id`.
>
> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., Szo,
> 15:28):
>
>> Just to clarify, I was asking a question.
>>
>> Is it valid to add a new data file with a row?
>>
>>    - whose persisted row-id value is lower than the snapshot's
>>    first-row-id
>>    - whose last-updated-seq-number is not set and inherit from the
>>    snapshot sequence number
>>
>> Thanks,
>> Steven
>>
>> On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]>
>> wrote:
>>
>>> +1 for this proposal
>>>
>>> Slightly related, but we can move this to a separate thread if it needs
>>> independent discussion: We should clarify the relationship between `row-id`
>>> and `first-row-id`. This has come up several times in our discussions about
>>> the equality delete removal proposal, where we considered generating
>>> `row-ids` manually instead of relying on the auto-assignment feature.
>>>
>>> As discussed with Steven:
>>>
>>>> It is valid to add a new data file with a row:
>>>>
>>>>    - whose persisted row-id value is lower than the snapshot's
>>>>    first-row-id
>>>>    - whose last-updated-seq-number is not set and inherit from the
>>>>    snapshot sequence number
>>>>
>>>>
>>> Prashant Singh <[email protected]> ezt írta (időpont: 2025. nov.
>>> 22., Szo, 5:29):
>>>
>>>> +1 for making it explicit that an *undelete *of a row can't be done by
>>>> unsetting the corresponding bit in DV
>>>>
>>>> *Rows should only be added via new data files*, sounds reasonable to
>>>> me !
>>>>
>>>> apart from row-lineage it also complicates the operation type inference
>>>> like here [1] as we would now
>>>> inspect the contents of these DV to see if it's an insert ?
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/14581#discussion_r2533057189
>>>>
>>>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]>
>>>> wrote:
>>>>
>>>>> It makes sense to me, it sounds like a minor clarification.  For v2
>>>>> position deletes, code like rewrite_position_deletes may have made some
>>>>> assumptions like this and would not work well if violated, maybe other 
>>>>> code
>>>>> as well.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Similar weird behavior can also happen for V2 position delete files
>>>>>> with `undelete`.
>>>>>>
>>>>>> In V2, there could be multiple position delete files (say pd1, pd2)
>>>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5
>>>>>> and 10 and pd2 deletes row 15.
>>>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING),
>>>>>> and pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5)
>>>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING)
>>>>>>
>>>>>> In either case, essentially some rows are added (back) to the table
>>>>>> with lower sequence number than the new snapshot's sequence number.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Just to recap the question: should the spec (v2 and v3) spell out
>>>>>> that `undelete row` is not allowed? Rows should only be added via new 
>>>>>> data
>>>>>> files.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> >Are we specifically stating somewhere that all row-ids should be
>>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>>> that don't have a specific row-id assigned.
>>>>>>>
>>>>>>> I meant an ADDED row should have `row-id` higher than or equal to
>>>>>>> the snapshot's `first-row-id`. EXISTING or UPDATED row can have lower 
>>>>>>> row
>>>>>>> id.
>>>>>>>
>>>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > Can we create a validator to prevent this from happening?
>>>>>>>>
>>>>>>>> We don't have this problem with the Java implementation.
>>>>>>>> `BaseDVFileWriter` merges the  previous DV with the new delta DV. So 
>>>>>>>> there
>>>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow
>>>>>>>> "undelete". So we probably don't need to add any validation code in the
>>>>>>>> Java impl.
>>>>>>>>
>>>>>>>> Just thought it is good to spell it out in the spec so that
>>>>>>>> clients/engines can be clear about the expected behavior.
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Are we specifically stating somewhere that all row-ids should be
>>>>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>>>>> that don't have a specific row-id assigned.
>>>>>>>>>
>>>>>>>>> Noneless, I agree that the `row-id` and the
>>>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can 
>>>>>>>>> say that
>>>>>>>>> undeleting a row is not allowed because of this.
>>>>>>>>>
>>>>>>>>> Can we create a validator to prevent this from happening?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov.
>>>>>>>>> 21., P, 21:11):
>>>>>>>>>
>>>>>>>>>> The undeleted row would have invalid `row-id` and
>>>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it 
>>>>>>>>>> should have
>>>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` 
>>>>>>>>>> and the
>>>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's 
>>>>>>>>>> sequence
>>>>>>>>>> number.
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*"
>>>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially 
>>>>>>>>>>> adds a row
>>>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would 
>>>>>>>>>>> violate the
>>>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be
>>>>>>>>>>> monotonically increasing.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Steven
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to