Re: [DISCUSS] V3 spec: add monotonic requirement to data DV

Steven Wu Mon, 01 Dec 2025 11:08:45 -0800

> _row_id a unique long identifier for every row within the table. The
value is assigned via inheritance when a row is first added to the table.


Actually, current spec doesn't allow explicitly assigning row-id for new
rows.

So currently we don't need to worry about the question if it is allowed to
have *new* rows with explicitly assigned row-id values lower than the
snapshot's first-row-id.

On Mon, Dec 1, 2025 at 9:50 AM Steven Wu <[email protected]> wrote:

> Here is the spec PR to clarify undelete is not allowed. Will start a vote
> thread for that.
> https://github.com/apache/iceberg/pull/14731
>
> Let me start a new discussion thread for the first-row-id and row-id
> question for row lineage to get more attention and input.
>
> On Sat, Nov 22, 2025 at 7:02 AM Péter Váry <[email protected]>
> wrote:
>
>> Apologies if I was unclear. As Steven also mentioned, I wanted to confirm
>> whether we agree on the clarification regarding the `row-id` and
>> `first-row-id`.
>>
>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., Szo,
>> 15:28):
>>
>>> Just to clarify, I was asking a question.
>>>
>>> Is it valid to add a new data file with a row?
>>>
>>>    - whose persisted row-id value is lower than the snapshot's
>>>    first-row-id
>>>    - whose last-updated-seq-number is not set and inherit from the
>>>    snapshot sequence number
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> +1 for this proposal
>>>>
>>>> Slightly related, but we can move this to a separate thread if it needs
>>>> independent discussion: We should clarify the relationship between `row-id`
>>>> and `first-row-id`. This has come up several times in our discussions about
>>>> the equality delete removal proposal, where we considered generating
>>>> `row-ids` manually instead of relying on the auto-assignment feature.
>>>>
>>>> As discussed with Steven:
>>>>
>>>>> It is valid to add a new data file with a row:
>>>>>
>>>>>    - whose persisted row-id value is lower than the snapshot's
>>>>>    first-row-id
>>>>>    - whose last-updated-seq-number is not set and inherit from the
>>>>>    snapshot sequence number
>>>>>
>>>>>
>>>> Prashant Singh <[email protected]> ezt írta (időpont: 2025.
>>>> nov. 22., Szo, 5:29):
>>>>
>>>>> +1 for making it explicit that an *undelete *of a row can't be done
>>>>> by unsetting the corresponding bit in DV
>>>>>
>>>>> *Rows should only be added via new data files*, sounds reasonable to
>>>>> me !
>>>>>
>>>>> apart from row-lineage it also complicates the operation type
>>>>> inference like here [1] as we would now
>>>>> inspect the contents of these DV to see if it's an insert ?
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/iceberg/pull/14581#discussion_r2533057189
>>>>>
>>>>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> It makes sense to me, it sounds like a minor clarification.  For v2
>>>>>> position deletes, code like rewrite_position_deletes may have made some
>>>>>> assumptions like this and would not work well if violated, maybe other 
>>>>>> code
>>>>>> as well.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Similar weird behavior can also happen for V2 position delete files
>>>>>>> with `undelete`.
>>>>>>>
>>>>>>> In V2, there could be multiple position delete files (say pd1, pd2)
>>>>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5
>>>>>>> and 10 and pd2 deletes row 15.
>>>>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING),
>>>>>>> and pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5)
>>>>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING)
>>>>>>>
>>>>>>> In either case, essentially some rows are added (back) to the table
>>>>>>> with lower sequence number than the new snapshot's sequence number.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Just to recap the question: should the spec (v2 and v3) spell out
>>>>>>> that `undelete row` is not allowed? Rows should only be added via new 
>>>>>>> data
>>>>>>> files.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> >Are we specifically stating somewhere that all row-ids should be
>>>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>>>> that don't have a specific row-id assigned.
>>>>>>>>
>>>>>>>> I meant an ADDED row should have `row-id` higher than or equal to
>>>>>>>> the snapshot's `first-row-id`. EXISTING or UPDATED row can have lower 
>>>>>>>> row
>>>>>>>> id.
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> > Can we create a validator to prevent this from happening?
>>>>>>>>>
>>>>>>>>> We don't have this problem with the Java implementation.
>>>>>>>>> `BaseDVFileWriter` merges the  previous DV with the new delta DV. So 
>>>>>>>>> there
>>>>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow
>>>>>>>>> "undelete". So we probably don't need to add any validation code in 
>>>>>>>>> the
>>>>>>>>> Java impl.
>>>>>>>>>
>>>>>>>>> Just thought it is good to spell it out in the spec so that
>>>>>>>>> clients/engines can be clear about the expected behavior.
>>>>>>>>>
>>>>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Are we specifically stating somewhere that all row-ids should be
>>>>>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>>>>>> that don't have a specific row-id assigned.
>>>>>>>>>>
>>>>>>>>>> Noneless, I agree that the `row-id` and the
>>>>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can 
>>>>>>>>>> say that
>>>>>>>>>> undeleting a row is not allowed because of this.
>>>>>>>>>>
>>>>>>>>>> Can we create a validator to prevent this from happening?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov.
>>>>>>>>>> 21., P, 21:11):
>>>>>>>>>>
>>>>>>>>>>> The undeleted row would have invalid `row-id` and
>>>>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it 
>>>>>>>>>>> should have
>>>>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` 
>>>>>>>>>>> and the
>>>>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's 
>>>>>>>>>>> sequence
>>>>>>>>>>> number.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*"
>>>>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially 
>>>>>>>>>>>> adds a row
>>>>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would 
>>>>>>>>>>>> violate the
>>>>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be
>>>>>>>>>>>> monotonically increasing.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Steven
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] V3 spec: add monotonic requirement to data DV

Reply via email to