> _row_id a unique long identifier for every row within the table. The value is assigned via inheritance when a row is first added to the table.
Actually, current spec doesn't allow explicitly assigning row-id for new rows. So currently we don't need to worry about the question if it is allowed to have *new* rows with explicitly assigned row-id values lower than the snapshot's first-row-id. On Mon, Dec 1, 2025 at 9:50 AM Steven Wu <[email protected]> wrote: > Here is the spec PR to clarify undelete is not allowed. Will start a vote > thread for that. > https://github.com/apache/iceberg/pull/14731 > > Let me start a new discussion thread for the first-row-id and row-id > question for row lineage to get more attention and input. > > On Sat, Nov 22, 2025 at 7:02 AM Péter Váry <[email protected]> > wrote: > >> Apologies if I was unclear. As Steven also mentioned, I wanted to confirm >> whether we agree on the clarification regarding the `row-id` and >> `first-row-id`. >> >> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., Szo, >> 15:28): >> >>> Just to clarify, I was asking a question. >>> >>> Is it valid to add a new data file with a row? >>> >>> - whose persisted row-id value is lower than the snapshot's >>> first-row-id >>> - whose last-updated-seq-number is not set and inherit from the >>> snapshot sequence number >>> >>> Thanks, >>> Steven >>> >>> On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]> >>> wrote: >>> >>>> +1 for this proposal >>>> >>>> Slightly related, but we can move this to a separate thread if it needs >>>> independent discussion: We should clarify the relationship between `row-id` >>>> and `first-row-id`. This has come up several times in our discussions about >>>> the equality delete removal proposal, where we considered generating >>>> `row-ids` manually instead of relying on the auto-assignment feature. >>>> >>>> As discussed with Steven: >>>> >>>>> It is valid to add a new data file with a row: >>>>> >>>>> - whose persisted row-id value is lower than the snapshot's >>>>> first-row-id >>>>> - whose last-updated-seq-number is not set and inherit from the >>>>> snapshot sequence number >>>>> >>>>> >>>> Prashant Singh <[email protected]> ezt írta (időpont: 2025. >>>> nov. 22., Szo, 5:29): >>>> >>>>> +1 for making it explicit that an *undelete *of a row can't be done >>>>> by unsetting the corresponding bit in DV >>>>> >>>>> *Rows should only be added via new data files*, sounds reasonable to >>>>> me ! >>>>> >>>>> apart from row-lineage it also complicates the operation type >>>>> inference like here [1] as we would now >>>>> inspect the contents of these DV to see if it's an insert ? >>>>> >>>>> [1] >>>>> https://github.com/apache/iceberg/pull/14581#discussion_r2533057189 >>>>> >>>>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]> >>>>> wrote: >>>>> >>>>>> It makes sense to me, it sounds like a minor clarification. For v2 >>>>>> position deletes, code like rewrite_position_deletes may have made some >>>>>> assumptions like this and would not work well if violated, maybe other >>>>>> code >>>>>> as well. >>>>>> >>>>>> Thanks >>>>>> Szehon >>>>>> >>>>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Similar weird behavior can also happen for V2 position delete files >>>>>>> with `undelete`. >>>>>>> >>>>>>> In V2, there could be multiple position delete files (say pd1, pd2) >>>>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5 >>>>>>> and 10 and pd2 deletes row 15. >>>>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), >>>>>>> and pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5) >>>>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING) >>>>>>> >>>>>>> In either case, essentially some rows are added (back) to the table >>>>>>> with lower sequence number than the new snapshot's sequence number. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Just to recap the question: should the spec (v2 and v3) spell out >>>>>>> that `undelete row` is not allowed? Rows should only be added via new >>>>>>> data >>>>>>> files. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >Are we specifically stating somewhere that all row-ids should be >>>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>>>> that don't have a specific row-id assigned. >>>>>>>> >>>>>>>> I meant an ADDED row should have `row-id` higher than or equal to >>>>>>>> the snapshot's `first-row-id`. EXISTING or UPDATED row can have lower >>>>>>>> row >>>>>>>> id. >>>>>>>> >>>>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> > Can we create a validator to prevent this from happening? >>>>>>>>> >>>>>>>>> We don't have this problem with the Java implementation. >>>>>>>>> `BaseDVFileWriter` merges the previous DV with the new delta DV. So >>>>>>>>> there >>>>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow >>>>>>>>> "undelete". So we probably don't need to add any validation code in >>>>>>>>> the >>>>>>>>> Java impl. >>>>>>>>> >>>>>>>>> Just thought it is good to spell it out in the spec so that >>>>>>>>> clients/engines can be clear about the expected behavior. >>>>>>>>> >>>>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Are we specifically stating somewhere that all row-ids should be >>>>>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>>>>>> that don't have a specific row-id assigned. >>>>>>>>>> >>>>>>>>>> Noneless, I agree that the `row-id` and the >>>>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can >>>>>>>>>> say that >>>>>>>>>> undeleting a row is not allowed because of this. >>>>>>>>>> >>>>>>>>>> Can we create a validator to prevent this from happening? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. >>>>>>>>>> 21., P, 21:11): >>>>>>>>>> >>>>>>>>>>> The undeleted row would have invalid `row-id` and >>>>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it >>>>>>>>>>> should have >>>>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` >>>>>>>>>>> and the >>>>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's >>>>>>>>>>> sequence >>>>>>>>>>> number. >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*" >>>>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially >>>>>>>>>>>> adds a row >>>>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would >>>>>>>>>>>> violate the >>>>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be >>>>>>>>>>>> monotonically increasing. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Steven >>>>>>>>>>>> >>>>>>>>>>>
