Here is the spec PR to clarify undelete is not allowed. Will start a vote thread for that. https://github.com/apache/iceberg/pull/14731
Let me start a new discussion thread for the first-row-id and row-id question for row lineage to get more attention and input. On Sat, Nov 22, 2025 at 7:02 AM Péter Váry <[email protected]> wrote: > Apologies if I was unclear. As Steven also mentioned, I wanted to confirm > whether we agree on the clarification regarding the `row-id` and > `first-row-id`. > > Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., Szo, > 15:28): > >> Just to clarify, I was asking a question. >> >> Is it valid to add a new data file with a row? >> >> - whose persisted row-id value is lower than the snapshot's >> first-row-id >> - whose last-updated-seq-number is not set and inherit from the >> snapshot sequence number >> >> Thanks, >> Steven >> >> On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]> >> wrote: >> >>> +1 for this proposal >>> >>> Slightly related, but we can move this to a separate thread if it needs >>> independent discussion: We should clarify the relationship between `row-id` >>> and `first-row-id`. This has come up several times in our discussions about >>> the equality delete removal proposal, where we considered generating >>> `row-ids` manually instead of relying on the auto-assignment feature. >>> >>> As discussed with Steven: >>> >>>> It is valid to add a new data file with a row: >>>> >>>> - whose persisted row-id value is lower than the snapshot's >>>> first-row-id >>>> - whose last-updated-seq-number is not set and inherit from the >>>> snapshot sequence number >>>> >>>> >>> Prashant Singh <[email protected]> ezt írta (időpont: 2025. nov. >>> 22., Szo, 5:29): >>> >>>> +1 for making it explicit that an *undelete *of a row can't be done by >>>> unsetting the corresponding bit in DV >>>> >>>> *Rows should only be added via new data files*, sounds reasonable to >>>> me ! >>>> >>>> apart from row-lineage it also complicates the operation type inference >>>> like here [1] as we would now >>>> inspect the contents of these DV to see if it's an insert ? >>>> >>>> [1] https://github.com/apache/iceberg/pull/14581#discussion_r2533057189 >>>> >>>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]> >>>> wrote: >>>> >>>>> It makes sense to me, it sounds like a minor clarification. For v2 >>>>> position deletes, code like rewrite_position_deletes may have made some >>>>> assumptions like this and would not work well if violated, maybe other >>>>> code >>>>> as well. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> >>>>> wrote: >>>>> >>>>>> Similar weird behavior can also happen for V2 position delete files >>>>>> with `undelete`. >>>>>> >>>>>> In V2, there could be multiple position delete files (say pd1, pd2) >>>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5 >>>>>> and 10 and pd2 deletes row 15. >>>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), >>>>>> and pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5) >>>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING) >>>>>> >>>>>> In either case, essentially some rows are added (back) to the table >>>>>> with lower sequence number than the new snapshot's sequence number. >>>>>> >>>>>> >>>>>> >>>>>> Just to recap the question: should the spec (v2 and v3) spell out >>>>>> that `undelete row` is not allowed? Rows should only be added via new >>>>>> data >>>>>> files. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >Are we specifically stating somewhere that all row-ids should be >>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>>> that don't have a specific row-id assigned. >>>>>>> >>>>>>> I meant an ADDED row should have `row-id` higher than or equal to >>>>>>> the snapshot's `first-row-id`. EXISTING or UPDATED row can have lower >>>>>>> row >>>>>>> id. >>>>>>> >>>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> > Can we create a validator to prevent this from happening? >>>>>>>> >>>>>>>> We don't have this problem with the Java implementation. >>>>>>>> `BaseDVFileWriter` merges the previous DV with the new delta DV. So >>>>>>>> there >>>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow >>>>>>>> "undelete". So we probably don't need to add any validation code in the >>>>>>>> Java impl. >>>>>>>> >>>>>>>> Just thought it is good to spell it out in the spec so that >>>>>>>> clients/engines can be clear about the expected behavior. >>>>>>>> >>>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Are we specifically stating somewhere that all row-ids should be >>>>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>>>>> that don't have a specific row-id assigned. >>>>>>>>> >>>>>>>>> Noneless, I agree that the `row-id` and the >>>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can >>>>>>>>> say that >>>>>>>>> undeleting a row is not allowed because of this. >>>>>>>>> >>>>>>>>> Can we create a validator to prevent this from happening? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. >>>>>>>>> 21., P, 21:11): >>>>>>>>> >>>>>>>>>> The undeleted row would have invalid `row-id` and >>>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it >>>>>>>>>> should have >>>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` >>>>>>>>>> and the >>>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's >>>>>>>>>> sequence >>>>>>>>>> number. >>>>>>>>>> >>>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*" >>>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially >>>>>>>>>>> adds a row >>>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would >>>>>>>>>>> violate the >>>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be >>>>>>>>>>> monotonically increasing. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Steven >>>>>>>>>>> >>>>>>>>>>
