Apologies if I was unclear. As Steven also mentioned, I wanted to confirm whether we agree on the clarification regarding the `row-id` and `first-row-id`.
Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., Szo, 15:28): > Just to clarify, I was asking a question. > > Is it valid to add a new data file with a row? > > - whose persisted row-id value is lower than the snapshot's > first-row-id > - whose last-updated-seq-number is not set and inherit from the > snapshot sequence number > > Thanks, > Steven > > On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]> > wrote: > >> +1 for this proposal >> >> Slightly related, but we can move this to a separate thread if it needs >> independent discussion: We should clarify the relationship between `row-id` >> and `first-row-id`. This has come up several times in our discussions about >> the equality delete removal proposal, where we considered generating >> `row-ids` manually instead of relying on the auto-assignment feature. >> >> As discussed with Steven: >> >>> It is valid to add a new data file with a row: >>> >>> - whose persisted row-id value is lower than the snapshot's >>> first-row-id >>> - whose last-updated-seq-number is not set and inherit from the >>> snapshot sequence number >>> >>> >> Prashant Singh <[email protected]> ezt írta (időpont: 2025. nov. >> 22., Szo, 5:29): >> >>> +1 for making it explicit that an *undelete *of a row can't be done by >>> unsetting the corresponding bit in DV >>> >>> *Rows should only be added via new data files*, sounds reasonable to me >>> ! >>> >>> apart from row-lineage it also complicates the operation type inference >>> like here [1] as we would now >>> inspect the contents of these DV to see if it's an insert ? >>> >>> [1] https://github.com/apache/iceberg/pull/14581#discussion_r2533057189 >>> >>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]> >>> wrote: >>> >>>> It makes sense to me, it sounds like a minor clarification. For v2 >>>> position deletes, code like rewrite_position_deletes may have made some >>>> assumptions like this and would not work well if violated, maybe other code >>>> as well. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> wrote: >>>> >>>>> Similar weird behavior can also happen for V2 position delete files >>>>> with `undelete`. >>>>> >>>>> In V2, there could be multiple position delete files (say pd1, pd2) >>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5 >>>>> and 10 and pd2 deletes row 15. >>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), and >>>>> pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5) >>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING) >>>>> >>>>> In either case, essentially some rows are added (back) to the table >>>>> with lower sequence number than the new snapshot's sequence number. >>>>> >>>>> >>>>> >>>>> Just to recap the question: should the spec (v2 and v3) spell out that >>>>> `undelete row` is not allowed? Rows should only be added via new data >>>>> files. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]> >>>>> wrote: >>>>> >>>>>> >Are we specifically stating somewhere that all row-ids should be >>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>> that don't have a specific row-id assigned. >>>>>> >>>>>> I meant an ADDED row should have `row-id` higher than or equal to the >>>>>> snapshot's `first-row-id`. EXISTING or UPDATED row can have lower row id. >>>>>> >>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> > Can we create a validator to prevent this from happening? >>>>>>> >>>>>>> We don't have this problem with the Java implementation. >>>>>>> `BaseDVFileWriter` merges the previous DV with the new delta DV. So >>>>>>> there >>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow >>>>>>> "undelete". So we probably don't need to add any validation code in the >>>>>>> Java impl. >>>>>>> >>>>>>> Just thought it is good to spell it out in the spec so that >>>>>>> clients/engines can be clear about the expected behavior. >>>>>>> >>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Are we specifically stating somewhere that all row-ids should be >>>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>>>> that don't have a specific row-id assigned. >>>>>>>> >>>>>>>> Noneless, I agree that the `row-id` and the >>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can say >>>>>>>> that >>>>>>>> undeleting a row is not allowed because of this. >>>>>>>> >>>>>>>> Can we create a validator to prevent this from happening? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. >>>>>>>> 21., P, 21:11): >>>>>>>> >>>>>>>>> The undeleted row would have invalid `row-id` and >>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it should >>>>>>>>> have >>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` >>>>>>>>> and the >>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's >>>>>>>>> sequence >>>>>>>>> number. >>>>>>>>> >>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*" >>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially >>>>>>>>>> adds a row >>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would >>>>>>>>>> violate the >>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be >>>>>>>>>> monotonically increasing. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Steven >>>>>>>>>> >>>>>>>>>
