Gabor, thanks for starting this discussion. I have been thinking about this problem independently since the column update sync. Here is the detailed design document <https://docs.google.com/document/d/160-FizR6zOASMb86NycfgCm7cZbh6HK7FLcUW8_Xp-0/edit?usp=sharing> .
Gabor, I read your section of How to support _last_updated_sequence_number <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.xvm52pv4m7lq#heading=h.k2neg79ocgu>. If I understand correctly, the proposal is to repurpose file_sequence_number to capture the snapshot sequence number of the latest column file. I suggest we don't change the semantics of the existing file_sequence_number. Instead, we can introduce a new latest_column_file_sequence_number field in the tracking struct. My doc described the reasoning. That is the only real difference as far as I can tell. Otherwise, I think we had the same idea/design. On Wed, May 20, 2026 at 11:56 AM Gábor Kaszab <[email protected]> wrote: > Hey Iceberg Community, > > Anurag started a separate, focused discussion > <https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0> on the > column update file representation, similarly, let me start another one for > the metadata representation. Hopefully, we can make some iterations on this > before the next sync. > > We covered this topic in the sync yesterday and agreed on some of the > fields, but we left the "tracking" information part open. The *required* > fields we agreed on so far: > > ColumnFile > field_ids list<int> > location string > file_size_in_bytes long > > *Tracking information* > Additionally to the above, we discussed the need of tracking information. > These are the potential ones: > > *1) Sequence number* > > - Usage for _last_updated_sequence_number > > I did think about how to produce _last_updated_sequence_number and I think > technically we don't need to store the sequence number on the update file > level for that. I wrote up the steps here > <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.xvm52pv4m7lq>, > but in a nutshell: we could either fille that from the > _last_updated_sequence_number written into the latest column file, or if > null we can use the base file's file_sequence_number. > > - Usage for equality deletes > > As we agreed previously, we don't want to support update files together > with equality deletes, so we won't need to store column file level sequence > numbers for this either. > > - Usage for CDC, observability, etc. > > I'm wondering if there is any use case where we want to see the order of > the column updates to see the sequence they were created. If this matters > for CDC or reproducibility or anything else, then let's have a column file > level sequence number too, if not, we can omit this. > > *2) Status* > I think, similarly to TrackedFile, we need the following statuses here: > EXISTING, ADDED, DELETED, REPLACED > With these, when the base file's status is REPLACED, taking a look at the > column_files we can know exactly what has changed wrt the column updates. > Some examples to demonstrate: > > Step 1: Start with an existing base file: > base file: {location: "file.parquet", seq_num: 1, file_seq_num: 1, status: > EXISTING, column_files:[]} > > Step 2: Adding a column update for field IDs [1, 2]: > base file: {location: "file.parquet", seq_num: 1, file_seq_num: *2*, > status: *REPLACED*, > column_files: [ *{field_ids: [1, 2], location: > "update1.parquet", status: ADDED}* ]} > > Step 3: Adding an overlapping column update with field IDs [2, 3] > ("de-duplicate" field IDs): > base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, > status: REPLACED, > column_files: [ {field_ids: *[1],* location: > "update1.parquet", status: *REPLACED}, **{field_ids: [2, 3], location: > "update2.parquet", status: ADDED}* ]} > > Step 4: Add another column update for field ID [1] to completely eliminate > one previous update file from metadata > base file: {location: "file.parquet", seq_num: 1, file_seq_num: *4*, > status: REPLACED, > column_files: [ {field_ids: [1]*,* location: > "update1.parquet", status: *DELETED},* {field_ids: [2, 3], location: > "update2.parquet", status: *EXISTING*}, *{field_ids: [1], location: > "update3.parquet", status: ADDED}* ]} > > *Thoughts on REPLACED* > In step 3, we marked the existing column file as REPLACED while reducing > the field_ids list to de-duplicate them with the incoming update > file's field_ids. With this, REPLACED indicates that field_ids content was > reduced, however, we won't know exactly what field IDs were removed. > > - Alternative approach 1: > We could use DELETED status leaving the field ID list intact, and then > create a new ColumnFile with the reduced list. Step 3 would look like this: > > base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, > status: REPLACED, > column_files: [ {field_ids: [1, 2]*,* location: > "update1.parquet", status: *DELETED}, **{field_ids: [1], location: > "update1.parquet", status: ADDED}, **{field_ids: [2, 3], location: > "update2.parquet", status: ADDED}* ]} > > - Alternative approach 2: > We can use REPLACED as originally, and also have a field in the tracking > data to *keep track of the removed field IDs* (similarly to > Tracking.DELETED_POSITIONS). Step 3 would look like this: > > base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, > status: REPLACED, > column_files: [ {field_ids: *[1],* location: > "update1.parquet", status: *REPLACED, removed_field_ids: [2]}, **{field_ids: > [2, 3], location: "update2.parquet", status: ADDED}* ]} > > - Preference: > I think the REPLACED approach is cleaner, I'd prefer that. In case we want > to track what IDs were removed, we could follow "alternative approach 2". > > - Additional, note: > Re-writing the column file as REPLACED shouldn't alter the sequence number > of the column file (if we decide to have one). > > *3) Snapshot ID* > 'Tracking' has this, I think it could make sense for column files too. > > *4) First row ID* > Row IDs should come from the base file's metadata IMO, we shouldn't store > this for the update files. > > *Summary of all the potential tracking fields:* > > ColumnFileTracking > required status int > optional snapshot_id long > optional sequence_number long > optional removed_field_ids list<int> > > *Field IDs* > The first free field ID within TrackedFile is 157. The last used one is > DeletionVector.CARDINALITY > <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeletionVector.java#L42> > with field ID 156. > I'm working with Amogh to coordinate assigning the required field IDs here. > > Let me know if I miss anything here! Any feedback is appreciated! > > Best Regards, > Gabor > >
