Gabor, thanks for starting this discussion.

I have been thinking about this problem independently since the column
update sync. Here is the detailed design document
<https://docs.google.com/document/d/160-FizR6zOASMb86NycfgCm7cZbh6HK7FLcUW8_Xp-0/edit?usp=sharing>
.

Gabor, I read your section of How to support _last_updated_sequence_number
<https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.xvm52pv4m7lq#heading=h.k2neg79ocgu>.
If I understand correctly, the proposal is to repurpose
file_sequence_number to capture the snapshot sequence number of the latest
column file. I suggest we don't change the semantics of the existing
file_sequence_number. Instead, we can introduce a new
latest_column_file_sequence_number field in the tracking struct. My doc
described the reasoning.

That is the only real difference as far as I can tell. Otherwise, I think
we had the same idea/design.

On Wed, May 20, 2026 at 11:56 AM Gábor Kaszab <[email protected]>
wrote:

> Hey Iceberg Community,
>
> Anurag started a separate, focused discussion
> <https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0> on the
> column update file representation, similarly, let me start another one for
> the metadata representation. Hopefully, we can make some iterations on this
> before the next sync.
>
> We covered this topic in the sync yesterday and agreed on some of the
> fields, but we left the "tracking" information part open. The *required*
> fields we agreed on so far:
>
> ColumnFile
> field_ids list<int>
> location string
> file_size_in_bytes long
>
> *Tracking information*
> Additionally to the above, we discussed the need of tracking information.
> These are the potential ones:
>
> *1) Sequence number*
>
>    - Usage for _last_updated_sequence_number
>
> I did think about how to produce _last_updated_sequence_number and I think
> technically we don't need to store the sequence number on the update file
> level for that. I wrote up the steps here
> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.xvm52pv4m7lq>,
> but in a nutshell: we could either fille that from the
> _last_updated_sequence_number written into the latest column file, or if
> null we can use the base file's file_sequence_number.
>
>    - Usage for equality deletes
>
> As we agreed previously, we don't want to support update files together
> with equality deletes, so we won't need to store column file level sequence
> numbers for this either.
>
>    - Usage for CDC, observability, etc.
>
> I'm wondering if there is any use case where we want to see the order of
> the column updates to see the sequence they were created. If this matters
> for CDC or reproducibility or anything else, then let's have a column file
> level sequence number too, if not, we can omit this.
>
> *2) Status*
> I think, similarly to TrackedFile, we need the following statuses here:
> EXISTING, ADDED, DELETED, REPLACED
> With these, when the base file's status is REPLACED, taking a look at the
> column_files we can know exactly what has changed wrt the column updates.
> Some examples to demonstrate:
>
> Step 1: Start with an existing base file:
> base file: {location: "file.parquet", seq_num: 1, file_seq_num: 1, status:
> EXISTING, column_files:[]}
>
> Step 2: Adding a column update for field IDs [1, 2]:
> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *2*,
> status: *REPLACED*,
>                 column_files: [ *{field_ids: [1, 2], location:
> "update1.parquet", status: ADDED}* ]}
>
> Step 3: Adding an overlapping column update with field IDs [2, 3]
> ("de-duplicate" field IDs):
> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
> status: REPLACED,
>                 column_files: [ {field_ids: *[1],* location:
> "update1.parquet", status: *REPLACED}, **{field_ids: [2, 3], location:
> "update2.parquet", status: ADDED}* ]}
>
> Step 4: Add another column update for field ID [1] to completely eliminate
> one previous update file from metadata
> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *4*,
> status: REPLACED,
>                 column_files: [ {field_ids: [1]*,* location:
> "update1.parquet", status: *DELETED},*  {field_ids: [2, 3], location:
> "update2.parquet", status: *EXISTING*}, *{field_ids: [1], location:
> "update3.parquet", status: ADDED}* ]}
>
> *Thoughts on REPLACED*
> In step 3, we marked the existing column file as REPLACED while reducing
> the field_ids list to de-duplicate them with the incoming update
> file's field_ids. With this, REPLACED indicates that field_ids content was
> reduced, however, we won't know exactly what field IDs were removed.
>
>   - Alternative approach 1:
> We could use DELETED status leaving the field ID list intact, and then
> create a new ColumnFile with the reduced list. Step 3 would look like this:
>
> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
> status: REPLACED,
>                 column_files: [ {field_ids: [1, 2]*,* location:
> "update1.parquet", status: *DELETED}, **{field_ids: [1], location:
> "update1.parquet", status: ADDED}, **{field_ids: [2, 3], location:
> "update2.parquet", status: ADDED}* ]}
>
>   - Alternative approach 2:
> We can use REPLACED as originally, and also have a field in the tracking
> data to *keep track of the removed field IDs* (similarly to
> Tracking.DELETED_POSITIONS). Step 3 would look like this:
>
> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
> status: REPLACED,
>                 column_files: [ {field_ids: *[1],* location:
> "update1.parquet", status: *REPLACED, removed_field_ids: [2]}, **{field_ids:
> [2, 3], location: "update2.parquet", status: ADDED}* ]}
>
>   - Preference:
> I think the REPLACED approach is cleaner, I'd prefer that. In case we want
> to track what IDs were removed, we could follow "alternative approach 2".
>
>   - Additional, note:
> Re-writing the column file as REPLACED shouldn't alter the sequence number
> of the column file (if we decide to have one).
>
> *3) Snapshot ID*
> 'Tracking' has this, I think it could make sense for column files too.
>
> *4) First row ID*
> Row IDs should come from the base file's metadata IMO, we shouldn't store
> this for the update files.
>
> *Summary of all the potential tracking fields:*
>
> ColumnFileTracking
> required status int
> optional snapshot_id long
> optional sequence_number long
> optional removed_field_ids list<int>
>
> *Field IDs*
> The first free field ID within TrackedFile is 157. The last used one is
> DeletionVector.CARDINALITY
> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeletionVector.java#L42>
> with field ID 156.
> I'm working with Amogh to coordinate assigning the required field IDs here.
>
> Let me know if I miss anything here! Any feedback is appreciated!
>
> Best Regards,
> Gabor
>
>

Reply via email to