Hi folks,
I’ve noticed some interesting differences across Iceberg clients when
assigning new field IDs during schema conversion
Specifically:
1.
*Iceberg Java* assigns field IDs using *ordinal order for the root
struct*, followed by a *post-order traversal* for nested structs. For
example:
struct<
0: id: required long,
1: info: optional struct<
4: name: optional string,
5: attrs: optional struct<
2: age: optional int,
3: score: optional double
>
>
>
Here, nested fields follow a post-order traversal (age → score → attrs →
name).
2.
*Iceberg Python* appears to use a *pre-order traversal* when assigning
fresh field IDs:
https://github.com/apache/iceberg-python/blob/950fc7131b8e597f73647c6ff2bd78d0b24102ad/pyiceberg/schema.py#L1295
3.
*Iceberg Rust* does not currently have a helper for schema
conversion+field id assignment, but some existing logic appears to
follow a *level-order
traversal*:
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/schema/id_reassigner.rs#L27
This leads to two questions:
1.
*Does the assignment order of fresh field IDs actually matter?*
My intuition is that it should not, as long as the field-ID → field
mapping is consistent and the highest field ID is tracked correctly, but I
would love to be corrected
2.
*If the order does matter, is there a recommended or canonical traversal
order that clients should follow?*
Any guidance or historical context would be appreciated. Thanks!
Best,
Shawn