Re: [PR] fix: build_fallback_field_id_map produces incorrect column indices for schemas with nested types [iceberg-rust]

via GitHub Thu, 09 Apr 2026 19:07:18 -0700


c-thiel commented on code in PR #2307:
URL: https://github.com/apache/iceberg-rust/pull/2307#discussion_r3061632596



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -1111,14 +1111,33 @@ fn build_field_id_map(parquet_schema: 
&SchemaDescriptor) -> Result<Option<HashMa
 }
 
 /// Build a fallback field ID map for Parquet files without embedded field IDs.
-/// Position-based (1, 2, 3, ...) for compatibility with iceberg-java 
migrations.
+///
+/// Must use top-level field positions (not leaf column positions) to stay 
consistent
+/// with `add_fallback_field_ids_to_arrow_schema`, which assigns ordinal IDs to
+/// top-level Arrow fields. Using leaf positions instead would produce wrong 
indices
+/// when nested types (struct/list/map) expand into multiple leaf columns.
+///
+/// Mirrors iceberg-java's ParquetSchemaUtil.addFallbackIds() which iterates
+/// fileSchema.getFields() assigning ordinal IDs to top-level fields.
 fn build_fallback_field_id_map(parquet_schema: &SchemaDescriptor) -> 
HashMap<i32, usize> {
     let mut column_map = HashMap::new();
+    let mut leaf_idx = 0;
 
-    // 1-indexed to match iceberg-java's convention
-    for (idx, _field) in parquet_schema.columns().iter().enumerate() {
-        let field_id = (idx + 1) as i32;
-        column_map.insert(field_id, idx);
+    for (top_pos, field) in 
parquet_schema.root_schema().get_fields().iter().enumerate() {
+        let field_id = (top_pos + 1) as i32;
+
+        if field.is_primitive() {
+            column_map.insert(field_id, leaf_idx);
+            leaf_idx += 1;
+        } else {
+            // Advance past all leaves in this group. Count them by checking
+            // how many subsequent leaves share this root index.
+            while leaf_idx < parquet_schema.num_columns()
+                && parquet_schema.get_column_root_idx(leaf_idx) == top_pos
+            {
+                leaf_idx += 1;
+            }

Review Comment:
   Just a nit, but should we extract a recursive `leaf_count(&ParquetType) -> 
usize` and writing the loop as:
   ```rust
   if field.is_primitive() {
       column_map.insert(field_id, leaf_idx);
   }
   leaf_idx += leaf_count(field);
   ```
   Decouples the map-building from parquet_schema's flat leaf layout. Then we 
would not need to rely on num_columns(), get_column_root_idx(), and the 
DFS-ordering invariant being kept in lockstep with a cursor.



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -1111,14 +1111,33 @@ fn build_field_id_map(parquet_schema: 
&SchemaDescriptor) -> Result<Option<HashMa
 }
 
 /// Build a fallback field ID map for Parquet files without embedded field IDs.
-/// Position-based (1, 2, 3, ...) for compatibility with iceberg-java 
migrations.
+///
+/// Must use top-level field positions (not leaf column positions) to stay 
consistent
+/// with `add_fallback_field_ids_to_arrow_schema`, which assigns ordinal IDs to
+/// top-level Arrow fields. Using leaf positions instead would produce wrong 
indices
+/// when nested types (struct/list/map) expand into multiple leaf columns.
+///
+/// Mirrors iceberg-java's ParquetSchemaUtil.addFallbackIds() which iterates
+/// fileSchema.getFields() assigning ordinal IDs to top-level fields.
 fn build_fallback_field_id_map(parquet_schema: &SchemaDescriptor) -> 
HashMap<i32, usize> {

Review Comment:
   I think we should add a doc here making the contract explicit: "entries only 
for primitive top-level fields."
   Checked the consumers (PredicateConverter, 
RowGroupMetricsEvaluator::stats_for_field_id, 
get_row_selection_for_filter_predicate) — all resolve field ids to primitive 
leaves, so primitive-only seems exactly right. As discussed before this 
diverges slightly from Java's ParquetSchemaUtil.addFallbackIds, which attaches 
an ordinal id to every top-level field (not just primitives) — but Java's 
downstream lookups only resolve primitives anyway, so observable behavior 
should match.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: build_fallback_field_id_map produces incorrect column indices for schemas with nested types [iceberg-rust]

Reply via email to