c-thiel commented on code in PR #2307:
URL: https://github.com/apache/iceberg-rust/pull/2307#discussion_r3061632596
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -1111,14 +1111,33 @@ fn build_field_id_map(parquet_schema:
&SchemaDescriptor) -> Result<Option<HashMa
}
/// Build a fallback field ID map for Parquet files without embedded field IDs.
-/// Position-based (1, 2, 3, ...) for compatibility with iceberg-java
migrations.
+///
+/// Must use top-level field positions (not leaf column positions) to stay
consistent
+/// with `add_fallback_field_ids_to_arrow_schema`, which assigns ordinal IDs to
+/// top-level Arrow fields. Using leaf positions instead would produce wrong
indices
+/// when nested types (struct/list/map) expand into multiple leaf columns.
+///
+/// Mirrors iceberg-java's ParquetSchemaUtil.addFallbackIds() which iterates
+/// fileSchema.getFields() assigning ordinal IDs to top-level fields.
fn build_fallback_field_id_map(parquet_schema: &SchemaDescriptor) ->
HashMap<i32, usize> {
let mut column_map = HashMap::new();
+ let mut leaf_idx = 0;
- // 1-indexed to match iceberg-java's convention
- for (idx, _field) in parquet_schema.columns().iter().enumerate() {
- let field_id = (idx + 1) as i32;
- column_map.insert(field_id, idx);
+ for (top_pos, field) in
parquet_schema.root_schema().get_fields().iter().enumerate() {
+ let field_id = (top_pos + 1) as i32;
+
+ if field.is_primitive() {
+ column_map.insert(field_id, leaf_idx);
+ leaf_idx += 1;
+ } else {
+ // Advance past all leaves in this group. Count them by checking
+ // how many subsequent leaves share this root index.
+ while leaf_idx < parquet_schema.num_columns()
+ && parquet_schema.get_column_root_idx(leaf_idx) == top_pos
+ {
+ leaf_idx += 1;
+ }
Review Comment:
Just a nit, but should we extract a recursive `leaf_count(&ParquetType) ->
usize` and writing the loop as:
```rust
if field.is_primitive() {
column_map.insert(field_id, leaf_idx);
}
leaf_idx += leaf_count(field);
```
Decouples the map-building from parquet_schema's flat leaf layout. Then we
would not need to rely on num_columns(), get_column_root_idx(), and the
DFS-ordering invariant being kept in lockstep with a cursor.
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -1111,14 +1111,33 @@ fn build_field_id_map(parquet_schema:
&SchemaDescriptor) -> Result<Option<HashMa
}
/// Build a fallback field ID map for Parquet files without embedded field IDs.
-/// Position-based (1, 2, 3, ...) for compatibility with iceberg-java
migrations.
+///
+/// Must use top-level field positions (not leaf column positions) to stay
consistent
+/// with `add_fallback_field_ids_to_arrow_schema`, which assigns ordinal IDs to
+/// top-level Arrow fields. Using leaf positions instead would produce wrong
indices
+/// when nested types (struct/list/map) expand into multiple leaf columns.
+///
+/// Mirrors iceberg-java's ParquetSchemaUtil.addFallbackIds() which iterates
+/// fileSchema.getFields() assigning ordinal IDs to top-level fields.
fn build_fallback_field_id_map(parquet_schema: &SchemaDescriptor) ->
HashMap<i32, usize> {
Review Comment:
I think we should add a doc here making the contract explicit: "entries only
for primitive top-level fields."
Checked the consumers (PredicateConverter,
RowGroupMetricsEvaluator::stats_for_field_id,
get_row_selection_for_filter_predicate) — all resolve field ids to primitive
leaves, so primitive-only seems exactly right. As discussed before this
diverges slightly from Java's ParquetSchemaUtil.addFallbackIds, which attaches
an ordinal id to every top-level field (not just primitives) — but Java's
downstream lookups only resolve primitives anyway, so observable behavior
should match.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]