mbutrovich opened a new issue, #2306: URL: https://github.com/apache/iceberg-rust/issues/2306
### Describe the bug `build_fallback_field_id_map` maps Iceberg field IDs to wrong Parquet leaf column indices when the schema contains nested types (struct, list, map). This causes predicate evaluation to crash on migrated Parquet files (files without embedded field IDs). **Error:** "Leave column id in predicates isn't a root column in Parquet schema" This affects migrated tables where Parquet files were written by Spark/Hive without Iceberg field IDs, then imported via `add_files` or `importSparkTable()`. ### Root Cause #### How fallback field IDs work When a Parquet file lacks embedded field IDs, iceberg-rust assigns position-based fallback IDs. Two functions must agree on the mapping: 1. `add_fallback_field_ids_to_arrow_schema` — assigns field IDs 1, 2, 3... to **top-level** Arrow schema fields 2. `build_fallback_field_id_map` — maps those field IDs to Parquet **leaf** column indices for predicate evaluation #### What goes wrong `build_fallback_field_id_map` iterates over `parquet_schema.columns()` (leaf columns) instead of top-level fields. Nested types expand into multiple leaves, causing the mapping to diverge from the Arrow schema's field IDs. **Example:** `name: string, address: struct(street: string, city: string), id: int` | | Arrow top-level fields | Parquet leaf columns | |---|---|---| | Fields | name, address, id | name, street, city, id | | Assigned field IDs | 1, 2, 3 | 1, 2, 3, 4 (bug) | When a predicate references `id` (field_id=3 from Arrow), the column map returns leaf index 2 (`city`, inside the `address` group). `PredicateConverter::bound_reference` then calls `get_column_root(2).is_group()` → `true` → error. ### How Iceberg Java handles this Java's [`ParquetSchemaUtil.addFallbackIds()`](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java#L174-L184) iterates **top-level fields**, not leaf columns: ```java public static MessageType addFallbackIds(MessageType fileSchema) { MessageTypeBuilder builder = org.apache.parquet.schema.Types.buildMessage(); int ordinal = 1; for (Type type : fileSchema.getFields()) { builder.addField(type.withId(ordinal)); ordinal += 1; } return builder.named(fileSchema.getName()); } ``` Additionally, Java's https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java gracefully handles nested types — predicates on nested columns return ROWS_MIGHT_MATCH instead of crashing. ### Proposed Fix Change `build_fallback_field_id_map` to iterate over `parquet_schema.root_schema().get_fields()`` (top-level fields) instead of `parquet_schema.columns()`` (leaf columns). For each top-level field: - If primitive: map `ordinal` → `leaf_column_index` - If group (struct/list/map): skip the mapping, advance the leaf counter past all leaves in that group This makes `build_fallback_field_id_map` consistent with `add_fallback_field_ids_to_arrow_schema`, which already correctly iterates top-level Arrow fields. `PredicateConverter::bound_reference` already validates that the resolved column is a root column and rejects groups, so no changes are needed there. Files to modify 1. `crates/iceberg/src/arrow/reader.rs — build_fallback_field_id_map` Related - https://github.com/apache/datafusion-comet/issues/3860: Downstream issue in Comet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
