[PR] feat(arrow): Support reading nested parquet columns [iceberg-rust]

via GitHub Tue, 06 Jan 2026 01:19:27 -0800


sundy-li opened a new pull request, #2001:
URL: https://github.com/apache/iceberg-rust/pull/2001


   ## Which issue does this PR close?
   
   - Closes #.
   
   ## What changes are included in this PR?
   
   This PR enables projection of nested fields within struct columns when 
reading parquet files. Previously, selecting a field nested inside a struct 
would result in a `FeatureUnsupported` error.
   
   ### Problem
   
   When users try to select nested fields like `person.name` from a schema such 
as:
   ```
   id: Int (field_id=1)
   person: Struct (field_id=2)
     name: String (field_id=3)
     age: Int (field_id=4)
   ```
   
   The scan would fail with "Projecting nested field is not supported now" 
error, blocking access to nested column data.
   
   ### Solution
   
   **1. `crates/iceberg/src/arrow/reader.rs`**
   - Add `RecordBatchProjector` integration to detect and handle nested field 
projections
   - After parquet projection, detect if any requested field IDs are nested 
(not direct children of the schema's top-level struct)
   - Create a `RecordBatchProjector` to extract nested fields from their parent 
structs, flattening them into the output record batch
   - Exclude metadata fields (like `_file`) from nested field detection
   
   **2. `crates/iceberg/src/arrow/record_batch_transformer.rs`**
   - Extend `build_field_id_to_arrow_schema_map` to recursively index nested 
struct fields
   - Add helper function `collect_field_ids_recursive` to traverse the field 
hierarchy
   - This allows the transformer to find field IDs that are nested within 
structs
   
   **3. `crates/iceberg/src/scan/mod.rs`**
   - Remove the restriction that blocked nested field selection (the 
`FeatureUnsupported` error)
   
   ### How it works
   
   1. When processing a `FileScanTask`, detect if any requested field IDs are 
nested by checking if `schema.as_struct().field_by_id(id)` returns `None`
   2. If nested fields are detected, create a `RecordBatchProjector` with the 
projected arrow schema
   3. The projector builds index paths to locate nested fields (e.g., `[1, 0]` 
means column 1, inner field 0)
   4. After parquet reads the data, the projector extracts nested fields from 
their parent structs
   5. The transformer then processes the flattened batch normally
   
   ## Are these changes tested?
   
   Yes, added `test_read_nested_parquet_column` test that:
   - Creates a parquet file with nested struct data (`id`, `person { name, age 
}`)
   - Reads with projection `[1, 3]` (selecting `id` and nested `name`)
   - Verifies both the top-level field and nested field are correctly extracted
   - All 1051 existing tests continue to pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(arrow): Support reading nested parquet columns [iceberg-rust]

Reply via email to