antonlin1 opened a new pull request, #15726:
URL: https://github.com/apache/iceberg/pull/15726

   ## Problem
   
   `BaseSparkScanBuilder.allUsedFieldIds()` uses `TypeUtil.getProjectedIds()` 
to collect the set of field IDs already in use by the table schema, in order to 
safely reassign `_partition` struct child IDs and avoid collisions with data 
columns.
   
   `TypeUtil.getProjectedIds()` was designed for column projection and 
**silently omits MAP and LIST field IDs** (only primitives, structs, and 
variants are included). This means MAP/LIST column IDs appear to be "available" 
for reassignment.
   
   When a `_partition` child field is assigned the same ID as a MAP/LIST column 
(e.g. `tags MAP<string,string>` with field ID 3), the reassigned ID ends up in 
`selectedIds` during a merge-on-read scan. `PruneColumns.message()` then 
encounters the MAP column in the Parquet file, finds its ID in `selectedIds`, 
but `expected.field(id)` returns `null` because the ID is nested inside 
`_partition`, not a direct top-level data column. This triggers a NPE in 
`PruneColumns.isStruct()`:
   
   ```
   java.lang.NullPointerException: Cannot invoke 
"org.apache.iceberg.types.Types$NestedField.type()" because "expected" is null
       at 
org.apache.iceberg.parquet.PruneColumns.isStruct(PruneColumns.java:173)
       at org.apache.iceberg.parquet.PruneColumns.message(PruneColumns.java:61)
   ```
   
   This regression was introduced in #15297 which replaced 
`TypeUtil.indexById().keySet()` with `TypeUtil.getProjectedIds()` in 
`allUsedFieldIds()`.
   
   ## Fix
   
   Replace `TypeUtil.getProjectedIds()` with `TypeUtil.indexById().keySet()` in 
`allUsedFieldIds()`. `indexById()` recursively indexes **all** field IDs 
including MAP and LIST containers, which is the correct semantic for "what IDs 
are already in use". This restores the behavior of the original Spark 3.5 code.
   
   ## Reproduction
   
   A table with schema `id(1), ts(2), tags MAP<string,string>(3)` partitioned 
by `bucket(1, id)`:
   - `allUsedFieldIds()` with the bug returns `{1, 2, 4, 5}` — missing `3` (the 
MAP)
   - Partition field id=1000 is reassigned to `3`, colliding with `tags`
   - Reading with `_partition` triggers the NPE on any Parquet file containing 
`tags`
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to