antonlin1 opened a new pull request, #15726:
URL: https://github.com/apache/iceberg/pull/15726
## Problem
`BaseSparkScanBuilder.allUsedFieldIds()` uses `TypeUtil.getProjectedIds()`
to collect the set of field IDs already in use by the table schema, in order to
safely reassign `_partition` struct child IDs and avoid collisions with data
columns.
`TypeUtil.getProjectedIds()` was designed for column projection and
**silently omits MAP and LIST field IDs** (only primitives, structs, and
variants are included). This means MAP/LIST column IDs appear to be "available"
for reassignment.
When a `_partition` child field is assigned the same ID as a MAP/LIST column
(e.g. `tags MAP<string,string>` with field ID 3), the reassigned ID ends up in
`selectedIds` during a merge-on-read scan. `PruneColumns.message()` then
encounters the MAP column in the Parquet file, finds its ID in `selectedIds`,
but `expected.field(id)` returns `null` because the ID is nested inside
`_partition`, not a direct top-level data column. This triggers a NPE in
`PruneColumns.isStruct()`:
```
java.lang.NullPointerException: Cannot invoke
"org.apache.iceberg.types.Types$NestedField.type()" because "expected" is null
at
org.apache.iceberg.parquet.PruneColumns.isStruct(PruneColumns.java:173)
at org.apache.iceberg.parquet.PruneColumns.message(PruneColumns.java:61)
```
This regression was introduced in #15297 which replaced
`TypeUtil.indexById().keySet()` with `TypeUtil.getProjectedIds()` in
`allUsedFieldIds()`.
## Fix
Replace `TypeUtil.getProjectedIds()` with `TypeUtil.indexById().keySet()` in
`allUsedFieldIds()`. `indexById()` recursively indexes **all** field IDs
including MAP and LIST containers, which is the correct semantic for "what IDs
are already in use". This restores the behavior of the original Spark 3.5 code.
## Reproduction
A table with schema `id(1), ts(2), tags MAP<string,string>(3)` partitioned
by `bucket(1, id)`:
- `allUsedFieldIds()` with the bug returns `{1, 2, 4, 5}` — missing `3` (the
MAP)
- Partition field id=1000 is reassigned to `3`, colliding with `tags`
- Reading with `_partition` triggers the NPE on any Parquet file containing
`tags`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]