sunchao opened a new pull request, #56070:
URL: https://github.com/apache/spark/pull/56070

   ### Why are the changes needed?
   
   Spark can prune nested struct fields referenced directly by a query, but it 
does not currently prune nested fields read through the lambda variable of 
`transform` over an `array<struct>` column.
   
   For example:
   
   ```sql
   SELECT transform(rule_results, rule ->
     named_struct(
       'rule_public_id', rule.rule_public_id,
       'rule_version', rule.rule_version))
   FROM events
   ```
   
   If `rule_results` contains additional fields, Spark currently retains the 
full element struct in the scan schema even though only two nested fields are 
required. This causes unnecessary Parquet and ORC input reads for wide array 
element schemas.
   
   This change addresses 
[SPARK-57022](https://issues.apache.org/jira/browse/SPARK-57022).
   
   ### What changes were proposed in this pull request?
   
   - Recognize statically identifiable nested field reads through the element 
variable of `ArrayTransform`.
   - Build a projected array element schema from exactly those referenced 
fields and propagate it to the scan input.
   - Rewrite the bound lambda variable type and `GetStructField` ordinals 
against the projected element schema after pruning.
   - Fall back to retaining the full element schema when the lambda consumes 
the complete element, so pruning is applied only when it is safe.
   - Add Catalyst and datasource tests covering ordinal rewrites, deep nesting, 
nested input paths with null values, indexed lambdas, case-insensitive 
resolution, and conservative fallback.
   
   The implementation intentionally has two stages. `SchemaPruning` discovers 
which fields the lambda needs from the array element. `ProjectionOverSchema` 
then rewrites the lambda against the narrower element type because pruning can 
change field ordinals. For example, pruning `struct<a, b, c>` to `struct<a, c>` 
moves `c` from ordinal `2` to ordinal `1`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Eligible queries using `transform` over arrays of structs can read a 
narrower input schema. Query results and SQL APIs are unchanged.
   
   ### How was this patch tested?
   
   - `build/sbt "catalyst/testOnly 
org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite"`
   - `build/sbt "sql/testOnly 
org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite 
org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite 
org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite 
org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z 
ArrayTransform"`
   - `git diff --check apache/master...HEAD`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex (GPT-5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to