[ 
https://issues.apache.org/jira/browse/SPARK-57175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57175:
-----------------------------------
    Labels: pull-request-available  (was: )

> Extend nested column pruning to exists and forall over arrays of structs
> ------------------------------------------------------------------------
>
>                 Key: SPARK-57175
>                 URL: https://issues.apache.org/jira/browse/SPARK-57175
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Chao Sun
>            Priority: Major
>              Labels: pull-request-available
>
> SPARK-57022 added nested column pruning for transform over array<struct> 
> inputs. The same optimization does not yet apply to the exists and forall 
> higher-order array functions.
> For example:
> {code:sql}
> SELECT exists(rule_results, rule -> rule.rule_version > 10)
> FROM events
> {code}
> If rule_results contains additional fields, Spark currently retains the full 
> element struct in the scan schema even though the predicate only reads 
> rule_version. This causes unnecessary Parquet and ORC input reads for wide 
> array element schemas.
> The optimization can be extended safely to exists and forall because both 
> consume array elements to produce a boolean result; neither returns the 
> original elements. The implementation should reuse the two-stage approach 
> introduced by SPARK-57022:
> * SchemaPruning identifies statically known GetStructField chains rooted at 
> the element lambda variable and propagates a narrower array element schema to 
> the scan.
> * ProjectionOverSchema rewrites the bound lambda variable type and nested 
> field ordinals after pruning.
> * If the lambda consumes the whole element, Spark conservatively retains the 
> complete element schema.
> ArrayFilter and ArraySort are intentionally out of scope because they return 
> original input elements and therefore require a different downstream-schema 
> design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to