Chao Sun created SPARK-57175:
--------------------------------
Summary: Extend nested column pruning to exists and forall over
arrays of structs
Key: SPARK-57175
URL: https://issues.apache.org/jira/browse/SPARK-57175
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Chao Sun
SPARK-57022 added nested column pruning for transform over array<struct>
inputs. The same optimization does not yet apply to the exists and forall
higher-order array functions.
For example:
{code:sql}
SELECT exists(rule_results, rule -> rule.rule_version > 10)
FROM events
{code}
If rule_results contains additional fields, Spark currently retains the full
element struct in the scan schema even though the predicate only reads
rule_version. This causes unnecessary Parquet and ORC input reads for wide
array element schemas.
The optimization can be extended safely to exists and forall because both
consume array elements to produce a boolean result; neither returns the
original elements. The implementation should reuse the two-stage approach
introduced by SPARK-57022:
* SchemaPruning identifies statically known GetStructField chains rooted at the
element lambda variable and propagates a narrower array element schema to the
scan.
* ProjectionOverSchema rewrites the bound lambda variable type and nested field
ordinals after pruning.
* If the lambda consumes the whole element, Spark conservatively retains the
complete element schema.
ArrayFilter and ArraySort are intentionally out of scope because they return
original input elements and therefore require a different downstream-schema
design.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]