Chao Sun created SPARK-57022:
--------------------------------

             Summary: Support nested column pruning for transform over arrays 
of structs
                 Key: SPARK-57022
                 URL: https://issues.apache.org/jira/browse/SPARK-57022
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Chao Sun


Spark supports nested column pruning for ordinary nested struct field accesses, 
but it does not currently prune nested fields accessed through the lambda 
variable of transform over an array<struct> column.

For example:
{code:sql}
SELECT transform(rule_results, rule ->  
  named_struct(
    'rule_public_id', rule.rule_public_id,
    'rule_version', rule.rule_version))
FROM events
{code}
If rule_results contains additional fields, Spark currently retains the full 
element struct in the scan schema, even though the query only reads 
rule_public_id and rule_version. This can increase Parquet and ORC I/O for 
datasets containing wide array element structs.

The proposed change extends nested schema pruning to recognize statically 
identifiable nested field accesses through the element variable of 
ArrayTransform. It builds a projected element schema from the referenced fields 
and carries that narrower schema back to the array input.

Because pruning fields changes the physical ordinal layout of the element 
struct, the projected expression must also rewrite the bound lambda variable 
type and nested GetStructField ordinals against the pruned schema. For example, 
if struct<a, b, c> is pruned to struct<a, c>, field c moves from ordinal 2 to 1.

The optimization remains conservative: if the lambda consumes the complete 
array element, such as x -> struct(x.a, x), Spark retains the full element 
schema instead of pruning it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to