Eyck Troschke created SPARK-56660:
-------------------------------------
Summary: Optimizer: No file pruning for struct literal comparisons
or tuple comparisons, although equivalent scalar predicates would allow pruning
Key: SPARK-56660
URL: https://issues.apache.org/jira/browse/SPARK-56660
Project: Spark
Issue Type: Improvement
Components: Optimizer, SQL
Affects Versions: 4.1.1
Environment: Spark 4.1.1
DeltaTable 4.2
Reporter: Eyck Troschke
When filtering a Delta table on individual fields of a nested struct, Spark
correctly performs file pruning based on per‑column statistics
(min/max/nullCount). However, when the same filter is expressed as a struct
literal comparison or a tuple comparison, Spark does not perform file pruning,
even though these predicates are semantically equivalent and could be rewritten
into conjunctive scalar predicates.
This results in unnecessary full scans and significantly worse performance for
large Delta tables.
----
h2. *Reproduction*
Given a Delta table with a nested struct column:
{{CREATE TABLE t (
id STRING,
nested STRUCT<
col1: STRING,
col2: STRING,
col3: STRING
>
) USING DELTA;}}
h3. *Case 1 — Scalar predicates (file pruning works)*
{{SELECT *
FROM t
WHERE nested.col1 = 'foo'
AND nested.col2 = 'bar'
AND nested.col3 = 'baz';}}
Spark correctly prunes files based on column‑level statistics.
h3. *Case 2 — Struct literal comparison (no pruning)*
{{SELECT *
FROM t
WHERE nested = struct('foo', 'bar', 'baz');}}
This predicate is semantically equivalent to the conjunctive scalar predicates
above, but Spark does *not* perform file pruning. The entire dataset is scanned.
h3. *Case 3 — Tuple comparison (no pruning)*
{{SELECT *
FROM t
WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}
Tuple comparisons are also not decomposed into scalar predicates, even though
they are logically equivalent and safe to rewrite.
h2. *Expected Behavior*
Spark should be able to rewrite:
{{nested = struct('foo', 'bar', 'baz')}}
and:
{{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}
into:
{{nested.col1 = 'foo'
AND nested.col2 = 'bar'
AND nested.col3 = 'baz'}}
at least when:
* field names and order match
* all fields are primitive types
* the struct or tuple literal is deterministic
This rewrite would allow Delta Lake to use existing per‑column statistics for
file pruning.
h2. *Actual Behavior*
* Struct equality and tuple equality are treated as opaque predicates.
* The optimizer does not decompose them into scalar comparisons.
* Delta Lake cannot use per‑column statistics.
* No file pruning occurs.
* Performance degrades significantly for large tables.
h2. *Why This Matters*
* Structs and tuples are common modeling patterns in Spark SQL.
* Many APIs and UDFs return structs, making struct comparisons natural.
* Users expect semantically equivalent predicates to have equivalent
performance.
* Other SQL engines rewrite tuple/row comparisons into conjunctive predicates.
* The rewrite is deterministic and preserves semantics for primitive fields.
h2. *Proposed Improvement*
Introduce an optimizer rule that:
# Detects struct equality or tuple equality where both sides are:
** deterministic expressions
** structs/tuples of identical length
** containing only primitive fields
# Rewrites the comparison into a conjunction of field‑level comparisons.
This would enable file pruning without changing query semantics.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]