Eyck Troschke created SPARK-56660:
-------------------------------------

             Summary: Optimizer: No file pruning for struct literal comparisons 
or tuple comparisons, although equivalent scalar predicates would allow pruning
                 Key: SPARK-56660
                 URL: https://issues.apache.org/jira/browse/SPARK-56660
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer, SQL
    Affects Versions: 4.1.1
         Environment: Spark 4.1.1

DeltaTable 4.2
            Reporter: Eyck Troschke


When filtering a Delta table on individual fields of a nested struct, Spark 
correctly performs file pruning based on per‑column statistics 
(min/max/nullCount). However, when the same filter is expressed as a struct 
literal comparison or a tuple comparison, Spark does not perform file pruning, 
even though these predicates are semantically equivalent and could be rewritten 
into conjunctive scalar predicates.

This results in unnecessary full scans and significantly worse performance for 
large Delta tables.
----
h2. *Reproduction*

Given a Delta table with a nested struct column:

{{CREATE TABLE t (
  id STRING,
  nested STRUCT<
    col1: STRING,
    col2: STRING,
    col3: STRING
  >
) USING DELTA;}}
h3. *Case 1 — Scalar predicates (file pruning works)*

{{SELECT *
FROM t
WHERE nested.col1 = 'foo'
  AND nested.col2 = 'bar'
  AND nested.col3 = 'baz';}}

Spark correctly prunes files based on column‑level statistics.
h3. *Case 2 — Struct literal comparison (no pruning)*

{{SELECT *
FROM t
WHERE nested = struct('foo', 'bar', 'baz');}}

This predicate is semantically equivalent to the conjunctive scalar predicates 
above, but Spark does *not* perform file pruning. The entire dataset is scanned.
h3. *Case 3 — Tuple comparison (no pruning)*

{{SELECT *
FROM t
WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}

Tuple comparisons are also not decomposed into scalar predicates, even though 
they are logically equivalent and safe to rewrite.
h2. *Expected Behavior*

Spark should be able to rewrite:

{{nested = struct('foo', 'bar', 'baz')}}

and:

{{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}

into:

{{nested.col1 = 'foo'
AND nested.col2 = 'bar'
AND nested.col3 = 'baz'}}

at least when:
 * field names and order match
 * all fields are primitive types
 * the struct or tuple literal is deterministic

This rewrite would allow Delta Lake to use existing per‑column statistics for 
file pruning.
h2. *Actual Behavior*
 * Struct equality and tuple equality are treated as opaque predicates.
 * The optimizer does not decompose them into scalar comparisons.
 * Delta Lake cannot use per‑column statistics.
 * No file pruning occurs.
 * Performance degrades significantly for large tables.

h2. *Why This Matters*
 * Structs and tuples are common modeling patterns in Spark SQL.
 * Many APIs and UDFs return structs, making struct comparisons natural.
 * Users expect semantically equivalent predicates to have equivalent 
performance.
 * Other SQL engines rewrite tuple/row comparisons into conjunctive predicates.
 * The rewrite is deterministic and preserves semantics for primitive fields.

h2. *Proposed Improvement*

Introduce an optimizer rule that:
 # Detects struct equality or tuple equality where both sides are:

 ** deterministic expressions
 ** structs/tuples of identical length
 ** containing only primitive fields
 # Rewrites the comparison into a conjunction of field‑level comparisons.

This would enable file pruning without changing query semantics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to