[jira] [Updated] (SPARK-56660) Optimizer: No file pruning for struct literal comparisons or tuple comparisons, although equivalent scalar predicates would allow pruning

Eyck Troschke (Jira) Wed, 29 Apr 2026 00:53:12 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-56660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eyck Troschke updated SPARK-56660:
----------------------------------
    Description: 
When filtering a Delta table on individual fields of a nested struct, Spark 
correctly performs file pruning based on per‑column statistics 
(min/max/nullCount). However, when the same filter is expressed as a struct 
literal comparison or a tuple comparison, Spark does not perform file pruning, 
even though these predicates are semantically equivalent and could be rewritten 
into conjunctive scalar predicates.

This results in unnecessary full scans and significantly worse performance for 
large Delta tables.
h2. *Reproduction*

Given a Delta table with a nested struct column:

{{CREATE TABLE t (}}
{{id STRING,}}
{{nested STRUCT<}}
{{col1: STRING,}}
{{col2: STRING,}}
{{col3: STRING}}
{{>}}
{{) USING DELTA;}}
h3. *Case 1 — Scalar predicates (file pruning works)*

{{SELECT *}}
{{FROM t}}
{{WHERE nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz';}}

Spark correctly prunes files based on column‑level statistics.
h3. *Case 2 — Struct literal comparison (no pruning)*

{{SELECT *}}
{{FROM t}}
{{WHERE nested = struct('foo', 'bar', 'baz');}}

This predicate is semantically equivalent to the conjunctive scalar predicates 
above, but Spark does *not* perform file pruning. The entire dataset is scanned.
h3. *Case 3 — Tuple comparison (no pruning)*

{{SELECT *}}
{{FROM t}}
{{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}

Tuple comparisons are also not decomposed into scalar predicates, even though 
they are logically equivalent and safe to rewrite.
h2. *Expected Behavior*

Spark should be able to rewrite:

{{nested = struct('foo', 'bar', 'baz')}}

and:

{{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}

into:

{{nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz'}}

at least when:
 * field names and order match
 * all fields are primitive types
 * the struct or tuple literal is deterministic

This rewrite would allow Delta Lake to use existing per‑column statistics for 
file pruning.
h2. *Actual Behavior*
 * Struct equality and tuple equality are treated as opaque predicates.
 * The optimizer does not decompose them into scalar comparisons.
 * Delta Lake cannot use per‑column statistics.
 * No file pruning occurs.
 * Performance degrades significantly for large tables.

h2. *Why This Matters*
 * Structs and tuples are common modeling patterns in Spark SQL.
 * Many APIs and UDFs return structs, making struct comparisons natural.
 * Users expect semantically equivalent predicates to have equivalent 
performance.
 * Other SQL engines rewrite tuple/row comparisons into conjunctive predicates.
 * The rewrite is deterministic and preserves semantics for primitive fields.

h2. *Proposed Improvement*

Introduce an optimizer rule that:
 # Detects struct equality or tuple equality where both sides are:

 * 
 ** deterministic expressions
 ** structs/tuples of identical length
 ** containing only primitive fields

 # Rewrites the comparison into a conjunction of field‑level comparisons.

This would enable file pruning without changing query semantics.

  was:
When filtering a Delta table on individual fields of a nested struct, Spark 
correctly performs file pruning based on per‑column statistics 
(min/max/nullCount). However, when the same filter is expressed as a struct 
literal comparison or a tuple comparison, Spark does not perform file pruning, 
even though these predicates are semantically equivalent and could be rewritten 
into conjunctive scalar predicates.

This results in unnecessary full scans and significantly worse performance for 
large Delta tables.
h2. *Reproduction*

Given a Delta table with a nested struct column:

{{CREATE TABLE t (}}
{{id STRING,}}
{{nested STRUCT<}}
{{col1: STRING,}}
{{col2: STRING,}}
{{col3: STRING}}
{{>}}
{{) USING DELTA;}}
h3. *Case 1 — Scalar predicates (file pruning works)*

{{SELECT *}}
{{FROM t}}
{{WHERE nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz';}}

Spark correctly prunes files based on column‑level statistics.
h3. *Case 2 — Struct literal comparison (no pruning)*

{{SELECT *}}
{{FROM t}}
{{WHERE nested = struct('foo', 'bar', 'baz');}}

This predicate is semantically equivalent to the conjunctive scalar predicates 
above, but Spark does *not* perform file pruning. The entire dataset is scanned.
h3. *Case 3 — Tuple comparison (no pruning)*

{{SELECT *}}
{{FROM t}}
{{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}

Tuple comparisons are also not decomposed into scalar predicates, even though 
they are logically equivalent and safe to rewrite.
h2. *Expected Behavior*

Spark should be able to rewrite:

{{{{nested = struct('foo', 'bar', 'baz')}}}}

and:

{{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}

into:

{{nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz'}}

at least when:
 * field names and order match
 * all fields are primitive types
 * the struct or tuple literal is deterministic

This rewrite would allow Delta Lake to use existing per‑column statistics for 
file pruning.
h2. *Actual Behavior*
 * Struct equality and tuple equality are treated as opaque predicates.
 * The optimizer does not decompose them into scalar comparisons.
 * Delta Lake cannot use per‑column statistics.
 * No file pruning occurs.
 * Performance degrades significantly for large tables.

h2. *Why This Matters*
 * Structs and tuples are common modeling patterns in Spark SQL.
 * Many APIs and UDFs return structs, making struct comparisons natural.
 * Users expect semantically equivalent predicates to have equivalent 
performance.
 * Other SQL engines rewrite tuple/row comparisons into conjunctive predicates.
 * The rewrite is deterministic and preserves semantics for primitive fields.

h2. *Proposed Improvement*

Introduce an optimizer rule that:
 # Detects struct equality or tuple equality where both sides are:

 * 
 ** deterministic expressions
 ** structs/tuples of identical length
 ** containing only primitive fields

 # Rewrites the comparison into a conjunction of field‑level comparisons.

This would enable file pruning without changing query semantics.


> Optimizer: No file pruning for struct literal comparisons or tuple 
> comparisons, although equivalent scalar predicates would allow pruning
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56660
>                 URL: https://issues.apache.org/jira/browse/SPARK-56660
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 4.1.1
>         Environment: Spark 4.1.1
> DeltaTable 4.2
>            Reporter: Eyck Troschke
>            Priority: Major
>
> When filtering a Delta table on individual fields of a nested struct, Spark 
> correctly performs file pruning based on per‑column statistics 
> (min/max/nullCount). However, when the same filter is expressed as a struct 
> literal comparison or a tuple comparison, Spark does not perform file 
> pruning, even though these predicates are semantically equivalent and could 
> be rewritten into conjunctive scalar predicates.
> This results in unnecessary full scans and significantly worse performance 
> for large Delta tables.
> h2. *Reproduction*
> Given a Delta table with a nested struct column:
> {{CREATE TABLE t (}}
> {{id STRING,}}
> {{nested STRUCT<}}
> {{col1: STRING,}}
> {{col2: STRING,}}
> {{col3: STRING}}
> {{>}}
> {{) USING DELTA;}}
> h3. *Case 1 — Scalar predicates (file pruning works)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE nested.col1 = 'foo'}}
> {{AND nested.col2 = 'bar'}}
> {{AND nested.col3 = 'baz';}}
> Spark correctly prunes files based on column‑level statistics.
> h3. *Case 2 — Struct literal comparison (no pruning)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE nested = struct('foo', 'bar', 'baz');}}
> This predicate is semantically equivalent to the conjunctive scalar 
> predicates above, but Spark does *not* perform file pruning. The entire 
> dataset is scanned.
> h3. *Case 3 — Tuple comparison (no pruning)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}
> Tuple comparisons are also not decomposed into scalar predicates, even though 
> they are logically equivalent and safe to rewrite.
> h2. *Expected Behavior*
> Spark should be able to rewrite:
> {{nested = struct('foo', 'bar', 'baz')}}
> and:
> {{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}
> into:
> {{nested.col1 = 'foo'}}
> {{AND nested.col2 = 'bar'}}
> {{AND nested.col3 = 'baz'}}
> at least when:
>  * field names and order match
>  * all fields are primitive types
>  * the struct or tuple literal is deterministic
> This rewrite would allow Delta Lake to use existing per‑column statistics for 
> file pruning.
> h2. *Actual Behavior*
>  * Struct equality and tuple equality are treated as opaque predicates.
>  * The optimizer does not decompose them into scalar comparisons.
>  * Delta Lake cannot use per‑column statistics.
>  * No file pruning occurs.
>  * Performance degrades significantly for large tables.
> h2. *Why This Matters*
>  * Structs and tuples are common modeling patterns in Spark SQL.
>  * Many APIs and UDFs return structs, making struct comparisons natural.
>  * Users expect semantically equivalent predicates to have equivalent 
> performance.
>  * Other SQL engines rewrite tuple/row comparisons into conjunctive 
> predicates.
>  * The rewrite is deterministic and preserves semantics for primitive fields.
> h2. *Proposed Improvement*
> Introduce an optimizer rule that:
>  # Detects struct equality or tuple equality where both sides are:
>  * 
>  ** deterministic expressions
>  ** structs/tuples of identical length
>  ** containing only primitive fields
>  # Rewrites the comparison into a conjunction of field‑level comparisons.
> This would enable file pruning without changing query semantics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-56660) Optimizer: No file pruning for struct literal comparisons or tuple comparisons, although equivalent scalar predicates would allow pruning

Reply via email to