[
https://issues.apache.org/jira/browse/SPARK-56660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eyck Troschke updated SPARK-56660:
----------------------------------
Description:
When filtering a Delta table on individual fields of a nested struct, Spark
correctly performs file pruning based on per‑column statistics
(min/max/nullCount). However, when the same filter is expressed as a struct
literal comparison or a tuple comparison, Spark does not perform file pruning,
even though these predicates are semantically equivalent and could be rewritten
into conjunctive scalar predicates.
This results in unnecessary full scans and significantly worse performance for
large Delta tables.
h2. *Reproduction*
Given a Delta table with a nested struct column:
{{CREATE TABLE t (}}
{{id STRING,}}
{{nested STRUCT<}}
{{col1: STRING,}}
{{col2: STRING,}}
{{col3: STRING}}
{{>}}
{{) USING DELTA;}}
h3. *Case 1 — Scalar predicates (file pruning works)*
{{SELECT *}}
{{FROM t}}
{{WHERE nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz';}}
Spark correctly prunes files based on column‑level statistics.
h3. *Case 2 — Struct literal comparison (no pruning)*
{{SELECT *}}
{{FROM t}}
{{WHERE nested = struct('foo', 'bar', 'baz');}}
This predicate is semantically equivalent to the conjunctive scalar predicates
above, but Spark does *not* perform file pruning. The entire dataset is scanned.
h3. *Case 3 — Tuple comparison (no pruning)*
{{SELECT *}}
{{FROM t}}
{{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}
Tuple comparisons are also not decomposed into scalar predicates, even though
they are logically equivalent and safe to rewrite.
h2. *Expected Behavior*
Spark should be able to rewrite:
{{nested = struct('foo', 'bar', 'baz')}}
and:
{{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}
into:
{{nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz'}}
at least when:
* field names and order match
* all fields are primitive types
* the struct or tuple literal is deterministic
This rewrite would allow Delta Lake to use existing per‑column statistics for
file pruning.
h2. *Actual Behavior*
* Struct equality and tuple equality are treated as opaque predicates.
* The optimizer does not decompose them into scalar comparisons.
* Delta Lake cannot use per‑column statistics.
* No file pruning occurs.
* Performance degrades significantly for large tables.
h2. *Why This Matters*
* Structs and tuples are common modeling patterns in Spark SQL.
* Many APIs and UDFs return structs, making struct comparisons natural.
* Users expect semantically equivalent predicates to have equivalent
performance.
* Other SQL engines rewrite tuple/row comparisons into conjunctive predicates.
* The rewrite is deterministic and preserves semantics for primitive fields.
h2. *Proposed Improvement*
Introduce an optimizer rule that:
# Detects struct equality or tuple equality where both sides are:
*
** deterministic expressions
** structs/tuples of identical length
** containing only primitive fields
# Rewrites the comparison into a conjunction of field‑level comparisons.
This would enable file pruning without changing query semantics.
was:
When filtering a Delta table on individual fields of a nested struct, Spark
correctly performs file pruning based on per‑column statistics
(min/max/nullCount). However, when the same filter is expressed as a struct
literal comparison or a tuple comparison, Spark does not perform file pruning,
even though these predicates are semantically equivalent and could be rewritten
into conjunctive scalar predicates.
This results in unnecessary full scans and significantly worse performance for
large Delta tables.
h2. *Reproduction*
Given a Delta table with a nested struct column:
{{CREATE TABLE t (}}
{{id STRING,}}
{{nested STRUCT<}}
{{col1: STRING,}}
{{col2: STRING,}}
{{col3: STRING}}
{{>}}
{{) USING DELTA;}}
h3. *Case 1 — Scalar predicates (file pruning works)*
{{SELECT *}}
{{FROM t}}
{{WHERE nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz';}}
Spark correctly prunes files based on column‑level statistics.
h3. *Case 2 — Struct literal comparison (no pruning)*
{{SELECT *}}
{{FROM t}}
{{WHERE nested = struct('foo', 'bar', 'baz');}}
This predicate is semantically equivalent to the conjunctive scalar predicates
above, but Spark does *not* perform file pruning. The entire dataset is scanned.
h3. *Case 3 — Tuple comparison (no pruning)*
{{SELECT *}}
{{FROM t}}
{{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}
Tuple comparisons are also not decomposed into scalar predicates, even though
they are logically equivalent and safe to rewrite.
h2. *Expected Behavior*
Spark should be able to rewrite:
{{{{nested = struct('foo', 'bar', 'baz')}}}}
and:
{{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}
into:
{{nested.col1 = 'foo'}}
{{AND nested.col2 = 'bar'}}
{{AND nested.col3 = 'baz'}}
at least when:
* field names and order match
* all fields are primitive types
* the struct or tuple literal is deterministic
This rewrite would allow Delta Lake to use existing per‑column statistics for
file pruning.
h2. *Actual Behavior*
* Struct equality and tuple equality are treated as opaque predicates.
* The optimizer does not decompose them into scalar comparisons.
* Delta Lake cannot use per‑column statistics.
* No file pruning occurs.
* Performance degrades significantly for large tables.
h2. *Why This Matters*
* Structs and tuples are common modeling patterns in Spark SQL.
* Many APIs and UDFs return structs, making struct comparisons natural.
* Users expect semantically equivalent predicates to have equivalent
performance.
* Other SQL engines rewrite tuple/row comparisons into conjunctive predicates.
* The rewrite is deterministic and preserves semantics for primitive fields.
h2. *Proposed Improvement*
Introduce an optimizer rule that:
# Detects struct equality or tuple equality where both sides are:
*
** deterministic expressions
** structs/tuples of identical length
** containing only primitive fields
# Rewrites the comparison into a conjunction of field‑level comparisons.
This would enable file pruning without changing query semantics.
> Optimizer: No file pruning for struct literal comparisons or tuple
> comparisons, although equivalent scalar predicates would allow pruning
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-56660
> URL: https://issues.apache.org/jira/browse/SPARK-56660
> Project: Spark
> Issue Type: Improvement
> Components: Optimizer, SQL
> Affects Versions: 4.1.1
> Environment: Spark 4.1.1
> DeltaTable 4.2
> Reporter: Eyck Troschke
> Priority: Major
>
> When filtering a Delta table on individual fields of a nested struct, Spark
> correctly performs file pruning based on per‑column statistics
> (min/max/nullCount). However, when the same filter is expressed as a struct
> literal comparison or a tuple comparison, Spark does not perform file
> pruning, even though these predicates are semantically equivalent and could
> be rewritten into conjunctive scalar predicates.
> This results in unnecessary full scans and significantly worse performance
> for large Delta tables.
> h2. *Reproduction*
> Given a Delta table with a nested struct column:
> {{CREATE TABLE t (}}
> {{id STRING,}}
> {{nested STRUCT<}}
> {{col1: STRING,}}
> {{col2: STRING,}}
> {{col3: STRING}}
> {{>}}
> {{) USING DELTA;}}
> h3. *Case 1 — Scalar predicates (file pruning works)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE nested.col1 = 'foo'}}
> {{AND nested.col2 = 'bar'}}
> {{AND nested.col3 = 'baz';}}
> Spark correctly prunes files based on column‑level statistics.
> h3. *Case 2 — Struct literal comparison (no pruning)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE nested = struct('foo', 'bar', 'baz');}}
> This predicate is semantically equivalent to the conjunctive scalar
> predicates above, but Spark does *not* perform file pruning. The entire
> dataset is scanned.
> h3. *Case 3 — Tuple comparison (no pruning)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}
> Tuple comparisons are also not decomposed into scalar predicates, even though
> they are logically equivalent and safe to rewrite.
> h2. *Expected Behavior*
> Spark should be able to rewrite:
> {{nested = struct('foo', 'bar', 'baz')}}
> and:
> {{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}
> into:
> {{nested.col1 = 'foo'}}
> {{AND nested.col2 = 'bar'}}
> {{AND nested.col3 = 'baz'}}
> at least when:
> * field names and order match
> * all fields are primitive types
> * the struct or tuple literal is deterministic
> This rewrite would allow Delta Lake to use existing per‑column statistics for
> file pruning.
> h2. *Actual Behavior*
> * Struct equality and tuple equality are treated as opaque predicates.
> * The optimizer does not decompose them into scalar comparisons.
> * Delta Lake cannot use per‑column statistics.
> * No file pruning occurs.
> * Performance degrades significantly for large tables.
> h2. *Why This Matters*
> * Structs and tuples are common modeling patterns in Spark SQL.
> * Many APIs and UDFs return structs, making struct comparisons natural.
> * Users expect semantically equivalent predicates to have equivalent
> performance.
> * Other SQL engines rewrite tuple/row comparisons into conjunctive
> predicates.
> * The rewrite is deterministic and preserves semantics for primitive fields.
> h2. *Proposed Improvement*
> Introduce an optimizer rule that:
> # Detects struct equality or tuple equality where both sides are:
> *
> ** deterministic expressions
> ** structs/tuples of identical length
> ** containing only primitive fields
> # Rewrites the comparison into a conjunction of field‑level comparisons.
> This would enable file pruning without changing query semantics.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]