Juliusz Sompolski created SPARK-56685:
-----------------------------------------

             Summary: There should be a way for CTE referenced multiple times 
to be a guaranteed reused shuffle
                 Key: SPARK-56685
                 URL: https://issues.apache.org/jira/browse/SPARK-56685
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Juliusz Sompolski


A SQL CTE referenced in more than one position is currently either inlined 
(each reference re-executes the CTE body independently) or replaced by a shared 
repartition (whose subtree is still subject to per-reference rewrites in the 
optimizer). Neither path guarantees that the references see the same rows.

When the CTE body is non-deterministic — rand(), a streaming source, a table 
with concurrent writers, or any expression that can produce different output 
across executions — the references diverge. A query like WITH t AS (SELECT id, 
rand() AS r FROM ...) SELECT a.r, b.r FROM t a JOIN t b ON a.id = b.id can 
produce rows where a.r != b.r, contradicting the user-natural assumption that 
two references to the same CTE name see the same data.

There is currently no way in Spark to mark a CTE so that it is guaranteed to 
materialize exactly once and have every reference read from the same 
materialized result. The behavior should be available so that user-written CTEs 
and Catalyst rewrite rules that introduce CTEs (such as the row-level operation 
rewrites) can rely on reused-shuffle semantics.

This could be used to make the two reads of source in MERGE consistent, see 
https://issues.apache.org/jira/browse/SPARK-56683



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to