Juliusz Sompolski created SPARK-56685:
-----------------------------------------
Summary: There should be a way for CTE referenced multiple times
to be a guaranteed reused shuffle
Key: SPARK-56685
URL: https://issues.apache.org/jira/browse/SPARK-56685
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.2.0
Reporter: Juliusz Sompolski
A SQL CTE referenced in more than one position is currently either inlined
(each reference re-executes the CTE body independently) or replaced by a shared
repartition (whose subtree is still subject to per-reference rewrites in the
optimizer). Neither path guarantees that the references see the same rows.
When the CTE body is non-deterministic — rand(), a streaming source, a table
with concurrent writers, or any expression that can produce different output
across executions — the references diverge. A query like WITH t AS (SELECT id,
rand() AS r FROM ...) SELECT a.r, b.r FROM t a JOIN t b ON a.id = b.id can
produce rows where a.r != b.r, contradicting the user-natural assumption that
two references to the same CTE name see the same data.
There is currently no way in Spark to mark a CTE so that it is guaranteed to
materialize exactly once and have every reference read from the same
materialized result. The behavior should be available so that user-written CTEs
and Catalyst rewrite rules that introduce CTEs (such as the row-level operation
rewrites) can rely on reused-shuffle semantics.
This could be used to make the two reads of source in MERGE consistent, see
https://issues.apache.org/jira/browse/SPARK-56683
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]