Juliusz Sompolski created SPARK-56683:
-----------------------------------------
Summary: MERGE INTO TABLE reads the source twice and the two reads
can disagree leading to data inconsistency
Key: SPARK-56683
URL: https://issues.apache.org/jira/browse/SPARK-56683
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.2.0
Reporter: Juliusz Sompolski
RewriteMergeIntoTable rewrites a MERGE INTO statement into a plan that
references the source query in two positions: once as the streamed input to the
join that pairs source rows with target rows, and once inside a subquery that
the rewrite uses to identify which rows or groups have matching source rows.
The two positions are independent reads of the same source. When the source is
non-deterministic — for example, a table with concurrent writers, a streaming
source, or a query containing expressions like rand() — the two reads can
observe different sets of rows. The MERGE result is then computed against an
inconsistent picture of the source: rows can be filtered in or out by the
subquery while the join sees a different set of rows, producing dropped,
duplicated, or wrongly-matched rows.
The two reads of the source need to be made consistent so that both positions
in the rewritten plan see the same source data, regardless of source
determinism.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]