On Thu, 28 Aug 2025 at 15:10, Yugo Nagata <[email protected]> wrote:
>
> Thank you for your suggestion and the test patch. The test looks good
> to me, so I’ve attached an updated patch including it.
Thanks, that mostly looks good to me.
There's one other place in ExecMergeMatched() that's using
TM_FailureData.ctid when it shouldn't, and that's this check:
if (ItemPointerIndicatesMovedPartitions(&context->tmfd.ctid))
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("tuple to be merged was already moved to
another partition due to concurrent update")));
I've updated that to use tupleid in the attached v3 patch, and added a
couple more isolation tests. In practice, however, I don't think that
error can ever happen because this check follows table_tuple_lock()
which has a similar test which will always error out first. I was
tempted to simply remove this check from ExecMergeMatched(), but I
think perhaps it's worth keeping just in case.
Also, I thought that it's worth updating the comments for
table_tuple_lock() and TM_FailureData to make it clearer that it
updates its tid argument, and which fields of TM_FailureData can be
relied upon in the different cases.
Regards,
Dean
From 3ba14584951d44aad683201330b063b7b3d4dc45 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <[email protected]>
Date: Thu, 4 Sep 2025 13:18:42 +0100
Subject: [PATCH v3] Fix concurrent update issue with MERGE.
When executing a MERGE UPDATE action, if more than one other session
performed a concurrent update of the target row, the lock-and-retry
code would sometimes incorrectly identify the latest version of the
target tuple, leading to incorrect results.
This was caused by using the ctid field from the TM_FailureData
returned by table_tuple_lock() in a case where the result was TM_Ok,
which is unsafe because the TM_FailureData struct is not guaranteed to
be fully populated in that case. Instead, it should use the tupleid
passed to (and updated by) table_tuple_lock().
To reduce the chances of similar errors in the future, improve the
commentary for table_tuple_lock() and TM_FailureData to make it
clearer that table_tuple_lock() updates the tid passed to it, and most
fields of TM_FailureData should not be relied on in non-failure cases.
An exception to this is the "traversed" field, which is set in both
success and failure cases.
Reported-by: Dmitry <[email protected]>
Author: Yugo Nagata <[email protected]>
Reviewed-by: Dean Rasheed <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 15
---
src/backend/executor/nodeModifyTable.c | 9 +-
src/include/access/tableam.h | 15 +-
.../expected/merge-match-recheck.out | 215 +++++++++++++++++-
.../isolation/specs/merge-match-recheck.spec | 24 ++
4 files changed, 254 insertions(+), 9 deletions(-)
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b0c4e2c0d32..4c5647ac38a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3402,7 +3402,7 @@ lmerge_matched:
* the tuple moved, and setting our current
* resultRelInfo to that.
*/
- if (ItemPointerIndicatesMovedPartitions(&context->tmfd.ctid))
+ if (ItemPointerIndicatesMovedPartitions(tupleid))
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("tuple to be merged was already moved to another partition due to concurrent update")));
@@ -3450,12 +3450,13 @@ lmerge_matched:
if (ItemPointerIsValid(&lockedtid))
UnlockTuple(resultRelInfo->ri_RelationDesc, &lockedtid,
InplaceUpdateTupleLock);
- LockTuple(resultRelInfo->ri_RelationDesc, &context->tmfd.ctid,
+ LockTuple(resultRelInfo->ri_RelationDesc, tupleid,
InplaceUpdateTupleLock);
- lockedtid = context->tmfd.ctid;
+ lockedtid = *tupleid;
}
+
if (!table_tuple_fetch_row_version(resultRelationDesc,
- &context->tmfd.ctid,
+ tupleid,
SnapshotAny,
resultRelInfo->ri_oldTupleSlot))
elog(ERROR, "failed to fetch the target tuple");
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..b2ce35e2a34 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -121,7 +121,9 @@ typedef enum TU_UpdateIndexes
/*
* When table_tuple_update, table_tuple_delete, or table_tuple_lock fail
* because the target tuple is already outdated, they fill in this struct to
- * provide information to the caller about what happened.
+ * provide information to the caller about what happened. When those functions
+ * succeed, the contents of this struct should not be relied upon, except for
+ * `traversed`, which may be set in both success and failure cases.
*
* ctid is the target's ctid link: it is the same as the target's TID if the
* target was deleted, or the location of the replacement tuple if the target
@@ -137,6 +139,9 @@ typedef enum TU_UpdateIndexes
* tuple); otherwise cmax is zero. (We make this restriction because
* HeapTupleHeaderGetCmax doesn't work for tuples outdated in other
* transactions.)
+ *
+ * traversed indicates if an update chain was followed in order to try to lock
+ * the target tuple. (This may be set in both success and failure cases.)
*/
typedef struct TM_FailureData
{
@@ -1508,7 +1513,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
*
* Input parameters:
* relation: relation containing tuple (caller must hold suitable lock)
- * tid: TID of tuple to lock
+ * tid: TID of tuple to lock (updated if an update chain was followed)
* snapshot: snapshot to use for visibility determinations
* cid: current command ID (used for visibility test, and stored into
* tuple's cmax if lock is successful)
@@ -1533,8 +1538,10 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
* TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
*
* In the failure cases other than TM_Invisible and TM_Deleted, the routine
- * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See
- * comments for struct TM_FailureData for additional info.
+ * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax.
+ * Additionally, in both success and failure cases, tmfd->traversed is set if
+ * an update chain was followed. See comments for struct TM_FailureData for
+ * additional info.
*/
static inline TM_Result
table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
diff --git a/src/test/isolation/expected/merge-match-recheck.out b/src/test/isolation/expected/merge-match-recheck.out
index 90300f1db5a..dd0bc92b03c 100644
--- a/src/test/isolation/expected/merge-match-recheck.out
+++ b/src/test/isolation/expected/merge-match-recheck.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 3 sessions
starting permutation: update1 merge_status c2 select1 c1
step update1: UPDATE target t SET balance = balance + 10, val = t.val || ' updated by update1' WHERE t.key = 1;
@@ -271,6 +271,219 @@ key|balance|status|val
step c1: COMMIT;
+starting permutation: update1 merge_bal c2 select1 c1
+step update1: UPDATE target t SET balance = balance + 10, val = t.val || ' updated by update1' WHERE t.key = 1;
+step merge_bal:
+ MERGE INTO target t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3';
+ <waiting ...>
+step c2: COMMIT;
+step merge_bal: <... completed>
+step select1: SELECT * FROM target;
+key|balance|status|val
+---+-------+------+------------------------------
+ 1| 680|s1 |setup updated by update1 when2
+(1 row)
+
+step c1: COMMIT;
+
+starting permutation: update1_pa merge_bal_pa c2 select1_pa c1
+step update1_pa: UPDATE target_pa t SET balance = balance + 10, val = t.val || ' updated by update1_pa' WHERE t.key = 1;
+step merge_bal_pa:
+ MERGE INTO target_pa t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3';
+ <waiting ...>
+step c2: COMMIT;
+step merge_bal_pa: <... completed>
+step select1_pa: SELECT * FROM target_pa;
+key|balance|status|val
+---+-------+------+---------------------------------
+ 1| 680|s1 |setup updated by update1_pa when2
+(1 row)
+
+step c1: COMMIT;
+
+starting permutation: update1_tg merge_bal_tg c2 select1_tg c1
+s2: NOTICE: Update: (1,160,s1,setup) -> (1,170,s1,"setup updated by update1_tg")
+step update1_tg: UPDATE target_tg t SET balance = balance + 10, val = t.val || ' updated by update1_tg' WHERE t.key = 1;
+step merge_bal_tg:
+ WITH t AS (
+ MERGE INTO target_tg t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3'
+ RETURNING t.*
+ )
+ SELECT * FROM t;
+ <waiting ...>
+step c2: COMMIT;
+s1: NOTICE: Update: (1,170,s1,"setup updated by update1_tg") -> (1,680,s1,"setup updated by update1_tg when2")
+step merge_bal_tg: <... completed>
+key|balance|status|val
+---+-------+------+---------------------------------
+ 1| 680|s1 |setup updated by update1_tg when2
+(1 row)
+
+step select1_tg: SELECT * FROM target_tg;
+key|balance|status|val
+---+-------+------+---------------------------------
+ 1| 680|s1 |setup updated by update1_tg when2
+(1 row)
+
+step c1: COMMIT;
+
+starting permutation: update1 b3 update1b merge_bal c2 c3 select1 c1
+step update1: UPDATE target t SET balance = balance + 10, val = t.val || ' updated by update1' WHERE t.key = 1;
+step b3: BEGIN ISOLATION LEVEL READ COMMITTED;
+step update1b: UPDATE target t SET balance = balance - 100, val = t.val || ' updated by update1b' WHERE t.key = 1; <waiting ...>
+step merge_bal:
+ MERGE INTO target t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3';
+ <waiting ...>
+step c2: COMMIT;
+step update1b: <... completed>
+step c3: COMMIT;
+step merge_bal: <... completed>
+step select1: SELECT * FROM target;
+key|balance|status|val
+---+-------+------+--------------------------------------------------
+ 1| 140|s1 |setup updated by update1 updated by update1b when1
+(1 row)
+
+step c1: COMMIT;
+
+starting permutation: update1_pa b3 update1b_pa merge_bal_pa c2 c3 select1_pa c1
+step update1_pa: UPDATE target_pa t SET balance = balance + 10, val = t.val || ' updated by update1_pa' WHERE t.key = 1;
+step b3: BEGIN ISOLATION LEVEL READ COMMITTED;
+step update1b_pa: UPDATE target_pa t SET balance = balance - 100, val = t.val || ' updated by update1b_pa' WHERE t.key = 1; <waiting ...>
+step merge_bal_pa:
+ MERGE INTO target_pa t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3';
+ <waiting ...>
+step c2: COMMIT;
+step update1b_pa: <... completed>
+step c3: COMMIT;
+step merge_bal_pa: <... completed>
+step select1_pa: SELECT * FROM target_pa;
+key|balance|status|val
+---+-------+------+--------------------------------------------------------
+ 1| 140|s1 |setup updated by update1_pa updated by update1b_pa when1
+(1 row)
+
+step c1: COMMIT;
+
+starting permutation: update1_tg b3 update1b_tg merge_bal_tg c2 c3 select1_tg c1
+s2: NOTICE: Update: (1,160,s1,setup) -> (1,170,s1,"setup updated by update1_tg")
+step update1_tg: UPDATE target_tg t SET balance = balance + 10, val = t.val || ' updated by update1_tg' WHERE t.key = 1;
+step b3: BEGIN ISOLATION LEVEL READ COMMITTED;
+step update1b_tg: UPDATE target_tg t SET balance = balance - 100, val = t.val || ' updated by update1b_tg' WHERE t.key = 1; <waiting ...>
+step merge_bal_tg:
+ WITH t AS (
+ MERGE INTO target_tg t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3'
+ RETURNING t.*
+ )
+ SELECT * FROM t;
+ <waiting ...>
+step c2: COMMIT;
+s3: NOTICE: Update: (1,170,s1,"setup updated by update1_tg") -> (1,70,s1,"setup updated by update1_tg updated by update1b_tg")
+step update1b_tg: <... completed>
+step c3: COMMIT;
+s1: NOTICE: Update: (1,70,s1,"setup updated by update1_tg updated by update1b_tg") -> (1,140,s1,"setup updated by update1_tg updated by update1b_tg when1")
+step merge_bal_tg: <... completed>
+key|balance|status|val
+---+-------+------+--------------------------------------------------------
+ 1| 140|s1 |setup updated by update1_tg updated by update1b_tg when1
+(1 row)
+
+step select1_tg: SELECT * FROM target_tg;
+key|balance|status|val
+---+-------+------+--------------------------------------------------------
+ 1| 140|s1 |setup updated by update1_tg updated by update1b_tg when1
+(1 row)
+
+step c1: COMMIT;
+
+starting permutation: update1_pa_move merge_bal_pa c2 c1
+step update1_pa_move: UPDATE target_pa t SET balance = 210, val = t.val || ' updated by update1_pa_move' WHERE t.key = 1;
+step merge_bal_pa:
+ MERGE INTO target_pa t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3';
+ <waiting ...>
+step c2: COMMIT;
+step merge_bal_pa: <... completed>
+ERROR: tuple to be locked was already moved to another partition due to concurrent update
+step c1: COMMIT;
+
+starting permutation: update1_pa b3 update1b_pa_move merge_bal_pa c2 c3 c1
+step update1_pa: UPDATE target_pa t SET balance = balance + 10, val = t.val || ' updated by update1_pa' WHERE t.key = 1;
+step b3: BEGIN ISOLATION LEVEL READ COMMITTED;
+step update1b_pa_move: UPDATE target_pa t SET balance = 210, val = t.val || ' updated by update1b_pa_move' WHERE t.key = 1; <waiting ...>
+step merge_bal_pa:
+ MERGE INTO target_pa t
+ USING (SELECT 1 as key) s
+ ON s.key = t.key
+ WHEN MATCHED AND balance < 100 THEN
+ UPDATE SET balance = balance * 2, val = t.val || ' when1'
+ WHEN MATCHED AND balance < 200 THEN
+ UPDATE SET balance = balance * 4, val = t.val || ' when2'
+ WHEN MATCHED AND balance < 300 THEN
+ UPDATE SET balance = balance * 8, val = t.val || ' when3';
+ <waiting ...>
+step c2: COMMIT;
+step update1b_pa_move: <... completed>
+step c3: COMMIT;
+step merge_bal_pa: <... completed>
+ERROR: tuple to be locked was already moved to another partition due to concurrent update
+step c1: COMMIT;
+
starting permutation: update1 merge_delete c2 select1 c1
step update1: UPDATE target t SET balance = balance + 10, val = t.val || ' updated by update1' WHERE t.key = 1;
step merge_delete:
diff --git a/src/test/isolation/specs/merge-match-recheck.spec b/src/test/isolation/specs/merge-match-recheck.spec
index 15226e40c9e..4a0a2e9b6bb 100644
--- a/src/test/isolation/specs/merge-match-recheck.spec
+++ b/src/test/isolation/specs/merge-match-recheck.spec
@@ -146,6 +146,8 @@ setup
BEGIN ISOLATION LEVEL READ COMMITTED;
}
step "update1" { UPDATE target t SET balance = balance + 10, val = t.val || ' updated by update1' WHERE t.key = 1; }
+step "update1_pa" { UPDATE target_pa t SET balance = balance + 10, val = t.val || ' updated by update1_pa' WHERE t.key = 1; }
+step "update1_pa_move" { UPDATE target_pa t SET balance = 210, val = t.val || ' updated by update1_pa_move' WHERE t.key = 1; }
step "update1_tg" { UPDATE target_tg t SET balance = balance + 10, val = t.val || ' updated by update1_tg' WHERE t.key = 1; }
step "update2" { UPDATE target t SET status = 's2', val = t.val || ' updated by update2' WHERE t.key = 1; }
step "update2_tg" { UPDATE target_tg t SET status = 's2', val = t.val || ' updated by update2_tg' WHERE t.key = 1; }
@@ -158,6 +160,14 @@ step "update_bal1_pa" { UPDATE target_pa t SET balance = 50, val = t.val || ' up
step "update_bal1_tg" { UPDATE target_tg t SET balance = 50, val = t.val || ' updated by update_bal1_tg' WHERE t.key = 1; }
step "c2" { COMMIT; }
+session "s3"
+step "b3" { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "update1b" { UPDATE target t SET balance = balance - 100, val = t.val || ' updated by update1b' WHERE t.key = 1; }
+step "update1b_pa" { UPDATE target_pa t SET balance = balance - 100, val = t.val || ' updated by update1b_pa' WHERE t.key = 1; }
+step "update1b_pa_move" { UPDATE target_pa t SET balance = 210, val = t.val || ' updated by update1b_pa_move' WHERE t.key = 1; }
+step "update1b_tg" { UPDATE target_tg t SET balance = balance - 100, val = t.val || ' updated by update1b_tg' WHERE t.key = 1; }
+step "c3" { COMMIT; }
+
# merge_status sees concurrently updated row and rechecks WHEN conditions, but recheck passes and final status = 's2'
permutation "update1" "merge_status" "c2" "select1" "c1"
permutation "update1_tg" "merge_status_tg" "c2" "select1_tg" "c1"
@@ -179,6 +189,20 @@ permutation "update_bal1" "merge_bal" "c2" "select1" "c1"
permutation "update_bal1_pa" "merge_bal_pa" "c2" "select1_pa" "c1"
permutation "update_bal1_tg" "merge_bal_tg" "c2" "select1_tg" "c1"
+# merge_bal sees concurrently updated row and rechecks WHEN conditions, recheck passes, so final balance = 680
+permutation "update1" "merge_bal" "c2" "select1" "c1"
+permutation "update1_pa" "merge_bal_pa" "c2" "select1_pa" "c1"
+permutation "update1_tg" "merge_bal_tg" "c2" "select1_tg" "c1"
+
+# merge_bal sees row concurrently updated twice and rechecks WHEN conditions, recheck fails, so final balance = 140
+permutation "update1" "b3" "update1b" "merge_bal" "c2" "c3" "select1" "c1"
+permutation "update1_pa" "b3" "update1b_pa" "merge_bal_pa" "c2" "c3" "select1_pa" "c1"
+permutation "update1_tg" "b3" "update1b_tg" "merge_bal_tg" "c2" "c3" "select1_tg" "c1"
+
+# merge_bal sees concurrently updated row moved to new partition, so fails
+permutation "update1_pa_move" "merge_bal_pa" "c2" "c1"
+permutation "update1_pa" "b3" "update1b_pa_move" "merge_bal_pa" "c2" "c3" "c1"
+
# merge_delete sees concurrently updated row and rechecks WHEN conditions, but recheck passes and row is deleted
permutation "update1" "merge_delete" "c2" "select1" "c1"
permutation "update1_tg" "merge_delete_tg" "c2" "select1_tg" "c1"
--
2.51.0