Hi,
13.08.2023 00:00, Andres Freund wrote:
Hi,
On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
Thanks. I realised that it's easy enough to test that theory about
cleanup locks by hacking ConditionalLockBufferForCleanup() to return
false randomly. Then the test occasionally fails as described. Seems
like we'll need to fix that test, but it's not evidence of a server
bug, and my signal handler refactoring patch is in the clear. Thanks
for testing it!
WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
passes even after I make ConditionalLockBufferForCleanup() fail 100%.
I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out
executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
# Failed test 'startup deadlock: lock acquisition is waiting'
# at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat
65280, 0xff00)
Failed 1/15 subtests
I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>
031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement:
BEGIN;
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor
CURSOR FOR SELECT a FROM test_recovery_conflict_table1;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM
test_recovery_conflict_cursor;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement:
SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG: recovery still waiting after 10.217
ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel
1663/16385/16489, blk 0
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery
at character 15
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock with
recovery.
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT:
SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG: recovery finished waiting after
10.382 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel
1663/16385/16489, blk 0
2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
...
2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:51:25.969 UTC [145514] FATAL: could not receive data from WAL stream: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...
rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-09
12:43:13.797513 UTC
rmgr: Standby len (rec/tot): 42/ 42, tx: 973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db
16385 rel 16389
rmgr: Transaction len (rec/tot): 178/ 178, tx: 973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid
lock: 2023-10-09 12:43:13.798048 UTC
rmgr: Heap len (rec/tot): 59/ 59, tx: 974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT
2023-10-09 12:43:13.798434 UTC
rmgr: Transaction len (rec/tot): 34/ 34, tx: 975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT
2023-10-09 12:43:13.900046 UTC
rmgr: Heap2 len (rec/tot): 255/ 255, tx: 0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE
snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101],
unused: [], blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap2 len (rec/tot): 248/ 248, tx: 0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused:
100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0
So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).
Best regards,
Alexander
diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index 7cafe2fd86..7f4e50c650 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -211,7 +211,6 @@ check_conflict_stat("tablespace");
## RECOVERY CONFLICT 5: Deadlock
$sect = "startup deadlock";
-$expected_conflicts++;
# Want to test recovery deadlock conflicts, not buffer pin conflicts. Without
# changing max_standby_streaming_delay it'd be timing dependent what we hit
@@ -223,6 +222,11 @@ $node_standby->adjust_conf(
$node_standby->restart();
$psql_standby->reconnect_and_clear();
+for (my $i = 1; $i <= 100; $i++) {
+diag("iteration $i");
+
+$expected_conflicts++;
+
# Generate a few dead rows, to later be cleaned up by vacuum. Then acquire a
# lock on another relation in a prepared xact, so it's held continuously by
# the startup process. The standby psql will block acquiring that lock while
@@ -230,6 +234,7 @@ $psql_standby->reconnect_and_clear();
$node_primary->safe_psql(
$test_db,
qq[
+DROP TABLE IF EXISTS $table1;
CREATE TABLE $table1(a int, b int);
INSERT INTO $table1 VALUES (1);
BEGIN;
@@ -271,10 +276,18 @@ $node_primary->wait_for_replay_catchup($node_standby);
check_conflict_log("User transaction caused buffer deadlock with recovery.");
$psql_standby->reconnect_and_clear();
-check_conflict_stat("deadlock");
+## check_conflict_stat("deadlock");
# clean up for next tests
$node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
+
+$node_standby->restart();
+$psql_standby->reconnect_and_clear();
+
+}
+
+# clean up for next tests
+## $node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
$node_standby->adjust_conf('postgresql.conf', 'max_standby_streaming_delay',
'50ms');
$node_standby->restart();