Hi,

13.08.2023 00:00, Andres Freund wrote:
Hi,

On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
Thanks.  I realised that it's easy enough to test that theory about
cleanup locks by hacking ConditionalLockBufferForCleanup() to return
false randomly.  Then the test occasionally fails as described.  Seems
like we'll need to fix that test, but it's not evidence of a server
bug, and my signal handler refactoring patch is in the clear.  Thanks
for testing it!
WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
passes even after I make ConditionalLockBufferForCleanup() fail 100%.


I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out 
executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
#   Failed test 'startup deadlock: lock acquisition is waiting'
#   at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 
65280, 0xff00)
Failed 1/15 subtests

I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>

031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: 
BEGIN;
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT a FROM test_recovery_conflict_table1; 2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: 
SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery still waiting after 10.217 
ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel 1663/16385/16489, blk 0 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR:  canceling statement due to conflict with recovery at character 15 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL:  User transaction caused buffer deadlock with recovery.
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT:  
SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery finished waiting after 
10.382 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel 1663/16385/16489, blk 0 2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
...
2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:51:25.969 UTC [145514] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Transaction len (rec/tot):     34/    34, tx:        972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-09 12:43:13.797513 UTC rmgr: Standby     len (rec/tot):     42/    42, tx:        973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db 16385 rel 16389 rmgr: Transaction len (rec/tot):    178/   178, tx:        973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid lock: 2023-10-09 12:43:13.798048 UTC rmgr: Heap        len (rec/tot):     59/    59, tx:        974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Transaction len (rec/tot):     34/    34, tx:        974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT 2023-10-09 12:43:13.798434 UTC rmgr: Transaction len (rec/tot):     34/    34, tx:        975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT 2023-10-09 12:43:13.900046 UTC rmgr: Heap2       len (rec/tot):    255/   255, tx:          0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: [], blkref #0: rel 1663/16385/16489 blk 0 rmgr: Heap2       len (rec/tot):    248/   248, tx:          0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused: 100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0

So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).

Best regards,
Alexander
diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index 7cafe2fd86..7f4e50c650 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -211,7 +211,6 @@ check_conflict_stat("tablespace");
 
 ## RECOVERY CONFLICT 5: Deadlock
 $sect = "startup deadlock";
-$expected_conflicts++;
 
 # Want to test recovery deadlock conflicts, not buffer pin conflicts. Without
 # changing max_standby_streaming_delay it'd be timing dependent what we hit
@@ -223,6 +222,11 @@ $node_standby->adjust_conf(
 $node_standby->restart();
 $psql_standby->reconnect_and_clear();
 
+for (my $i = 1; $i <= 100; $i++)  {
+diag("iteration $i");
+
+$expected_conflicts++;
+
 # Generate a few dead rows, to later be cleaned up by vacuum. Then acquire a
 # lock on another relation in a prepared xact, so it's held continuously by
 # the startup process. The standby psql will block acquiring that lock while
@@ -230,6 +234,7 @@ $psql_standby->reconnect_and_clear();
 $node_primary->safe_psql(
 	$test_db,
 	qq[
+DROP TABLE IF EXISTS $table1;
 CREATE TABLE $table1(a int, b int);
 INSERT INTO $table1 VALUES (1);
 BEGIN;
@@ -271,10 +276,18 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 check_conflict_log("User transaction caused buffer deadlock with recovery.");
 $psql_standby->reconnect_and_clear();
-check_conflict_stat("deadlock");
+## check_conflict_stat("deadlock");
 
 # clean up for next tests
 $node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
+
+$node_standby->restart();
+$psql_standby->reconnect_and_clear();
+
+}
+
+# clean up for next tests
+## $node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
 $node_standby->adjust_conf('postgresql.conf', 'max_standby_streaming_delay',
 	'50ms');
 $node_standby->restart();

Reply via email to