Re: A failure in 031_recovery_conflict.pl on Debian/s390x

Alexander Lakhin Mon, 09 Oct 2023 07:00:23 -0700

Hi,

13.08.2023 00:00, Andres Freund wrote:

Hi,


On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:

Thanks.  I realised that it's easy enough to test that theory about
cleanup locks by hacking ConditionalLockBufferForCleanup() to return
false randomly.  Then the test occasionally fails as described.  Seems
like we'll need to fix that test, but it's not evidence of a server
bug, and my signal handler refactoring patch is in the clear.  Thanks
for testing it!

WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
passes even after I make ConditionalLockBufferForCleanup() fail 100%.


I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out 
executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
#   Failed test 'startup deadlock: lock acquisition is waiting'
#   at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 
65280, 0xff00)
Failed 1/15 subtests

I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>

031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: 
BEGIN;

2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursorCURSOR FOR SELECT a FROM test_recovery_conflict_table1;2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROMtest_recovery_conflict_cursor;

2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: 
SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery still waiting after 10.217 
ms: recovery conflict on buffer pin

2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel1663/16385/16489, blk 02023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recoveryat character 152023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock withrecovery.

2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT:  
SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery finished waiting after 
10.382 ms: recovery conflict on buffer pin

2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel1663/16385/16489, blk 02023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERElocktype = 'relation' AND NOT granted;2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERElocktype = 'relation' AND NOT granted;2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERElocktype = 'relation' AND NOT granted;

...

2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERElocktype = 'relation' AND NOT granted;2023-10-09 12:51:25.969 UTC [145514] FATAL: could not receive data from WAL stream: server closed the connectionunexpectedly

                This probably means the server terminated abnormally
                before or while processing the request.

Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...

rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100,flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101,flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0rmgr: Transaction len (rec/tot): 34/ 34, tx: 972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-0912:43:13.797513 UTCrmgr: Standby len (rec/tot): 42/ 42, tx: 973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db16385 rel 16389rmgr: Transaction len (rec/tot): 178/ 178, tx: 973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gidlock: 2023-10-09 12:43:13.798048 UTCrmgr: Heap len (rec/tot): 59/ 59, tx: 974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102,flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0rmgr: Transaction len (rec/tot): 34/ 34, tx: 974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT2023-10-09 12:43:13.798434 UTCrmgr: Transaction len (rec/tot): 34/ 34, tx: 975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT2023-10-09 12:43:13.900046 UTCrmgr: Heap2 len (rec/tot): 255/ 255, tx: 0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNEsnapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101],unused: [], blkref #0: rel 1663/16385/16489 blk 0rmgr: Heap2 len (rec/tot): 248/ 248, tx: 0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused:100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0


So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).

Best regards,
Alexander

diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index 7cafe2fd86..7f4e50c650 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -211,7 +211,6 @@ check_conflict_stat("tablespace");
 
 ## RECOVERY CONFLICT 5: Deadlock
 $sect = "startup deadlock";
-$expected_conflicts++;
 
 # Want to test recovery deadlock conflicts, not buffer pin conflicts. Without
 # changing max_standby_streaming_delay it'd be timing dependent what we hit
@@ -223,6 +222,11 @@ $node_standby->adjust_conf(
 $node_standby->restart();
 $psql_standby->reconnect_and_clear();
 
+for (my $i = 1; $i <= 100; $i++)  {
+diag("iteration $i");
+
+$expected_conflicts++;
+
 # Generate a few dead rows, to later be cleaned up by vacuum. Then acquire a
 # lock on another relation in a prepared xact, so it's held continuously by
 # the startup process. The standby psql will block acquiring that lock while
@@ -230,6 +234,7 @@ $psql_standby->reconnect_and_clear();
 $node_primary->safe_psql(
 	$test_db,
 	qq[
+DROP TABLE IF EXISTS $table1;
 CREATE TABLE $table1(a int, b int);
 INSERT INTO $table1 VALUES (1);
 BEGIN;
@@ -271,10 +276,18 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 check_conflict_log("User transaction caused buffer deadlock with recovery.");
 $psql_standby->reconnect_and_clear();
-check_conflict_stat("deadlock");
+## check_conflict_stat("deadlock");
 
 # clean up for next tests
 $node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
+
+$node_standby->restart();
+$psql_standby->reconnect_and_clear();
+
+}
+
+# clean up for next tests
+## $node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
 $node_standby->adjust_conf('postgresql.conf', 'max_standby_streaming_delay',
 	'50ms');
 $node_standby->restart();

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

Reply via email to