Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund Fri, 29 Apr 2022 13:08:32 -0700

Hi,

Attached are patches for this issue.


It adds a test case for deadlock conflicts to make sure that case isn't
broken. I also tested the recovery conflict tests in the back branches, and
they work there with a reasonably small set of changes.

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
  preceding numbers are unused. It seems way more problematic to use a
  different number in the backbranches than have gaps?

- The test uses pump_until() and wait_for_log(), which don't exist in the
  backbranches. For now I've just inlined the implementation, but I guess we
  could also backpatch their introduction?

- There's a few incompatibilities in the test with older branches:
  - older branches don't have allow_in_place_tablespaces - I think just
    skipping tablespace conflicts is fine, they're comparatively
    simple.

    Eventually it might make sense to backpatch allow_in_place_tablespaces,
    our test coverage in the area is quite poor.

  - the stats tests can't easily made reliably in the backbranches - which is
    ok, as the conflict itself is verified via the log

  - some branches don't have log_recovery_conflict_waits, since it's not
    critical to the test, it's ok to just not include it there

  I played with the idea of handling the differences using version comparisons
  in the code, and have the test be identically across branches. Since it's
  something we don't do so far, I'm leaning against it, but ...


> - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
>   replay progress reporting. Is there a reason to keep them in the
>   backbranches? Hard to see how an extension or something could rely on it,
>   but ...?

I've left it as is for now, will start a separate thread.


> - There's the following comment in ResolveRecoveryConflictWithBufferPin():
>
>   "We assume that only UnpinBuffer() and the timeout requests established
>    above can wake us up here."
>
>   That bogus afaict? There's plenty other things that can cause MyProc->latch
>   to be set. Is it worth doing something about this at the same time? Right
>   now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
>   succession initially.

The comment is more recent than I had realized. I raised this separately in
https://postgr.es/m/20220429191815.xewxjlpmq7mxhsr2%40alap3.anarazel.de


pgindent uses some crazy formatting nearby:
        SendRecoveryConflictWithBufferPin(
                                          
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I'm tempted to clean that up in passing by having just one
SendRecoveryConflictWithBufferPin() call instead of two, storing the type of
conflict in a local variable? Doesn't look entirely pretty either, but ...


I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(),
btw. But that'd be a non-backpatchable cleanup, I think:
 * The ProcWaitForSignal() sleep normally done in LockBufferForCleanup()
 * (when not InHotStandby) is performed here, for code clarity.


Greetings,

Andres Freund

>From 08b71cceefabe48ca80b1b15752031b27d05229d Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 29 Apr 2022 12:50:10 -0700
Subject: [PATCH vHEAD 1/2] Fix possibility of self-deadlock in
 ResolveRecoveryConflictWithBufferPin().

Author:
Reviewed-By:
Discussion: https://postgr.es/m/20220413002626.udl7lll7f3o7n...@alap3.anarazel.de
Backpatch:
---
 src/backend/storage/ipc/standby.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 2850867323b..8c5e8432e73 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -46,6 +46,7 @@ static HTAB *RecoveryLockLists;
 
 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
+static volatile sig_atomic_t got_standby_delay_timeout = false;
 static volatile sig_atomic_t got_standby_lock_timeout = false;
 
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
@@ -793,7 +794,8 @@ ResolveRecoveryConflictWithBufferPin(void)
 	}
 
 	/*
-	 * Wait to be signaled by UnpinBuffer().
+	 * Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
+	 * by one of the timeouts established above.
 	 *
 	 * We assume that only UnpinBuffer() and the timeout requests established
 	 * above can wake us up here. WakeupRecovery() called by walreceiver or
@@ -802,7 +804,9 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 */
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
-	if (got_standby_deadlock_timeout)
+	if (got_standby_delay_timeout)
+		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+	else if (got_standby_deadlock_timeout)
 	{
 		/*
 		 * Send out a request for hot-standby backends to check themselves for
@@ -828,6 +832,7 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 * individually, but that'd be slower.
 	 */
 	disable_all_timeouts(false);
+	got_standby_delay_timeout = false;
 	got_standby_deadlock_timeout = false;
 }
 
@@ -887,8 +892,8 @@ CheckRecoveryConflictDeadlock(void)
  */
 
 /*
- * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT
- * occurs before STANDBY_TIMEOUT.
+ * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT is
+ * exceeded.
  */
 void
 StandbyDeadLockHandler(void)
@@ -898,16 +903,11 @@ StandbyDeadLockHandler(void)
 
 /*
  * StandbyTimeoutHandler() will be called if STANDBY_TIMEOUT is exceeded.
- * Send out a request to release conflicting buffer pins unconditionally,
- * so we can press ahead with applying changes in recovery.
  */
 void
 StandbyTimeoutHandler(void)
 {
-	/* forget any pending STANDBY_DEADLOCK_TIMEOUT request */
-	disable_timeout(STANDBY_DEADLOCK_TIMEOUT, false);
-
-	SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+	got_standby_delay_timeout = true;
 }
 
 /*
-- 
2.35.1.677.gabf474a5dd

>From 3a945779e2313ce70a1e2b47998ff2e30ff7cc83 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 29 Apr 2022 12:51:00 -0700
Subject: [PATCH vHEAD 2/2] Add tests for recovery deadlock conflicts.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/20220413002626.udl7lll7f3o7n...@alap3.anarazel.de
Backpatch:
---
 src/test/recovery/t/031_recovery_conflict.pl | 86 ++++++++++++++++++--
 1 file changed, 79 insertions(+), 7 deletions(-)

diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index 83a8579dcfa..192d2d5a31d 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -4,8 +4,6 @@
 # recovery conflict is detected Also, test that statistics in
 # pg_stat_database_conflicts are populated correctly
 
-# TODO: add a test for deadlock recovery conflicts.
-
 use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
@@ -24,6 +22,9 @@ $node_primary->append_conf(
 allow_in_place_tablespaces = on
 log_temp_files = 0
 
+# for deadlock test
+max_prepared_transactions = 10
+
 # wait some to test the wait paths as well, but not long for obvious reasons
 max_standby_streaming_delay = 50ms
 
@@ -55,9 +56,13 @@ $node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
 
 # test schema / data
 my $table1 = "test_recovery_conflict_table1";
-$node_primary->safe_psql($test_db, qq[CREATE TABLE ${table1}(a int, b int);]);
-$node_primary->safe_psql($test_db,
-	qq[INSERT INTO $table1 SELECT i % 3, 0 FROM generate_series(1,20) i]);
+my $table2 = "test_recovery_conflict_table2";
+$node_primary->safe_psql(
+	$test_db, qq[
+CREATE TABLE ${table1}(a int, b int);
+INSERT INTO $table1 SELECT i % 3, 0 FROM generate_series(1,20) i;
+CREATE TABLE ${table2}(a int, b int);
+]);
 my $primary_lsn = $node_primary->lsn('flush');
 $node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
 
@@ -217,6 +222,67 @@ reconnect_and_clear();
 check_conflict_stat("tablespace");
 
 
+## RECOVERY CONFLICT 5: Deadlock
+$sect = "startup deadlock";
+$expected_conflicts++;
+
+# Generate a few dead rows, to later be cleaned up by vacuum. Then acquire a
+# lock on another relation in a prepared xact, so it's held continuously by
+# the startup process. The standby psql will block acquiring that lock while
+# holding a pin that vacuum needs, triggering the deadlock.
+$node_primary->safe_psql(
+	$test_db,
+	qq[
+CREATE TABLE $table1(a int, b int);
+INSERT INTO $table1 VALUES (1);
+BEGIN;
+INSERT INTO $table1(a) SELECT generate_series(1, 100) i;
+ROLLBACK;
+BEGIN;
+LOCK TABLE $table2;
+PREPARE TRANSACTION 'lock';
+INSERT INTO $table1(a) VALUES (170);
+SELECT txid_current();
+]);
+
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+$psql_standby{stdin} .= qq[
+    BEGIN;
+    -- hold pin
+    DECLARE $cursor1 CURSOR FOR SELECT a FROM $table1;
+    FETCH FORWARD FROM $cursor1;
+    -- wait for lock held by prepared transaction
+	SELECT * FROM $table2;
+    ];
+ok( pump_until(
+		$psql_standby{run},     $psql_timeout,
+		\$psql_standby{stdout}, qr/^1$/m,),
+	"$sect: cursor holding conflicting pin, also waiting for lock, established"
+);
+
+# just to make sure we're waiting for lock already
+ok( $node_standby->poll_query_until(
+		'postgres', qq[
+SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
+], 'waiting'),
+	"$sect: lock acquisition is waiting");
+
+# VACUUM will prune away rows, causing a buffer pin conflict, while standby
+# psql is waiting on lock
+$node_primary->safe_psql($test_db, qq[VACUUM $table1;]);
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+check_conflict_log("User transaction caused buffer deadlock with recovery.");
+reconnect_and_clear();
+check_conflict_stat("deadlock");
+
+# clean up for next tests
+$node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
+
+
 # Check that expected number of conflicts show in pg_stat_database. Needs to
 # be tested before database is dropped, for obvious reasons.
 is( $node_standby->safe_psql(
@@ -226,7 +292,7 @@ is( $node_standby->safe_psql(
 	qq[$expected_conflicts recovery conflicts shown in pg_stat_database]);
 
 
-## RECOVERY CONFLICT 5: Database conflict
+## RECOVERY CONFLICT 6: Database conflict
 $sect = "database conflict";
 
 $node_primary->safe_psql('postgres', qq[DROP DATABASE $test_db;]);
@@ -259,7 +325,13 @@ sub pump_until_standby
 
 sub reconnect_and_clear
 {
-	$psql_standby{stdin} .= "\\q\n";
+	# If psql isn't dead already, tell it to quit as \q, when already dead,
+	# causes IPC::Run to unhelpfully error out with "ack Broken pipe:".
+	$psql_standby{run}->pump_nb();
+	if ($psql_standby{run}->pumpable())
+	{
+		$psql_standby{stdin} .= "\\q\n";
+	}
 	$psql_standby{run}->finish;
 
 	# restart
-- 
2.35.1.677.gabf474a5dd

>From 030e4816fe2033ccd9e8488479cd5d2c17a1d6b9 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 29 Apr 2022 12:50:10 -0700
Subject: [PATCH vREL_10_STABLE 1/2] Fix possibility of self-deadlock in
 ResolveRecoveryConflictWithBufferPin().

Author:
Reviewed-By:
Discussion: https://postgr.es/m/20220413002626.udl7lll7f3o7n...@alap3.anarazel.de
Backpatch:
---
 src/backend/storage/ipc/standby.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 863cb641007..47cdaead9e6 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -44,6 +44,7 @@ static HTAB *RecoveryLockLists;
 
 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
+static volatile sig_atomic_t got_standby_delay_timeout = false;
 static volatile sig_atomic_t got_standby_lock_timeout = false;
 
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
@@ -593,10 +594,15 @@ ResolveRecoveryConflictWithBufferPin(void)
 		enable_timeouts(timeouts, cnt);
 	}
 
-	/* Wait to be signaled by UnpinBuffer() */
+	/*
+	 * Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
+	 * by one of the timeouts established above.
+	 */
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
-	if (got_standby_deadlock_timeout)
+	if (got_standby_delay_timeout)
+		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+	else if (got_standby_deadlock_timeout)
 	{
 		/*
 		 * Send out a request for hot-standby backends to check themselves for
@@ -622,6 +628,7 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 * individually, but that'd be slower.
 	 */
 	disable_all_timeouts(false);
+	got_standby_delay_timeout = false;
 	got_standby_deadlock_timeout = false;
 }
 
@@ -681,8 +688,8 @@ CheckRecoveryConflictDeadlock(void)
  */
 
 /*
- * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT
- * occurs before STANDBY_TIMEOUT.
+ * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT is
+ * exceeded.
  */
 void
 StandbyDeadLockHandler(void)
@@ -692,16 +699,11 @@ StandbyDeadLockHandler(void)
 
 /*
  * StandbyTimeoutHandler() will be called if STANDBY_TIMEOUT is exceeded.
- * Send out a request to release conflicting buffer pins unconditionally,
- * so we can press ahead with applying changes in recovery.
  */
 void
 StandbyTimeoutHandler(void)
 {
-	/* forget any pending STANDBY_DEADLOCK_TIMEOUT request */
-	disable_timeout(STANDBY_DEADLOCK_TIMEOUT, false);
-
-	SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+	got_standby_delay_timeout = true;
 }
 
 /*
-- 
2.35.1.677.gabf474a5dd

>From ce61a8301aefd71a7ac3005ce73c777ef4e78a7e Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 29 Apr 2022 12:54:02 -0700
Subject: [PATCH vREL_10_STABLE 2/2] Backpatch 031_recovery_conflict.pl.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/20220413002626.udl7lll7f3o7n...@alap3.anarazel.de
Backpatch:
---
 src/test/recovery/t/031_recovery_conflict.pl | 417 +++++++++++++++++++
 1 file changed, 417 insertions(+)
 create mode 100644 src/test/recovery/t/031_recovery_conflict.pl

diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
new file mode 100644
index 00000000000..3a624f65b3b
--- /dev/null
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -0,0 +1,417 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test that connections to a hot standby are correctly canceled when a
+# recovery conflict is detected Also, test that statistics in
+# pg_stat_database_conflicts are populated correctly
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Carp;
+use Time::HiRes qw(usleep);
+
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+
+my $tablespace1 = "test_recovery_conflict_tblspc";
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+
+# Doesn't currently exist pre 15, but might be backpatched later
+#allow_in_place_tablespaces = on
+#temp_tablespaces = $tablespace1
+
+log_temp_files = 0
+
+# for deadlock test
+max_prepared_transactions = 10
+
+# wait some to test the wait paths as well, but not long for obvious reasons
+max_standby_streaming_delay = 50ms
+
+# Some of the recovery conflict logging code only gets exercised after
+# deadlock_timeout. The test doesn't rely on that additional output, but it's
+# nice to get some minimal coverage of that code.
+#log_recovery_conflict_waits = on # Doesn't exist < 14.
+deadlock_timeout = 10ms
+]);
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+
+# See allow_in_place_tablespaces comment above
+#$node_primary->safe_psql('postgres',
+#	qq[CREATE TABLESPACE $tablespace1 LOCATION '']);
+
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+
+$node_standby->start;
+
+my $test_db = "test_db";
+
+# use a new database, to trigger database recovery conflict
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+# test schema / data
+my $table1 = "test_recovery_conflict_table1";
+my $table2 = "test_recovery_conflict_table2";
+$node_primary->safe_psql($test_db, qq[
+CREATE TABLE ${table1}(a int, b int);
+INSERT INTO $table1 SELECT i % 3, 0 FROM generate_series(1,20) i;
+CREATE TABLE ${table2}(a int, b int);
+]);
+my $primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+
+# a longrunning psql that we can use to trigger conflicts
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my %psql_standby = ('stdin' => '', 'stdout' => '');
+$psql_standby{run} =
+  $node_standby->background_psql($test_db, \$psql_standby{stdin},
+	\$psql_standby{stdout},
+	$psql_timeout);
+$psql_standby{stdout} = '';
+
+my $expected_conflicts = 0;
+
+
+## RECOVERY CONFLICT 1: Buffer pin conflict
+my $sect = "buffer pin conflict";
+$expected_conflicts++;
+
+# Aborted INSERT on primary that will be cleaned up by vacuum. Has to be old
+# enough so that there's not a snapshot conflict before the buffer pin
+# conflict.
+
+$node_primary->safe_psql(
+	$test_db,
+	qq[
+	BEGIN;
+	INSERT INTO $table1 VALUES (1,0);
+	ROLLBACK;
+	-- ensure flush, rollback doesn't do so
+	BEGIN; LOCK $table1; COMMIT;
+	]);
+
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+my $cursor1 = "test_recovery_conflict_cursor";
+
+# DECLARE and use a cursor on standby, causing buffer with the only block of
+# the relation to be pinned on the standby
+$psql_standby{stdin} .= qq[
+        BEGIN;
+        DECLARE $cursor1 CURSOR FOR SELECT b FROM $table1;
+        FETCH FORWARD FROM $cursor1;
+        ];
+# FETCH FORWARD should have returned a 0 since all values of b in the table
+# are 0
+ok(pump_until_standby(qr/^0$/m),
+	"$sect: cursor with conflicting pin established");
+
+# to check the log starting now for recovery conflict messages
+my $log_location = -s $node_standby->logfile;
+
+# VACUUM on the primary
+$node_primary->safe_psql($test_db, qq[VACUUM $table1;]);
+
+# Wait for catchup. Existing connection will be terminated before replay is
+# finished, so waiting for catchup ensures that there is no race between
+# encountering the recovery conflict which causes the disconnect and checking
+# the logfile for the terminated connection.
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+check_conflict_log("User was holding shared buffer pin for too long");
+reconnect_and_clear();
+check_conflict_stat("bufferpin");
+
+
+## RECOVERY CONFLICT 2: Snapshot conflict
+$sect = "snapshot conflict";
+$expected_conflicts++;
+
+$node_primary->safe_psql($test_db,
+	qq[INSERT INTO $table1 SELECT i, 0 FROM generate_series(1,20) i]);
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+# DECLARE and FETCH from cursor on the standby
+$psql_standby{stdin} .= qq[
+        BEGIN;
+        DECLARE $cursor1 CURSOR FOR SELECT b FROM $table1;
+        FETCH FORWARD FROM $cursor1;
+        ];
+ok( pump_until(
+		$psql_standby{run},     $psql_timeout,
+		\$psql_standby{stdout}, qr/^0$/m,),
+	"$sect: cursor with conflicting snapshot established");
+
+# Do some HOT updates
+$node_primary->safe_psql($test_db,
+	qq[UPDATE $table1 SET a = a + 1 WHERE a > 2;]);
+
+# VACUUM, pruning those dead tuples
+$node_primary->safe_psql($test_db, qq[VACUUM $table1;]);
+
+# Wait for attempted replay of PRUNE records
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+check_conflict_log(
+	"User query might have needed to see row versions that must be removed");
+reconnect_and_clear();
+check_conflict_stat("snapshot");
+
+
+## RECOVERY CONFLICT 3: Lock conflict
+$sect = "lock conflict";
+$expected_conflicts++;
+
+# acquire lock to conflict with
+$psql_standby{stdin} .= qq[
+        BEGIN;
+        LOCK TABLE $table1 IN ACCESS SHARE MODE;
+        SELECT 1;
+        ];
+ok(pump_until_standby(qr/^1$/m), "$sect: conflicting lock acquired");
+
+# DROP TABLE containing block which standby has in a pinned buffer
+$node_primary->safe_psql($test_db, qq[DROP TABLE $table1;]);
+
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+check_conflict_log("User was holding a relation lock for too long");
+reconnect_and_clear();
+check_conflict_stat("lock");
+
+
+# See allow_in_place_tablespaces comment above
+### RECOVERY CONFLICT 4: Tablespace conflict
+#$sect = "tablespace conflict";
+#$expected_conflicts++;
+#
+## DECLARE a cursor for a query which, with sufficiently low work_mem, will
+## spill tuples into temp files in the temporary tablespace created during
+## setup.
+#$psql_standby{stdin} .= qq[
+#        BEGIN;
+#        SET work_mem = '64kB';
+#        DECLARE $cursor1 CURSOR FOR
+#          SELECT count(*) FROM generate_series(1,6000);
+#        FETCH FORWARD FROM $cursor1;
+#        ];
+#ok(pump_until_standby(qr/^6000$/m),
+#	"$sect: cursor with conflicting temp file established");
+#
+## Drop the tablespace currently containing spill files for the query on the
+## standby
+#$node_primary->safe_psql($test_db, qq[DROP TABLESPACE $tablespace1;]);
+#
+#$primary_lsn = $node_primary->lsn('flush');
+#$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+#
+#check_conflict_log(
+#	"User was or might have been using tablespace that must be dropped");
+#reconnect_and_clear();
+#check_conflict_stat("tablespace");
+
+
+## RECOVERY CONFLICT 5: Deadlock
+$sect = "startup deadlock";
+$expected_conflicts++;
+
+# Generate a few dead rows, to later be cleaned up by vacuum. Then acquire a
+# lock on another relation in a prepared xact, so it's held continuously by
+# the startup process. The standby psql will block acquiring that lock while
+# holding a pin that vacuum needs, triggering the deadlock.
+$node_primary->safe_psql(
+	$test_db,
+	qq[
+CREATE TABLE $table1(a int, b int);
+INSERT INTO $table1 VALUES (1);
+BEGIN;
+INSERT INTO $table1(a) SELECT generate_series(1, 100) i;
+ROLLBACK;
+BEGIN;
+LOCK TABLE $table2;
+PREPARE TRANSACTION 'lock';
+INSERT INTO $table1(a) VALUES (170);
+SELECT txid_current();
+]);
+
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+$psql_standby{stdin} .= qq[
+    BEGIN;
+    -- hold pin
+    DECLARE $cursor1 CURSOR FOR SELECT a FROM $table1;
+    FETCH FORWARD FROM $cursor1;
+    -- wait for lock held by prepared transaction
+	SELECT * FROM $table2;
+    ];
+ok( pump_until(
+		$psql_standby{run},     $psql_timeout,
+		\$psql_standby{stdout}, qr/^1$/m,),
+	"$sect: cursor holding conflicting pin, also waiting for lock, established"
+);
+
+# just to make sure we're waiting for lock already
+ok( $node_standby->poll_query_until(
+		'postgres', qq[
+SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
+], 'waiting'),
+	"$sect: lock acquisition is waiting");
+
+# VACUUM will prune away rows, causing a buffer pin conflict, while standby
+# psql is waiting on lock
+$node_primary->safe_psql($test_db, qq[VACUUM $table1;]);
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+check_conflict_log("User transaction caused buffer deadlock with recovery.");
+reconnect_and_clear();
+check_conflict_stat("deadlock");
+
+# clean up for next tests
+$node_primary->safe_psql($test_db, qq[ROLLBACK PREPARED 'lock';]);
+
+
+# Check that expected number of conflicts show in pg_stat_database. Needs to
+# be tested before database is dropped, for obvious reasons.
+is( $node_standby->safe_psql(
+		$test_db,
+		qq[SELECT conflicts FROM pg_stat_database WHERE datname='$test_db';]),
+	$expected_conflicts,
+	qq[$expected_conflicts recovery conflicts shown in pg_stat_database]);
+
+
+## RECOVERY CONFLICT 6: Database conflict
+$sect = "database conflict";
+
+$node_primary->safe_psql('postgres', qq[DROP DATABASE $test_db;]);
+
+$primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_standby, 'replay', $primary_lsn);
+
+check_conflict_log("User was connected to a database that must be dropped");
+
+
+# explicitly shut down psql instances gracefully - to avoid hangs or worse on
+# windows
+$psql_standby{stdin} .= "\\q\n";
+$psql_standby{run}->finish;
+
+$node_standby->stop();
+$node_primary->stop();
+
+
+done_testing();
+
+
+sub pump_until_standby
+{
+	my $match = shift;
+
+	return pump_until($psql_standby{run}, $psql_timeout,
+		\$psql_standby{stdout}, $match);
+}
+
+sub reconnect_and_clear
+{
+	# if psql isn't dead already, tell it to quit (\q when already dead
+	# causes IPC::Run to, unhelpfully, error out with "ack Broken pipe:")
+	$psql_standby{run}->pump_nb();
+	if ($psql_standby{run}->pumpable())
+	{
+		$psql_standby{stdin} .= "\\q\n";
+	}
+	$psql_standby{run}->finish;
+
+	# restart
+	$psql_standby{run}->run();
+	$psql_standby{stdin}  = '';
+	$psql_standby{stdout} = '';
+
+	# Run query to ensure connection has finished re-establishing
+	$psql_standby{stdin} .= qq[SELECT 1;\n];
+	die unless pump_until_standby(qr/^1$/m);
+	$psql_standby{stdout} = '';
+}
+
+sub check_conflict_log
+{
+	my $message          = shift;
+	my $old_log_location = $log_location;
+
+	$log_location = wait_for_log($node_standby, qr/$message/, $log_location);
+
+	cmp_ok($log_location, '>', $old_log_location,
+		"$sect: logfile contains terminated connection due to recovery conflict"
+	);
+}
+
+sub check_conflict_stat
+{
+	# Stats can't easily be checked before 15, requires waiting for stats to
+	# be reported to stats collector and then those messages need to be
+	# processed. Dealt with here to reduce intra-branch difference in the
+	# tests.
+}
+
+sub pump_until
+{
+	my ($proc, $timeout, $stream, $until) = @_;
+	$proc->pump_nb();
+	while (1)
+	{
+		last if $$stream =~ /$until/;
+		if ($timeout->is_expired)
+		{
+			diag("pump_until: timeout expired when searching for \"$until\" with stream: \"$$stream\"");
+			return 0;
+		}
+		if (not $proc->pumpable())
+		{
+			diag("pump_until: process terminated unexpectedly when searching for \"$until\" with stream: \"$$stream\"");
+			return 0;
+		}
+		$proc->pump();
+	}
+	return 1;
+}
+
+sub wait_for_log
+{
+	my ($self, $regexp, $offset) = @_;
+	$offset = 0 unless defined $offset;
+
+	my $max_attempts = 10 * $TestLib::timeout_default;
+	my $attempts     = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		my $log = TestLib::slurp_file($self->logfile, $offset);
+
+		return $offset+length($log) if ($log =~ m/$regexp/);
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		$attempts++;
+	}
+
+	croak "timed out waiting for match: $regexp";
+}
-- 
2.35.1.677.gabf474a5dd

Re: failures in t/031_recovery_conflict.pl on CI

Reply via email to