Hi,

On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> Hello,
> 
> 15.01.2024 12:00, Alexander Lakhin wrote:
> > If this approach looks promising to you, maybe we could add a submodule to
> > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> > 019_replslot_limit) as well.
> > 
> > Personally I think that having such a functionality for using in tests
> > might be useful not only to avoid some "problematic" behaviour but also to
> > test the opposite cases.
> 
> After spending a few days on it, I've discovered two more issues:
> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
> 
> (The latter is not related to bgwriter directly, but it was discovered
> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
> 
> So it becomes clear that the 035 test is not the only one, which might
> suffer from bgwriter's activity,

Yeah... thanks for sharing!

> and inventing a way to stop bgwriter/
> control it is a different subject, getting out of scope of the current
> issue.

Agree.

> 15.01.2024 11:49, Bertrand Drouvot wrote:
> > We did a few things in this thread, so to sum up what we've discovered:
> > 
> > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> > - we need to launch the vacuum(s) only if we are sure we got a newer XID 
> > horizon
> > ( proposal in in v6 attached)
> > - we need a way to control how frequent xl_running_xacts are emmitted (to 
> > ensure
> > they are not triggered in a middle of an active slot invalidation test).
> > 
> > I'm not sure it's possible to address Tom's concern and keep the test 
> > "predictable".
> > 
> > So, I think I'd vote for Michael's proposal to implement a 
> > superuser-settable
> > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass 
> > $windows_os) would
> > still not address Tom's concern anyway).
> > 
> > Another option would be to "sacrifice" the full predictablity of the test 
> > (in
> > favor of real-world behavior testing)?
> > 
> > [1]: 
> > https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal
> 
> So, now we have the test 035 failing due to nondeterministic vacuum
> activity in the first place, and due to bgwriter's activity in the second.

Yeah, that's also my understanding.

> Maybe it would be a right move to commit the fix, and then think about
> more rare failures.

+1

> Though I have a couple of question regarding the fix left, if you don't
> mind:
> 1) The test has minor defects in the comments, that I noted before [1];
> would you like to fix them in passing?
> 
> > BTW, it looks like the comment:
> > # One way to produce recovery conflict is to create/drop a relation and
> > # launch a vacuum on pg_class with hot_standby_feedback turned off on the 
> > standby.
> > in scenario 3 is a copy-paste error.

Nice catch, thanks! Fixed in v7 attached.

> > Also, there are two "Scenario 4" in this test.

D'oh! Fixed in v7.

> > 
> 
> 2) Shall we align the 035_standby_logical_decoding with
> 031_recovery_conflict in regard to improving stability of vacuum?

Yeah, I think that could make sense.

> I see the following options for this:
> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
> b) use FREEZE and autovacuum = off in both tests;
> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>  autovacuum = off in both.
>

I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
sure using FREEZE would give full "stabilization predictability", at least for
035_standby_logical_decoding.pl). That said I did not test what the outcome 
would
be for 031_recovery_conflict.pl by making use of a).

> I've re-tested the v6 patch today and confirmed that it makes the test
> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
> parallel and got failures ('inactiveslot slot invalidation is logged with
> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> (With unlimited CPU, when average test duration is around 70 seconds.)
> 
> But with v6 applied, 60 iterations succeeded.

Nice! Thanks for the testing!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b45f0c55d17cddfac7c99d11000b819c8b09fa56 Mon Sep 17 00:00:00 2001
From: bdrouvot <bdrou...@gmail.com>
Date: Tue, 9 Jan 2024 05:08:35 +0000
Subject: [PATCH v7] Fix 035_standby_logical_decoding.pl race condition

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

While at it, also fixing some typos/bad test description in it.
---
 .../t/035_standby_logical_decoding.pl         | 68 ++++++++++---------
 1 file changed, 35 insertions(+), 33 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..94406ad960 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -238,6 +238,25 @@ sub check_for_invalidation
 	) or die "Timed out waiting confl_active_logicalslot to be updated";
 }
 
+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove
+{
+	my ($vac_option, $sql, $to_vac) = @_;
+
+	# get the current xid horizon
+	my $xid_horizon = $node_primary->safe_psql('testdb', qq[select pg_snapshot_xmin(pg_current_snapshot());]);
+
+	# launch our sql
+	$node_primary->safe_psql('testdb', qq[$sql]);
+
+	$node_primary->poll_query_until('testdb',
+		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0")
+	  or die "new snapshot does not have a newer horizon";
+
+	$node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac;
+										  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal;]);
+}
 ########################
 # Initialize primary node
 ########################
@@ -248,6 +267,7 @@ $node_primary->append_conf(
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
+autovacuum = off
 });
 $node_primary->dump_info;
 $node_primary->start;
@@ -468,13 +488,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 	0, 1);
 
 # This should trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE TABLE conflict_test(x integer, y text);
-  DROP TABLE conflict_test;
-  VACUUM full pg_class;
-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
-]);
+wait_until_vacuum_can_remove('full', 'CREATE TABLE conflict_test(x integer, y text);
+								 DROP TABLE conflict_test;', 'pg_class');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
@@ -550,13 +565,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
 	0, 1);
 
 # This should trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE TABLE conflict_test(x integer, y text);
-  DROP TABLE conflict_test;
-  VACUUM pg_class;
-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
-]);
+wait_until_vacuum_can_remove('', 'CREATE TABLE conflict_test(x integer, y text);
+							 DROP TABLE conflict_test;', 'pg_class');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
@@ -582,19 +592,15 @@ check_pg_recvlogical_stderr($handle,
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-# One way to produce recovery conflict is to create/drop a relation and
-# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+# One way to produce recovery conflict on a shared catalog table is to create/drop
+# a role and launch a vacuum on pg_authid with hot_standby_feedback turned off on
+# the standby.
 reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
 	'shared_row_removal_', 0, 1);
 
 # Trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE ROLE create_trash;
-  DROP ROLE create_trash;
-  VACUUM pg_authid;
-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
-]);
+wait_until_vacuum_can_remove('', 'CREATE ROLE create_trash;
+							 DROP ROLE create_trash;', 'pg_authid');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
@@ -625,14 +631,10 @@ reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
 # This should not trigger a conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE TABLE conflict_test(x integer, y text);
-  INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
-  UPDATE conflict_test set x=1, y=1;
-  VACUUM conflict_test;
-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
-]);
+wait_until_vacuum_can_remove('', 'CREATE TABLE conflict_test(x integer, y text);
+							 INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
+							 UPDATE conflict_test set x=1, y=1;', 'conflict_test');
+
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # message should not be issued
@@ -671,7 +673,7 @@ $node_standby->restart;
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-# Scenario 4: conflict due to on-access pruning.
+# Scenario 5: conflict due to on-access pruning.
 ##################################################
 
 # get the position to search from in the standby logfile
@@ -711,7 +713,7 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-# Scenario 5: incorrect wal_level on primary.
+# Scenario 6: incorrect wal_level on primary.
 ##################################################
 
 # get the position to search from in the standby logfile
-- 
2.34.1

Reply via email to