On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote:
> I looked at the buildfarm failures for the recoveryCheck stage.  It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
> AFAICS this is wholly unrelated to the problem at hand.

Yes, that's unrelated.

This failure is interesting, still it has happened only once per what
I can see.  I think that this points out a rare race condition in test
007_sync_rep.pl because of this sequence which reorders the standbys:
# Stop and start standbys to rearrange the order of standbys
# in WalSnd array. Now, if standbys have the same priority,
# standby2 is selected preferentially and standby3 is next.
$node_standby_1->stop;
$node_standby_2->stop;
$node_standby_3->stop;

$node_standby_2->start;
$node_standby_3->start;

The failure actually indicates that even if standby3 has started
after standby2, standby3 has registered back into the WAL sender array
of the primary before standby2 had the occasion to do it, but we have
no guarantee that the ordering is actually right.  So this has messed
up with the expected set of sync standbys when these are not directly
listed.  I think that this can happen as well when initializing the
standbys at the top of the test, but the window is much, much
narrower and basically impossible to reach.

Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes
the test faster, but that's much more costly than using just
poll_query_until() on pg_stat_replication to make sure that each
standby is registered on the primary's WAL sender array.  In short,
something like the attached should improve the stability of the test.

Thoughts?
--
Michael
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index bba47da17a..653a6bc842 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -27,6 +27,23 @@ sub test_sync_state
 	return;
 }
 
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print "### Waiting for standby \"$standby_name\" on \"$master_name\"\n";
+	$master->poll_query_until('postgres', $query);
+}
+
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
@@ -36,23 +53,27 @@ my $backup_name = 'master_backup';
 # Take backup
 $node_master->backup($backup_name);
 
+# Create all the standbys.  Their status on the primary is checked to
+# ensure the ordering of each one of them in the WAL sender array of the
+# primary.
+
 # Create standby1 linking to master
 my $node_standby_1 = get_new_node('standby1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_1->start;
+start_standby_and_wait($node_master, $node_standby_1);
 
 # Create standby2 linking to master
 my $node_standby_2 = get_new_node('standby2');
 $node_standby_2->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_2->start;
+start_standby_and_wait($node_master, $node_standby_2);
 
 # Create standby3 linking to master
 my $node_standby_3 = get_new_node('standby3');
 $node_standby_3->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_3->start;
+start_standby_and_wait($node_master, $node_standby_3);
 
 # Check that sync_state is determined correctly when
 # synchronous_standby_names is specified in old syntax.
@@ -82,8 +103,10 @@ $node_standby_1->stop;
 $node_standby_2->stop;
 $node_standby_3->stop;
 
-$node_standby_2->start;
-$node_standby_3->start;
+# Make sure that each standby reports back to the primary in
+# the wanted order.
+start_standby_and_wait($node_master, $node_standby_2);
+start_standby_and_wait($node_master, $node_standby_3);
 
 # Specify 2 as the number of sync standbys.
 # Check that two standbys are in 'sync' state.
@@ -94,7 +117,7 @@ standby3|3|sync),
 	'2(standby1,standby2,standby3)');
 
 # Start standby1
-$node_standby_1->start;
+start_standby_and_wait($node_master, $node_standby_1);
 
 # Create standby4 linking to master
 my $node_standby_4 = get_new_node('standby4');

Attachment: signature.asc
Description: PGP signature

Reply via email to