On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote: > I looked at the buildfarm failures for the recoveryCheck stage. It > looks like there is only one failure for branch master after this > commit, which was chipmunk saying: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11 > AFAICS this is wholly unrelated to the problem at hand.
Yes, that's unrelated. This failure is interesting, still it has happened only once per what I can see. I think that this points out a rare race condition in test 007_sync_rep.pl because of this sequence which reorders the standbys: # Stop and start standbys to rearrange the order of standbys # in WalSnd array. Now, if standbys have the same priority, # standby2 is selected preferentially and standby3 is next. $node_standby_1->stop; $node_standby_2->stop; $node_standby_3->stop; $node_standby_2->start; $node_standby_3->start; The failure actually indicates that even if standby3 has started after standby2, standby3 has registered back into the WAL sender array of the primary before standby2 had the occasion to do it, but we have no guarantee that the ordering is actually right. So this has messed up with the expected set of sync standbys when these are not directly listed. I think that this can happen as well when initializing the standbys at the top of the test, but the window is much, much narrower and basically impossible to reach. Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes the test faster, but that's much more costly than using just poll_query_until() on pg_stat_replication to make sure that each standby is registered on the primary's WAL sender array. In short, something like the attached should improve the stability of the test. Thoughts? -- Michael
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl index bba47da17a..653a6bc842 100644 --- a/src/test/recovery/t/007_sync_rep.pl +++ b/src/test/recovery/t/007_sync_rep.pl @@ -27,6 +27,23 @@ sub test_sync_state return; } +# Start a standby and check that it is registered within the WAL sender +# array of the given primary. This polls the primary's pg_stat_replication +# until the standby is confirmed as registered. +sub start_standby_and_wait +{ + my ($master, $standby) = @_; + my $master_name = $master->name; + my $standby_name = $standby->name; + my $query = + "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'"; + + $standby->start; + + print "### Waiting for standby \"$standby_name\" on \"$master_name\"\n"; + $master->poll_query_until('postgres', $query); +} + # Initialize master node my $node_master = get_new_node('master'); $node_master->init(allows_streaming => 1); @@ -36,23 +53,27 @@ my $backup_name = 'master_backup'; # Take backup $node_master->backup($backup_name); +# Create all the standbys. Their status on the primary is checked to +# ensure the ordering of each one of them in the WAL sender array of the +# primary. + # Create standby1 linking to master my $node_standby_1 = get_new_node('standby1'); $node_standby_1->init_from_backup($node_master, $backup_name, has_streaming => 1); -$node_standby_1->start; +start_standby_and_wait($node_master, $node_standby_1); # Create standby2 linking to master my $node_standby_2 = get_new_node('standby2'); $node_standby_2->init_from_backup($node_master, $backup_name, has_streaming => 1); -$node_standby_2->start; +start_standby_and_wait($node_master, $node_standby_2); # Create standby3 linking to master my $node_standby_3 = get_new_node('standby3'); $node_standby_3->init_from_backup($node_master, $backup_name, has_streaming => 1); -$node_standby_3->start; +start_standby_and_wait($node_master, $node_standby_3); # Check that sync_state is determined correctly when # synchronous_standby_names is specified in old syntax. @@ -82,8 +103,10 @@ $node_standby_1->stop; $node_standby_2->stop; $node_standby_3->stop; -$node_standby_2->start; -$node_standby_3->start; +# Make sure that each standby reports back to the primary in +# the wanted order. +start_standby_and_wait($node_master, $node_standby_2); +start_standby_and_wait($node_master, $node_standby_3); # Specify 2 as the number of sync standbys. # Check that two standbys are in 'sync' state. @@ -94,7 +117,7 @@ standby3|3|sync), '2(standby1,standby2,standby3)'); # Start standby1 -$node_standby_1->start; +start_standby_and_wait($node_master, $node_standby_1); # Create standby4 linking to master my $node_standby_4 = get_new_node('standby4');
signature.asc
Description: PGP signature