Hi all, TAP tests of pg_rewind are using in 2 places hardcoded values to wait for a given amount of time for some events. In HEAD, those things are: 1) Wait for 1s for standby to catch up. 2) Wait for 2s for promotion of standby. However after discussion with a colleague we have noticed that those values may not be enough in slow environments, a value of up to 10s being sometimes needed after promotion to make tests pass. And actually the current way of doing is not reliable because it depends on how the environment is able to handle quickly standby replay and promotion (I am expecting issues regarding that on Windows btw, and on small spec machines).
Attached is a patch improving this sleep logic and doing the following things: 1) To ensure that standby has caught up, check replay position on the standby and compare it with the current WAL position of master. 2) To ensure that promotion is effective, use pg_is_in_recovery() and continue processing until we are sure that the standby is out of recovery. In each case the patch attached makes a maximum of 30 attempts, each attempt waiting 1s before processing, so we let a margin of up to 30s for environments to handle replay and promotion properly. Note that this patch adds a small routine called command_result in TestLib.pm, able to return stdout, stderr and the exit code. That's really handy, and I am planning to use something like that as well in the replication test suite. Regards, -- Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm index e6a5b9b..86e9def 100644 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@ -183,7 +183,7 @@ sub create_standby # Base backup is taken with xlog files included system_or_bail("pg_basebackup -D $test_standby_datadir -p $port_master -x >>$log_path 2>&1"); append_to_file("$test_standby_datadir/recovery.conf", qq( -primary_conninfo='$connstr_master' +primary_conninfo='$connstr_master application_name=rewind_standby' standby_mode=on recovery_target_timeline='latest' )); @@ -191,8 +191,29 @@ recovery_target_timeline='latest' # Start standby system_or_bail("pg_ctl -w -D $test_standby_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_standby\" start >>$log_path 2>&1"); - # sleep a bit to make sure the standby has caught up. - sleep 1; + # Wait until the standby has caught up with the primary by comparing + # WAL positions on both nodes. Note that this is fine + my $max_attempts = 30; + my $attempts = 0; + while ($attempts < $max_attempts) + { + # Wait a bit before proceeding. + sleep 1; + $attempts++; + + my $query = "SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';"; + my $cmd = ['psql', '-At', '-c', "$query", '-d', "$connstr_master" ]; + my ($res, $stdout, $stderr) = command_result($cmd); + chomp($stdout); + if ($stdout eq "t") + { + last; + } + } + if ($attempts == $max_attempts) + { + die "Maximum number of attempts reached when waiting for standby to catch up"; + } } sub promote_standby @@ -201,9 +222,31 @@ sub promote_standby # up standby # Now promote slave and insert some new data on master, this will put - # the master out-of-sync with the standby. + # the master out-of-sync with the standby. Be sure that we leave here + # with a standby actually ready for the next operations. system_or_bail("pg_ctl -w -D $test_standby_datadir promote >>$log_path 2>&1"); - sleep 2; + my $max_attempts = 30; + my $attempts = 0; + while ($attempts < $max_attempts) + { + # Wait a bit before proceeding, promotion may have not taken effect + # in such a short time. + sleep 1; + $attempts++; + + my $query = "SELECT pg_is_in_recovery()"; + my $cmd = ['psql', '-At', '-c', "$query", '-d', "$connstr_standby" ]; + my ($res, $stdout, $stderr) = command_result($cmd); + chomp($stdout); + if ($stdout eq "f") + { + last; + } + } + if ($attempts == $max_attempts) + { + die "Maximum number of attempts reached when waiting for promotion of standby"; + } } sub run_pg_rewind diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm index 003cd9a..3c6a49e 100644 --- a/src/test/perl/TestLib.pm +++ b/src/test/perl/TestLib.pm @@ -16,6 +16,7 @@ our @EXPORT = qw( command_ok command_fails command_exit_is + command_result program_help_ok program_version_ok program_options_handling_ok @@ -161,6 +162,14 @@ sub command_exit_is is($h->result(0), $expected, $test_name); } +sub command_result +{ + my ($cmd) = @_; + my ($stdout, $stderr); + my $result = run $cmd, '>', \$stdout, '2>', \$stderr; + return ($result, $stdout, $stderr); +} + sub program_help_ok { my ($cmd) = @_;
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers