[HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

Noah Misch Wed, 02 Sep 2015 23:26:23 -0700

My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception.  It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill.  The animals share a machine with
each other and with dozens of other developers.  I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods.  Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.


The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout.  Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice.  If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.


The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it.  The timing dependency comes from the pg_ctl "slop" time:

                                        /*
                                         * Make sanity checks.  If it's for a 
standalone backend
                                         * (negative PID), or the recorded 
start time is before
                                         * pg_ctl started, then either we are 
looking at the wrong
                                         * data directory, or this is a 
pre-existing pidfile that
                                         * hasn't (yet?) been overwritten by 
our child postmaster.
                                         * Allow 2 seconds slop for possible 
cross-process clock
                                         * skew.
                                         */

The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:

    All of this could be improved if we rewrote start_postmaster() so that it
    could report the child postmaster's PID, so that we'd know a-priori the
    correct PID to test with postmaster_is_alive().  That looks like a bit too
    much change for so late in the 9.1 development cycle, unfortunately.

I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:

--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;
 command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
        'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
-       'second pg_ctl start succeeds');
+sleep 3;    # bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+       'second pg_ctl start fails');
 command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
        'pg_ctl stop -w');


Alternately, I could just remove the test.

crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

Reply via email to