My AIX buildfarm members have failed the BinInstallCheck step on and off since inception. It became more frequent when I added animals sungazer and tern alongside the older hornet and mandrill. The animals share a machine with each other and with dozens of other developers. I setpriority() the animals to the lowest available priority, so they probably lose the CPU for long periods. Separately, this machine has slow filesystem metadata operations. For example, git-new-workdir takes ~50s for a PostgreSQL tree.
The pg_rewind suite has failed a few times when crash recovery took longer than the 60s pg_ctl default timeout. Disabling fsync (commit 7d7a103) reduced median crash recovery time by 75%, which may suffice. If not, I'll be inclined to add --timeout=900 to each pg_ctl invocation. The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds". You can reproduce that by adding "sleep 3;" between that test and the one before it. The timing dependency comes from the pg_ctl "slop" time: /* * Make sanity checks. If it's for a standalone backend * (negative PID), or the recorded start time is before * pg_ctl started, then either we are looking at the wrong * data directory, or this is a pre-existing pidfile that * hasn't (yet?) been overwritten by our child postmaster. * Allow 2 seconds slop for possible cross-process clock * skew. */ The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug that we'd ideally fix as described in the last paragraph of the commit 3c485ca log message: All of this could be improved if we rewrote start_postmaster() so that it could report the child postmaster's PID, so that we'd know a-priori the correct PID to test with postmaster_is_alive(). That looks like a bit too much change for so late in the 9.1 development cycle, unfortunately. I recommend we invert the test expectation and, pending the ideal pg_ctl fix, add the "sleep 3" to avoid falling within the time slop: --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -35,6 +35,7 @@ close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], - 'second pg_ctl start succeeds'); +sleep 3; # bridge test_postmaster_connection() slop threshold +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], + 'second pg_ctl start fails'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w'); Alternately, I could just remove the test. crake failed the same way, once: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers