<Bounced from commits to have it on hackers>

Hello Tom,

Well, I think it's mostly about valgrind making everything really slow. Since we have seen some passes from skink recently, perhaps there was also a component of more-load-on-the-machine-than-usual. But in the end this is just evidence for my point that regression tests have to be very much not timing-sensitive. We run them under all kinds of out-of-the-ordinary stress.

Attached is an attempt at improving the situation:

(1) there are added comments to explain the whys, which is to provide
coverage for pgbench time-related features... while still not being
time-sensitive, which is a challenge.

(2) the test now only expects "progress: \d" from the output, so it is enough that one progress is shown, whenever it is shown.

(3) if the test is detected to have gone AWOL, detailed log checks are
coldly skipped.

This would have passed on "skink" under the special conditions it encountered.

I cannot guaranty that it would pass under any circumstance, though.

If it still encounters a failure, ISTM that it should only be a missing "progress:" in the output, which has not been encountered so far.

If it occurs, a few options would remain, none of them very convincing:

 - give the test some more time, eg 3 seconds (hmmm... could still fail
   after any time...)

 - drop the "progress:" expectation (hmmm... but then it does not check
   anything).

 - make the "progress:" output check conditional to the running time
   (hmmm... it would require changing the command_checks_all function,
    and there is still a chance that the bench was stuck for 2 seconds
    then given time to realize that it has to stop right now...)

 - use an even simpler transaction, eg "SELECT;" which is less likely to
   get stuck (but still could get stuck...).

For the random-ness related test (--sample-rate), we could add a feature to pgbench which allows to control the random seed, so that the number of samples could be known in advance for testing purposes.

--
Fabien.
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 11bc0fe..5043504 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -4,13 +4,14 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use Time::HiRes qw{time};
 
 # start a pgbench specific server
 my $node = get_new_node('main');
 $node->init;
 $node->start;
 
-# invoke pgbench
+# invoke pgbench, return elapsed time
 sub pgbench
 {
 	my ($opts, $stat, $out, $err, $name, $files) = @_;
@@ -32,10 +33,13 @@ sub pgbench
 			append_to_file($filename, $$files{$fn});
 		}
 	}
+
+	my $t0 = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
+	return time() - $t0;
 }
 
 # Test concurrent insertion into table with UNIQUE oid column.  DDL expects
@@ -445,14 +449,53 @@ sub check_pgbench_logs
 	ok(unlink(@logs), "remove log files");
 }
 
-# with sampling rate
+# note: --progress-timestamp is not tested
+
+# The point of this test is coverage of various time-related features
+# (-T, -P, --aggregate-interval, --rate, --latency-limit...), so it is
+# somehow time sensitive.
+# The checks performed are quite loose so as to still pass even under
+# degraded (high load, slow host, valgrind run) testing conditions.
+# Maybe it might fail if no every second progress report is
+# shown over 2 seconds...
+my $elapsed = pgbench(
+	'-T 2 -P 1 -l --log-prefix=001_pgbench_log_1 --aggregate-interval=1'
+	  . ' -S -b se@2 --rate=20 --latency-limit=1000 -j ' . $nthreads
+	  . ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		# the shown duration is really -T argument value
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[	qr{vacuum},
+		# hopefully at least expect some progress report?
+		qr{progress: \d\b} ],
+	'pgbench progress');
+
+# if the test has gone AWOL, coldly skip these detailed checks.
+if (abs($elapsed - 2.0) < 0.5)
+{
+	# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+	# only 1 or as many as 3 progress reports per thread.
+	check_pgbench_logs('001_pgbench_log_1', $nthreads, 1, 3,
+		qr{^\d+ \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+}
+
+# with sampling rate, not time sensitive
 pgbench(
 '-n -S -t 50 -c 2 --log --log-prefix=001_pgbench_log_2 --sampling-rate=0.5',
 	0,
 	[ qr{select only}, qr{processed: 100/100} ],
-	[qr{^$}],
+	[ qr{^$} ],
 	'pgbench logs');
 
+# random 50% of 2*50 transactions, accept between 8 and 92
+# probability of failure is about 2 * 2^-42 (?)
 check_pgbench_logs('001_pgbench_log_2', 1, 8, 92,
 	qr{^0 \d{1,2} \d+ \d \d+ \d+$});
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to