Re: [HACKERS] pgbench randomness initialization

2018-03-02 Thread Tom Lane
Fabien COELHO  writes:
>> Hm ... so I tried to replicate this problem, and failed to: the log files
>> get made under the VPATH build directory, as desired, even without this
>> patch.  Am I doing something wrong, or is this platform-dependent somehow?

> As I recall, it indeed works if the source directories are rw, but fails 
> if they are ro because then the local mkdir fails. So you would have to do 
> a vpath build with sources that are ro to get the issue the patch is 
> fixing.

Ah, right you are.  Apparently, if the source is rw, the temporary files
in question are made there but immediately deleted, so the bug isn't
obvious.

Fix verified and pushed.

regards, tom lane



Re: [HACKERS] pgbench randomness initialization

2018-02-28 Thread Fabien COELHO


Hello Tom,


Fabien COELHO  writes:

This is a simple patch that does what it says on the tin. I ran into
trouble with the pgbench TAP test *even before applying the patch*, but
only because I was doing a VPATH build as a user without 'write'
on the source tree (001_pgbench_with_server.pl tried to make pgbench
create log files there). Bad me. Oddly, that was the only test in the
whole tree to have such an issue, so here I add a pre-patch to fix that.
Now my review needs a review. :)



Yep. I find the multiple chdir solution a little bit too extreme.



ISTM that it should rather add the correct path to --log-prefix by
prepending $node->basedir, like the pgbench function does for -f scripts.
See attached.


Hm ... so I tried to replicate this problem, and failed to: the log files
get made under the VPATH build directory, as desired, even without this
patch.  Am I doing something wrong, or is this platform-dependent somehow?


As I recall, it indeed works if the source directories are rw, but fails 
if they are ro because then the local mkdir fails. So you would have to do 
a vpath build with sources that are ro to get the issue the patch is 
fixing. Otherwise, the issue would have been cought earlier by the 
buildfarm, which I guess is doing vpath compilation and full validation.


--
Fabien.



Re: [HACKERS] pgbench randomness initialization

2018-02-28 Thread Tom Lane
Fabien COELHO  writes:
>> This is a simple patch that does what it says on the tin. I ran into
>> trouble with the pgbench TAP test *even before applying the patch*, but
>> only because I was doing a VPATH build as a user without 'write'
>> on the source tree (001_pgbench_with_server.pl tried to make pgbench
>> create log files there). Bad me. Oddly, that was the only test in the
>> whole tree to have such an issue, so here I add a pre-patch to fix that.
>> Now my review needs a review. :)

> Yep. I find the multiple chdir solution a little bit too extreme.

> ISTM that it should rather add the correct path to --log-prefix by 
> prepending $node->basedir, like the pgbench function does for -f scripts.
> See attached.

Hm ... so I tried to replicate this problem, and failed to: the log files
get made under the VPATH build directory, as desired, even without this
patch.  Am I doing something wrong, or is this platform-dependent somehow?

regards, tom lane



Re: Re: [HACKERS] pgbench randomness initialization

2018-01-24 Thread Fabien COELHO



Here is a rebase, plus some more changes:

I have improved the error message to tell from where the value was 
provided.


I have removed the test to the exact values produced from the expression 
test run.


I have added a test which run from the same seed value several times
and checks that the output values are the same.


This version adds a :random_seed script variable, for information.


Rebased.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 3dd492c..aa1a9e0 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -680,6 +680,45 @@ pgbench  options  d
  
 
  
+  --random-seed=SEED
+  
+   
+Set random generator seed.  Seeds the system random number generator,
+which then produces a sequence of initial generator states, one for
+each thread.
+Values for SEED may be:
+time (the default, the seed is based on the current time),
+rand (use a strong random source, failing if none
+is available), or any unsigned octal (012470),
+decimal (5432) or
+hexedecimal (0x1538) integer value.
+The random generator is invoked explicitly from a pgbench script
+(random... functions) or implicitly (for instance option
+--rate uses it to schedule transactions).
+When explicitely set, the value used for seeding is shown on the terminal.
+Any value allowed for SEED may also be
+provided through the environment variable
+PGBENCH_RANDOM_SEED.
+To ensure that the provided seed impacts all possible uses, put this option
+first or use the environment variable.
+  
+  
+Setting the seed explicitly allows to reproduce a pgbench
+run exactly, as far as random numbers are concerned.
+As the random state is managed per thread, this means the exact same
+pgbench run for an identical invocation if there is one
+client per thread and there are no external or data dependencies.
+From a statistical viewpoint reproducing runs exactly is a bad idea because
+it can hide the performance variability or improve performance unduly,
+e.g. by hitting the same pages as a previous run.
+However, it may also be of great help for debugging, for instance
+re-running a tricky case which leads to an error.
+Use wisely.
+   
+  
+ 
+
+ 
   --sampling-rate=rate
   

@@ -874,14 +913,19 @@ pgbench  options  d
 
  
   
-scale 
-   current scale factor
-  
-
-  
 client_id 
unique number identifying the client session (starts from zero)
   
+
+  
+random_seed 
+   random generator seed
+  
+
+  
+scale 
+   current scale factor
+  
  
 

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 31ea6ca..c49a48a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -146,6 +146,9 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* random seed used when calling srandom() */
+int64 random_seed = -1;
+
 /*
  * end of configurable parameters
  */
@@ -560,6 +563,7 @@ usage(void)
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
 		   "  --progress-timestamp use Unix epoch timestamps for progress\n"
+		   "  --random-seed=SEED   set random seed (\"time\", \"rand\", integer)\n"
 		   "  --sampling-rate=NUM  fraction of transactions to log (e.g., 0.01 for 1%%)\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug  print debugging output\n"
@@ -4362,6 +4366,57 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 	}
 }
 
+/* call srandom based on some seed. NULL triggers the default behavior. */
+static void
+set_random_seed(const char *seed, const char *origin)
+{
+	unsigned int iseed;
+
+	if (seed == NULL || strcmp(seed, "time") == 0)
+	{
+		/* rely on current time */
+		instr_time	now;
+		INSTR_TIME_SET_CURRENT(now);
+		iseed = (unsigned int) INSTR_TIME_GET_MICROSEC(now);
+	}
+	else if (strcmp(seed, "rand") == 0)
+	{
+		/* use some "strong" random source */
+		if (!pg_strong_random(&iseed, sizeof(iseed)))
+		{
+			fprintf(stderr, "cannot seed random from a strong source\n");
+			exit(1);
+		}
+	}
+	else
+	{
+		/* parse uint seed value */
+		char garbage;
+		if (!
+			/* hexa */
+			((seed[0] == '0' && (seed[1] == 'x' || seed[1] == 'X') &&
+			 sscanf(seed, "%x%c", &iseed, &garbage) == 1) ||
+			 /* octal */
+			(seed[0] == '0' && seed[1] != 'x' && seed[1] != 'X' &&
+			 sscanf(seed, "%o%c", &iseed, &garbage) == 1) ||
+			 /* decimal */
+			 (seed[0] != '0' &&
+			  sscanf(seed, "%u%c", &iseed, &garbage) == 1)))
+		{
+			fprintf(stderr,

Re: Re: [HACKERS] pgbench randomness initialization

2018-01-18 Thread Fabien COELHO



Here is a rebase, plus some more changes:

I have improved the error message to tell from where the value was provided.

I have removed the test to the exact values produced from the expression test 
run.


I have added a test which run from the same seed value several times
and checks that the output values are the same.


This version adds a :random_seed script variable, for information.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 3dd492c..aa1a9e0 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -680,6 +680,45 @@ pgbench  options  d
  
 
  
+  --random-seed=SEED
+  
+   
+Set random generator seed.  Seeds the system random number generator,
+which then produces a sequence of initial generator states, one for
+each thread.
+Values for SEED may be:
+time (the default, the seed is based on the current time),
+rand (use a strong random source, failing if none
+is available), or any unsigned octal (012470),
+decimal (5432) or
+hexedecimal (0x1538) integer value.
+The random generator is invoked explicitly from a pgbench script
+(random... functions) or implicitly (for instance option
+--rate uses it to schedule transactions).
+When explicitely set, the value used for seeding is shown on the terminal.
+Any value allowed for SEED may also be
+provided through the environment variable
+PGBENCH_RANDOM_SEED.
+To ensure that the provided seed impacts all possible uses, put this option
+first or use the environment variable.
+  
+  
+Setting the seed explicitly allows to reproduce a pgbench
+run exactly, as far as random numbers are concerned.
+As the random state is managed per thread, this means the exact same
+pgbench run for an identical invocation if there is one
+client per thread and there are no external or data dependencies.
+From a statistical viewpoint reproducing runs exactly is a bad idea because
+it can hide the performance variability or improve performance unduly,
+e.g. by hitting the same pages as a previous run.
+However, it may also be of great help for debugging, for instance
+re-running a tricky case which leads to an error.
+Use wisely.
+   
+  
+ 
+
+ 
   --sampling-rate=rate
   

@@ -874,14 +913,19 @@ pgbench  options  d
 
  
   
-scale 
-   current scale factor
-  
-
-  
 client_id 
unique number identifying the client session (starts from zero)
   
+
+  
+random_seed 
+   random generator seed
+  
+
+  
+scale 
+   current scale factor
+  
  
 

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 31ea6ca..c49a48a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -146,6 +146,9 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* random seed used when calling srandom() */
+int64 random_seed = -1;
+
 /*
  * end of configurable parameters
  */
@@ -560,6 +563,7 @@ usage(void)
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
 		   "  --progress-timestamp use Unix epoch timestamps for progress\n"
+		   "  --random-seed=SEED   set random seed (\"time\", \"rand\", integer)\n"
 		   "  --sampling-rate=NUM  fraction of transactions to log (e.g., 0.01 for 1%%)\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug  print debugging output\n"
@@ -4362,6 +4366,57 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 	}
 }
 
+/* call srandom based on some seed. NULL triggers the default behavior. */
+static void
+set_random_seed(const char *seed, const char *origin)
+{
+	unsigned int iseed;
+
+	if (seed == NULL || strcmp(seed, "time") == 0)
+	{
+		/* rely on current time */
+		instr_time	now;
+		INSTR_TIME_SET_CURRENT(now);
+		iseed = (unsigned int) INSTR_TIME_GET_MICROSEC(now);
+	}
+	else if (strcmp(seed, "rand") == 0)
+	{
+		/* use some "strong" random source */
+		if (!pg_strong_random(&iseed, sizeof(iseed)))
+		{
+			fprintf(stderr, "cannot seed random from a strong source\n");
+			exit(1);
+		}
+	}
+	else
+	{
+		/* parse uint seed value */
+		char garbage;
+		if (!
+			/* hexa */
+			((seed[0] == '0' && (seed[1] == 'x' || seed[1] == 'X') &&
+			 sscanf(seed, "%x%c", &iseed, &garbage) == 1) ||
+			 /* octal */
+			(seed[0] == '0' && seed[1] != 'x' && seed[1] != 'X' &&
+			 sscanf(seed, "%o%c", &iseed, &garbage) == 1) ||
+			 /* decimal */
+			 (seed[0] != '0' &&
+			  sscanf(seed, "%u%c", &iseed, &garbage) == 1)))
+		{
+			fprintf(stderr,
+	"error

Re: Re: [HACKERS] pgbench randomness initialization

2018-01-10 Thread Fabien COELHO


Here is a rebase, plus some more changes:

I have improved the error message to tell from where the value was 
provided.


I have removed the test to the exact values produced from the expression 
test run.


I have added a test which run from the same seed value several times
and checks that the output values are the same.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 3dd492c..22ebb51 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -680,6 +680,45 @@ pgbench  options  d
  
 
  
+  --random-seed=SEED
+  
+   
+Set random generator seed.  Seeds the system random number generator,
+which then produces a sequence of initial generator states, one for
+each thread.
+Values for SEED may be:
+time (the default, the seed is based on the current time),
+rand (use a strong random source, failing if none
+is available), or any unsigned octal (012470),
+decimal (5432) or
+hexedecimal (0x1538) integer value.
+The random generator is invoked explicitly from a pgbench script
+(random... functions) or implicitly (for instance option
+--rate uses it to schedule transactions).
+When explicitely set, the value used for seeding is shown on the terminal.
+Any value allowed for SEED may also be
+provided through the environment variable
+PGBENCH_RANDOM_SEED.
+To ensure that the provided seed impacts all possible uses, put this option
+first or use the environment variable.
+  
+  
+Setting the seed explicitly allows to reproduce a pgbench
+run exactly, as far as random numbers are concerned.
+As the random state is managed per thread, this means the exact same
+pgbench run for an identical invocation if there is one
+client per thread and there are no external or data dependencies.
+From a statistical viewpoint reproducing runs exactly is a bad idea because
+it can hide the performance variability or improve performance unduly,
+e.g. by hitting the same pages as a previous run.
+However, it may also be of great help for debugging, for instance
+re-running a tricky case which leads to an error.
+Use wisely.
+   
+  
+ 
+
+ 
   --sampling-rate=rate
   

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 31ea6ca..206dfd5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -560,6 +560,7 @@ usage(void)
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
 		   "  --progress-timestamp use Unix epoch timestamps for progress\n"
+		   "  --random-seed=SEED   set random seed (\"time\", \"rand\", integer)\n"
 		   "  --sampling-rate=NUM  fraction of transactions to log (e.g., 0.01 for 1%%)\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug  print debugging output\n"
@@ -4362,6 +4363,55 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 	}
 }
 
+/* call srandom based on some seed. NULL triggers the default behavior. */
+static void
+set_random_seed(const char *seed, const char *origin)
+{
+	unsigned int iseed;
+
+	if (seed == NULL || strcmp(seed, "time") == 0)
+	{
+		/* rely on current time */
+		instr_time	now;
+		INSTR_TIME_SET_CURRENT(now);
+		iseed = (unsigned int) INSTR_TIME_GET_MICROSEC(now);
+	}
+	else if (strcmp(seed, "rand") == 0)
+	{
+		/* use some "strong" random source */
+		if (!pg_strong_random(&iseed, sizeof(iseed)))
+		{
+			fprintf(stderr, "cannot seed random from a strong source\n");
+			exit(1);
+		}
+	}
+	else
+	{
+		/* parse uint seed value */
+		char garbage;
+		if (!
+			/* hexa */
+			((seed[0] == '0' && (seed[1] == 'x' || seed[1] == 'X') &&
+			 sscanf(seed, "%x%c", &iseed, &garbage) == 1) ||
+			 /* octal */
+			(seed[0] == '0' && seed[1] != 'x' && seed[1] != 'X' &&
+			 sscanf(seed, "%o%c", &iseed, &garbage) == 1) ||
+			 /* decimal */
+			 (seed[0] != '0' &&
+			  sscanf(seed, "%u%c", &iseed, &garbage) == 1)))
+		{
+			fprintf(stderr,
+	"error while scanning '%s' from %s, expecting an unsigned integer, 'time' or 'rand'\n",
+	seed, origin);
+			exit(1);
+		}
+	}
+
+	if (seed != NULL)
+		fprintf(stderr, "setting random seed to %u\n", iseed);
+	srandom(iseed);
+}
+
 
 int
 main(int argc, char **argv)
@@ -4404,6 +4454,7 @@ main(int argc, char **argv)
 		{"progress-timestamp", no_argument, NULL, 6},
 		{"log-prefix", required_argument, NULL, 7},
 		{"foreign-keys", no_argument, NULL, 8},
+		{"random-seed", required_argument, NULL, 9},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -4472,6 +4523,9 @@ main(int argc, char **argv)
 	state = (CState *) pg_malloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
+	/* set random seed early, because it may be used while parsing scripts.

Re: Re: [HACKERS] pgbench randomness initialization

2018-01-10 Thread Fabien COELHO



This is a simple patch that does what it says on the tin. I ran into
trouble with the pgbench TAP test *even before applying the patch*, but
only because I was doing a VPATH build as a user without 'write'
on the source tree (001_pgbench_with_server.pl tried to make pgbench
create log files there). Bad me. Oddly, that was the only test in the
whole tree to have such an issue, so here I add a pre-patch to fix that.
Now my review needs a review. :)


Yep. I find the multiple chdir solution a little bit too extreme.

ISTM that it should rather add the correct path to --log-prefix by 
prepending $node->basedir, like the pgbench function does for -f scripts.


See attached.

--
Fabien.diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index e579334..ba7e363 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -611,24 +611,26 @@ sub check_pgbench_logs
 	ok(unlink(@logs), "remove log files");
 }
 
+my $bdir = $node->basedir;
+
 # with sampling rate
 pgbench(
-'-n -S -t 50 -c 2 --log --log-prefix=001_pgbench_log_2 --sampling-rate=0.5',
+"-n -S -t 50 -c 2 --log --log-prefix=$bdir/001_pgbench_log_2 --sampling-rate=0.5",
 	0,
 	[ qr{select only}, qr{processed: 100/100} ],
 	[qr{^$}],
 	'pgbench logs');
 
-check_pgbench_logs('001_pgbench_log_2', 1, 8, 92,
+check_pgbench_logs("$bdir/001_pgbench_log_2", 1, 8, 92,
 	qr{^0 \d{1,2} \d+ \d \d+ \d+$});
 
 # check log file in some detail
 pgbench(
-	'-n -b se -t 10 -l --log-prefix=001_pgbench_log_3',
+	"-n -b se -t 10 -l --log-prefix=$bdir/001_pgbench_log_3",
 	0, [ qr{select only}, qr{processed: 10/10} ],
 	[qr{^$}], 'pgbench logs contents');
 
-check_pgbench_logs('001_pgbench_log_3', 1, 10, 10,
+check_pgbench_logs("$bdir/001_pgbench_log_3", 1, 10, 10,
 	qr{^\d \d{1,2} \d+ \d \d+ \d+$});
 
 # done


Re: Re: [HACKERS] pgbench randomness initialization

2018-01-09 Thread Fabien COELHO


Hello Chapman,

Thanks for the review,


The tests assume that stdlib random/srandom behavior is standard thus
deterministic between platform.


Is the behavior of srandom() and the system generator really so precisely
specified that seed 5432 will produce the same values hardcoded in the
tests on all platforms? [...]


Good question.

I'm hoping that in practice it would be the same, or that their would be 
very few cases (eg linux vs windows vs macos...). I was counting on the 
the buildfarm to tell me if I'm wrong, and fix it if needed.



To have the test run pgbench twice with the same seed and compare the
results sounds safer.


Interesting idea. The test script would be more complicated to do that, 
though. I would prefer to bet on "random" determinism (:-) and resort to 
such a solution only if it is not the case. Or maybe just put back some 
"\d+" to keep it simple.


This is a debatable strategy.

I did some wordsmithing of the doc, which I hope was not presumptuous of 
me, only as a conversation starter. [...]


Thanks for the doc improvements.


The documentation doesn't say that --random-seed= (or PGBENCH_RANDOM_SEED=)
will have the same effect as 'time', and indeed, I really think it should
be unset (defaulting to 'time'), or 'time', or 'rand', or an integer,
or an error.


Ok, done.

The error, right now, says only "expecting an unsigned integer"; it 
should also mention time and rand.


Ok, done.

Should 'rand' be something that conveys more about its meaning, 'strong' 
perhaps?


Hmmm. "Random means random":-). I have no opinion about whether it would 
be better. ISTM that "strong" would require some explanations. I let is as 
"rand" for now.



The documentation doesn't mention the allowed range of the unsigned
integer (which I get, because 'unsigned int' is exactly the signature
for srandom, but somebody might read "unsigned integer" in a more
generic sense).


Ok. I extended so that it works with octal, decimal and hexadecimal, and 
updated the doc accordingly. I did not add range information though.



I'm not sure what would be a better way to say it.
The base (only decimal, as now in the code) isn't specified either.


Sure.

Maybe the documentation should mention that the output now includes the 
random seed being used, so that (even without planning ahead) [...]


It does so only if the seed is explicitely set, otherwise it keeps the 
previous behavior of being silent. I added a sentence about that.



Something more may need to be said in the doc about reproducibility. I think
the real story is that a run can be reproduced if the number of clients is
equal to the number of threads.


Yes, good point. I tried to hide the issue under the "as far as random 
numbers are concerned". I've tried to improve this point in the doc.


Although each thread has its own generator state, each client does not 
(if there is more than one per thread), and as soon as real select() 
delays start happening in CSTATE_WAIT_RESULT, the clients dealt out to a 
given thread may not be hitting that thread's generator in a 
deterministic order.


Yep. This may evolve, for instance the error handling patch needs to 
restart transactions so it adds a state into the client.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 1519fe7..7e81a51 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -680,6 +680,45 @@ pgbench  options  d
  
 
  
+  --random-seed=SEED
+  
+   
+Set random generator seed.  Seeds the system random number generator,
+which then produces a sequence of initial generator states, one for
+each thread.
+Values for SEED may be:
+time (the default, the seed is based on the current time),
+rand (use a strong random source, failing if none
+is available), or any unsigned octal (012470),
+decimal (5432) or
+hexedecimal (0x1538) integer value.
+The random generator is invoked explicitly from a pgbench script
+(random... functions) or implicitly (for instance option
+--rate uses it to schedule transactions).
+When explicitely set, the value used for seeding is shown on the terminal.
+Any value allowed for SEED may also be
+provided through the environment variable
+PGBENCH_RANDOM_SEED.
+To ensure that the provided seed impacts all possible uses, put this option
+first or use the environment variable.
+  
+  
+Setting the seed explicitly allows to reproduce a pgbench
+run exactly, as far as random numbers are concerned.
+As the random state is managed per thread, this means the exact same
+pgbench run for an identical invocation if there is one
+client per thread and there are no external or data dependencies.
+From a statistical viewpoint reproducing runs exactly is a bad idea because
+it can hide the perf

Re: Re: [HACKERS] pgbench randomness initialization

2018-01-09 Thread Chapman Flack
On 01/02/18 05:57, Fabien COELHO wrote:
>> Here is a new version which output use used seed when a seed is
>> explicitely set with an option or from the environment.


This is a simple patch that does what it says on the tin. I ran into
trouble with the pgbench TAP test *even before applying the patch*, but
only because I was doing a VPATH build as a user without 'write'
on the source tree (001_pgbench_with_server.pl tried to make pgbench
create log files there). Bad me. Oddly, that was the only test in the
whole tree to have such an issue, so here I add a pre-patch to fix that.
Now my review needs a review. :)

With that taken care of, the added tests all pass for me. I wonder, though:

> The tests assume that stdlib random/srandom behavior is standard thus
> deterministic between platform.

Is the behavior of srandom() and the system generator really so precisely
specified that seed 5432 will produce the same values hardcoded in the
tests on all platforms? It does on mine, but could it produce spurious
test failures for others? I guess the (or a) spec would be here:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/initstate.html

It specifies a "non-linear additive feedback random-number generator
employing a default state array size of 31 long integers", but it does
not pin down parameters or claim only one candidate exists.

To have the test run pgbench twice with the same seed and compare the
results sounds safer.

This revised pgbench-seed-4.patch changes the code not at all, and
the TAP test only in whitespace. I did some wordsmithing of the doc,
which I hope was not presumptuous of me, only as a conversation starter.
I expanded the second sentence because on my first reading I wasn't quite
sure of its meaning. Once I had looked at the code, I could see that the
sentence was economical and precise already, but I tried to find a version
that would also have been clear to the me-before-I-looked.

The documentation doesn't say that --random-seed= (or PGBENCH_RANDOM_SEED=)
will have the same effect as 'time', and indeed, I really think it should
be unset (defaulting to 'time'), or 'time', or 'rand', or an integer,
or an error. The error, right now, says only "expecting an unsigned
integer"; it should also mention time and rand. Should 'rand' be something
that conveys more about its meaning, 'strong' perhaps?

The documentation doesn't mention the allowed range of the unsigned
integer (which I get, because 'unsigned int' is exactly the signature
for srandom, but somebody might read "unsigned integer" in a more
generic sense). I'm not sure what would be a better way to say it.
The base (only decimal, as now in the code) isn't specified either.

Maybe the documentation should mention that the output now includes the
random seed being used, so that (even without planning ahead) a session
can be re-run with the same seed if necessary. It could just say "an
unsigned integer in decimal, as it is shown in pgbench's output" or
something like that.

Something more may need to be said in the doc about reproducibility. I think
the real story is that a run can be reproduced if the number of clients is
equal to the number of threads. Although each thread has its own generator
state, each client does not (if there is more than one per thread), and as
soon as real select() delays start happening in CSTATE_WAIT_RESULT, the
clients dealt out to a given thread may not be hitting that thread's
generator in a deterministic order.

-Chap
>From 4fe8f8563fedcb04b167db476adaf3bc398f879d Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Mon, 8 Jan 2018 23:15:01 -0500
Subject: [PATCH 1/2] Survive VPATH build without w on source tree.

A couple tests pass log options to pgbench, with a bare filename
for --log-prefix, so it would be created in the current directory.
In a VPATH build, inconveniently, the Makefile chdirs into the
original source tree before running this, and if the build user has
no write access there, the tests fail. Chdir back to a nice writable
$TESTDIR/tmp_check to avoid that.
---
 src/bin/pgbench/t/001_pgbench_with_server.pl | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3dd080e..8305db3 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -5,6 +5,11 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 
+if ( exists $ENV{'TESTDIR'} ) {
+	chdir $ENV{'TESTDIR'};
+	-d 'tmp_check' && -w 'tmp_check' && chdir 'tmp_check';
+}
+
 # start a pgbench specific server
 my $node = get_new_node('main');
 $node->init;
-- 
2.7.3

>From 33d5ec2b863fe3e8198d1c89de91bc36b52db5cc Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Tue, 9 Jan 2018 01:29:43 -0500
Subject: [PATCH 2/2] New pgbench --random-seed option for reproducibility.

---
 doc/src/sgml/ref/pgbench.sgml| 33 +
 src/bin/pgbench/pgbench.c| 53 

Re: [HACKERS] pgbench randomness initialization

2018-01-02 Thread Fabien COELHO


Here is a new version which output use used seed when a seed is explicitely 
set with an option or from the environment.


It is even better without xml typos, with simpler coding and the doc in 
the right place... Sorry for the noise.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 1519fe7..ec07fa3 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -680,6 +680,37 @@ pgbench  options  d
  
 
  
+  --random-seed=SEED
+  
+   
+Set random generator seed.  This random generator is used to initialize
+per-thread random generator states.
+Expected values for SEED are:
+time (the default, the seed is based on the current time),
+rand (use a strong random source if available),
+or any unsigned integer value.
+The random generator is invoked explicitely from a pgbench script
+(random... functions) or implicitely (for instance option
+--rate uses random to schedule transactions).
+The random generator seed may also be provided through environment variable
+PGBENCH_RANDOM_SEED.
+To ensure that the provided seed impacts all possible uses, put this option
+first or use the environment variable.
+  
+  
+Setting the seed explicitely allows to reproduce a pgbench
+run exactly, as far as random numbers are concerned.
+From a statistical viewpoint this is a bad idea because it can hide the
+performance variability or improve performance unduly, e.g. by hitting
+the same pages than a previous run.
+However it may also be of great help for debugging, for instance
+re-running a tricky case which leads to an error.
+Use wisely.
+   
+  
+ 
+
+ 
   --sampling-rate=rate
   

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e065f7b..4eb82f7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -557,6 +557,7 @@ usage(void)
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
 		   "  --progress-timestamp use Unix epoch timestamps for progress\n"
+		   "  --random-seed=SEED   set random seed (\"time\", \"rand\", integer)\n"
 		   "  --sampling-rate=NUM  fraction of transactions to log (e.g., 0.01 for 1%%)\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug  print debugging output\n"
@@ -4010,6 +4011,46 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 	}
 }
 
+/* call srandom based on some seed. NULL triggers the default behavior. */
+static void
+set_random_seed(const char *seed)
+{
+	unsigned int iseed;
+
+	if (seed == NULL || *seed == '\0' || strcmp(seed, "time") == 0)
+	{
+		/* rely on current time */
+		instr_time	now;
+		INSTR_TIME_SET_CURRENT(now);
+		iseed = (unsigned int) INSTR_TIME_GET_MICROSEC(now);
+	}
+	else if (strcmp(seed, "rand") == 0)
+	{
+		/* use some "strong" random source */
+		if (!pg_strong_random(&iseed, sizeof(iseed)))
+		{
+			fprintf(stderr, "cannot seed random from a strong source\n");
+			exit(1);
+		}
+	}
+	else
+	{
+		/* parse seed value coming either from option or environment */
+		char garbage;
+		if (sscanf(seed, "%u%c", &iseed, &garbage) != 1)
+		{
+			fprintf(stderr,
+	"error while scanning '%s', expecting an unsigned integer\n",
+	seed);
+			exit(1);
+		}
+	}
+
+	if (seed != NULL)
+		fprintf(stderr, "setting random seed to %u\n", iseed);
+	srandom(iseed);
+}
+
 
 int
 main(int argc, char **argv)
@@ -4052,6 +4093,7 @@ main(int argc, char **argv)
 		{"progress-timestamp", no_argument, NULL, 6},
 		{"log-prefix", required_argument, NULL, 7},
 		{"foreign-keys", no_argument, NULL, 8},
+		{"random-seed", required_argument, NULL, 9},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -4120,6 +4162,9 @@ main(int argc, char **argv)
 	state = (CState *) pg_malloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
+	/* set random seed early, because it may be used while parsing scripts. */
+	set_random_seed(getenv("PGBENCH_RANDOM_SEED"));
+
 	while ((c = getopt_long(argc, argv, "iI:h:nvp:dqb:SNc:j:Crs:t:T:U:lf:D:F:M:P:R:L:", long_options, &optindex)) != -1)
 	{
 		char	   *script;
@@ -4392,6 +4437,10 @@ main(int argc, char **argv)
 initialization_option_set = true;
 foreign_keys = true;
 break;
+			case 9:/* random-seed */
+benchmarking_option_set = true;
+set_random_seed(optarg);
+break;
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit(1);
@@ -4698,10 +4747,6 @@ main(int argc, char **argv)
 	}
 	PQfinish(con);
 
-	/* set random seed */
-	INSTR_TIME_SET_CURRENT(start_time);
-	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
-
 	/* set up thread data structures */
 	threads = (TState *) pg_malloc(sizeof(TState) * nthreads);
 	nclients_dealt = 0

Re: [HACKERS] pgbench randomness initialization

2018-01-02 Thread Fabien COELHO


Hello Alvaro,

I revive this patch because controlling the seed is useful for tap testing 
pgbench.


The output should include the random seed used, whether it was passed 
with --random-seed, environment variable or randomly determined.  That 
way, the user that later wants to verify why a particular run caused 
some particular misbehavior knows what seed to use to reproduce that 
run.


Yep.

Here is a new version which output use used seed when a seed is 
explicitely set with an option or from the environment.


However, the default (current) behavior remains silent, so no visible 
changes unless tinkering with it.


The patch also allows to use a "strong" random for seeding the PRNG, 
thanks to pg_strong_random().


The tests assume that stdlib random/srandom behavior is standard thus 
deterministic between platform.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 1519fe7..49dda81 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -761,6 +761,37 @@ pgbench  options  d

   
  
+
+ 
+  --random-seed=SEED
+  
+   
+Set random generator seed.  This random generator is used to initialize
+per-thread random generator states.
+Expected values for SEED are:
+time (the default, the seed is based on the current time),
+rand (use a strong random source if available),
+or any unsigned integer value.
+The random generator is invoked explicitely from a pgbench script
+(random... functions) or implicitely (for instance option
+--rate uses random to schedule transactions).
+The random generator seed may also be provided through environment variable
+PGBENCH_RANDOM_SEED.
+To ensure that the provided seed impacts all possible uses, put this option
+first or use the environment variable.
+  
+  
+Setting the seed explicitely allows to reproduce a pgbench
+run exactly, as far as random numbers are concerned.
+From a statistical viewpoint this is a bad idea because it can hide the
+performance variability or improve performance unduly, e.g. by hitting
+the same pages than a previous run.
+However it may also be of great help for debugging, for instance
+re-running a tricky case which leads to an error.
+Use wisely.
+   
+  
+ 
 

 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e065f7b..fdd731d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -557,6 +557,7 @@ usage(void)
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
 		   "  --progress-timestamp use Unix epoch timestamps for progress\n"
+		   "  --random-seed=SEED   set random seed (\"time\", \"rand\", integer)\n"
 		   "  --sampling-rate=NUM  fraction of transactions to log (e.g., 0.01 for 1%%)\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug  print debugging output\n"
@@ -4010,6 +4011,49 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 	}
 }
 
+/* call srandom based on some seed. NULL triggers the default behavior. */
+static void
+set_random_seed(const char *seed)
+{
+	unsigned int iseed;
+
+	if (seed == NULL)
+		seed = getenv("PGBENCH_RANDOM_SEED");
+
+	if (seed == NULL || *seed == '\0' || strcmp(seed, "time") == 0)
+	{
+		/* rely on current time */
+		instr_time	now;
+		INSTR_TIME_SET_CURRENT(now);
+		iseed = (unsigned int) INSTR_TIME_GET_MICROSEC(now);
+	}
+	else if (strcmp(seed, "rand") == 0)
+	{
+		/* use some "strong" random source */
+		if (!pg_strong_random(&iseed, sizeof(iseed)))
+		{
+			fprintf(stderr, "cannot seed random from a strong source\n");
+			exit(1);
+		}
+	}
+	else
+	{
+		/* parse seed value coming either from option or environment */
+		char garbage;
+		if (sscanf(seed, "%u%c", &iseed, &garbage) != 1)
+		{
+			fprintf(stderr,
+	"error while scanning '%s', expecting an unsigned integer\n",
+	seed);
+			exit(1);
+		}
+	}
+
+	if (seed != NULL && *seed != '\0')
+		fprintf(stderr, "setting random seed to %u\n", iseed);
+	srandom(iseed);
+}
+
 
 int
 main(int argc, char **argv)
@@ -4052,6 +4096,7 @@ main(int argc, char **argv)
 		{"progress-timestamp", no_argument, NULL, 6},
 		{"log-prefix", required_argument, NULL, 7},
 		{"foreign-keys", no_argument, NULL, 8},
+		{"random-seed", required_argument, NULL, 9},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -4120,6 +4165,9 @@ main(int argc, char **argv)
 	state = (CState *) pg_malloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
+	/* set random seed early, because it may be used while parsing scripts. */
+	set_random_seed(NULL);
+
 	while ((c = getopt_long(argc, argv, "iI:h:nvp:dqb:SNc:j:Crs:t:T:U:lf:D:F:M:P:R:L:", long_options, &optindex)) != -1)
 	{
 		char	   *script;
@@ -4392,6 +4440,10 @@ main(int argc, c