Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-08-01 Thread Fabien COELHO


Hello,

Version one is k' = 1 + (a * k + b) modulo n with a prime with 
respect to n, n being the number of keys. This is nearly possible, 
but for the modulo operator which is currently missing, and that I'm 
planning to submit for this very reason, but probably another time.


That's pretty crude,


Yep. It is very simple, it is much better than nothing, and for a database 
test is may be good enough.


although I don't object to a modulo operator.  It would be nice to be 
able to use a truly random permutation, which is not hard to generate 
but probably requires O(n) storage, likely a problem for large scale 
factors.


That is indeed the actual issue in my mind. I was thinking of permutations 
with a formula, which are not so easy to find and may end-up looking like 
(a*k+b)%n anyway. I had the same issue for generating random data for a 
schema (see http://www.coelho.net/datafiller.html).


Maybe somebody who knows more math than I do (like you, probably!) can 
come up with something more clever.


I can certainly suggest other formula, but that does not mean beautiful 
code, thus would probably be rejected. I'll see.


An alternative to this whole process may be to hash/modulo a non uniform 
random value.


  id = 1 + hash(some-random()) % n

But the hashing changes the distribution as it adds collisions, so I have 
to think about how to be able to control the distribution in that case, 
and what hash function to use.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-08-01 Thread Mitsumasa KONDO
Hi,

2014-08-01 16:26 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr


  Maybe somebody who knows more math than I do (like you, probably!) can
 come up with something more clever.


 I can certainly suggest other formula, but that does not mean beautiful
 code, thus would probably be rejected. I'll see.

 An alternative to this whole process may be to hash/modulo a non uniform
 random value.

   id = 1 + hash(some-random()) % n

 But the hashing changes the distribution as it adds collisions, so I have
 to think about how to be able to control the distribution in that case, and
 what hash function to use.

I think that we have to consider and select reproducible method, because
benchmark is always needed robust and reproducible result. And if we
realize this idea, we might need more accurate random generator that is
like Mersenne twister algorithm.  erand48 algorithm is slow and not
accurate very much.

By the way, I don't know relativeness of this topic and command line
option... Well whatever...

Regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Robert Haas
On Wed, Jul 30, 2014 at 4:18 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 nor am I in favor of patch B.

 Yep. Would providing these as additional contrib files be more acceptable?
 Something like tpc-b-gauss.sql... Otherwise there is no example available
 to show the feature.

To be honest, it just feels like clutter to me.  If we added examples
for every feature that is as significant as this one is, we'd end up
with twice the installation footprint, and most of it would be stuff
nobody ever looked at.  I think the documentation is good enough that
people will be able to understand how to use this feature, which is
good enough for me.

One thing that might still be worth doing is including the standard
pgbench scripts in the pgbench documentation.  Then we could say
things like and you could also modify these.  Right now I tend to
end up cut-and-pasting from the source code, which is fine if you're a
hacker but not so user-friendly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Robert Haas
On Wed, Jul 30, 2014 at 9:00 PM, Mitsumasa KONDO
kondo.mitsum...@gmail.com wrote:
 Hmm... It doesn't have harm for pgbench source code. And, in general,
 checking script is useful for avoiding bug.

Not if nobody runs it, or if people run it but don't know what the
output should look like.  I think anyone who knows enough to find bugs
by running these scripts probably doesn't need the scripts.

 No, patch B is still needed. Please tell me the reason. I don't like
 deciding by someones feeling,
 and it needs logical reason. Our documentation is better than the past. I
 think it can easy to understand decile probability.
 This part of the discussion is needed to continue...

 Would providing these as additional contrib files be more acceptable?
 Something like tpc-b-gauss.sql... Otherwise there is no example available
 to show the feature.

 I agree the test script and including command line options. It's not harm,
 and it's useful.

As to all of this, I simply don't agree that the stuff has enough
value to justify including it.  Now, of course, that is subjective:
one person may think it has enough value, while another person may
think that it does not have enough value.  So it just comes down to a
question of opinion, and we make those judgements of opinion all the
time.  If we included everything that everyone who works on the code
wants included, we'd end up with a bloated mess of stuff that nobody
cares about; indeed, we have a significant amount of stuff in the
source code that IMHO looks like somebody's debugging leftovers that
should have been removed before commit.  I don't want to add more
unless there is clear and convincing evidence that a significant
number of people want it, and that is not the case here.

Now, if we get a few reports from people saying, hey, I was doing some
benchmarking with pgbench, and I found the new gaussian feature to be
really excellent, but it sucked that there was no command-line option
for it, we can go back and add one.  No problem!  But in the meantime,
we've added the core of the feature without cluttering up the list of
command-line options with things that may or may not prove to be
useful.

One of the concerns that I have about the proposal of simply slapping
a gaussian or exponential modifier onto \setrandom aid 1 :naccounts is
that, while it will allow you to make part of the relation hot and
another part of the relation cold, you really can't get any more
fine-grained than that.  If you use exponential, all the hot accounts
will be near the beginning of the relation, and if you use gaussian,
they'll all be in the middle.  I'm not sure exactly will happen after
some updating has happened; I'm guessing some of the keys will still
be in their original location and others will have been pushed to the
end of the relation following relation-extension.  But there's no way,
with those command line options, to for example have 5 hot spots
distributed uniformly through the relation; or even to have the end of
the relation rather than the beginning or the middle as the hot spot.
You can do those things with the newly-enhanced \setrand *and a custom
script* but not with just a command-line option.  So that makes me
think that people who find these new facilities useful might not get
all that much use out of the command-line option anyway; and we can't
have a command-line option for every behavior anyone ever wants.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Fabien COELHO


Hello Robert,

[...]

One of the concerns that I have about the proposal of simply slapping a 
gaussian or exponential modifier onto \setrandom aid 1 :naccounts is 
that, while it will allow you to make part of the relation hot and 
another part of the relation cold, you really can't get any more 
fine-grained than that. If you use exponential, all the hot accounts 
will be near the beginning of the relation, and if you use gaussian, 
they'll all be in the middle.


That is a very good remark. Although I thought of it, I do not have a very 
good solution yet:-)


From a testing perspective, if we assume that keys have no semantics, a 
reasonable assumption is that the distribution of access for actual 
realistic workloads is probably exponential (of gaussian, anyway hardly 
uniform), but without direct correlation between key values.


In order to simulate that, we would have to apply a fixed (pseudo-)random 
permutation to the exponential-drawn key values. This is a non trivial 
problem. The version zero of solving it is to do nothing... it is the 
current status;-) Version one is k' = 1 + (a * k + b) modulo n with a 
prime with respect to n, n being the number of keys. This is nearly 
possible, but for the modulo operator which is currently missing, and that 
I'm planning to submit for this very reason, but probably another time.


I'm not sure exactly will happen after some updating has happened; I'm 
guessing some of the keys will still be in their original location and 
others will have been pushed to the end of the relation following 
relation-extension.


This is a not too bad side. What matters most in the long term is not the 
key value correlation, but the actual storage correlation, i.e. whether 
two tuples required are in the same page or not. At the beginning of a 
simulation, with close key numbers being picked up with an exponential 
distribution, the correlation is more that what would be expected. 
However, once a significant amount of the table has been updated, this 
initial artificial correlation is going to fade, and the test should 
become more realistic.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Robert Haas
On Thu, Jul 31, 2014 at 10:01 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 One of the concerns that I have about the proposal of simply slapping a
 gaussian or exponential modifier onto \setrandom aid 1 :naccounts is that,
 while it will allow you to make part of the relation hot and another part of
 the relation cold, you really can't get any more fine-grained than that. If
 you use exponential, all the hot accounts will be near the beginning of the
 relation, and if you use gaussian, they'll all be in the middle.

 That is a very good remark. Although I thought of it, I do not have a very
 good solution yet:-)

 From a testing perspective, if we assume that keys have no semantics, a
 reasonable assumption is that the distribution of access for actual
 realistic workloads is probably exponential (of gaussian, anyway hardly
 uniform), but without direct correlation between key values.

 In order to simulate that, we would have to apply a fixed (pseudo-)random
 permutation to the exponential-drawn key values. This is a non trivial
 problem. The version zero of solving it is to do nothing... it is the
 current status;-) Version one is k' = 1 + (a * k + b) modulo n with a
 prime with respect to n, n being the number of keys. This is nearly
 possible, but for the modulo operator which is currently missing, and that
 I'm planning to submit for this very reason, but probably another time.

That's pretty crude, although I don't object to a modulo operator.  It
would be nice to be able to use a truly random permutation, which is
not hard to generate but probably requires O(n) storage, likely a
problem for large scale factors.  Maybe somebody who knows more math
than I do (like you, probably!) can come up with something more
clever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-30 Thread Robert Haas
On Tue, Jul 29, 2014 at 4:41 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Attached B patch does turn incorrect setrandom syntax into errors instead
 of ignoring extra parameters.

 First A patch is repeated to help commitfest references.

 Oops, I applied the change on the wrong part:-(

 Here is the change on part A which checks setrandom syntax, and B for
 completeness.

I've committed the changes to pgbench.c and the documentation changes
with some further wordsmithing.  I don't think including the other
changes in patch A is a good idea, nor am I in favor of patch B.  But
thanks for your and Kondo-san's hard work on this; I think this will
be quite useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-30 Thread Fabien COELHO


Hello Robert,

I've committed the changes to pgbench.c and the documentation changes 
with some further wordsmithing.


Ok, thanks a lot for your reviews and your help with improving the 
documentation.



I don't think including the other changes in patch A is a good idea,


Fine. It was mostly for testing and checking purposes.


nor am I in favor of patch B.


Yep. Would providing these as additional contrib files be more acceptable? 
Something like tpc-b-gauss.sql... Otherwise there is no example 
available to show the feature.


Thanks again,

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-30 Thread Mitsumasa KONDO
Hi,

2014-07-31 5:18 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:

  I've committed the changes to pgbench.c and the documentation changes
 with some further wordsmithing.


 Ok, thanks a lot for your reviews and your help with improving the
 documentation.

Yeah, thanks for all relative members.


  I don't think including the other changes in patch A is a good idea,


 Fine. It was mostly for testing and checking purposes.

Hmm... It doesn't have harm for pgbench source code. And, in general,
checking script is useful for avoiding bug.

 nor am I in favor of patch B.


 Yep.

No, patch B is still needed. Please tell me the reason. I don't like
deciding by someones feeling,
and it needs logical reason. Our documentation is better than the past. I
think it can easy to understand decile probability.
This part of the discussion is needed to continue...

Would providing these as additional contrib files be more acceptable?
 Something like tpc-b-gauss.sql... Otherwise there is no example available
 to show the feature.

I agree the test script and including command line options. It's not harm,
and it's useful.

Best regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Hello Robert,

I wish to agree, but my interpretation of the previous code is that 
they were ignored before, so ISTM that we are stuck with keeping the 
same unfortunate behavior.


I don't agree.  I'm not in a huge hurry to fix all the places where 
pgbench currently lacks error checks just because I don't have enough to 
do (hint: I do have enough to do), but when we're adding more 
complicated syntax in one particular place, bringing the error checks in 
that portion of the code up to scratch is an eminently sensible thing to 
do, and we should do it.


Ok. I'm in favor of that anyway. It is just that was afraid that changing 
behavior, however poor the said behavior, could be a blocker.



Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.


Sorry. It does not break my mailer which relies on internal headers, but 
I'll try to be compatible with this gmail features in the future.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Hello Robert,


3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc  6 OR argc  6.  I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.


I wish to agree, but my interpretation of the previous code is that they
were ignored before, so ISTM that we are stuck with keeping the same
unfortunate behavior.


I don't agree.


Attached B patch does turn incorrect setrandom syntax into errors instead 
of ignoring extra parameters.


First A patch is repeated to help commitfest references.

--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max  min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the 

Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Attached B patch does turn incorrect setrandom syntax into errors instead of 
ignoring extra parameters.


First A patch is repeated to help commitfest references.


Oops, I applied the change on the wrong part:-(

Here is the change on part A which checks setrandom syntax, and B for 
completeness.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..16e44bd 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max  min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min  0 || (max - min) + 1  0)
 			{
@@ -1377,10 +1450,64 @@ top:
 return true;
 			}
 
+			if (argc == 4 || /* uniform without or with uniform keyword */
+(argc == 5  pg_strcasecmp(argv[4], uniform) == 0))
+			

Re: [HACKERS] gaussian distribution pgbench

2014-07-28 Thread Heikki Linnakangas

On 07/17/2014 11:13 PM, Fabien COELHO wrote:



However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).


Well, I think it's a balance.  I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench.  Saying
it's complicated, so we don't have to explain it would be a cop out;
we need to *make* it simple.  And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.


  [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.0

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%


I don't have a clue what that means.  None.


Maybe we could add in front of the decile/percent

distribution of increasing account key values selected by pgbench:


I still wouldn't know what that meant.  And it misses the point
anyway: if the documentation is good, this will be unnecessary.  If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.


The decile description is quite classic when discussing statistics.


IMHO we should include a diagram for each distribution. A diagram would 
be much more easy to understand than a decile or verbal explanation.


The only problem is that the build infrastructure doesn't currently 
support including images in the docs. That's been discussed before, and 
I think we even used to have a couple of images there a long time ago. 
Now would be a good time to bite the bullet and add the support.
We got fairly close to a consensus on how to do it in this thread: 
www.postgresql.org/message-id/flat/20120712181636.gc11...@momjian.us. 
The biggest problem was choosing an editor that has a fairly stable file 
format, so that we don't get huge diffs every time someone moves a line 
in a diagram. One work-around for that is to use graphviz and/or gnuplot 
as the source format, instead of a graphical editor.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-28 Thread Robert Haas
On Wed, Jul 23, 2014 at 12:39 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 3. Similarly, I suggest that the use of gaussian or uniform be an
 error when argc  6 OR argc  6.  I also suggest that the
 parenthesized distribution type be dropped from the error message in
 all cases.

 I wish to agree, but my interpretation of the previous code is that they
 were ignored before, so ISTM that we are stuck with keeping the same
 unfortunate behavior.

I don't agree.  I'm not in a huge hurry to fix all the places where
pgbench currently lacks error checks just because I don't have enough
to do (hint: I do have enough to do), but when we're adding more
complicated syntax in one particular place, bringing the error checks
in that portion of the code up to scratch is an eminently sensible
thing to do, and we should do it.

Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits Bv6

2014-07-25 Thread Mitsumasa KONDO
Thanks for your modify the patch! I confirmed that It seems to be fine.

I think that our latest patch fill all community comment.
So it is really ready for committer now.

Best regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-24 Thread Mitsumasa KONDO
Hi,

Thank you for your grate documentation and fix working!!!
It becomes very helpful for understanding our feature.

I add two feature in gauss_B_4.patch.

1) Add gaussianProbability() function
It is same as exponentialProbability(). And the feature is as same as
before.

2) Add result of max/min percent of the range
It is almost same as --exponential option's result. However, max percent of
the range is center of distribution
and min percent of the range is most side of distribution.
Here is the output example,

+ pgbench_account's aid selected with a truncated gaussian distribution

+ standard deviation threshold: 5.0

+ decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%

+ probability of max/min percent of the range: 4.0% 0.0%


And I add the explanation about this in the document.

I'm very appreciate for your works!!!


Best regards,

--

Mitsumasa KONDO


gauss_B_5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits Bv6

2014-07-24 Thread Fabien COELHO



Thank you for your grate documentation and fix working!!!
It becomes very helpful for understanding our feature.


Hopefully it will help make it, or part of it, pass through.


I add two feature in gauss_B_4.patch.

1) Add gaussianProbability() function
It is same as exponentialProbability(). And the feature is as same as
before.


Ok, that is better for readability and easy reuse.


2) Add result of max/min percent of the range
It is almost same as --exponential option's result. However, max percent of
the range is center of distribution
and min percent of the range is most side of distribution.
Here is the output example,


Ok, good that make it homogeneous with the exponential case.


+ pgbench_account's aid selected with a truncated gaussian distribution
+ standard deviation threshold: 5.0
+ decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+ probability of max/min percent of the range: 4.0% 0.0%



And I add the explanation about this in the document.


This is a definite improvement. I tested these minor changes and 
everything seems ok.


Attached is a very small update. One word removed from the doc, and one 
redundant declaration removed from the code.


I also have a problem with assert  Assert.  I finally figured out that 
Assert is not compiled in by default, thus it is generally ignored. So it 
is more for debugging purposes when activated than for guarding against 
some unexpected user errors.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e07206a..0247a05 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -173,6 +174,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;  /* threshold for gaussian or exponential */
+booluse_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = ;
 char	   *pgport = ;
 char	   *login = NULL;
@@ -294,11 +300,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
-	\\setrandom aid 1 :naccounts\n
+	\\setrandom aid 1 :naccounts%s\n
 	\\setrandom bid 1 :nbranches\n
 	\\setrandom tid 1 :ntellers\n
 	\\setrandom delta -5000 5000\n
@@ -312,11 +318,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
-	\\setrandom aid 1 :naccounts\n
+	\\setrandom aid 1 :naccounts%s\n
 	\\setrandom bid 1 :nbranches\n
 	\\setrandom tid 1 :ntellers\n
 	\\setrandom delta -5000 5000\n
@@ -328,9 +334,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
-	\\setrandom aid 1 :naccounts\n
+	\\setrandom aid 1 :naccounts%s\n
 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
 };
 
@@ -377,6 +383,8 @@ usage(void)
 		 -v, --vacuum-all vacuum all four standard tables before tests\n
 		 --aggregate-interval=NUM aggregate data over NUM seconds\n
 		 --sampling-rate=NUM  fraction of transactions to log (e.g. 0.01 for 1%%)\n
+		 --exponential=NUMexponential distribution with NUM threshold parameter\n
+		 --gaussian=NUM   gaussian distribution with NUM threshold parameter\n
 		   \nCommon options:\n
 		 -d, --debug  print debugging output\n
 	-h, --host=HOSTNAME  database server host or socket directory\n
@@ -2329,6 +2337,30 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated gaussian random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double gaussianProbability(int i, int slots, double threshold)
+{
+	assert(1 = i  i = slots);
+	return (0.50 * (erf (threshold * (1.0 - 1.0 / slots * (2.0 * i - 2.0)) / sqrt(2.0)) -
+		erf (threshold * (1.0 - 1.0 / slots * 2.0 * i) / sqrt(2.0))) /
+		erf (threshold / sqrt(2.0)));
+}
+
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 = i  i = slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		

Re: [HACKERS] gaussian distribution pgbench -- splits Bv6

2014-07-24 Thread Alvaro Herrera
Fabien COELHO wrote:

 I also have a problem with assert  Assert.  I finally figured out
 that Assert is not compiled in by default, thus it is generally
 ignored. So it is more for debugging purposes when activated than
 for guarding against some unexpected user errors.

Yes, Assert() is for debugging during development.  If you need to deal
with user error, use regular if () and exit() as appropriate (ereport()
in the backend).  We mostly avoid assert() in our own code.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- part 1/2

2014-07-23 Thread Robert Haas
On Thu, Jul 17, 2014 at 12:09 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 pgbench with gaussian  exponential, part 1 of 2.

 This patch is a subset of the previous patch which only adds the two
 new \setrandom gaussian and exponantial variants, but not the
 adapted pgbench test cases, as suggested by Fujii Masao.
 There is no new code nor code changes.

 The corresponding documentation has been yet again extended wrt
 to the initial patch, so that what is achieved is hopefully unambiguous
 (there are two mathematical formula, tasty!), in answer to Andres Freund
 comments, and partly to Robert Haas comments as well.

 This patch also provides several sql/pgbench scripts and a README, so
 that the feature can be tested. I do not know whether these scripts
 should make it to postgresql. I would say yes, otherwise there is no way
 to test...

 part 2 which provide adapted pgbench test cases will come later.

Some review comments:

1. I suggest that getExponentialrand and getGaussianrand be renamed to
getExponentialRand and getGaussianRand.

2. I suggest that the code be changed so that the branch currently
labeled as /* uniform with extra argument */ become a hard error
instead of a warning.

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc  6 OR argc  6.  I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.

4. This question mark seems like it should be a period:

+* value fails the test? To be on the safe side, let
us try over.

5. With regards to the following paragraph:

  para
+  The default random distribution is uniform, that is all values in the
+  range are drawn with equal probability. The gaussian and exponential
+  options allow to change this default. The mandatory
+  replaceablethreshold/ double value controls the actual distribution
+  with gaussian or exponential.
+ /para

This paragraph needs a bit of copy-editing.  Here's an attempt: By
default, all values in the range are drawn with equal probability.
The literalgaussian/ and literalexponential/ options modify
this behavior; each requires a mandatory threshold which determines
the precise shape of the distribution.  The following paragraph
should be changed to begin with For a Gaussian distribution and the
one after For an exponential distribution.

6. Overall, I think the documentation here looks much better now, but
I suggest adding one or two example to the Gaussian section.  Like
this: for example, if threshold is 2.0, 68% of the values will fall in
the middle third of the interval; with a threshold of 3.0, 99.7% of
the values will fall in the middle third of the interval.  These
numbers are fabricated, and the middle third of the interval might not
be the best part to talk about, but you get the idea (I hope).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-23 Thread Fabien COELHO


Hello Robert,


Some review comments:


Thanks a lot for your return.

Please find attached two new parts of the patch (A for setrandom 
extension, B for pgbench embedded test case extension).



1. I suggest that getExponentialrand and getGaussianrand be renamed to
getExponentialRand and getGaussianRand.


Done.

It was named like that because getrand was used for the uniform case.



2. I suggest that the code be changed so that the branch currently
labeled as /* uniform with extra argument */ become a hard error
instead of a warning.

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc  6 OR argc  6.  I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.


I wish to agree, but my interpretation of the previous code is that they 
were ignored before, so ISTM that we are stuck with keeping the same 
unfortunate behavior.



4. This question mark seems like it should be a period:

+  * value fails the test? To be on the safe side, let us try over.


Indeed.


5. With regards to the following paragraph:

 para
+  The default random distribution is uniform, that is all values in the
+  range are drawn with equal probability. The gaussian and exponential
+  options allow to change this default. The mandatory
+  replaceablethreshold/ double value controls the actual distribution
+  with gaussian or exponential.
+ /para

This paragraph needs a bit of copy-editing.  Here's an attempt: By
default, all values in the range are drawn with equal probability.
The literalgaussian/ and literalexponential/ options modify
this behavior; each requires a mandatory threshold which determines
the precise shape of the distribution.  The following paragraph
should be changed to begin with For a Gaussian distribution and the
one after For an exponential distribution.


Ok. I've kept uniform in the first sentence, because this is both
an option name and it is significant in term of probabilities.


6. Overall, I think the documentation here looks much better now, but
I suggest adding one or two example to the Gaussian section.  Like
this: for example, if threshold is 2.0, 68% of the values will fall in
the middle third of the interval; with a threshold of 3.0, 99.7% of
the values will fall in the middle third of the interval.  These
numbers are fabricated, and the middle third of the interval might not
be the best part to talk about, but you get the idea (I hope).


Done with threshold value 4.0 so I have a middle quarter and a middle 
half.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. 

Re: [HACKERS] gaussian distribution pgbench

2014-07-22 Thread Fabien COELHO


Please find attached 2 patches, which are a split of the patch discussed in 
this thread.


Please find attached a very minor improvement to apply a code (variable 
name) simplification directly in patch A so as to avoid a change in patch 
B. The cumulated patch is the same as previous.



(A) add gaussian  exponential options to pgbench \setrandom
   the patch includes sql test files.

There is no change in the *code* from previous already reviewed submissions, 
so I do not think that it needs another review on that account.


However I have (yet again) reworked the *documentation* (for Andres Freund  
Robert Haas), in particular both descriptions now follow the same structure 
(introduction, formula, intuition, rule of thumb and constraint). I have 
differentiated the concept and the option by putting the later in literal 
tags, and added a link to the corresponding wikipedia pages.



Please bear in mind that:
1. English is not my native language.
2. this is not easy reading... this is maths, to read slowly:-)
3. word smithing contributions are welcome.

I assume somehow that a user interested in gaussian  exponential 
distributions must know a little bit about probabilities...




(B) add pgbench test variants with gauss  exponential.

I have reworked the patch so as to avoid copy pasting the 3 test cases, as 
requested by Andres Freund, thus this is new, although quite simple, code. I 
have also added explanations in the documentation about how to interpret the 
decile outputs, so as to hopefully address Robert Haas comments.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..379ef24 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) 

Re: [HACKERS] gaussian distribution pgbench

2014-07-18 Thread Mitsumasa KONDO
2014-07-18 5:13 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:


  However, ISTM that it is not the purpose of pgbench documentation to be a
 primer about what is an exponential or gaussian distribution, so the idea
 would yet be to have a relatively compact explanation, and that the
 interested but clueless reader would document h..self from wikipedia or a
 text book or a friend or a math teacher (who could be a friend as
 well:-).


 Well, I think it's a balance.  I agree that the pgbench documentation
 shouldn't try to substitute for a text book or a math teacher, but I
 also think that you shouldn't necessarily need to refer to a text book
 or a math teacher in order to figure out how to use pgbench.  Saying
 it's complicated, so we don't have to explain it would be a cop out;
 we need to *make* it simple.  And if there's no way to do that, then
 IMHO we should reject the patch in favor of some future patch that
 implements something that will be easy for users to understand.

   [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
 starting vacuum...end.
 transaction type: Exponential distribution TPC-B (sort of)
 scaling factor: 1
 exponential threshold: 10.0

 decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
 highest/lowest percent of the range: 9.5% 0.0%


 I don't have a clue what that means.  None.


 Maybe we could add in front of the decile/percent

 distribution of increasing account key values selected by pgbench:


 I still wouldn't know what that meant.  And it misses the point
 anyway: if the documentation is good, this will be unnecessary.  If
 the documentation is bad, a printout that tries to illustrate it by
 example is not an acceptable substitute.


 The decile description is quite classic when discussing statistics.

Yeah, maybe, I and Fabien-san don't believe that he doesn't know the decile
percentage.
However, I think more description about decile is needed.

For example,  when we set the number of transaction 10,000 (-t 1),
range of aid is 100,000,
and --exponential is 10, decile percents is under following as you know.

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

They mean that,
#number of access in range of aid (from decile percents):
  1 to 10,000 = 6,320 times
  10,001 to 20,000= 2,330 times
  20,001 to 30,000= 860 times
  ...
  90,001 to 10,  = 0 times

#number of access in range of aid (from highest/lowest percent of the
range):
 1 to 1,000= 950 times
 ...
 99,001 to 10,   = 0 times

that's all.

Their information is easy to understand distribution of access probability,
isn't it?
Maybe I and Fabien-san have a knowledge of mathematics, so we think decile
percentage is common sense.
But if it isn't common sense, I agree with adding about these explanation
in the documents.

Best regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench

2014-07-18 Thread Fabien COELHO



For example,  when we set the number of transaction 10,000 (-t 1),
range of aid is 100,000,
and --exponential is 10, decile percents is under following as you know.

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

They mean that,
#number of access in range of aid (from decile percents):
 1 to 10,000 = 6,320 times
 10,001 to 20,000= 2,330 times
 20,001 to 30,000= 860 times
 ...
 90,001 to 10,  = 0 times

#number of access in range of aid (from highest/lowest percent of the
range):
1 to 1,000= 950 times
...
99,001 to 10,   = 0 times

that's all.

Their information is easy to understand distribution of access probability,
isn't it?
Maybe I and Fabien-san have a knowledge of mathematics, so we think decile
percentage is common sense.
But if it isn't common sense, I agree with adding about these explanation
in the documents.


What we are talking about is the summary at the end of the run, which is 
expected to be compact, hence the terse few lines.


I'm not sure how to make it explicit without extending the summary too 
much, so it would not be a summary anymore:-)


My initial assumption is that anyone interested enough in changing the 
default uniform distribution for a test would know about decile, but that 
seems to be optimistic.


Maybe it would be okay to keep a terse summary but to expand the 
documentation to explain what it means, as you suggested above...


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-18 Thread Fabien COELHO


Please find attached 2 patches, which are a split of the patch discussed 
in this thread.


(A) add gaussian  exponential options to pgbench \setrandom
the patch includes sql test files.

There is no change in the *code* from previous already reviewed 
submissions, so I do not think that it needs another review on that 
account.


However I have (yet again) reworked the *documentation* (for Andres Freund 
 Robert Haas), in particular both descriptions now follow the same 
structure (introduction, formula, intuition, rule of thumb and 
constraint). I have differentiated the concept and the option by putting 
the later in literal tags, and added a link to the corresponding 
wikipedia pages.



Please bear in mind that:
 1. English is not my native language.
 2. this is not easy reading... this is maths, to read slowly:-)
 3. word smithing contributions are welcome.

I assume somehow that a user interested in gaussian  exponential 
distributions must know a little bit about probabilities...




(B) add pgbench test variants with gauss  exponential.

I have reworked the patch so as to avoid copy pasting the 3 test cases, as 
requested by Andres Freund, thus this is new, although quite simple, code. 
I have also added explanations in the documentation about how to interpret 
the decile outputs, so as to hopefully address Robert Haas comments.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..a80c0a5 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-exp_threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double cut, uniform, rand;
+	assert(exp_threshold  0.0);
+	cut = exp(-exp_threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if exp_threshold  0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold  stdev = stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -stdev_threshold || stdev = stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return 

Re: [HACKERS] gaussian distribution pgbench

2014-07-17 Thread Robert Haas
On Wed, Jul 16, 2014 at 12:57 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Well, I think the feedback has been pretty clear, honestly.  Here's
 what I'm unhappy about: I can't understand what these options are
 actually doing.

 We can try to improve the documentation, once more!

 However, ISTM that it is not the purpose of pgbench documentation to be a
 primer about what is an exponential or gaussian distribution, so the idea
 would yet be to have a relatively compact explanation, and that the
 interested but clueless reader would document h..self from wikipedia or a
 text book or a friend or a math teacher (who could be a friend as well:-).

Well, I think it's a balance.  I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench.  Saying
it's complicated, so we don't have to explain it would be a cop out;
we need to *make* it simple.  And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

  [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
 starting vacuum...end.
 transaction type: Exponential distribution TPC-B (sort of)
 scaling factor: 1
 exponential threshold: 10.0

 decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
 highest/lowest percent of the range: 9.5% 0.0%

 I don't have a clue what that means.  None.

 Maybe we could add in front of the decile/percent

 distribution of increasing account key values selected by pgbench:

I still wouldn't know what that meant.  And it misses the point
anyway: if the documentation is good, this will be unnecessary.  If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

 Here is an example of an explanation that would make sense to me.
 This is not the actual behavior of your patch, I'm quite sure, so this
 is just an example of the *kind* of explanation that I think is
 needed:

 This is more or less the approximate behavior of the patch, but for 1% of
 the range, not 50%. However I'm not sure that the current documentation is
 so bad.

I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution.  That having been said, I
agree the current documentation for the exponential distribution is
not too bad.  But this part does not make sense:

+  A crude approximation of the distribution is that the most frequent 1%
+  values are drawn replaceablethreshold/% of the time.
+  The closer to 0.0 the threshold, the flatter (more uniform) the access
+  distribution.

Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.

The documentation for the Gaussian distribution is in somewhat worse
shape.  Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like.  The closest it comes is this:

+  In other worlds, the larger the replaceablethreshold/,
+  the narrower the access range around the middle.

But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior, the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more.  (Also, the English expression is in other
words not in other worlds - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely.  And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
gaussian and exponential are referring to the option and which are
referring to the abstract mathematical concept.  When you're talking
about the option, you should use all lower-case (as you've done) but
with literal tags or similar.  When you're referring to the
mathematical distribution, Gaussian should be capitalized.

BTW, I agree with both Heikki's suggestion that we make these options
to setrandom only and not expose command-line options for them, and
with Andres's critique that the documentation of those options is far
too repetitive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-07-17 Thread Fabien COELHO



However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).


Well, I think it's a balance.  I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench.  Saying
it's complicated, so we don't have to explain it would be a cop out;
we need to *make* it simple.  And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.


 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.0

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%


I don't have a clue what that means.  None.


Maybe we could add in front of the decile/percent

distribution of increasing account key values selected by pgbench:


I still wouldn't know what that meant.  And it misses the point
anyway: if the documentation is good, this will be unnecessary.  If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.


The decile description is quite classic when discussing statistics.


Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:


This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.


I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution.


Ok. But the general thrust was ok.

That having been said, I agree the current documentation for the 
exponential distribution is not too bad.  But this part does not make 
sense:


+  A crude approximation of the distribution is that the most frequent 1%
+  values are drawn replaceablethreshold/% of the time.


I'm trying to be nice to the reader by providing an intuitive 
information. I do not seem to succeed:-) I'm attempting to say that when 
you draw from a range, say 1 to 1000, the first 1%, i.e. values 1 to 10,

are draw about threshold% of the time.

If I draw from one hundred values:

\setrandom x 1 100 exponential 10.0

The 1 will be drawn about 10% of the time, and the 99 next values will 
share the remaining 90%.



+  The closer to 0.0 the threshold, the flatter (more uniform) the access
+  distribution.

Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.


This is in the sense of epsilon, small number close to 0 but different 
from 0. The lowest possible threshold is the smallest 
strictly positive representable with a double.



The documentation for the Gaussian distribution is in somewhat worse
shape.  Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like.  The closest it comes is this:

+  In other worlds, the larger the replaceablethreshold/,
+  the narrower the access range around the middle.

But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior,


I have now added a precise formula for Gaussian. When you see the formula, 
maybe you still would want see the decile to have an intuition.


I think that we assumed that the reader would know that a gaussian 
distribution is the classic bell-shaped distribution, and if not .?he 
would not be interested anyway.



the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more.  (Also, the English expression is in other
words not in other worlds - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely.  And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
gaussian and exponential are referring to the option and 

Re: [HACKERS] gaussian distribution pgbench -- part 1/2

2014-07-16 Thread Fabien COELHO


pgbench with gaussian  exponential, part 1 of 2.

This patch is a subset of the previous patch which only adds the two
new \setrandom gaussian and exponantial variants, but not the
adapted pgbench test cases, as suggested by Fujii Masao.
There is no new code nor code changes.

The corresponding documentation has been yet again extended wrt
to the initial patch, so that what is achieved is hopefully unambiguous
(there are two mathematical formula, tasty!), in answer to Andres Freund
comments, and partly to Robert Haas comments as well.

This patch also provides several sql/pgbench scripts and a README, so
that the feature can be tested. I do not know whether these scripts
should make it to postgresql. I would say yes, otherwise there is no way
to test...

part 2 which provide adapted pgbench test cases will come later.

--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..4b8fd59
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -f test-XXX-run.sql -t 100 -P 1 test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..a80c0a5 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-exp_threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double cut, uniform, rand;
+	assert(exp_threshold  0.0);
+	cut = exp(-exp_threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if exp_threshold  0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold  stdev = stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -stdev_threshold || stdev = stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1392,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1438,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-		

Re: [HACKERS] gaussian distribution pgbench

2014-07-15 Thread Fabien COELHO


Hello Robert,


Well, I think the feedback has been pretty clear, honestly.  Here's
what I'm unhappy about: I can't understand what these options are
actually doing.


We can try to improve the documentation, once more!

However, ISTM that it is not the purpose of pgbench documentation to be a 
primer about what is an exponential or gaussian distribution, so the idea 
would yet be to have a relatively compact explanation, and that the 
interested but clueless reader would document h..self from wikipedia or a 
text book or a friend or a math teacher (who could be a friend as well:-).



 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.0

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%


I don't have a clue what that means.  None.


Maybe we could add in front of the decile/percent

distribution of increasing account key values selected by pgbench:


Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:


This is more or less the approximate behavior of the patch, but for 1% of 
the range, not 50%. However I'm not sure that the current documentation is 
so bad.



The --exponential option causes pgbench to select lower-numbered
account IDs exponentially more frequently than higher-numbered account
IDs.  The argument to --exponential controls the strength of the
preference for lower-numbered account IDs, with a smaller value
indicating a stronger preference.  Specifically, it is the percentage
of the total number of account IDs which will receive half the total
accesses.  For example, with --exponential=10, half the accesses will
be to the smallest 10 percent of the account IDs; half the remaining
accesses will be to the next-smallest 10 percent of account IDs, and
so on.  --exponential=50 therefore represents a completely flat
distribution; larger values are not allowed.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-14 Thread Robert Haas
On Sun, Jul 13, 2014 at 2:27 AM, Mitsumasa KONDO
kondo.mitsum...@gmail.com wrote:
 I still agree with Fabien-san. I cannot understand why our logical proposal
 isn't accepted...

Well, I think the feedback has been pretty clear, honestly.  Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

And this isn't helping me a bit:

  [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
 starting vacuum...end.
 transaction type: Exponential distribution TPC-B (sort of)
 scaling factor: 1
 exponential threshold: 10.0

 decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
 highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means.  None.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

The --exponential option causes pgbench to select lower-numbered
account IDs exponentially more frequently than higher-numbered account
IDs.  The argument to --exponential controls the strength of the
preference for lower-numbered account IDs, with a smaller value
indicating a stronger preference.  Specifically, it is the percentage
of the total number of account IDs which will receive half the total
accesses.  For example, with --exponential=10, half the accesses will
be to the smallest 10 percent of the account IDs; half the remaining
accesses will be to the next-smallest 10 percent of account IDs, and
so on.  --exponential=50 therefore represents a completely flat
distribution; larger values are not allowed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-13 Thread Mitsumasa KONDO
Hi,

2014-07-04 19:05 GMT+09:00 Andres Freund and...@2ndquadrant.com:

 On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:
 
  Yea. I certainly disagree with the patch in it's current state because
 it
  copies the same 15 lines several times with a two word difference.
  Independent of whether we want those options, I don't think that's going
  to fly.
 
  I liked a simple static string for the different variants, which means
  replication. Factorizing out the (large) common part will mean malloc 
  sprintf. Well, why not.

 It sucks from a maintenance POV. And I don't see the overhead of malloc
 being relevant here...

  OTOH, we've almost reached the consensus that supporting gaussian
  and exponential options in \setrandom. So I think that you should
  separate those two features into two patches, and we should apply
  the \setrandom one first. Then we can discuss whether the other patch
  should be applied or not.
 
  Sounds like a good plan.
 
  Sigh. I'll do that as it seems to be a blocker...

I still agree with Fabien-san. I cannot understand why our logical proposal
isn't accepted...

I think we also need documentation about the actual mathematical
 behaviour of the randomness generators.
  + para
  +  With the gaussian option, the larger the
 replaceablethreshold/,
  +  the more frequently values close to the middle of the interval
 are drawn,
  +  and the less frequently values close to the replaceablemin/
 and
  +  replaceablemax/ bounds.
  +  In other worlds, the larger the replaceablethreshold/,
  +  the narrower the access range around the middle.
  +  the smaller the threshold, the smoother the access pattern
  +  distribution. The minimum threshold is 2.0 for performance.
  + /para

 The only way to actually understand the distribution here is to create a
 table, insert random values, and then look at the result. That's not a
 good thing.

That's right. Therefore, we create command line option to easy to
understand parametrized Gaussian distribution.
When you want to know the parameter of distribution, you can use command
line option like under followings.

 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.0
decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 5.0
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
highest/lowest percent of the range: 4.9% 0.0%

If you have a better method than our method, please share us.


  The caveat that I have is that without these options there is:
 
  (1) no return about the actual distributions in the final summary, which
  depend on the threshold value, and
 
  (2) no included mean to test the feature, so the first patch is less
  meaningful if the feature cannot be used simply and require a custom
 script.

 I personally agree that we likely want that as an additional
 feature. Even if just because it makes the results easier to compare.

If we can do positive and logical discussion, I will agree with the
proposal about separate patches.
However, I think that most opposite hacker decided by his feelings...
Actuary, he didn't answer to our proposal about understanding the
parametrized distribution...
So I also think it is blocker. Command line feature is also needed.
Besides, is there a other good method? Please share us.

Best regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench

2014-07-04 Thread Fabien COELHO


Yea. I certainly disagree with the patch in it's current state because 
it copies the same 15 lines several times with a two word difference. 
Independent of whether we want those options, I don't think that's going 
to fly.


I liked a simple static string for the different variants, which means 
replication. Factorizing out the (large) common part will mean malloc  
sprintf. Well, why not.



OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.



Sounds like a good plan.


Sigh. I'll do that as it seems to be a blocker...

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which 
depend on the threshold value, and


(2) no included mean to test the feature, so the first patch is less 
meaningful if the feature cannot be used simply and require a custom 
script.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-04 Thread Andres Freund
On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:
 
 Yea. I certainly disagree with the patch in it's current state because it
 copies the same 15 lines several times with a two word difference.
 Independent of whether we want those options, I don't think that's going
 to fly.
 
 I liked a simple static string for the different variants, which means
 replication. Factorizing out the (large) common part will mean malloc 
 sprintf. Well, why not.

It sucks from a maintenance POV. And I don't see the overhead of malloc
being relevant here...

 OTOH, we've almost reached the consensus that supporting gaussian
 and exponential options in \setrandom. So I think that you should
 separate those two features into two patches, and we should apply
 the \setrandom one first. Then we can discuss whether the other patch
 should be applied or not.
 
 Sounds like a good plan.
 
 Sigh. I'll do that as it seems to be a blocker...

I think we also need documentation about the actual mathematical
behaviour of the randomness generators.

 + para
 +  With the gaussian option, the larger the replaceablethreshold/,
 +  the more frequently values close to the middle of the interval are 
 drawn,
 +  and the less frequently values close to the replaceablemin/ and
 +  replaceablemax/ bounds.
 +  In other worlds, the larger the replaceablethreshold/,
 +  the narrower the access range around the middle.
 +  the smaller the threshold, the smoother the access pattern
 +  distribution. The minimum threshold is 2.0 for performance.
 + /para

The only way to actually understand the distribution here is to create a
table, insert random values, and then look at the result. That's not a
good thing.

 The caveat that I have is that without these options there is:
 
 (1) no return about the actual distributions in the final summary, which
 depend on the threshold value, and
 
 (2) no included mean to test the feature, so the first patch is less
 meaningful if the feature cannot be used simply and require a custom script.

I personally agree that we likely want that as an additional
feature. Even if just because it makes the results easier to compare.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Fabien COELHO


Hello Gavin,


 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


I would suggest that probabilities should NEVER be expressed in percentages! 
As a percentage probability looks weird, and is never used for serious 
statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35 
rather than 35%.


I could agree about the mathematics, but ISTM that 11.5% is more 
readable and intuitive than 0.115.


I could change probability and replace it with frequency or maybe 
occurence, what would you think about that?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Gavin Flower

On 03/07/14 20:58, Fabien COELHO wrote:


Hello Gavin,


 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


I would suggest that probabilities should NEVER be expressed in 
percentages! As a percentage probability looks weird, and is never 
used for serious statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 
0.35 rather than 35%.


I could agree about the mathematics, but ISTM that 11.5% is more 
readable and intuitive than 0.115.


I could change probability and replace it with frequency or maybe 
occurence, what would you think about that?




You may well be hitting a situation, where you meet opposition whatever 
you do!  :-)


frequency implies a positive integer (though relative frequency 
might be okay) - and if you use occurrence, someone else is bound to 
complain...


Though, I'd opt for relative frequency, if you can't use values in the 
range 0 ... 1 for probabilities, if %'s are used - so long as it does 
not generate a flame war.


I suspect it may not be worth the grief to change.


Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Fujii Masao
On Wed, Jul 2, 2014 at 6:05 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:

 Hello Mitsumasa-san,

 And I'm also interested in your decile percents output like under
 followings,
 decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%


 Sure, I'm really fine with that.


 I think that it is easier than before. Sum of decile percents is just
 100%.


 That's a good property:-)

 However, I don't prefer highest/lowest percentage because it will be
 confused with decile percentage for users, and anyone cannot understand this
 digits. I cannot understand 4.9%, 0.0% when I see the first time. Then, I
 checked the source code, I understood it:( It's not good design... #Why this
 parameter use 100?


 What else? People have ten fingers and like powers of 10, and are used to
 percents?


 So I'd like to remove it if you like. It will be more simple.


 I think that for the exponential distribution it helps, especially for high
 threshold, to have the lowest/highest percent density. For low thresholds,
 the decile is also definitely useful. So I'm fine with both outputs as you
 have put them.

 I have just updated the wording so that it may be clearer:

  decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
  probability of fist/last percent of the range: 11.3% 0.0%


 Attached patch is fixed version, please confirm it.


 Attached a v15 which just fixes a typo and the above wording update. I'm
 validating it for committers.


 #Of course, World Cup is being held now. I'm not hurry at all.


 I'm not a soccer kind of person, so it does not influence my
 availibility.:-)


 Suggested commit message:

 Add drawing random integers with a Gaussian or truncated exponentional
 distributions to pgbench.

 Test variants with these distributions are also provided and triggered
 with options --gaussian=... and --exponential=

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.
OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Andres Freund
On 2014-07-03 21:27:53 +0900, Fujii Masao wrote:
  Add drawing random integers with a Gaussian or truncated exponentional
  distributions to pgbench.
 
  Test variants with these distributions are also provided and triggered
  with options --gaussian=... and --exponential=
 
 IIRC we've not reached consensus about whether we should support
 such options in pgbench. Several hackers disagreed to support them.

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word
difference. Independent of whether we want those options, I don't think
that's going to fly.

 OTOH, we've almost reached the consensus that supporting gaussian
 and exponential options in \setrandom. So I think that you should
 separate those two features into two patches, and we should apply
 the \setrandom one first. Then we can discuss whether the other patch
 should be applied or not.

Sounds like a good plan.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-02 Thread Fabien COELHO


Hello Mitsumasa-san,


And I'm also interested in your decile percents output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%


Sure, I'm really fine with that.


I think that it is easier than before. Sum of decile percents is just 100%.


That's a good property:-)

However, I don't prefer highest/lowest percentage because it will be 
confused with decile percentage for users, and anyone cannot understand 
this digits. I cannot understand 4.9%, 0.0% when I see the first time. 
Then, I checked the source code, I understood it:( It's not good 
design... #Why this parameter use 100?


What else? People have ten fingers and like powers of 10, and are used to 
percents?



So I'd like to remove it if you like. It will be more simple.


I think that for the exponential distribution it helps, especially for 
high threshold, to have the lowest/highest percent density. For low 
thresholds, the decile is also definitely useful. So I'm fine with both 
outputs as you have put them.


I have just updated the wording so that it may be clearer:

 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


Attached patch is fixed version, please confirm it.


Attached a v15 which just fixes a typo and the above wording update. I'm 
validating it for committers.



#Of course, World Cup is being held now. I'm not hurry at all.


I'm not a soccer kind of person, so it does not influence my 
availibility.:-)



Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional 
distributions to pgbench.


Test variants with these distributions are also provided and triggered
with options --gaussian=... and --exponential=


Have a nice day/night,

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..3541b7e 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -171,6 +174,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+booluse_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = ;
 char	   *pgport = ;
 char	   *login = NULL;
@@ -332,6 +343,88 @@ static char *select_only = {
 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	\\set nbranches 

Re: [HACKERS] gaussian distribution pgbench

2014-07-02 Thread Fabien COELHO




I have just updated the wording so that it may be clearer:


Oops, I have sent the wrong patch, without the wording fix. Here is the 
real updated version, which I tested.



probability of fist/last percent of the range: 11.3% 0.0%


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..f8ad17e 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -171,6 +174,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+booluse_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = ;
 char	   *pgport = ;
 char	   *login = NULL;
@@ -332,6 +343,88 @@ static char *select_only = {
 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --gaussian with -N case */
+static char *gaussian_simple_update = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --gaussian with -S case */
+static char *gaussian_select_only = {
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts 

Re: [HACKERS] gaussian distribution pgbench

2014-07-02 Thread Gavin Flower

On 02/07/14 21:05, Fabien COELHO wrote:


Hello Mitsumasa-san,


And I'm also interested in your decile percents output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%


Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just 
100%.


That's a good property:-)

However, I don't prefer highest/lowest percentage because it will 
be confused with decile percentage for users, and anyone cannot 
understand this digits. I cannot understand 4.9%, 0.0% when I see 
the first time. Then, I checked the source code, I understood it:( 
It's not good design... #Why this parameter use 100?


What else? People have ten fingers and like powers of 10, and are used 
to percents?



So I'd like to remove it if you like. It will be more simple.


I think that for the exponential distribution it helps, especially for 
high threshold, to have the lowest/highest percent density. For low 
thresholds, the decile is also definitely useful. So I'm fine with 
both outputs as you have put them.


I have just updated the wording so that it may be clearer:

 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


Attached patch is fixed version, please confirm it.


Attached a v15 which just fixes a typo and the above wording update. 
I'm validating it for committers.



#Of course, World Cup is being held now. I'm not hurry at all.


I'm not a soccer kind of person, so it does not influence my 
availibility.:-)



Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional 
distributions to pgbench.


Test variants with these distributions are also provided and triggered
with options --gaussian=... and --exponential=


Have a nice day/night,



I would suggest that probabilities should NEVER be expressed in 
percentages! As a percentage probability looks weird, and is never used 
for serious statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 
0.35 rather than 35%.



Cheers,
Gavin


Re: [HACKERS] gaussian distribution pgbench

2014-06-17 Thread Mitsumasa KONDO
Hello Fabien-san,

I have checked your v13 patch, and tested the new exponential distribution
generating algorithm. It works fine and less or no overhead than previous
version.
Great work! And I agree with your proposal.

And I'm also interested in your decile percents output like under
followings,

 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=20
 ~
 decile percents: 86.5% 11.7% 1.6% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
 ~
 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
 ~
 decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
 ~
 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
 ~
 decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
 ~

I think that it is easy to understand exponential distribution when I check
the exponential parameter. I also agree with it. So I create decile
percents output
 in gaussian distribution.
Here are the examples.

 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=20
 ~
 decile percents: 0.0% 0.0% 0.0% 0.0% 50.0% 50.0% 0.0% 0.0% 0.0% 0.0%
 ~
 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=10
 ~
 decile percents: 0.0% 0.0% 0.0% 2.3% 47.7% 47.7% 2.3% 0.0% 0.0% 0.0%
 ~
 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=5
 ~
 decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%

I think that it is easier than before. Sum of decile percents is just 100%.


However, I don't prefer highest/lowest percentage because it will be
confused
 with decile percentage for users, and anyone cannot understand this
digits.

Here is example when sets exponential=5,
 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
 ~
 decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
 highest/lowest percent of the range: 4.9% 0.0%
 ~

I cannot understand 4.9%, 0.0% when I see the first time.
Then, I checked the source code, I understood it:( It's not good design...
#Why this parameter use 100?
So I'd like to remove it if you like. It will be more simple.

Attached patch is fixed version, please confirm it.
#Of course, World Cup is being held now. I'm not hurry at all.

Best regards,
-- 
Mitsumasa KONDO
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 41,46 
--- 41,47 
  #include math.h
  #include signal.h
  #include sys/time.h
+ #include assert.h
  #ifdef HAVE_SYS_SELECT_H
  #include sys/select.h
  #endif
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 99,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 171,176  bool		is_connect;			/* establish connection for each transaction */
--- 174,187 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = ;
  char	   *pgport = ;
  char	   *login = NULL;
***
*** 332,337  static char *select_only = {
--- 343,430 
  	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT 

Re: [HACKERS] gaussian distribution pgbench

2014-04-01 Thread Fabien COELHO


Please find attached an updated version v13 for this patch.

I have (I hope) significanlty improved the documentation, including not so 
helpful mathematical explanation about the actual meaning of the threshold 
value. If a native English speaker could check the documentation, it would 
be nice!


I have improved the implementation of the exponential distribution so as 
to avoid a loop, which allows to lift the minimum threshold value 
constraint, and the exponential pgbench summary displays decile and 
first/last percent drawing probabilities. However, the same simplification 
cannot be applied on the gaussian distribution part which must rely on a 
loop, thus needs a minimal threshold for performance. I have also checked 
(see the 4 attached scripts) the actual distribution against the computed 
probabilities.



I disagree with the suggestion to remove the included gaussian  
exponential tests variants, because (1) it would mean removing the 
specific summaries as well, which are essential to help feel how the 
feature works; (2) the corresponding code in the source is rather 
straightforward; (3) the tests correspond to the schema and data created 
with -i, so it makes sense that they are stored in pgbench; (4) in order 
for this feature to be used, it is best that it is available directly and 
simply from pgbench, and not to be thought for elsewhere.



If this is a commit blocker, then the embedded script will have to be 
removed, but I really think that they add a significant value to pgbench 
and its non uniform features because they make it easy to test.



If Mitsumasa-san aggrees with these proposed changes, I would suggest to
apply this patch.

--
Fabiendiff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 7c1e59e..eb1ecb3 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include math.h
 #include signal.h
 #include sys/time.h
+#include assert.h
 #ifdef HAVE_SYS_SELECT_H
 #include sys/select.h
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -169,6 +172,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+booluse_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = ;
 char	   *pgport = ;
 char	   *login = NULL;
@@ -330,6 +341,88 @@ static char *select_only = {
 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  

Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread KONDO Mitsumasa
(2014/03/17 22:37), Tom Lane wrote:
 KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp writes:
 (2014/03/17 18:02), Heikki Linnakangas wrote:
 On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:
 There is an infinite number of variants of the TPC-B test that we could 
 include
 in pgbench. If we start adding every one of them, we're quickly going to 
 have
 hundreds of options to choose the workload. I'd like to keep pgbench simple.
 These two new test variants, gaussian and exponential, are not that special 
 that
 they'd deserve to be included in the program itself.
 
 Well, I add only two options, and they are major distribution that are seen 
 in
 real database system than uniform distiribution. I'm afraid, I think you are 
 too
 worried and it will not be added hundreds of options. And pgbench is still 
 simple.
 
 FWIW, I concur with Heikki on this.  Adding new versions of \setrandom is
 useful functionality.  Embedding them in the standard test is not,
 because that just makes it (even) less standard.  And pgbench has too darn
 many switches already.
Hmm, I cooled down and see the pgbench option. I can understand his arguments,
there are many sitches already and it will become more largear options unless we
stop adding new option. However, I think that the man who added the option in
the past thought the option will be useful for PostgreSQL performance
improvement. But now, they are disturb the new option such like my feature which
can create more real system benchmark distribution. I think it is very
unfortunate and also tending to stop progress of improvement of PostgreSQL
performance, not only pgbench. And if we remove command line option, I think new
feature will tend to reject. It is not also good.

By the way, if we remove command line option, it is difficult to understand
distirbution of gaussian, because threshold parameter is very sensitive and it 
is
also very useful feature. It is difficult and taking labor that analyzing and
visualization pgbench_history using SQL.

What do you think about this problem? This is not disscussed yet.

 [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=2
 ~
 access probability of top 20%, 10% and 5% records: 0.32566 0.16608 0.08345
 ~
 [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=4
 ~
 access probability of top 20%, 10% and 5% records: 0.57633 0.31086 0.15853
 ~
 [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=10
 ~
 access probability of top 20%, 10% and 5% records: 0.95450 0.68269 0.38292
 ~

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread KONDO Mitsumasa

(2014/03/17 23:29), Robert Haas wrote:

On Sat, Mar 15, 2014 at 4:50 AM, Mitsumasa KONDO
kondo.mitsum...@gmail.com wrote:

There are explanations and computations as comments in the code. If it is
about the documentation, I'm not sure that a very precise mathematical
definition will help a lot of people, and might rather hinder understanding,
so the doc focuses on an intuitive explanation instead.


Yeah, I think that we had better to only explain necessary infomation for
using this feature. If we add mathematical theory in docs, it will be too
difficult for user.  And it's waste.


Well, if you *don't* include at least *some* mathematical description
of what the feature does in the documentation, then users who need to
understand it will have to read the source code to figure it out,
which is going to be even more difficult.
I had fixed this problem. Please see the v12 patch. I think it doesn't includ 
mathematical
description, but user will be able to understand intuitive from the explanation 
of document.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread KONDO Mitsumasa
And I find new useful point of this feature. Under following results are
'--gaussian=20' case and '--gaussian=2' case, and postgresql setting is same.

 [mitsu-ko@pg-rex31 pgbench]$ ./pgbench -c8 -j4 --gaussian=20 -T30 -P 5
 starting vacuum...end.
 progress: 5.0 s, 4285.8 tps, lat 1.860 ms stddev 0.425
 progress: 10.0 s, 4249.2 tps, lat 1.879 ms stddev 0.372
 progress: 15.0 s, 4230.3 tps, lat 1.888 ms stddev 0.430
 progress: 20.0 s, 4247.3 tps, lat 1.880 ms stddev 0.400
 LOG:  checkpoints are occurring too frequently (12 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 25.0 s, 4269.0 tps, lat 1.870 ms stddev 0.427
 progress: 30.0 s, 4318.1 tps, lat 1.849 ms stddev 0.415
 transaction type: Gaussian distribution TPC-B (sort of)
 scaling factor: 10
 standard deviation threshold: 20.0
 access probability of top 20%, 10% and 5% records: 0.4 0.95450 0.68269
 query mode: simple
 number of clients: 8
 number of threads: 4
 duration: 30 s
 number of transactions actually processed: 128008
 latency average: 1.871 ms
 latency stddev: 0.412 ms
 tps = 4266.266374 (including connections establishing)
 tps = 4267.312022 (excluding connections establishing)


 [mitsu-ko@pg-rex31 pgbench]$ ./pgbench -c8 -j4 --gaussian=2 -T30 -P 5
 starting vacuum...end.
 LOG:  checkpoints are occurring too frequently (13 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (1 second apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 5.0 s, 3927.9 tps, lat 2.030 ms stddev 0.691
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (1 second apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 10.0 s, 4045.8 tps, lat 1.974 ms stddev 0.835
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (1 second apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 15.0 s, 4042.5 tps, lat 1.976 ms stddev 0.613
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 20.0 s, 4103.9 tps, lat 1.946 ms stddev 0.540
 LOG:  checkpoints are occurring too frequently (1 second apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 25.0 s, 4003.1 tps, lat 1.995 ms stddev 0.526
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (1 second apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 LOG:  checkpoints are occurring too frequently (2 seconds apart)
 HINT:  Consider increasing the configuration parameter checkpoint_segments.
 progress: 30.0 s, 4025.5 tps, lat 1.984 ms stddev 0.568
 transaction type: Gaussian distribution TPC-B (sort of)
 scaling factor: 10
 standard deviation threshold: 2.0
 access probability of top 20%, 10% and 5% records: 0.32566 0.16608 0.08345
 query mode: simple
 number of clients: 8
 number of threads: 4
 duration: 30 s
 number of transactions actually processed: 120752
 latency average: 1.984 ms
 latency stddev: 0.638 ms
 tps = 4024.823433 (including connections establishing)
 tps = 4025.87 (excluding connections establishing)

In '--gaussian=2' benchmark, checkpoint is frequently happen than 
'--gaussian=20'
benchmark. Because former update large range of records
so that fullpage write WALs are bigger than later. Former distribution updates
large range of records, so that fullpage-write WALs are 

Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread Heikki Linnakangas

On 03/18/2014 11:57 AM, KONDO Mitsumasa wrote:

I think that this feature will be also useful for survey new buffer-replace
algorithm and checkpoint strategy, so on.


Sure. No doubt about that.


If we remove this option, it is really dissapointed..


As long as we get the \setrandom changes in, you can easily do these 
tests using a custom script. There's nothing wrong with using a custom 
script, it will be just as useful for exploring buffer replacement 
algorithms, checkpoints etc. as a built-in option.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread KONDO Mitsumasa

Hi Heikki-san,

(2014/03/17 14:39), KONDO Mitsumasa wrote:

(2014/03/15 15:53), Fabien COELHO wrote:


Hello Heikki,


A couple of comments:

* There should be an explicit \setrandom ... uniform option too, even though
you get that implicitly if you don't specify the distribution

Fix. We can use \setrandom val min max uniform without error messages.


* What exactly does the threshold mean? The docs informally explain that the
larger the thresold, the more frequent values close to the middle of the
interval are drawn, but that's pretty vague.


There are explanations and computations as comments in the code. If it is about
the documentation, I'm not sure that a very precise mathematical definition will
help a lot of people, and might rather hinder understanding, so the doc focuses
on an intuitive explanation instead.

Add more detail information in the document. Is it OK? Please confirm it.


* Does min and max really make sense for gaussian and exponential
distributions? For gaussian, I would expect mean and standard deviation as the
parameters, not min/max/threshold.


Yes... and no:-) The aim is to draw an integer primary key from a table, so it
must be in a specified range. This is approximated by drawing a double value 
with
the expected distribution (gaussian or exponential) and project it carefully 
onto
integers. If it is out of range, there is a loop and another value is drawn. The
minimal threshold constraint (2.0) ensures that the probability of looping is 
low.

It make sense. Please see the attached picutre in last day.


* How about setting the variable as a float instead of integer? Would seem more
natural to me. At least as an option.


Which variable? The values set by setrandom are mostly used for primary keys. We
really want integers in a range.

Oh, I see. He said about documents.

The document was mistaken.
Threshold parameter must be double and fix the document.

By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM 
command options. Can you tell me the objective reason? I think pgbench is the

benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark.
It is written in documents, and default benchmark wasn't changed by my patch.
So we need not remove command options, and they are one of the variety of
benchmark options. Maybe you have something misunderstanding about my patch...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = ;
  char	   *pgport = ;
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE 

Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread Heikki Linnakangas

On 03/15/2014 08:53 AM, Fabien COELHO wrote:

* Does min and max really make sense for gaussian and exponential
distributions? For gaussian, I would expect mean and standard deviation as
the parameters, not min/max/threshold.

Yes... and no:-) The aim is to draw an integer primary key from a table,
so it must be in a specified range.


Well, I don't agree with that aim. It's useful for choosing a primary 
key, as in the pgbench TPC-B workload, but a gaussian distributed random 
number could be used for many other things too. For example:


\setrandom foo ... gaussian

select * from cheese where weight  :foo

And :foo should be a float, not an integer. That's what I was trying to 
say earlier, when I said that the variable should be a float. If you 
need an integer, just cast or round it in the query.


I realize that the current \setrandom sets the variable to an integer, 
so gaussian/exponential would be different. But so what? An option to 
generate uniformly distributed floats would be handy too, though.



This is approximated by drawing a
double value with the expected distribution (gaussian or exponential) and
project it carefully onto integers. If it is out of range, there is a loop
and another value is drawn. The minimal threshold constraint (2.0) ensures
that the probability of looping is low.


Well, that's one way to do constraint it to the given range, but there 
are many other ways to do it. Like, clamp it to the min/max if it's out 
of range. I don't think we need to choose any particular method, you can 
handle that in the test script.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread Heikki Linnakangas

On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:

By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM
command options. Can you tell me the objective reason? I think pgbench is the
benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark.
It is written in documents, and default benchmark wasn't changed by my patch.
So we need not remove command options, and they are one of the variety of
benchmark options. Maybe you have something misunderstanding about my patch...


There is an infinite number of variants of the TPC-B test that we could 
include in pgbench. If we start adding every one of them, we're quickly 
going to have hundreds of options to choose the workload. I'd like to 
keep pgbench simple. These two new test variants, gaussian and 
exponential, are not that special that they'd deserve to be included in 
the program itself.


pgbench already has a mechanism for running custom scripts, in which you 
can specify whatever workload you want. Let's use that. If it's missing 
something you need to specify the workload you want, let's enhance the 
script language.


The features we're missing, which makes it difficult to write the 
gaussian and exponential variants as custom scripts, is the capability 
to create random numbers with a non-uniform distribution. That's the 
feature we should include in pgbench.


(Actually, you could do the Box-Muller transformation as part of the 
query, to convert the uniform random variable to a gaussian one. Then 
you wouldn't need any changes to pgbench. But I agree that \setrandom 
... gaussian would be quite handy)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread KONDO Mitsumasa

(2014/03/17 17:46), Heikki Linnakangas wrote:

On 03/15/2014 08:53 AM, Fabien COELHO wrote:

* Does min and max really make sense for gaussian and exponential
distributions? For gaussian, I would expect mean and standard deviation as
the parameters, not min/max/threshold.

Yes... and no:-) The aim is to draw an integer primary key from a table,
so it must be in a specified range.


Well, I don't agree with that aim. It's useful for choosing a primary key, as in
the pgbench TPC-B workload, but a gaussian distributed random number could be
used for many other things too. For example:

\setrandom foo ... gaussian

select * from cheese where weight  :foo

And :foo should be a float, not an integer. That's what I was trying to say
earlier, when I said that the variable should be a float. If you need an 
integer,
just cast or round it in the query.

I realize that the current \setrandom sets the variable to an integer, so
gaussian/exponential would be different. But so what? An option to generate
uniformly distributed floats would be handy too, though.
Well, it seems new feature. If you want to realise it as double, add 
'\setrandomd' as a double random generator in pgbebch. I will agree with that.



This is approximated by drawing a
double value with the expected distribution (gaussian or exponential) and
project it carefully onto integers. If it is out of range, there is a loop
and another value is drawn. The minimal threshold constraint (2.0) ensures
that the probability of looping is low.


Well, that's one way to do constraint it to the given range, but there are many
other ways to do it. Like, clamp it to the min/max if it's out of range.

It's too heavy method.. Client calculation must be light.


I don't
think we need to choose any particular method, you can handle that in the test
script.

I think our implementation is the best way to realize it.
It is fast and robustness for the probability of looping is low.

If you have better idea, please teach us.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread KONDO Mitsumasa

(2014/03/17 18:02), Heikki Linnakangas wrote:

On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:

By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM
command options. Can you tell me the objective reason? I think pgbench is the
benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark.
It is written in documents, and default benchmark wasn't changed by my patch.
So we need not remove command options, and they are one of the variety of
benchmark options. Maybe you have something misunderstanding about my patch...


There is an infinite number of variants of the TPC-B test that we could include
in pgbench. If we start adding every one of them, we're quickly going to have
hundreds of options to choose the workload. I'd like to keep pgbench simple.
These two new test variants, gaussian and exponential, are not that special that
they'd deserve to be included in the program itself.
Well, I add only two options, and they are major distribution that are seen in 
real database system than uniform distiribution. I'm afraid, I think you are too 
worried and it will not be added hundreds of options. And pgbench is still simple.



pgbench already has a mechanism for running custom scripts, in which you can
specify whatever workload you want. Let's use that. If it's missing something 
you
need to specify the workload you want, let's enhance the script language.
I have not seen user who is using pgbench custom script very much. And gaussian 
and exponential distribution are much better to measure the real system 
perfomance, so I'd like to use it command option. In now pgbench, we can only 
measure about database size, but it isn't realistic situation. We want to forcast 
the required system from calculating the size of hot spot or distirbution of 
access pettern.


I'd realy like to include it on my heart:)  Please...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread Fujii Masao
On Mon, Mar 17, 2014 at 7:07 PM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 (2014/03/17 18:02), Heikki Linnakangas wrote:

 On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:

 By the way, you seem to want to remove --gaussian=NUM and
 --exponential=NUM
 command options. Can you tell me the objective reason? I think pgbench is
 the
 benchmark test on PostgreSQL and default benchmark is TPC-B-like
 benchmark.
 It is written in documents, and default benchmark wasn't changed by my
 patch.
 So we need not remove command options, and they are one of the variety of
 benchmark options. Maybe you have something misunderstanding about my
 patch...


 There is an infinite number of variants of the TPC-B test that we could
 include
 in pgbench. If we start adding every one of them, we're quickly going to
 have
 hundreds of options to choose the workload. I'd like to keep pgbench
 simple.
 These two new test variants, gaussian and exponential, are not that
 special that
 they'd deserve to be included in the program itself.

 Well, I add only two options, and they are major distribution that are seen
 in real database system than uniform distiribution. I'm afraid, I think you
 are too worried and it will not be added hundreds of options. And pgbench is
 still simple.


 pgbench already has a mechanism for running custom scripts, in which you
 can
 specify whatever workload you want. Let's use that. If it's missing
 something you
 need to specify the workload you want, let's enhance the script language.

 I have not seen user who is using pgbench custom script very much. And
 gaussian and exponential distribution are much better to measure the real
 system perfomance, so I'd like to use it command option. In now pgbench, we
 can only measure about database size, but it isn't realistic situation. We
 want to forcast the required system from calculating the size of hot spot or
 distirbution of access pettern.

 I'd realy like to include it on my heart:)  Please...

I have no strong opinion about the command-line option for gaussian,
but I think that we should focus on \setrandom gaussian first. Even
after that's committed, we can implement that commnand-line option
later if many people think that's necessary.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread Tom Lane
KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp writes:
 (2014/03/17 18:02), Heikki Linnakangas wrote:
 On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:
 There is an infinite number of variants of the TPC-B test that we could 
 include
 in pgbench. If we start adding every one of them, we're quickly going to have
 hundreds of options to choose the workload. I'd like to keep pgbench simple.
 These two new test variants, gaussian and exponential, are not that special 
 that
 they'd deserve to be included in the program itself.

 Well, I add only two options, and they are major distribution that are seen 
 in 
 real database system than uniform distiribution. I'm afraid, I think you are 
 too 
 worried and it will not be added hundreds of options. And pgbench is still 
 simple.

FWIW, I concur with Heikki on this.  Adding new versions of \setrandom is
useful functionality.  Embedding them in the standard test is not,
because that just makes it (even) less standard.  And pgbench has too darn
many switches already.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread Robert Haas
On Sat, Mar 15, 2014 at 4:50 AM, Mitsumasa KONDO
kondo.mitsum...@gmail.com wrote:
 There are explanations and computations as comments in the code. If it is
 about the documentation, I'm not sure that a very precise mathematical
 definition will help a lot of people, and might rather hinder understanding,
 so the doc focuses on an intuitive explanation instead.

 Yeah, I think that we had better to only explain necessary infomation for
 using this feature. If we add mathematical theory in docs, it will be too
 difficult for user.  And it's waste.

Well, if you *don't* include at least *some* mathematical description
of what the feature does in the documentation, then users who need to
understand it will have to read the source code to figure it out,
which is going to be even more difficult.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-16 Thread KONDO Mitsumasa

(2014/03/15 15:53), Fabien COELHO wrote:


Hello Heikki,


A couple of comments:

* There should be an explicit \setrandom ... uniform option too, even though
you get that implicitly if you don't specify the distribution


Indeed. I agree. I suggested it, but it got lost.


* What exactly does the threshold mean? The docs informally explain that the
larger the thresold, the more frequent values close to the middle of the
interval are drawn, but that's pretty vague.


There are explanations and computations as comments in the code. If it is about
the documentation, I'm not sure that a very precise mathematical definition will
help a lot of people, and might rather hinder understanding, so the doc focuses
on an intuitive explanation instead.


* Does min and max really make sense for gaussian and exponential
distributions? For gaussian, I would expect mean and standard deviation as the
parameters, not min/max/threshold.


Yes... and no:-) The aim is to draw an integer primary key from a table, so it
must be in a specified range. This is approximated by drawing a double value 
with
the expected distribution (gaussian or exponential) and project it carefully 
onto
integers. If it is out of range, there is a loop and another value is drawn. The
minimal threshold constraint (2.0) ensures that the probability of looping is 
low.


* How about setting the variable as a float instead of integer? Would seem more
natural to me. At least as an option.


Which variable? The values set by setrandom are mostly used for primary keys. We
really want integers in a range.

Oh, I see. He said about documents.

+   Moreover, set gaussian or exponential with threshold interger value,
+   we can get gaussian or exponential random in integer value between
+   replaceablemin/ and replaceablemax/ bounds inclusive.

Collectry,
+   Moreover, set gaussian or exponential with threshold double value,
+   we can get gaussian or exponential random in integer value between
+   replaceablemin/ and replaceablemax/ bounds inclusive.


And I am going to fix the document more easily understanding for user.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-15 Thread Fabien COELHO


Hello Heikki,


A couple of comments:

* There should be an explicit \setrandom ... uniform option too, even 
though you get that implicitly if you don't specify the distribution


Indeed. I agree. I suggested it, but it got lost.

* What exactly does the threshold mean? The docs informally explain that 
the larger the thresold, the more frequent values close to the middle of the 
interval are drawn, but that's pretty vague.


There are explanations and computations as comments in the code. If it is 
about the documentation, I'm not sure that a very precise mathematical 
definition will help a lot of people, and might rather hinder 
understanding, so the doc focuses on an intuitive explanation instead.


* Does min and max really make sense for gaussian and exponential 
distributions? For gaussian, I would expect mean and standard deviation as 
the parameters, not min/max/threshold.


Yes... and no:-) The aim is to draw an integer primary key from a table, 
so it must be in a specified range. This is approximated by drawing a 
double value with the expected distribution (gaussian or exponential) and 
project it carefully onto integers. If it is out of range, there is a loop 
and another value is drawn. The minimal threshold constraint (2.0) ensures 
that the probability of looping is low.


* How about setting the variable as a float instead of integer? Would seem 
more natural to me. At least as an option.


Which variable? The values set by setrandom are mostly used for primary 
keys. We really want integers in a range.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-15 Thread Mitsumasa KONDO
Oh, sorry, I forgot to write URL referring picture.

http://en.wikipedia.org/wiki/Normal_distribution
http://en.wikipedia.org/wiki/Exponential_distribution

regards,
--
Mitsumasa KONDO


2014-03-15 17:50 GMT+09:00 Mitsumasa KONDO kondo.mitsum...@gmail.com:

 Hi

 2014-03-15 15:53 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:


 Hello Heikki,


  A couple of comments:

 * There should be an explicit \setrandom ... uniform option too, even
 though you get that implicitly if you don't specify the distribution


 Indeed. I agree. I suggested it, but it got lost.

 OK. If we keep to the SQL grammar, your saying is right. I will add it.


  * What exactly does the threshold mean? The docs informally explain
 that the larger the thresold, the more frequent values close to the middle
 of the interval are drawn, but that's pretty vague.


 There are explanations and computations as comments in the code. If it is
 about the documentation, I'm not sure that a very precise mathematical
 definition will help a lot of people, and might rather hinder
 understanding, so the doc focuses on an intuitive explanation instead.

 Yeah, I think that we had better to only explain necessary infomation for
 using this feature. If we add mathematical theory in docs, it will be too
 difficult for user.  And it's waste.


  * Does min and max really make sense for gaussian and exponential
 distributions? For gaussian, I would expect mean and standard deviation as
 the parameters, not min/max/threshold.


 Yes... and no:-) The aim is to draw an integer primary key from a table,
 so it must be in a specified range. This is approximated by drawing a
 double value with the expected distribution (gaussian or exponential) and
 project it carefully onto integers. If it is out of range, there is a loop
 and another value is drawn. The minimal threshold constraint (2.0) ensures
 that the probability of looping is low.

 I think it is difficult to understand from our text... So I create picture
 that will help you to understand it.
 Please see it.



  * How about setting the variable as a float instead of integer? Would
 seem more natural to me. At least as an option.


 Which variable? The values set by setrandom are mostly used for primary
 keys. We really want integers in a range.

 I think he said threshold parameter. Threshold parameter is very sensitive
 parameter, so we need to set double in threshold. I think that you can
 consent it when you see attached picture.

 regards,
 --
 Mitsumasa KONDO
 NTT Open Source Software Center



Re: [HACKERS] gaussian distribution pgbench

2014-03-15 Thread Fabien COELHO


Nice drawing!


 * How about setting the variable as a float instead of integer? Would

seem more natural to me. At least as an option.


Which variable? The values set by setrandom are mostly used for primary
keys. We really want integers in a range.


I think he said threshold parameter. Threshold parameter is very sensitive
parameter, so we need to set double in threshold. I think that you can
consent it when you see attached picture.


I'm sure that the threshold must be a double, but I thought it was already 
the case, because of atof, the static variables which are declared double, 
and the threshold function parameters which are declared double as well, 
and the putVariable uses a %lf format...


Possibly I'm missing something?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-15 Thread Mitsumasa KONDO
2014-03-15 19:04 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:


 Nice drawing!


   * How about setting the variable as a float instead of integer? Would

 seem more natural to me. At least as an option.


 Which variable? The values set by setrandom are mostly used for primary
 keys. We really want integers in a range.


 I think he said threshold parameter. Threshold parameter is very sensitive
 parameter, so we need to set double in threshold. I think that you can
 consent it when you see attached picture.

 Oh, sorry.. It is to Heikki. Not to you...


 I'm sure that the threshold must be a double, but I thought it was already
 the case, because of atof, the static variables which are declared double,
 and the threshold function parameters which are declared double as well,
 and the putVariable uses a %lf format...

I think it's collect. When we get double argument in scanf(), we can use
%lf format.


 Possibly I'm missing something?

Sorry. I think nothing is missing.

regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench

2014-03-14 Thread Fabien COELHO


Well, when we set '--gaussian=NUM' or '--exponential=NUM' on command line, we 
can see access probability of top N records in result of final output. This 
out put is under following,


Indeed. I had forgotten this point. This is a significant information that 
I would not like to loose.


This feature helps user to understand bias of distribution for tuning 
threshold parameter.
If this feature is nothing, it is difficult to understand distribution of 
access pattern, and it cannot realized on custom script. Because range of 
distribution (min, max, and SQL pattern) are unknown on custom script. So I 
think present UI is not bad and should not change.


Ok. I agree with this argument.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-14 Thread Heikki Linnakangas

On 03/13/2014 04:00 PM, Fujii Masao wrote:

On Thu, Mar 13, 2014 at 10:51 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

IMHO we should just implement the \setrandom changes, and not add any of
these options to modify the standard test workload. If someone wants to run
TPC-B workload with gaussian or exponential distribution, they can implement
it as a custom script. The docs include the script for the standard TPC-B
workload; just copy-paster that and modify the \setrandom lines.


Yeah, I'm OK with this.


So I took a look at the \setrandom parts of this patch to see if that's 
ready for commit, without any of the changes to modify the standard 
TPC-B workload. Attached is a patch with just those parts; everyone 
please focus on this.


A couple of comments:

* There should be an explicit \setrandom ... uniform option too, even 
though you get that implicitly if you don't specify the distribution


* What exactly does the threshold mean? The docs informally explain 
that the larger the thresold, the more frequent values close to the 
middle of the interval are drawn, but that's pretty vague.


* Does min and max really make sense for gaussian and exponential 
distributions? For gaussian, I would expect mean and standard deviation 
as the parameters, not min/max/threshold.


* How about setting the variable as a float instead of integer? Would 
seem more natural to me. At least as an option.


- Heikki
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 7c1e59e..a7713af 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,9 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+#define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -469,6 +472,79 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/* random number generator: exponential distribution from min to max inclusive */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double		rand;
+
+	/*
+	 * Get user specified random number in this loop. This loop is executed until
+	 * the number in the expected range. As the minimum threshold is 2.0, the
+	 * probability of a retry is at worst 13.5% as - ln(0.135) ~ 2.0 ;
+	 * For a 5.0 threshold, it is about e^{-5} ~ 0.7%.
+	 */
+	do
+	{
+		/* as pg_erand48 is in [0, 1), uniform is in (0, 1] */
+		double uniform = 1.0 - pg_erand48(thread-random_state);
+		/* rand is in [0 LARGE) */
+		rand = - log(uniform);
+	} while (rand = exp_threshold);
+
+	/* rand in [0, exp_threshold), normalized to [0,1) */
+	rand /= exp_threshold;
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold  stdev = stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping proability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/* we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -stdev_threshold || stdev = stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1312,6 +1388,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		

Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread Fujii Masao
On Tue, Mar 11, 2014 at 1:49 PM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 (2014/03/09 1:49), Fabien COELHO wrote:


 Hello Mitsumasa-san,

 New \setrandom interface is here.
  \setrandom var min max [gaussian threshold | exponential threshold]


 Attached patch realizes this interface, but it has little bit ugly
 codeing in
 executeStatement() and process_commands()..


 I think it is not too bad. The ignore extra arguments on the line is a
 little
 pre-existing mess anyway.

 All right.


 What do you think?


 I'm okay with this UI and its implementation.

 OK.

We should do the same discussion for the UI of command-line option?
The patch adds two options --gaussian and --exponential, but this UI
seems to be a bit inconsistent with the UI for \setrandom. Instead,
we can use something like --distribution=[uniform | gaussian | exponential].

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread Heikki Linnakangas

On 03/13/2014 03:17 PM, Fujii Masao wrote:

On Tue, Mar 11, 2014 at 1:49 PM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

(2014/03/09 1:49), Fabien COELHO wrote:


I'm okay with this UI and its implementation.


OK.


We should do the same discussion for the UI of command-line option?
The patch adds two options --gaussian and --exponential, but this UI
seems to be a bit inconsistent with the UI for \setrandom. Instead,
we can use something like --distribution=[uniform | gaussian | exponential].


IMHO we should just implement the \setrandom changes, and not add any of 
these options to modify the standard test workload. If someone wants to 
run TPC-B workload with gaussian or exponential distribution, they can 
implement it as a custom script. The docs include the script for the 
standard TPC-B workload; just copy-paster that and modify the \setrandom 
lines.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread Fujii Masao
On Thu, Mar 13, 2014 at 10:51 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 03/13/2014 03:17 PM, Fujii Masao wrote:

 On Tue, Mar 11, 2014 at 1:49 PM, KONDO Mitsumasa
 kondo.mitsum...@lab.ntt.co.jp wrote:

 (2014/03/09 1:49), Fabien COELHO wrote:


 I'm okay with this UI and its implementation.


 OK.


 We should do the same discussion for the UI of command-line option?
 The patch adds two options --gaussian and --exponential, but this UI
 seems to be a bit inconsistent with the UI for \setrandom. Instead,
 we can use something like --distribution=[uniform | gaussian |
 exponential].


 IMHO we should just implement the \setrandom changes, and not add any of
 these options to modify the standard test workload. If someone wants to run
 TPC-B workload with gaussian or exponential distribution, they can implement
 it as a custom script. The docs include the script for the standard TPC-B
 workload; just copy-paster that and modify the \setrandom lines.

Yeah, I'm OK with this.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread Fabien COELHO


We should do the same discussion for the UI of command-line option? The 
patch adds two options --gaussian and --exponential, but this UI seems 
to be a bit inconsistent with the UI for \setrandom.
Instead, we can use something like --distribution=[uniform | gaussian | 
exponential].


Hmmm. That is possible, obviously.

Note that it does not need to resort to a custom script, if one can do 
something like --define=exp_threshold=5.6. If so, maybe one simpler 
named variable could be used, say threshold, instead of separate names 
for each options.


However there is a catch: currently the option allows to check that the 
threshold is large enough so as to avoid loops in the generator. So this 
mean moving the check in the generator, and doing it over and over. 
Possibly this is a good idea, because otherwise a custom script could 
circumvent the check. Well, the current status is that the check can be 
avoided with --define...


Also, a shorter possibly additional name, would be nice, maybe something 
like: --dist=exp|gauss|uniform? Not sure. I like long options not to be 
too long.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread KONDO Mitsumasa

Hi,

(2014/03/14 4:21), Fabien COELHO wrote:



We should do the same discussion for the UI of command-line option? The patch
adds two options --gaussian and --exponential, but this UI seems to be a bit
inconsistent with the UI for \setrandom.
Instead, we can use something like --distribution=[uniform | gaussian |
exponential].


Hmmm. That is possible, obviously.

Note that it does not need to resort to a custom script, if one can do something
like --define=exp_threshold=5.6.
Yeah, threshold paramter should be needed by generating distribution algorithms 
in my patch. And it is important that we can control distribution pattern by this 
paramter.



If so, maybe one simpler named variable could
be used, say threshold, instead of separate names for each options.
If we separate threshold option, I think it is difficult to understand dependency 
of this parameter. Because threshold is very general term, and
when we will add other new feature, it is difficult to undestand which parameter 
is dependent and be needed.



However there is a catch: currently the option allows to check that the 
threshold
is large enough so as to avoid loops in the generator. So this mean moving the
check in the generator, and doing it over and over. Possibly this is a good 
idea,
because otherwise a custom script could circumvent the check. Well, the current
status is that the check can be avoided with --define...

Also, a shorter possibly additional name, would be nice, maybe something like:
--dist=exp|gauss|uniform? Not sure. I like long options not to be too long.
Well, if we run standard benchmark in pgbench, we need not set option because it 
is default benmchmark, and it is same as uniform distribution. And if we run 
extra benchmarks in pgbench which are like '-S' or '-N',  we need to set option. 
Because they are non-standard benchmark setting, and it is same as gaussian or 
exponential distribution. So present UI keeps consistency and along the pgbench 
history.


 I like long options not to be too long.
Yes, I like so too. Present UI is very simple and useful for combination using 
such like '-S' and '--gaussian'. So I hope not changing UI.


ex)
pgbench -S --gaussian=5
pgbench -N --exponential=2 --sampling-rate=0.8

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread KONDO Mitsumasa

(2014/03/13 23:00), Fujii Masao wrote:

On Thu, Mar 13, 2014 at 10:51 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 03/13/2014 03:17 PM, Fujii Masao wrote:


On Tue, Mar 11, 2014 at 1:49 PM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:


(2014/03/09 1:49), Fabien COELHO wrote:



I'm okay with this UI and itsaccess probability of top implementation.



OK.



We should do the same discussion for the UI of command-line option?
The patch adds two options --gaussian and --exponential, but this UI
seems to be a bit inconsistent with the UI for \setrandom. Instead,
we can use something like --distribution=[uniform | gaussian |
exponential].



IMHO we should just implement the \setrandom changes, and not add any of
these options to modify the standard test workload. If someone wants to run
TPC-B workload with gaussian or exponential distribution, they can implement
it as a custom script. The docs include the script for the standard TPC-B
workload; just copy-paster that and modify the \setrandom lines.
Well, when we set '--gaussian=NUM' or '--exponential=NUM' on command line, we can 
see access probability of top N records in result of final output. This out put 
is under following,



[mitsu-ko@localhost pgbench]$ ./pgbench --exponential=10 postgres
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.0
access probability of top 20%, 10% and 5% records: 0.86466 0.63212 0.39347
~
This feature helps user to understand bias of distribution for tuning threshold 
parameter.
If this feature is nothing, it is difficult to understand distribution of access 
pattern, and it cannot realized on custom script. Because range of distribution 
(min, max, and SQL pattern) are unknown on custom script. So I think present UI 
is not bad and should not change.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-10 Thread KONDO Mitsumasa

(2014/03/09 1:49), Fabien COELHO wrote:


Hello Mitsumasa-san,


New \setrandom interface is here.
 \setrandom var min max [gaussian threshold | exponential threshold]



Attached patch realizes this interface, but it has little bit ugly codeing in
executeStatement() and process_commands()..


I think it is not too bad. The ignore extra arguments on the line is a little
pre-existing mess anyway.

All right.


What do you think?


I'm okay with this UI and its implementation.

OK.

Attached patch is updated in the document. I don't like complex sentence,
so I use para tag a lot. If you like this documents, please mark ready for 
commiter.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center






*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = ;
  char	   *pgport = ;
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -S case */
+ static char *exponential_select_only = {
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ };
+ 
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --gaussian with -N case */
+ static char *gaussian_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n

Re: [HACKERS] gaussian distribution pgbench

2014-03-08 Thread Fabien COELHO


Hello Mitsumasa-san,


New \setrandom interface is here.
 \setrandom var min max [gaussian threshold | exponential threshold]


Attached patch realizes this interface, but it has little bit ugly codeing in 
executeStatement() and process_commands()..


I think it is not too bad. The ignore extra arguments on the line is a 
little pre-existing mess anyway.



What do you think?


I'm okay with this UI and its implementation.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-06 Thread KONDO Mitsumasa

Hi,

(2014/03/04 17:42), KONDO Mitsumasa wrote: (2014/03/04 17:28), Fabien COELHO 
wrote:
 OK. I'm not sure which idia is the best. So I wait for comments in 
community:)
 Hmmm. Maybe you can do what Tom voted for, he is the committer:-)
 Yeah, but he might change his mind by our disscuttion. So I wait untill 
tomorrow,
 and if nothing to comment, I will start to fix what Tom voted for.
I create the patch which is fixed UI. If we agree with this interface,
I also start to fix the document.


New \setrandom interface is here.
  \setrandom var min max [gaussian threshold | exponential threshold]

Attached patch realizes this interface, but it has little bit ugly codeing in 
executeStatement() and process_commands().. That is under following.

if(argc == 4)
{
... /* uniform */
}
else if (argv[4]== gaussian or exponential)
{
... /* gaussian or exponential */
}
else
{
... /* uniform with extra argments */
}

It is beacause pgbench custom script allows extra comments or extra argument in 
its file. For example, under following cases are no problem case.

  \setrandom var min max #hoge   -- uniform random
  \setrandom var min max #hoge1 #hoge2  -- uniform random
  \setrandom var min max gaussian threshold #hoge  --gaussian random

And other cases are classified under following.
  \setrandom var min max gaussian #hoge -- uniform
  \setrandom var min max max2 gaussian threshold -- uniform
  \setrandom var min gaussian #hoge -- ERROR

However, if we wrong grammer in pgbench custom script,
pgbench outputs error log on user terminal. So I think it is especially no 
problem.

What do you think?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = ;
  char	   *pgport = ;
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -S case */
+ static char *exponential_select_only = {
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 1 :naccounts exponential :exp_threshold\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ };
+ 
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setrandom aid 

Re: [HACKERS] gaussian distribution pgbench

2014-03-06 Thread KONDO Mitsumasa

(2014/03/07 16:02), KONDO Mitsumasa wrote:

And other cases are classified under following.
   \setrandom var min max gaussian #hoge -- uniform

Oh, it's wrong... It will be..
\setrandom var min max gaussian #hoge -- ERROR

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-04 Thread Fabien COELHO


OK. I'm not sure which idia is the best. So I wait for comments in 
community:)


Hmmm. Maybe you can do what Tom voted for, he is the committer:-)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-04 Thread KONDO Mitsumasa

(2014/03/04 17:28), Fabien COELHO wrote:

OK. I'm not sure which idia is the best. So I wait for comments in community:)

Hmmm. Maybe you can do what Tom voted for, he is the committer:-)
Yeah, but he might change his mind by our disscuttion. So I wait untill tomorrow, 
and if nothing to comment, I will start to fix what Tom voted for.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-03 Thread KONDO Mitsumasa

(2014/03/03 16:51), Fabien COELHO wrote:\setrandom foo 1 10 [uniform]
\setrandom foo 1 :size gaussian 3.6
\setrandom foo 1 100 exponential 7.2
 It's good design. I think it will become more low overhead at part of parsing
 in pgbench, because comparison of strings will be redeced(maybe). And I'd 
like
 to remove [uniform], beacause we have to have compatibility for old scripts,
 and random function always gets uniform distribution in common sense of
 programming.

 I just put uniform as an optional default, hence the brackets.
All right. I was misunderstanding. However, if we select this format, I'd like to 
remove it. Because pgbench needs to check counts of argment number. If we allow 
brackets, it will not be simple.


 Otherwise, what I would have in mind if this would be designed from scratch:

\set foo 124
\set foo string value (?)
\set foo :variable
\set foo 12 + :shift

 And then

\set foo uniform 1 10
\set foo gaussian 1 10 4.2
\set foo exponential 1 100 5.2

 or maybe functions could be repended with something like uniform.
 But that would be for another life:-)
I don't agree with that.. They are more overhead in parsing part and more complex 
for user.


 However, new grammer is little bit long in user script. It seems trade-off 
that
 are visibility of scripts and user writing cost.

 Yep.
OK. I'm not sure which idia is the best. So I wait for comments in community:)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-02 Thread Fabien COELHO


Hello Alvaro  Tom,


Alvaro Herrera alvhe...@2ndquadrant.com writes:

Seems that in the review so far, Fabien has focused mainly in the
mathematical properties of the new random number generation.  That seems
perfectly fine, but no comment has been made about the chosen UI for the
feature.
 Per the few initial messages in the thread, in the patch as submitted 
you ask for a gaussian random number by using \setgaussian, and 
exponential via \setexp.  Is this the right UI?


I thought it would be both concise  clear to have that as another form of 
\set*.


If I had it designed from the start, I think I may have put only \set 
with some functions such as uniform, gaussian and so on. but once 
there is a set and a setrandom for uniform, this suggested other settings 
would have their own set commands as well. Also, the number of expected 
arguments is not the same, so it may make the parsing code less obvious.
Finally, this is not a language heavily used, so I would emphasize 
simpler code over more elegant features, for once.


Currently you get an evenly distributed number with \setrandom.  There 
is nothing that makes it obvious on \setgaussian by itself that it 
produces random numbers.


Well, gaussian or exp are kind of a clue, at least to my 
mathematically-oriented mind.


Perhaps we should simply add a new argument to \setrandom, instead of 
creating new commands for each distribution?  I would guess that, in 
the future, we're going to want other distributions as well.


+1 for an argument to \setrandom instead of separate commands.



Not sure what it would look like; perhaps
\setrandom foo 1 10 gaussian


There is an additional argument expected. That would make:

  \setrandom foo 1 10 [uniform]
  \setrandom foo 1 :size gaussian 3.6
  \setrandom foo 1 100 exponential 7.2


FWIW, I think this style is sufficient; the others seem overcomplicated
for not much gain.  I'm not strongly attached to that position though.


If there is a change, I agree that one simple style is enough, especially 
as the parsing code is rather low-level already.


So I'm basically fine with the current status of the patch, but I would
be okay with a \setrandom as well.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-02 Thread KONDO Mitsumasa

(2014/03/02 22:32), Fabien COELHO wrote:

Alvaro Herrera alvhe...@2ndquadrant.com writes:

Seems that in the review so far, Fabien has focused mainly in the
mathematical properties of the new random number generation.  That seems
perfectly fine, but no comment has been made about the chosen UI for the
feature.
 Per the few initial messages in the thread, in the patch as submitted you ask
for a gaussian random number by using \setgaussian, and exponential via
\setexp.  Is this the right UI?

I thought it would be both concise  clear to have that as another form of 
\set*.

Yeah, but we got only two or three? concise. So I agree with discussing about 
UI.


There is an additional argument expected. That would make:

   \setrandom foo 1 10 [uniform]
   \setrandom foo 1 :size gaussian 3.6
   \setrandom foo 1 100 exponential 7.2
It's good design. I think it will become more low overhead at part of parsing in 
pgbench, because comparison of strings will be redeced(maybe). And I'd like to 
remove [uniform], beacause we have to have compatibility for old scripts, and 
random function always gets uniform distribution in common sense of programming.


However, new grammer is little bit long in user script. It seems trade-off that 
are visibility of scripts and user writing cost.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-02 Thread Fabien COELHO



   \setrandom foo 1 10 [uniform]
   \setrandom foo 1 :size gaussian 3.6
   \setrandom foo 1 100 exponential 7.2
It's good design. I think it will become more low overhead at part of parsing 
in pgbench, because comparison of strings will be redeced(maybe). And I'd 
like to remove [uniform], beacause we have to have compatibility for old 
scripts, and random function always gets uniform distribution in common sense 
of programming.


I just put uniform as an optional default, hence the brackets.

Otherwise, what I would have in mind if this would be designed from 
scratch:


  \set foo 124
  \set foo string value (?)
  \set foo :variable
  \set foo 12 + :shift

And then

  \set foo uniform 1 10
  \set foo gaussian 1 10 4.2
  \set foo exponential 1 100 5.2

or maybe functions could be repended with something like uniform.
But that would be for another life:-)

However, new grammer is little bit long in user script. It seems trade-off 
that are visibility of scripts and user writing cost.


Yep.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-01 Thread Alvaro Herrera
Seems that in the review so far, Fabien has focused mainly in the
mathematical properties of the new random number generation.  That seems
perfectly fine, but no comment has been made about the chosen UI for the
feature.  Per the few initial messages in the thread, in the patch as
submitted you ask for a gaussian random number by using \setgaussian,
and exponential via \setexp.  Is this the right UI?  Currently you get
an evenly distributed number with \setrandom.  There is nothing that
makes it obvious on \setgaussian by itself that it produces random
numbers.  Perhaps we should simply add a new argument to \setrandom,
instead of creating new commands for each distribution?  I would guess
that, in the future, we're going to want other distributions as well.

Not sure what it would look like; perhaps
\setrandom foo 1 10 gaussian
or 
\setrandom foo 1 10 dist=gaussian
or
\setrandom(gaussian) foo 1 10
or
\setrandom(dist=gaussian) foo 1 10

I think we could easily support

\set distrib gaussian
\setrandom(dist=:distrib) foo 1 10

so that it can be changed for a bunch of commands easily.

Or maybe I'm going overboard, everybody else is happy with \setgaussian,
and should just use that?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-03-01 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Seems that in the review so far, Fabien has focused mainly in the
 mathematical properties of the new random number generation.  That seems
 perfectly fine, but no comment has been made about the chosen UI for the
 feature.  Per the few initial messages in the thread, in the patch as
 submitted you ask for a gaussian random number by using \setgaussian,
 and exponential via \setexp.  Is this the right UI?  Currently you get
 an evenly distributed number with \setrandom.  There is nothing that
 makes it obvious on \setgaussian by itself that it produces random
 numbers.  Perhaps we should simply add a new argument to \setrandom,
 instead of creating new commands for each distribution?  I would guess
 that, in the future, we're going to want other distributions as well.

+1 for an argument to \setrandom instead of separate commands.

 Not sure what it would look like; perhaps
 \setrandom foo 1 10 gaussian

FWIW, I think this style is sufficient; the others seem overcomplicated
for not much gain.  I'm not strongly attached to that position though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-02-23 Thread Fabien COELHO


Gaussian Pgbench v8 patch by Mitsumasa KONDO review  patch v9.

* The purpose of the patch is to allow a pgbench script to draw from normally
  distributed or exponentially distributed integer values instead of uniformly
  distributed.

  This is a valuable contribution to enable pgbench to generate more realistic
  loads, which is seldom uniform in practice.

* Very minor change

  I have updated the patch (v9) based on Mitsumasa latest v8:
  - remove one spurious space in the help message.

* Compilation

  The patch applies cleanly and compiles against current head.

* Check

  I have checked that the aid values are skewed depending on the
  parameters by looking at the aid distribution in the pgbench_history
  table after a run.

* Mathematical soundness

  I've checked the mathematical soundness of the methods involved.

  I'm fine with casting doubles to integers for having the expected
  distribution on integers.

  Although there is a retry loop for finding a suitable, the looping
  probability is low thanks to the minimum threshold parameter required.

* Conclusion

  I suggest to apply this patch which provide a useful and more realistic
  testing capability to pgbench.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index a836acf..35edd27 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,9 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+#define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -169,6 +172,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+booluse_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = ;
 char	   *pgport = ;
 char	   *login = NULL;
@@ -330,6 +341,88 @@ static char *select_only = {
 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setexponential aid 1 :naccounts :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setexponential aid 1 :naccounts :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setexponential aid 1 :naccounts :exp_threshold\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setgaussian aid 1 :naccounts :stdev_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, 

Re: [HACKERS] gaussian distribution pgbench

2014-02-17 Thread KONDO Mitsumasa

(2014/02/16 7:38), Fabien COELHO wrote:

   I have updated the patch (v7) based on Mitsumasa latest v6:
   - some code simplifications  formula changes.
   - I've added explicit looping probability computations in comments
 to show the (low) looping probability of the iterative search.
   - I've tried to clarify the sgml documentation.
   - I've removed the 5.0 default value as it was not used anymore.
   - I've renamed some variables to match the naming style around.
Thank you for yor detail review and fix some code! I checked your modification 
version,

it seems better than previos version and very helpful for documents.


* Mathematical soundness

   I've checked again the mathematical soundness for the methods involved.

   After further thoughts, I'm not that sure that there is not a bias induced
   by taking the second value based on cos when the first based on sin
   as failed the test. So I removed the cos computation for the gaussian 
version,
   and simplified the code accordingly. This mean that it may be a little
   less efficient, but I'm more confident that there is no bias.
I tried to confirm which method is better. However, at the end of the day, it is 
not a problem because other part of implementations have bigger overhead in 
pgbench client. We like simple implementaion so I agree with your modification 
version. And I tested this version, there is no overhead in creating gaussian and 
exponential random number with minimum threshold that is most overhead situation.



* Conclusion

   If Mitsumasa-san is okay with the changes I have made, I would suggest
   to accept this patch.
Attached patch based on v7 is added output that is possibility of access record 
when we use exponential option
in the end of pgbench result. It is caluculated by a definite integral method for 
e^-x.

If you check it and think no problem, please mark it ready for commiter.
Ishii-san will review this patch:)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = ;
  char	   *pgport = ;
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setexponential aid 1 :naccounts :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setexponential aid 1 :naccounts :exp_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --exponential with -S case */
+ static char *exponential_select_only = {
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setexponential aid 1 :naccounts :exp_threshold\n
+ 

Re: [HACKERS] gaussian distribution pgbench

2014-02-15 Thread Fabien COELHO


Gaussian Pgbench v6 patch by Mitsumasa KONDO review  patch v7.

* The purpose of the patch is to allow a pgbench script to draw from normally
  distributed or exponentially distributed integer values instead of uniformly
  distributed.

  This is a valuable contribution to enable pgbench to generate more realistic
  loads, which is seldom uniform in practice.

* Changes

  I have updated the patch (v7) based on Mitsumasa latest v6:
  - some code simplifications  formula changes.
  - I've added explicit looping probability computations in comments
to show the (low) looping probability of the iterative search.
  - I've tried to clarify the sgml documentation.
  - I've removed the 5.0 default value as it was not used anymore.
  - I've renamed some variables to match the naming style around.

* Compilation

  The patch applies and compiles against current head. It works as expected,
  although there is few feedback from the script to show that. By looking
  at the aid distribution in the pgbench_history table after a run, I
  could check that the aid values are indeed skewed, depending on the 
parameters.

* Mathematical soundness

  I've checked again the mathematical soundness for the methods involved.

  After further thoughts, I'm not that sure that there is not a bias induced
  by taking the second value based on cos when the first based on sin
  as failed the test. So I removed the cos computation for the gaussian version,
  and simplified the code accordingly. This mean that it may be a little
  less efficient, but I'm more confident that there is no bias.

* Conclusion

  If Mitsumasa-san is okay with the changes I have made, I would suggest
  to accept this patch.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 16b7ab5..afe4a32 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -106,6 +106,9 @@ extern int	optind;
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+#define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -177,6 +180,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+booluse_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = ;
 char	   *pgport = ;
 char	   *login = NULL;
@@ -338,6 +349,88 @@ static char *select_only = {
 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setexponential aid 1 :naccounts :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setexponential aid 1 :naccounts :exp_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom tid 1 :ntellers\n
+	\\setrandom delta -5000 5000\n
+	BEGIN;\n
+	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+	END;\n
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setexponential aid 1 :naccounts :exp_threshold\n
+	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+	\\setgaussian aid 1 :naccounts :stdev_threshold\n
+	\\setrandom bid 1 :nbranches\n
+	\\setrandom 

Re: [HACKERS] gaussian distribution pgbench

2014-02-14 Thread Mitsumasa KONDO
I add exponential distribution random generator (and little bit
refactoring:) ).
I use inverse transform method to create its distribution.  It's very
simple method that is
created by - log (rand()). We can control slope of distribution using
threshold parameter.
It is same as gaussian threshold.

usage example
  pgbench --exponential=NUM -S

Attached graph is created with exponential threshold = 5. We can see
exponential
distribution in the graphs. It supports -S, -N options and custom script.
So we set
¥setexponential [var] [min] [max] [threshold] in a transaction pattern
file,
it appear distribution we want.

We have no time to fix its very much... But I think almost part of patch
have been completed.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


gaussian_and_exponential_pgbench_v6.patch
Description: Binary data
attachment: exponential=5.png

gnuplot.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-02-13 Thread KONDO Mitsumasa

Hi Febien,

Thank you very much for your very detail and useful comments!
I read your comment, I agree most of your advice:)

Attached patch is fixed for your comment. That are...
 - Remove redundant long-option.
   - We can use --gaussian=NUM -S or --gaussian=NUMN -N options.
 - Add sentence in document
 - Separate two random generate function which are uniform and gaussian.
   - getGaussianrand() is created.
 - Fix ranged random number more strictly, ex. (0,1) or [0,1).
   - Please see comment of source code in detail:).
 - Fix typo.
 - Use cos() and sin() function when we generate gaussian random number.
 - Add fast sqrt calculation algorithm.
 - Reuse sqrt result and pre generate random number for reducing calculation 
cost.
   - Experience of this method is under following. It will be little-bit faster 
than non-reuse method. And distribution of gaussian is still good.


* Settings
 shared_buffers = 1024MB

* Test script
 pgbench -i -s 1
 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n

* Result
  method |  try1  |  try2  |  try3  |
|
reuse method | 44189  | 44453  | 44013  |
non-reuse method | 43567  | 43635  | 43508  |



(2014/02/09 21:32), Fabien COELHO wrote:

   This is a valuable contribution to enable pgbench to generate more realistic
   loads, which is seldom uniform in practice.

Thanks!


   However, ISTM that other distributions such an exponantial one would make
   more sense,
I can easy to create exponential distribution. Here, I assume exponential 
distribution that is f(x) = lambda * exp^(-lambda * x) in general.

What do you think under following interface?

custom script: \setexp [varname] min max threshold
command  : --exponential=NUM(threshold)

I don't want to use lambda variable for simple implementation. So lambda is 
always 1. Because it can enough to control distribution by threshold. Threshold 
parameter is f(x) value. And using created distribution projects to 'aid' by same 
method. If you think OK, I will impliment under followings tomorrow, and also 
create parseing part of this function...


do
{
   rand = 1.0 - pg_erand48(thread-random_state);
   rand = -log(rand);
}while( rand  exp_threshold)

return rand / exp_threshold;



   and also the values should be further randomized so that
   neighboring values are not more likely to be drawn. The latest point is non
   trivial.
That's right, but I worry about gaussian randomness and benchmark reproducibility 
might be disappeared when we re-randomized access pattern, because Postgres 
storage method manages records by each pages and it is difficult to realize 
access randomness in whole pages, not record. If we solve this problem, we have 
to need algorithm for smart shuffule projection function that is still having 
gaussian randomized. I think it will be difficult, and it have to impement in 
another patch in the future.




* Mathematical soundness

   We want to derive a discrete normal distribution from a uniform one.
   Well, normal distributions are for continuous variables... Anyway, this is
   done by computing a continuous normal distribution which is then projected
   onto integers. I'm basically fine with that.

   The system uses a Box-Muller transform (1958) to do this transformation.
   The Ziggurat method seems to be prefered for this purpose, *but* it would
   require precalculated tables which depends on the target values. So I'm
   fine with the Box-Muller transform for pgbench.
Yes, that's right. I selected simple and relatively faster algorithm, that is 
Box-Muller transform.



   The BM method uses 2 uniformly distributed numbers to derive 2 normally
   distributed numbers. The implementation computes one of these, and loops
   over till one match a threshold criterion.

   More explanations, at least in comments, are needed about this threshold
   and its meaning. It is required to be more than 2. I guess is that it allows
   to limit the number of iterations of the while loop,

Yes. This loop could not almost go on, because min stdev_threshold is 2.
The possibility of retry-loop is under 4 percent. It might not be problem.


   but in what proportion
   is unclear. The documentation does not also help the user to understand
   this value and its meaning.

Yes, it is huristic method. So I added the comments in document.



   What I think it is: it is the deviation for the FURTHEST point around the
   mean, that is the actual deviation associated to the min and max target
   values. The 2 minimum value induces that there is a least 4 stddev lengths
   between min  max, with the most likely mean in the middle.

Correct!


   If the threshold test fails, one of the 2 uniform number is redrawn, a new
   candidate value is tested. I'm not at ease about why only 1 value is redrawn
   and not both, some explanations would be welcome. Also, on the other hand,
   why not 

Re: [HACKERS] gaussian distribution pgbench

2014-02-13 Thread KONDO Mitsumasa

Sorry, previos attached patch has small bug.
Please use latest one.

 134 - return min + (int64) (max - min + 1) * rand;
 134 + return min + (int64)((max - min + 1) * rand);

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 176,181  int			progress_nthreads = 0; /* number of threads for progress report */
--- 176,183 
  bool		is_connect;			/* establish connection for each transaction */
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
+ double		stdev_threshold = 5;		/* standard deviation threshold */
+ bool		gaussian_option = false;	/* use gaussian distribution random generator */
  
  char	   *pghost = ;
  char	   *pgport = ;
***
*** 338,346  static char *select_only = {
--- 340,390 
  	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
  };
  
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setgaussian aid 1 :naccounts :stdev_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n
+ 	UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --gaussian with -N case */
+ static char *gaussian_simple_update = {
+ 	\\set nbranches  CppAsString2(nbranches)  * :scale\n
+ 	\\set ntellers  CppAsString2(ntellers)  * :scale\n
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setgaussian aid 1 :naccounts :stdev_threshold\n
+ 	\\setrandom bid 1 :nbranches\n
+ 	\\setrandom tid 1 :ntellers\n
+ 	\\setrandom delta -5000 5000\n
+ 	BEGIN;\n
+ 	UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ 	INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n
+ 	END;\n
+ };
+ 
+ /* --gaussian with -S case */
+ static char *gaussian_select_only = {
+ 	\\set naccounts  CppAsString2(naccounts)  * :scale\n
+ 	\\setgaussian aid 1 :naccounts :stdev_threshold\n
+ 	SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n
+ };
+ 
  /* Function prototypes */
  static void setalarm(int seconds);
  static void *threadRun(void *arg);
+ static inline double sqrtd(const double x);
  
  static void
  usage(void)
***
*** 381,386  usage(void)
--- 425,431 
  		 -v, --vacuum-all vacuum all four standard tables before tests\n
  		 --aggregate-interval=NUM aggregate data over NUM seconds\n
  		 --sampling-rate=NUM  fraction of transactions to log (e.g. 0.01 for 1%%)\n
+ 		 --gaussian=NUM   gaussian distribution with NUM standard deviation threshold\n
  		   \nCommon options:\n
  		 -d, --debug  print debugging output\n
  		 -h, --host=HOSTNAME  database server host or socket directory\n
***
*** 477,482  getrand(TState *thread, int64 min, int64 max)
--- 522,597 
  	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
  }
  
+ /* random number generator: gaussian distribution from min to max inclusive */
+ static int64
+ getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+ {
+ 	double		stdev;
+ 	double		rand;
+ 	static double	rand1;
+ 	static double	rand2;
+ 	static double	var_sqrt;
+ 	static bool	reuse = false;
+ 	
+ 	/*
+ 	 * Get user specified random number(-stdev_threshold  stdev = stdev_threshold) 
+ 	 * in this loop. This loop is executed until appeared ranged number we want.
+ 	 * However, this loop could not almost go on, because min stdev_threshold is 2
+ 	 * then the possibility of retry-loop is under 4 percent. And possibility of
+ 	 * re-retry-loop is under 1.6 percent. And it doesn't happen frequentry even if
+ 	 * we also think about the cycle of the trigonometric function.
+  	 */
+ 	do
+ 	{
+ 		/* reuse pre calculation result as possible */
+ 		if(!reuse)
+ 		{
+ 			/* 
+  			 * pg_erand48 generates [0,1) random number. However rand1 
+  			 * needs (0,1) random number because log(0) cannot calculate.
+  			 * And rand2 also needs (0,1) random number in strictly. But
+  			 * normalization cost is high and we can substitute (0,1] at
+  			 * rand1 and [0,1) at rand2, so we use approximate calculation.
+  			 */
+ 			rand1 = 1.0 - pg_erand48(thread-random_state);
+ 			rand2 = pg_erand48(thread-random_state);
+ 		
+ 			 /* 

Re: [HACKERS] gaussian distribution pgbench

2014-02-09 Thread Fabien COELHO


Hello,


I revise my gaussian pgbench patch which wss requested from community.


With a lot of delay for which I apologise, please find hereafter the 
review.


Gaussian Pgbench v3 patch by Mitsumasa KONDO review

* The purpose of the patch is to allow a pgbench script to draw from normally
  distributed integer values instead of uniformly distributed.

  This is a valuable contribution to enable pgbench to generate more realistic
  loads, which is seldom uniform in practice.

  However, ISTM that other distributions such an exponantial one would make
  more sense, and also the values should be further randomized so that
  neighboring values are not more likely to be drawn. The latest point is non
  trivial.

* Compilation

  The patch applies and compiles against current head. It works as expected,
  although there is few feedback from the script to show that.

* Mathematical soundness

  We want to derive a discrete normal distribution from a uniform one.
  Well, normal distributions are for continuous variables... Anyway, this is
  done by computing a continuous normal distribution which is then projected
  onto integers. I'm basically fine with that.

  The system uses a Box-Muller transform (1958) to do this transformation.
  The Ziggurat method seems to be prefered for this purpose, *but* it would
  require precalculated tables which depends on the target values. So I'm
  fine with the Box-Muller transform for pgbench.

  The BM method uses 2 uniformly distributed numbers to derive 2 normally
  distributed numbers. The implementation computes one of these, and loops
  over till one match a threshold criterion.

  More explanations, at least in comments, are needed about this threshold
  and its meaning. It is required to be more than 2. I guess is that it allows
  to limit the number of iterations of the while loop, but in what proportion
  is unclear. The documentation does not also help the user to understand
  this value and its meaning.

  What I think it is: it is the deviation for the FURTHEST point around the
  mean, that is the actual deviation associated to the min and max target
  values. The 2 minimum value induces that there is a least 4 stddev lengths
  between min  max, with the most likely mean in the middle.

  If the threshold test fails, one of the 2 uniform number is redrawn, a new
  candidate value is tested. I'm not at ease about why only 1 value is redrawn
  and not both, some explanations would be welcome. Also, on the other hand,
  why not test the other possible value (with cos) if the first one fails?

  Also, as suggested above, I would like some explanations about how much this
  while loop may iterate without success, say with the expected average number
  of iterations with its explanation in a comment.

* Implementation

  Random values :
  double rand1 = 1.0 - rand; // instead of the LONG_MAX computation  limits.h
  rand2 should be in (0, 1], but it is in [0, 1), use 1.0 - ... as well?!

  What is called stdev* in getrand() is really the chosen deviation from
  the target mean, so it would make more sense to name it dev.

  I do not think that the getrand refactoring was such a good idea. I'm sorry
  if I may have suggested that in a previous comment.
  The new getrand possibly ignores its parameters, h. ISTM that it would
  be much simpler in the code to have a separate and clean getrand_normal
  or getrand_gauss called for \setgaussian, and that's it. This would
  allow to get rid of DistType and all of getrand changes in the code.

  There are heavy constants computations (sqrt(log()) within the while
  loop which would be moved out of the loop.

  ISTM that the while condition would be easier to read as:

 while ( dev  - threshold || threshold  dev )

  Maybe the \\setgaussian argument handling may be transformed into a function,
  so that it could be used easily later for some other distribution (say some
  setexp:-)

* Options

  ISTM that the test options would be better if made orthogonal, i.e. not to
  have three --gaussian* options. I would suggest to have only one
  --gaussian=NUM which would trigger gaussian tests with this threshold,
  and --gaussian=3.5 --select-only would use the select-only variant,
  and so on.

* Typos

  gausian - gaussian
  patern - pattern

* Conclusion :

 - this is a valuable patch to help create more realistic load and make pgbench
   a more useful tool. I'm greatly in favor of having such a functionality.

 - it seems to me that the patch should be further improved before being
   committed, in particular I would suggest:

   (1) improve the explanations in the code and in the documentation, especially
   about what is the deviation threshold and its precise link to generated
   values.

   (2) simplify the code with a separate gaussian getrand, and simpler or
   more efficient code here and there, see comments above.

   (3) use only one option to trigger gaussian tests.

   (bonus) \setexp would be a nice:-)

--

Re: [HACKERS] gaussian distribution pgbench

2013-12-19 Thread Peter Geoghegan
On Thu, Nov 21, 2013 at 9:13 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 So what I'd actually like to see is \setgaussian, for use in custom scripts.

+1. I'd really like to be able to run a benchmark with a Gaussian and
uniform distribution side-by-side for comparative purposes - we need
to know that we're not optimizing one at the expense of the other.
Sure, DBT-2 gets you a non-uniform distribution, but it has serious
baggage from it being a tool primarily intended for measuring the
relative performance of different database systems. pgbench would be
pretty worthless for measuring the relative strengths and weaknesses
of different database systems, but it is not bad at informing the
optimization efforts of hackers. pgbench is a defacto standard for
that kind of thing, so we should make it incrementally better for that
kind of thing. No standard industry benchmark is likely to replace it
for this purpose, because such optimizations require relatively narrow
focus.

Sometimes I want to maximally pessimize the number of FPIs generated.
Other times I do not. Getting a sense of how something affects a
variety of distributions would be very valuable, not least since
normal distributions abound in nature.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-12-19 Thread Gavin Flower

On 20/12/13 09:36, Peter Geoghegan wrote:

On Thu, Nov 21, 2013 at 9:13 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

So what I'd actually like to see is \setgaussian, for use in custom scripts.

+1. I'd really like to be able to run a benchmark with a Gaussian and
uniform distribution side-by-side for comparative purposes - we need
to know that we're not optimizing one at the expense of the other.
Sure, DBT-2 gets you a non-uniform distribution, but it has serious
baggage from it being a tool primarily intended for measuring the
relative performance of different database systems. pgbench would be
pretty worthless for measuring the relative strengths and weaknesses
of different database systems, but it is not bad at informing the
optimization efforts of hackers. pgbench is a defacto standard for
that kind of thing, so we should make it incrementally better for that
kind of thing. No standard industry benchmark is likely to replace it
for this purpose, because such optimizations require relatively narrow
focus.

Sometimes I want to maximally pessimize the number of FPIs generated.
Other times I do not. Getting a sense of how something affects a
variety of distributions would be very valuable, not least since
normal distributions abound in nature.


Curious, wouldn't the common usage pattern tend to favour a skewed 
distribution, such as the  Poisson Distribution (it has been over 40 
years since I studied this area, so there may be better candidates).


Just that gut feeling  experience tends to make me think that the 
Normal distribution may often not be the best for database access 
simulation.



Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-12-19 Thread Gregory Smith

On 12/19/13 5:52 PM, Gavin Flower wrote:
Curious, wouldn't the common usage pattern tend to favour a skewed 
distribution, such as the  Poisson Distribution (it has been over 40 
years since I studied this area, so there may be better candidates).




Some people like database load testing with a Pareto principle 
distribution, where 80% of the activity hammers 20% of the rows such 
that locking becomes important.  (That's one specific form of Pareto 
distribution)  The standard pgbench load indirectly gets you quite a bit 
of that due to all the contention on the branches table. Targeting all 
of that at a single table can be more realistic.


My last round of reviewing a pgbench change left me pretty worn out with 
wanting to extend that code much further.  Adding in some new 
probability distributions would be fine though, that's a narrow change.  
We shouldn't get too excited about pgbench remaining a great tool for 
too much longer though.  pgbench is fast approaching a wall nowadays, 
where it's hard for any single client server to fully overload today's 
larger server.  You basically need a second large server to generate 
load, whereas what people really want is a bunch of coordinated small 
clients.  (That sort of wall was in early versions too, it just got 
pushed upward a lot by the multi-worker changes in 9.0 coming around the 
same time desktop core counts really skyrocketed)


pgbench started as a clone of a now abandoned Java project called 
JDBCBench.  I've been seriously considering a move back toward that 
direction lately.  Nowadays spinning up ten machines to run load 
generation is trivial.  The idea of extending pgbench's C code to 
support multiple clients running at the same time and collating all of 
their results is not a project I'd be excited about.  It should remain a 
perfectly fine tool for PostgreSQL developers to find code hotspots, but 
that's only so useful.


(At this point someone normally points out Tsung solved all of those 
problems years ago if you'd only give it a chance.  I think it's kind of 
telling that work on sysbench is rewriting the whole thing so you can 
use Lua for your test scripts.)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-11-22 Thread Fabien COELHO


3. That said, this could be handy. But it would be even more handy if you 
could get Gaussian random numbers with \setrandom, so that you could use this 
with custom scripts. And once you implement that, do we actually need the -g 
flag anymore? If you want TPC-B transactions with gaussian distribution, you 
can write a custom script to do that. The documentation includes a full 
script that corresponds to the built-in TPC-B script.


So what I'd actually like to see is \setgaussian, for use in custom scripts.


Indeed, great idea! That looks pretty elegant! It would be something like:

  \setgauss var min max sigma

I'm not sure whether sigma should be relative to max-min, or absolute.
I would say relative is better...

A concerned I raised is that what one should really want is a pseudo 
randomized (discretized) gaussian, i.e. you want the probability of each 
value along a gaussian distribution, *but* no direct frequency correlation 
between neighbors. Otherwise, you may have unwanted/unrealistic positive 
cache effects. Maybe this could be achieved by an independent built-in, 
say either:


  \randomize var min max [parameter ?]
  \randomize var min max val [parameter]

Which would mean take variable var which must be in [min,max], and apply a 
pseudo-random transformation which results is also in [min,max].


From a probabilistic point of view, it seems to me that a randomized 
(discretized) exponential would be more significant to model a server 
load.


  \setexp var min max lambda...

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-11-21 Thread Heikki Linnakangas

On 30.09.2013 07:12, KONDO Mitsumasa wrote:

(2013/09/27 5:29), Peter Eisentraut wrote:

This patch no longer applies.

I will try to create this patch in next commit fest.
If you have nice idea, please send me!


A few thoughts on this:

1. DBT-2 uses a non-uniform distribution. You can use that instead of 
pgbench.


2. Do we really want to add everything and the kitchen sink to pgbench? 
Every addition is small when considered alone, but we'll soon end with a 
monster. So I'm inclined to reject this patch on those grounds.


3. That said, this could be handy. But it would be even more handy if 
you could get Gaussian random numbers with \setrandom, so that you could 
use this with custom scripts. And once you implement that, do we 
actually need the -g flag anymore? If you want TPC-B transactions with 
gaussian distribution, you can write a custom script to do that. The 
documentation includes a full script that corresponds to the built-in 
TPC-B script.


So what I'd actually like to see is \setgaussian, for use in custom scripts.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-09-29 Thread KONDO Mitsumasa

Sorry for my delay reply.
Since I have had vacation last week, I replyed from gmail.
However, it was stalled post to pgsql-hackers:-(

(2013/09/21 6:05), Kevin Grittner wrote:
 You had accidentally added to the CF In Progress.
Oh, I had completely mistook this CF schedule :-)
Maybe, Horiguchi-san is same situation...

However, because of your moving, I become first submitter in next CF.
Thank you for moving !
--
Mitsumasa KONDO
NTT Open Source Software Center






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-09-29 Thread KONDO Mitsumasa

Sorry for my delay reply.
Since I have had vacation last week, I replied from gmail.
However, it was stalled post to pgsql-hackers:-(

(2013/09/21 7:54), Fabien COELHO wrote:

However this pattern induces stronger cache effects which are maybe not too 
realistic,
because neighboring keys in the middle are more likely to be chosen.
I think that your opinion is right. However, in effect, it is a 
paseudo-benchmark, so that I think that such a simple mechanism is also necessary.



Have you considered adding a randomization layer, that is once you have a key in [1 
..  n] centered around n/2, then you perform a pseudo-random transformation into the same 
 domain so that key values are scattered over the whole domain?
Yes. I also consider this patch. It can realize by adding linear mapping array 
which is created by random generator. However, current erand48 algorithm is not 
high accuracy and  fossil algorithm, I do not know whether it works well. If we 
realize it, we may need more accurate random generator algorithm which is like 
Mersenne Twister.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-09-29 Thread KONDO Mitsumasa
(2013/09/27 5:29), Peter Eisentraut wrote:
 This patch no longer applies.
I will try to create this patch in next commit fest.
If you have nice idea, please send me!

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-09-26 Thread Peter Eisentraut
On 9/20/13 2:42 AM, KONDO Mitsumasa wrote:
 I create gaussinan distribution pgbench patch that can access records with
 gaussian frequency. And I submit this commit fest.

This patch no longer applies.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-09-23 Thread Mitsumasa KONDO
 However this pattern induces stronger cache effects which are maybe not
too realistic,

 because neighboring keys in the middle are more likely to be chosen.

I think that your opinion is right. However, in effect, it is a
paseudo-benchmark, so that I think that such a simple mechanism is also
necessary.


 Have you considered adding a randomization layer, that is once you have
a key in [1 ..  n] centered around n/2, then you perform a pseudo-random
transformation into the same  domain so that key values are scattered over
the whole domain?

Yes. I also consider this patch. It can realize by adding linear mapping
array which is created by random generator. However, current erand48
algorithm is not high accuracy and  fossil algorithm, I do not know whether
it works well. If we realize it, we may need more accurate random generator
algorithm which is like Mersenne Twister*.*


Regards,

--

Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench

2013-09-23 Thread Mitsumasa KONDO
 You had accidentally added to the CF In Progress.

Oh, I had completely mistook this CF schedule :-)

Maybe, Horiguchi-san is same situation...


However, because of your moving, I become first submitter in next CF.

Thank you for moving :-)

--

Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench

2013-09-20 Thread Kevin Grittner
KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote:

 I create gaussinan distribution pgbench patch that can access
 records with gaussian frequency. And I submit this commit fest.

Thanks!

I have moved this to the Open CommitFest, though.

https://commitfest.postgresql.org/action/commitfest_view/open

You had accidentally added to the CF In Progress.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2013-09-20 Thread Fabien COELHO


Hello Mitsumasa,


In the general transaction situation, clients access for all records equally is
hard to happen. I think gaussian distribution access patterns are most of
transaction petterns in general. My patch realizes neary this access pattern.


That is great! I was just looking for something like that!

I have not looked at the patch yet, but from the plots you sent, it seems 
that it is a gaussian distribution over the keys. However this pattern 
induces stronger cache effects which are maybe not too realistic, because 
neighboring keys in the middle are more likely to be chosen.


It seems to me that this is not desirable.

Have you considered adding a randomization layer, that is once you have 
a key in [1 .. n] centered around n/2, then you perform a pseudo-random 
transformation into the same domain so that key values are scattered over 
the whole domain?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers