Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hi, 2014-08-01 16:26 GMT+09:00 Fabien COELHO > > Maybe somebody who knows more math than I do (like you, probably!) can >> come up with something more clever. >> > > I can certainly suggest other formula, but that does not mean beautiful > code, thus would probably be rejected. I'll see. > > An alternative to this whole process may be to hash/modulo a non uniform > random value. > > id = 1 + hash(some-random()) % n > > But the hashing changes the distribution as it adds collisions, so I have > to think about how to be able to control the distribution in that case, and > what hash function to use. I think that we have to consider and select reproducible method, because benchmark is always needed robust and reproducible result. And if we realize this idea, we might need more accurate random generator that is like Mersenne twister algorithm. erand48 algorithm is slow and not accurate very much. By the way, I don't know relativeness of this topic and command line option... Well whatever... Regards, -- Mitsumasa KONDO
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello, Version one is "k' = 1 + (a * k + b) modulo n" with "a" prime with respect to "n", "n" being the number of keys. This is nearly possible, but for the modulo operator which is currently missing, and that I'm planning to submit for this very reason, but probably another time. That's pretty crude, Yep. It is very simple, it is much better than nothing, and for a database test is may be "good enough". although I don't object to a modulo operator. It would be nice to be able to use a truly random permutation, which is not hard to generate but probably requires O(n) storage, likely a problem for large scale factors. That is indeed the actual issue in my mind. I was thinking of permutations with a formula, which are not so easy to find and may end-up looking like "(a*k+b)%n" anyway. I had the same issue for generating random data for a schema (see http://www.coelho.net/datafiller.html). Maybe somebody who knows more math than I do (like you, probably!) can come up with something more clever. I can certainly suggest other formula, but that does not mean beautiful code, thus would probably be rejected. I'll see. An alternative to this whole process may be to hash/modulo a non uniform random value. id = 1 + hash(some-random()) % n But the hashing changes the distribution as it adds collisions, so I have to think about how to be able to control the distribution in that case, and what hash function to use. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
On Thu, Jul 31, 2014 at 10:01 AM, Fabien COELHO wrote: >> One of the concerns that I have about the proposal of simply slapping a >> gaussian or exponential modifier onto \setrandom aid 1 :naccounts is that, >> while it will allow you to make part of the relation hot and another part of >> the relation cold, you really can't get any more fine-grained than that. If >> you use exponential, all the hot accounts will be near the beginning of the >> relation, and if you use gaussian, they'll all be in the middle. > > That is a very good remark. Although I thought of it, I do not have a very > good solution yet:-) > > From a testing perspective, if we assume that keys have no semantics, a > reasonable assumption is that the distribution of access for actual > realistic workloads is probably exponential (of gaussian, anyway hardly > uniform), but without direct correlation between key values. > > In order to simulate that, we would have to apply a fixed (pseudo-)random > permutation to the exponential-drawn key values. This is a non trivial > problem. The version zero of solving it is to do nothing... it is the > current status;-) Version one is "k' = 1 + (a * k + b) modulo n" with "a" > prime with respect to "n", "n" being the number of keys. This is nearly > possible, but for the modulo operator which is currently missing, and that > I'm planning to submit for this very reason, but probably another time. That's pretty crude, although I don't object to a modulo operator. It would be nice to be able to use a truly random permutation, which is not hard to generate but probably requires O(n) storage, likely a problem for large scale factors. Maybe somebody who knows more math than I do (like you, probably!) can come up with something more clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, [...] One of the concerns that I have about the proposal of simply slapping a gaussian or exponential modifier onto \setrandom aid 1 :naccounts is that, while it will allow you to make part of the relation hot and another part of the relation cold, you really can't get any more fine-grained than that. If you use exponential, all the hot accounts will be near the beginning of the relation, and if you use gaussian, they'll all be in the middle. That is a very good remark. Although I thought of it, I do not have a very good solution yet:-) From a testing perspective, if we assume that keys have no semantics, a reasonable assumption is that the distribution of access for actual realistic workloads is probably exponential (of gaussian, anyway hardly uniform), but without direct correlation between key values. In order to simulate that, we would have to apply a fixed (pseudo-)random permutation to the exponential-drawn key values. This is a non trivial problem. The version zero of solving it is to do nothing... it is the current status;-) Version one is "k' = 1 + (a * k + b) modulo n" with "a" prime with respect to "n", "n" being the number of keys. This is nearly possible, but for the modulo operator which is currently missing, and that I'm planning to submit for this very reason, but probably another time. I'm not sure exactly will happen after some updating has happened; I'm guessing some of the keys will still be in their original location and others will have been pushed to the end of the relation following relation-extension. This is a not too bad side. What matters most in the long term is not the key value correlation, but the actual storage correlation, i.e. whether two tuples required are in the same page or not. At the beginning of a simulation, with close key numbers being picked up with an exponential distribution, the correlation is more that what would be expected. However, once a significant amount of the table has been updated, this initial artificial correlation is going to fade, and the test should become more realistic. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
On Wed, Jul 30, 2014 at 9:00 PM, Mitsumasa KONDO wrote: > Hmm... It doesn't have harm for pgbench source code. And, in general, > checking script is useful for avoiding bug. Not if nobody runs it, or if people run it but don't know what the output should look like. I think anyone who knows enough to find bugs by running these scripts probably doesn't need the scripts. > No, patch B is still needed. Please tell me the reason. I don't like > deciding by someones feeling, > and it needs logical reason. Our documentation is better than the past. I > think it can easy to understand decile probability. > This part of the discussion is needed to continue... > >> Would providing these as additional contrib files be more acceptable? >> Something like "tpc-b-gauss.sql"... Otherwise there is no example available >> to show the feature. > > I agree the test script and including command line options. It's not harm, > and it's useful. As to all of this, I simply don't agree that the stuff has enough value to justify including it. Now, of course, that is subjective: one person may think it has enough value, while another person may think that it does not have enough value. So it just comes down to a question of opinion, and we make those judgements of opinion all the time. If we included everything that everyone who works on the code wants included, we'd end up with a bloated mess of stuff that nobody cares about; indeed, we have a significant amount of stuff in the source code that IMHO looks like somebody's debugging leftovers that should have been removed before commit. I don't want to add more unless there is clear and convincing evidence that a significant number of people want it, and that is not the case here. Now, if we get a few reports from people saying, hey, I was doing some benchmarking with pgbench, and I found the new gaussian feature to be really excellent, but it sucked that there was no command-line option for it, we can go back and add one. No problem! But in the meantime, we've added the core of the feature without cluttering up the list of command-line options with things that may or may not prove to be useful. One of the concerns that I have about the proposal of simply slapping a gaussian or exponential modifier onto \setrandom aid 1 :naccounts is that, while it will allow you to make part of the relation hot and another part of the relation cold, you really can't get any more fine-grained than that. If you use exponential, all the hot accounts will be near the beginning of the relation, and if you use gaussian, they'll all be in the middle. I'm not sure exactly will happen after some updating has happened; I'm guessing some of the keys will still be in their original location and others will have been pushed to the end of the relation following relation-extension. But there's no way, with those command line options, to for example have 5 hot spots distributed uniformly through the relation; or even to have the end of the relation rather than the beginning or the middle as the hot spot. You can do those things with the newly-enhanced \setrand *and a custom script* but not with just a command-line option. So that makes me think that people who find these new facilities useful might not get all that much use out of the command-line option anyway; and we can't have a command-line option for every behavior anyone ever wants. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
On Wed, Jul 30, 2014 at 4:18 PM, Fabien COELHO wrote: >> nor am I in favor of patch B. > > Yep. Would providing these as additional contrib files be more acceptable? > Something like "tpc-b-gauss.sql"... Otherwise there is no example available > to show the feature. To be honest, it just feels like clutter to me. If we added examples for every feature that is as significant as this one is, we'd end up with twice the installation footprint, and most of it would be stuff nobody ever looked at. I think the documentation is good enough that people will be able to understand how to use this feature, which is good enough for me. One thing that might still be worth doing is including the standard pgbench scripts in the pgbench documentation. Then we could say things like "and you could also modify these". Right now I tend to end up cut-and-pasting from the source code, which is fine if you're a hacker but not so user-friendly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hi, 2014-07-31 5:18 GMT+09:00 Fabien COELHO : > > I've committed the changes to pgbench.c and the documentation changes >> with some further wordsmithing. >> > > Ok, thanks a lot for your reviews and your help with improving the > documentation. Yeah, thanks for all relative members. > I don't think including the other changes in patch A is a good idea, >> > > Fine. It was mostly for testing and checking purposes. Hmm... It doesn't have harm for pgbench source code. And, in general, checking script is useful for avoiding bug. nor am I in favor of patch B. >> > > Yep. No, patch B is still needed. Please tell me the reason. I don't like deciding by someones feeling, and it needs logical reason. Our documentation is better than the past. I think it can easy to understand decile probability. This part of the discussion is needed to continue... Would providing these as additional contrib files be more acceptable? > Something like "tpc-b-gauss.sql"... Otherwise there is no example available > to show the feature. > I agree the test script and including command line options. It's not harm, and it's useful. Best regards, -- Mitsumasa KONDO
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, I've committed the changes to pgbench.c and the documentation changes with some further wordsmithing. Ok, thanks a lot for your reviews and your help with improving the documentation. I don't think including the other changes in patch A is a good idea, Fine. It was mostly for testing and checking purposes. nor am I in favor of patch B. Yep. Would providing these as additional contrib files be more acceptable? Something like "tpc-b-gauss.sql"... Otherwise there is no example available to show the feature. Thanks again, -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
On Tue, Jul 29, 2014 at 4:41 AM, Fabien COELHO wrote: >> Attached B patch does turn incorrect setrandom syntax into errors instead >> of ignoring extra parameters. >> >> First A patch is repeated to help commitfest references. > > Oops, I applied the change on the wrong part:-( > > Here is the change on part A which checks setrandom syntax, and B for > completeness. I've committed the changes to pgbench.c and the documentation changes with some further wordsmithing. I don't think including the other changes in patch A is a good idea, nor am I in favor of patch B. But thanks for your and Kondo-san's hard work on this; I think this will be quite useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Attached B patch does turn incorrect setrandom syntax into errors instead of ignoring extra parameters. First A patch is repeated to help commitfest references. Oops, I applied the change on the wrong part:-( Here is the change on part A which checks setrandom syntax, and B for completeness. -- Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README new file mode 100644 index 000..6881256 --- /dev/null +++ b/contrib/pgbench/README @@ -0,0 +1,5 @@ +# gaussian and exponential tests +# with XXX as "expo" or "gauss" +psql test < test-init.sql +./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test +psql test < test-XXX-check.sql diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4aa8a50..16e44bd 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -98,6 +98,8 @@ static int pthread_join(pthread_t th, void **thread_return); #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ +#define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ @@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max) return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state)); } +/* + * random number generator: exponential distribution from min to max inclusive. + * the threshold is so that the density of probability for the last cut-off max + * value is exp(-threshold). + */ +static int64 +getExponentialRand(TState *thread, int64 min, int64 max, double threshold) +{ + double cut, uniform, rand; + Assert(threshold > 0.0); + cut = exp(-threshold); + /* erand in [0, 1), uniform in (0, 1] */ + uniform = 1.0 - pg_erand48(thread->random_state); + /* + * inner expresion in (cut, 1] (if threshold > 0), + * rand in [0, 1) + */ + Assert((1.0 - cut) != 0.0); + rand = - log(cut + (1.0 - cut) * uniform) / threshold; + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + +/* random number generator: gaussian distribution from min to max inclusive */ +static int64 +getGaussianRand(TState *thread, int64 min, int64 max, double threshold) +{ + double stdev; + double rand; + + /* + * Get user specified random number from this loop, with + * -threshold < stdev <= threshold + * + * This loop is executed until the number is in the expected range. + * + * As the minimum threshold is 2.0, the probability of looping is low: + * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average + * sinus multiplier as 2/pi, we have a 8.6% looping probability in the + * worst case. For a 5.0 threshold value, the looping probability + * is about e^{-5} * 2 / pi ~ 0.43%. + */ + do + { + /* + * pg_erand48 generates [0,1), but for the basic version of the + * Box-Muller transform the two uniformly distributed random numbers + * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller) + */ + double rand1 = 1.0 - pg_erand48(thread->random_state); + double rand2 = 1.0 - pg_erand48(thread->random_state); + + /* Box-Muller basic form transform */ + double var_sqrt = sqrt(-2.0 * log(rand1)); + stdev = var_sqrt * sin(2.0 * M_PI * rand2); + + /* + * we may try with cos, but there may be a bias induced if the previous + * value fails the test. To be on the safe side, let us try over. + */ + } + while (stdev < -threshold || stdev >= threshold); + + /* stdev is in [-threshold, threshold), normalization to [0,1) */ + rand = (stdev + threshold) / (threshold * 2.0); + + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + /* call PQexec() and exit() on failure */ static void executeStatement(PGconn *con, const char *sql) @@ -1319,6 +1391,7 @@ top: char *var; int64 min, max; + double threshold = 0; char res[64]; if (*argv[2] == ':') @@ -1364,11 +1437,11 @@ top: } /* - * getrand() needs to be able to subtract max from min and add one - * to the result without overflowing. Since we know max > min, we - * can detect overflow just by checking for a negative result. But - * we must check both that the subtraction doesn't overflow, and - * that adding one to the result doesn't overflow either. + * Generate random number functions need to be able to subtract + * max from min and add one to the result without overflowing. + * Since we know max > min, we can detect overflow just by checking + * for a negative result. But we must check both that the subtraction + * doesn't overflow, and that adding one to the result doesn't overflow either. */ if (max - min < 0 || (max - min) + 1 < 0) { @@ -1377,10 +1450,64 @@ top: return true; } + if (argc == 4 || /* uniform without or with "uniform" keyword */ +(argc == 5 && pg_strcasecmp(ar
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, 3. Similarly, I suggest that the use of gaussian or uniform be an error when argc < 6 OR argc > 6. I also suggest that the parenthesized distribution type be dropped from the error message in all cases. I wish to agree, but my interpretation of the previous code is that they were ignored before, so ISTM that we are stuck with keeping the same unfortunate behavior. I don't agree. Attached B patch does turn incorrect setrandom syntax into errors instead of ignoring extra parameters. First A patch is repeated to help commitfest references. -- Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README new file mode 100644 index 000..6881256 --- /dev/null +++ b/contrib/pgbench/README @@ -0,0 +1,5 @@ +# gaussian and exponential tests +# with XXX as "expo" or "gauss" +psql test < test-init.sql +./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test +psql test < test-XXX-check.sql diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4aa8a50..e07206a 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -98,6 +98,8 @@ static int pthread_join(pthread_t th, void **thread_return); #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ +#define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ @@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max) return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state)); } +/* + * random number generator: exponential distribution from min to max inclusive. + * the threshold is so that the density of probability for the last cut-off max + * value is exp(-threshold). + */ +static int64 +getExponentialRand(TState *thread, int64 min, int64 max, double threshold) +{ + double cut, uniform, rand; + Assert(threshold > 0.0); + cut = exp(-threshold); + /* erand in [0, 1), uniform in (0, 1] */ + uniform = 1.0 - pg_erand48(thread->random_state); + /* + * inner expresion in (cut, 1] (if threshold > 0), + * rand in [0, 1) + */ + Assert((1.0 - cut) != 0.0); + rand = - log(cut + (1.0 - cut) * uniform) / threshold; + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + +/* random number generator: gaussian distribution from min to max inclusive */ +static int64 +getGaussianRand(TState *thread, int64 min, int64 max, double threshold) +{ + double stdev; + double rand; + + /* + * Get user specified random number from this loop, with + * -threshold < stdev <= threshold + * + * This loop is executed until the number is in the expected range. + * + * As the minimum threshold is 2.0, the probability of looping is low: + * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average + * sinus multiplier as 2/pi, we have a 8.6% looping probability in the + * worst case. For a 5.0 threshold value, the looping probability + * is about e^{-5} * 2 / pi ~ 0.43%. + */ + do + { + /* + * pg_erand48 generates [0,1), but for the basic version of the + * Box-Muller transform the two uniformly distributed random numbers + * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller) + */ + double rand1 = 1.0 - pg_erand48(thread->random_state); + double rand2 = 1.0 - pg_erand48(thread->random_state); + + /* Box-Muller basic form transform */ + double var_sqrt = sqrt(-2.0 * log(rand1)); + stdev = var_sqrt * sin(2.0 * M_PI * rand2); + + /* + * we may try with cos, but there may be a bias induced if the previous + * value fails the test. To be on the safe side, let us try over. + */ + } + while (stdev < -threshold || stdev >= threshold); + + /* stdev is in [-threshold, threshold), normalization to [0,1) */ + rand = (stdev + threshold) / (threshold * 2.0); + + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + /* call PQexec() and exit() on failure */ static void executeStatement(PGconn *con, const char *sql) @@ -1319,6 +1391,7 @@ top: char *var; int64 min, max; + double threshold = 0; char res[64]; if (*argv[2] == ':') @@ -1364,11 +1437,11 @@ top: } /* - * getrand() needs to be able to subtract max from min and add one - * to the result without overflowing. Since we know max > min, we - * can detect overflow just by checking for a negative result. But - * we must check both that the subtraction doesn't overflow, and - * that adding one to the result doesn't overflow either. + * Generate random number functions need to be able to subtract + * max from min and add one to the result without overflowing. + * Since we know max > min, we can detect overflow just by checking + * for a negative result. But we must check both that the subtraction + * doesn't overflow, and that a
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, I wish to agree, but my interpretation of the previous code is that they were ignored before, so ISTM that we are stuck with keeping the same unfortunate behavior. I don't agree. I'm not in a huge hurry to fix all the places where pgbench currently lacks error checks just because I don't have enough to do (hint: I do have enough to do), but when we're adding more complicated syntax in one particular place, bringing the error checks in that portion of the code up to scratch is an eminently sensible thing to do, and we should do it. Ok. I'm in favor of that anyway. It is just that was afraid that changing behavior, however poor the said behavior, could be a blocker. Also, please stop changing the title of this thread every other post. It breaks threading for me (and anyone else using gmail), and that makes the thread hard to follow. Sorry. It does not break my mailer which relies on internal headers, but I'll try to be compatible with this gmail "features" in the future. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
On Wed, Jul 23, 2014 at 12:39 PM, Fabien COELHO wrote: >> 3. Similarly, I suggest that the use of gaussian or uniform be an >> error when argc < 6 OR argc > 6. I also suggest that the >> parenthesized distribution type be dropped from the error message in >> all cases. > > I wish to agree, but my interpretation of the previous code is that they > were ignored before, so ISTM that we are stuck with keeping the same > unfortunate behavior. I don't agree. I'm not in a huge hurry to fix all the places where pgbench currently lacks error checks just because I don't have enough to do (hint: I do have enough to do), but when we're adding more complicated syntax in one particular place, bringing the error checks in that portion of the code up to scratch is an eminently sensible thing to do, and we should do it. Also, please stop changing the title of this thread every other post. It breaks threading for me (and anyone else using gmail), and that makes the thread hard to follow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hi, Thank you for your grate documentation and fix working!!! It becomes very helpful for understanding our feature. I add two feature in gauss_B_4.patch. 1) Add gaussianProbability() function It is same as exponentialProbability(). And the feature is as same as before. 2) Add result of "max/min percent of the range" It is almost same as --exponential option's result. However, max percent of the range is center of distribution and min percent of the range is most side of distribution. Here is the output example, + pgbench_account's aid selected with a truncated gaussian distribution + standard deviation threshold: 5.0 + decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0% + probability of max/min percent of the range: 4.0% 0.0% And I add the explanation about this in the document. I'm very appreciate for your works!!! Best regards, -- Mitsumasa KONDO gauss_B_5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, Some review comments: Thanks a lot for your return. Please find attached two new parts of the patch (A for setrandom extension, B for pgbench embedded test case extension). 1. I suggest that getExponentialrand and getGaussianrand be renamed to getExponentialRand and getGaussianRand. Done. It was named like that because "getrand" was used for the uniform case. 2. I suggest that the code be changed so that the branch currently labeled as /* uniform with extra argument */ become a hard error instead of a warning. 3. Similarly, I suggest that the use of gaussian or uniform be an error when argc < 6 OR argc > 6. I also suggest that the parenthesized distribution type be dropped from the error message in all cases. I wish to agree, but my interpretation of the previous code is that they were ignored before, so ISTM that we are stuck with keeping the same unfortunate behavior. 4. This question mark seems like it should be a period: + * value fails the test? To be on the safe side, let us try over. Indeed. 5. With regards to the following paragraph: + The default random distribution is uniform, that is all values in the + range are drawn with equal probability. The gaussian and exponential + options allow to change this default. The mandatory + threshold double value controls the actual distribution + with gaussian or exponential. + This paragraph needs a bit of copy-editing. Here's an attempt: "By default, all values in the range are drawn with equal probability. The gaussian and exponential options modify this behavior; each requires a mandatory threshold which determines the precise shape of the distribution." The following paragraph should be changed to begin with "For a Gaussian distribution" and the one after "For an exponential distribution". Ok. I've kept "uniform" in the first sentence, because this is both an option name and it is significant in term of probabilities. 6. Overall, I think the documentation here looks much better now, but I suggest adding one or two example to the Gaussian section. Like this: for example, if threshold is 2.0, 68% of the values will fall in the middle third of the interval; with a threshold of 3.0, 99.7% of the values will fall in the middle third of the interval. These numbers are fabricated, and the middle third of the interval might not be the best part to talk about, but you get the idea (I hope). Done with threshold value 4.0 so I have a "middle quarter" and a "middle half". -- Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README new file mode 100644 index 000..6881256 --- /dev/null +++ b/contrib/pgbench/README @@ -0,0 +1,5 @@ +# gaussian and exponential tests +# with XXX as "expo" or "gauss" +psql test < test-init.sql +./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test +psql test < test-XXX-check.sql diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4aa8a50..e07206a 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -98,6 +98,8 @@ static int pthread_join(pthread_t th, void **thread_return); #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ +#define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ @@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max) return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state)); } +/* + * random number generator: exponential distribution from min to max inclusive. + * the threshold is so that the density of probability for the last cut-off max + * value is exp(-threshold). + */ +static int64 +getExponentialRand(TState *thread, int64 min, int64 max, double threshold) +{ + double cut, uniform, rand; + Assert(threshold > 0.0); + cut = exp(-threshold); + /* erand in [0, 1), uniform in (0, 1] */ + uniform = 1.0 - pg_erand48(thread->random_state); + /* + * inner expresion in (cut, 1] (if threshold > 0), + * rand in [0, 1) + */ + Assert((1.0 - cut) != 0.0); + rand = - log(cut + (1.0 - cut) * uniform) / threshold; + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + +/* random number generator: gaussian distribution from min to max inclusive */ +static int64 +getGaussianRand(TState *thread, int64 min, int64 max, double threshold) +{ + double stdev; + double rand; + + /* + * Get user specified random number from this loop, with + * -threshold < stdev <= threshold + * + * This loop is executed until the number is in the expected range. + * + * As the minimum threshold is 2.0, the probability of looping is low: + * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average + * sinus multiplier as 2/pi, we have a 8.6% looping probability in the + * worst case. For a 5.0