Hi Febien,
Thank you very much for your very detail and useful comments!
I read your comment, I agree most of your advice:)
Attached patch is fixed for your comment. That are...
- Remove redundant long-option.
- We can use "--gaussian=NUM -S" or "--gaussian=NUMN -N" options.
- Add sentence in document
- Separate two random generate function which are uniform and gaussian.
- getGaussianrand() is created.
- Fix ranged random number more strictly, ex. (0,1) or [0,1).
- Please see comment of source code in detail:).
- Fix typo.
- Use cos() and sin() function when we generate gaussian random number.
- Add fast sqrt calculation algorithm.
- Reuse sqrt result and pre generate random number for reducing calculation
cost.
- Experience of this method is under following. It will be little-bit faster
than non-reuse method. And distribution of gaussian is still good.
* Settings
shared_buffers = 1024MB
* Test script
pgbench -i -s 1
pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
* Result
method | try1 | try2 | try3 |
--------------------------------------------|
reuse method | 44189 | 44453 | 44013 |
non-reuse method | 43567 | 43635 | 43508 |
(2014/02/09 21:32), Fabien COELHO wrote:
This is a valuable contribution to enable pgbench to generate more realistic
loads, which is seldom uniform in practice.
Thanks!
However, ISTM that other distributions such an exponantial one would make
more sense,
I can easy to create exponential distribution. Here, I assume exponential
distribution that is f(x) = lambda * exp^(-lambda * x) in general.
What do you think under following interface?
custom script: \setexp [varname] min max threshold
command : --exponential=NUM(threshold)
I don't want to use lambda variable for simple implementation. So lambda is
always 1. Because it can enough to control distribution by threshold. Threshold
parameter is f(x) value. And using created distribution projects to 'aid' by same
method. If you think OK, I will impliment under followings tomorrow, and also
create parseing part of this function...
do
{
rand = 1.0 - pg_erand48(thread->random_state);
rand = -log(rand);
}while( rand > exp_threshold)
return rand / exp_threshold;
and also the values should be further randomized so that
neighboring values are not more likely to be drawn. The latest point is non
trivial.
That's right, but I worry about gaussian randomness and benchmark reproducibility
might be disappeared when we re-randomized access pattern, because Postgres
storage method manages records by each pages and it is difficult to realize
access randomness in whole pages, not record. If we solve this problem, we have
to need algorithm for smart shuffule projection function that is still having
gaussian randomized. I think it will be difficult, and it have to impement in
another patch in the future.
* Mathematical soundness
We want to derive a discrete normal distribution from a uniform one.
Well, normal distributions are for continuous variables... Anyway, this is
done by computing a continuous normal distribution which is then projected
onto integers. I'm basically fine with that.
The system uses a Box-Muller transform (1958) to do this transformation.
The Ziggurat method seems to be prefered for this purpose, *but* it would
require precalculated tables which depends on the target values. So I'm
fine with the Box-Muller transform for pgbench.
Yes, that's right. I selected simple and relatively faster algorithm, that is
Box-Muller transform.
The BM method uses 2 uniformly distributed numbers to derive 2 normally
distributed numbers. The implementation computes one of these, and loops
over till one match a threshold criterion.
More explanations, at least in comments, are needed about this threshold
and its meaning. It is required to be more than 2. I guess is that it allows
to limit the number of iterations of the while loop,
Yes. This loop could not almost go on, because min stdev_threshold is 2.
The possibility of retry-loop is under 4 percent. It might not be problem.
but in what proportion
is unclear. The documentation does not also help the user to understand
this value and its meaning.
Yes, it is huristic method. So I added the comments in document.
What I think it is: it is the deviation for the FURTHEST point around the
mean, that is the actual deviation associated to the "min" and "max" target
values. The 2 minimum value induces that there is a least 4 stddev lengths
between min & max, with the most likely mean in the middle.
Correct!
If the threshold test fails, one of the 2 uniform number is redrawn, a new
candidate value is tested. I'm not at ease about why only 1 value is redrawn
and not both, some explanations would be welcome. Also, on the other hand,
why not test the other possible value (with cos) if the first one fails?
Yes, I think so too. So I fixed this partan and it will be better. Past
implementations are not good:(
Also, as suggested above, I would like some explanations about how much this
while loop may iterate without success, say with the expected average number
of iterations with its explanation in a comment.
I add my comments in source code.
* Implementation
Random values :
double rand1 = 1.0 - rand; // instead of the LONG_MAX computation & limits.h
rand2 should be in (0, 1], but it is in [0, 1), use "1.0 - ..." as well?!
It's more smart method. I change to this method.
What is called "stdev*" in getrand() is really the chosen deviation from
the target mean, so it would make more sense to name it "dev".
Hmm, I like stdev*. Short variable makes us more confuse:( And it's not big
problem.
I do not think that the getrand refactoring was such a good idea. I'm sorry
if I may have suggested that in a previous comment.
The new getrand possibly ignores its parameters, hmmmm. ISTM that it would
be much simpler in the code to have a separate and clean "getrand_normal"
or "getrand_gauss" called for "\setgaussian", and that's it. This would
allow to get rid of DistType and all of getrand changes in the code.
I separate two function that are getrand() and getGaussianrand(), it becomes more
clear I think.
There are heavy constants computations (sqrt(log()) within the while
loop which would be moved out of the loop.
ISTM that the while condition would be easier to read as:
while ( dev < - threshold || threshold < dev )
OK, fixed.
Maybe the \\setgaussian argument handling may be transformed into a function,
so that it could be used easily later for some other distribution (say some
setexp:-)
* Options
ISTM that the test options would be better if made orthogonal, i.e. not to
have three --gaussian* options. I would suggest to have only one
--gaussian=NUM which would trigger gaussian tests with this threshold,
and --gaussian=3.5 --select-only would use the select-only variant,
and so on.
Agreed. Fixed.
* Typos
gausian -> gaussian
patern -> pattern
Oh, fixed.
* Conclusion :
- this is a valuable patch to help create more realistic load and make pgbench
a more useful tool. I'm greatly in favor of having such a functionality.
- it seems to me that the patch should be further improved before being
committed, in particular I would suggest:
(1) improve the explanations in the code and in the documentation,
especially
about what is the "deviation threshold" and its precise link to generated
values.
(2) simplify the code with a separate gaussian getrand, and simpler or
more efficient code here and there, see comments above.
(3) use only one option to trigger gaussian tests.
(bonus) \setexp would be a nice:-)
Thank you for your comments. They make my patch more polished:)
I think my patch is fixed for supporting all your comments, but it might not be
fixed
as you think. And if you notice other part, please send me.
Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***************
*** 176,181 **** int progress_nthreads = 0; /* number of threads for progress report */
--- 176,183 ----
bool is_connect; /* establish connection for each transaction */
bool is_latencies; /* report per-command latencies */
int main_pid; /* main process id used in log filename */
+ double stdev_threshold = 5; /* standard deviation threshold */
+ bool gaussian_option = false; /* use gaussian distribution random generator */
char *pghost = "";
char *pgport = "";
***************
*** 338,346 **** static char *select_only = {
--- 340,390 ----
"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
};
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ "\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ "\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ "\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ "\\setgaussian aid 1 :naccounts :stdev_threshold\n"
+ "\\setrandom bid 1 :nbranches\n"
+ "\\setrandom tid 1 :ntellers\n"
+ "\\setrandom delta -5000 5000\n"
+ "BEGIN;\n"
+ "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ "END;\n"
+ };
+
+ /* --gaussian with -N case */
+ static char *gaussian_simple_update = {
+ "\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ "\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ "\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ "\\setgaussian aid 1 :naccounts :stdev_threshold\n"
+ "\\setrandom bid 1 :nbranches\n"
+ "\\setrandom tid 1 :ntellers\n"
+ "\\setrandom delta -5000 5000\n"
+ "BEGIN;\n"
+ "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ "END;\n"
+ };
+
+ /* --gaussian with -S case */
+ static char *gaussian_select_only = {
+ "\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ "\\setgaussian aid 1 :naccounts :stdev_threshold\n"
+ "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ };
+
/* Function prototypes */
static void setalarm(int seconds);
static void *threadRun(void *arg);
+ static inline double sqrtd(const double x);
static void
usage(void)
***************
*** 381,386 **** usage(void)
--- 425,431 ----
" -v, --vacuum-all vacuum all four standard tables before tests\n"
" --aggregate-interval=NUM aggregate data over NUM seconds\n"
" --sampling-rate=NUM fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+ " --gaussian=NUM gaussian distribution with NUM standard deviation threshold\n"
"\nCommon options:\n"
" -d, --debug print debugging output\n"
" -h, --host=HOSTNAME database server host or socket directory\n"
***************
*** 477,482 **** getrand(TState *thread, int64 min, int64 max)
--- 522,597 ----
return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
}
+ /* random number generator: gaussian distribution from min to max inclusive */
+ static int64
+ getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+ {
+ double stdev;
+ double rand;
+ static double rand1;
+ static double rand2;
+ static double var_sqrt;
+ static bool reuse = false;
+
+ /*
+ * Get user specified random number(-stdev_threshold < stdev <= stdev_threshold)
+ * in this loop. This loop is executed until appeared ranged number we want.
+ * However, this loop could not almost go on, because min stdev_threshold is 2
+ * then the possibility of retry-loop is under 4 percent. And possibility of
+ * re-retry-loop is under 1.6 percent. And it doesn't happen frequentry even if
+ * we also think about the cycle of the trigonometric function.
+ */
+ do
+ {
+ /* reuse pre calculation result as possible */
+ if(!reuse)
+ {
+ /*
+ * pg_erand48 generates [0,1) random number. However rand1
+ * needs (0,1) random number because log(0) cannot calculate.
+ * And rand2 also needs (0,1) random number in strictly. But
+ * normalization cost is high and we can substitute (0,1] at
+ * rand1 and [0,1) at rand2, so we use approximate calculation.
+ */
+ rand1 = 1.0 - pg_erand48(thread->random_state);
+ rand2 = pg_erand48(thread->random_state);
+
+ /* Box-Muller transform */
+ var_sqrt = sqrtd(-2.0 * log(rand1));
+ stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+ reuse = true;
+ }
+ else
+ {
+ stdev = var_sqrt * cos(2.0 * M_PI * rand2);
+ reuse = false;
+ }
+ } while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+
+ /* normalization to [0,1) */
+ rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+ /* return int64 random number within between min and max */
+ return min + (int64) (max - min + 1) * rand;
+ }
+
+ /*
+ * fast sqrt algorithm: reference from Fast inverse square root algorithms.
+ */
+ static inline double
+ sqrtd(const double x)
+ {
+ double x_half = 0.5 * x;
+ long long int tmp = 0x5FE6EB50C7B537AAl - ( *(long long int*)&x >> 1);
+ double x_result = *(double*)&tmp;
+
+ x_result *= (1.5 - (x_half * x_result * x_result));
+ /* retry this calculation, it becomes higher precision at sqrt */
+ x_result *= (1.5 - (x_half * x_result * x_result));
+
+ return x_result * x;
+ }
+
/* call PQexec() and exit() on failure */
static void
executeStatement(PGconn *con, const char *sql)
***************
*** 1391,1396 **** top:
--- 1506,1601 ----
st->listen = 1;
}
+ else if (pg_strcasecmp(argv[0], "setgaussian") == 0)
+ {
+ char *var;
+ char *endptr;
+ int64 min;
+ int64 max;
+ double stdev_threshold;
+ char res[64];
+
+ if (*argv[2] == ':')
+ {
+ if((var = getVariable(st, argv[2] + 1)) == NULL)
+ {
+ fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
+ st->ecnt++;
+ return true;
+ }
+ min = strtoint64(var);
+ }
+ else
+ min = strtoint64(argv[2]);
+ #ifdef NOT_USED
+ if (min < 0)
+ {
+ fprintf(stderr, "%s: invalid minimum number %d\n", argv[0], min);
+ st->ecnt++;
+ return;
+ }
+ #endif
+ if (*argv[3] == ':')
+ {
+ if((var = getVariable(st, argv[3] + 1)) == NULL)
+ {
+ fprintf(stderr, "%s: invalid maximum number %s\n", argv[0], argv[3]);
+ st->ecnt++;
+ return true;
+ }
+ max = strtoint64(var);
+ }
+ else
+ max = strtoint64(argv[3]);
+
+ /* check if min and max are appropriate value */
+ if(max < min)
+ {
+ fprintf(stderr, "%s: maximum is less than minimum\n", argv[0]);
+ st->ecnt++;
+ return true;
+ }
+
+ /* for not overflowing when generating random number */
+ if(max - min < 0 || (max - min) + 1 < 0)
+ {
+ fprintf(stderr, "%s: range too large\n", argv[0]);
+ st->ecnt++;
+ return true;
+ }
+
+ if(*argv[4] == ':')
+ {
+ if((var = getVariable(st, argv[4] + 1)) == NULL)
+ {
+ fprintf(stderr, "%s: invalid gaussian threshold number %s\n", argv[0], argv[4]);
+ st->ecnt++;
+ return true;
+ }
+ stdev_threshold = strtod(var, NULL);
+ }
+ else
+ stdev_threshold = strtod(argv[4], &endptr);
+
+ if ( stdev_threshold < 2)
+ {
+ fprintf(stderr, "%s: gaussian threshold must be more than 2\n,", argv[4]);
+ st->ecnt++;
+ return true;
+ }
+ #ifdef DEBUG
+ printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, stdev_threshold));
+ #endif
+ snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, stdev_threshold));
+
+ if(!putVariable(st, argv[0], argv[1], res))
+ {
+ st->ecnt++;
+ return true;
+ }
+
+ st->listen = 1;
+ }
else if (pg_strcasecmp(argv[0], "set") == 0)
{
char *var;
***************
*** 1915,1920 **** process_commands(char *buf)
--- 2120,2137 ----
fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
my_commands->argv[0], my_commands->argv[j]);
}
+ else if (pg_strcasecmp(my_commands->argv[0], "setgaussian") == 0)
+ {
+ if (my_commands->argc < 5)
+ {
+ fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
+ exit(1);
+ }
+
+ for (j = 5; j < my_commands->argc; j++)
+ fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
+ my_commands->argv[0], my_commands->argv[j]);
+ }
else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
{
if (my_commands->argc < 3)
***************
*** 2188,2203 **** printResults(int ttype, int normal_xacts, int nclients,
(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
if (ttype == 0)
! s = "TPC-B (sort of)";
else if (ttype == 2)
! s = "Update only pgbench_accounts";
else if (ttype == 1)
! s = "SELECT only";
else
s = "Custom query";
printf("transaction type: %s\n", s);
printf("scaling factor: %d\n", scale);
printf("query mode: %s\n", QUERYMODE[querymode]);
printf("number of clients: %d\n", nclients);
printf("number of threads: %d\n", nthreads);
--- 2405,2447 ----
(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
if (ttype == 0)
! {
! if(gaussian_option)
! s = "TPC-B (sort of)";
! else
! s = "Gaussian distributed TPC-B (sort of)";
! }
else if (ttype == 2)
! {
! if(gaussian_option)
! s = "Gaussian distributed update only pgbench_accounts";
! else
! s = "Update only pgbench_accounts";
! }
else if (ttype == 1)
! {
! if(gaussian_option)
! s = "Gaussian distributed SELECT only";
! else
! s = "SELECT only";
! }
else
s = "Custom query";
printf("transaction type: %s\n", s);
printf("scaling factor: %d\n", scale);
+
+ /* output in only gaussian distributed benchmark */
+ if(gaussian_option)
+ {
+ printf("standard deviation threshold: %.5f\n", stdev_threshold);
+ printf("access probability of top 20%%, 10%% and 5%% records: %.5f %.5f %.5f\n",
+ (double) ((erf (stdev_threshold * 0.2 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+ (double) ((erf (stdev_threshold * 0.1 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+ (double) ((erf (stdev_threshold * 0.05 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0))))
+ );
+ }
+
printf("query mode: %s\n", QUERYMODE[querymode]);
printf("number of clients: %d\n", nclients);
printf("number of threads: %d\n", nthreads);
***************
*** 2327,2332 **** main(int argc, char **argv)
--- 2571,2577 ----
{"unlogged-tables", no_argument, &unlogged_tables, 1},
{"sampling-rate", required_argument, NULL, 4},
{"aggregate-interval", required_argument, NULL, 5},
+ {"gaussian", required_argument, NULL, 6},
{"rate", required_argument, NULL, 'R'},
{NULL, 0, NULL, 0}
};
***************
*** 2606,2611 **** main(int argc, char **argv)
--- 2851,2865 ----
}
#endif
break;
+ case 6:
+ gaussian_option = true;
+ stdev_threshold = atof(optarg);
+ if(stdev_threshold < 2)
+ {
+ fprintf(stderr, "--gaussian=NUM must be more than 2: %f\n", stdev_threshold);
+ exit(1);
+ }
+ break;
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
exit(1);
***************
*** 2803,2808 **** main(int argc, char **argv)
--- 3057,3073 ----
}
}
+ /* set :stdev_threshold variable */
+ if(getVariable(&state[0], "stdev_threshold") == NULL)
+ {
+ snprintf(val, sizeof(val), "%lf", stdev_threshold);
+ for (i = 0; i < nclients; i++)
+ {
+ if (!putVariable(&state[i], "startup", "stdev_threshold", val))
+ exit(1);
+ }
+ }
+
if (!is_no_vacuum)
{
fprintf(stderr, "starting vacuum...");
***************
*** 2828,2844 **** main(int argc, char **argv)
switch (ttype)
{
case 0:
! sql_files[0] = process_builtin(tpc_b);
num_files = 1;
break;
case 1:
! sql_files[0] = process_builtin(select_only);
num_files = 1;
break;
case 2:
! sql_files[0] = process_builtin(simple_update);
num_files = 1;
break;
--- 3093,3118 ----
switch (ttype)
{
case 0:
! if(gaussian_option)
! sql_files[0] = process_builtin(gaussian_tpc_b);
! else
! sql_files[0] = process_builtin(tpc_b);
num_files = 1;
break;
case 1:
! if(gaussian_option)
! sql_files[0] = process_builtin(gaussian_select_only);
! else
! sql_files[0] = process_builtin(select_only);
num_files = 1;
break;
case 2:
! if(gaussian_option)
! sql_files[0] = process_builtin(simple_update);
! else
! sql_files[0] = process_builtin(gaussian_simple_update);
num_files = 1;
break;
*** a/doc/src/sgml/pgbench.sgml
--- b/doc/src/sgml/pgbench.sgml
***************
*** 320,325 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
--- 320,342 ----
</varlistentry>
<varlistentry>
+ <term><option>--gaussian</option><replaceable>standard deviation</></term>
+ <listitem>
+ <para>
+ Gaussian distribution pgbench option. Need the standard deviation threshold.
+ Standard deviation threshold can control distribution of access patern that
+ is used by aid in pgbench_accounts table. If we set larger standard deviation
+ threshold, pgbench access patern limited more specific records. On the other
+ hands, if you set smaller standard deviation, pgbench access patern will be
+ more gently distribution. Standard deviation threshold must be higher than 2.
+ This rule is needed for realizing realistic calculation costs. If you add
+ '-N' or '-S' options, you can execute gaussian distribution pgbench in these
+ benchmarks.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
<term><option>-j</option> <replaceable>threads</></term>
<term><option>--jobs=</option><replaceable>threads</></term>
<listitem>
***************
*** 770,775 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
--- 787,818 ----
<varlistentry>
<term>
+ <literal>\setgaussian <replaceable>varname</> <replaceable>min</> <replaceable>max</> <replaceable>
+ standard deviation threshold</literal>
+ </term>
+
+ <listitem>
+ <para>
+ Sets variable <replaceable>varname</> to a gaussian random integer value
+ between the limits <replaceable>min</> and <replaceable>max</> inclusive.
+ Each limit can be either an integer constant or a
+ <literal>:</><replaceable>variablename</> reference to a variable
+ having an integer value. Standard deviation threshold controls
+ distribution of access patern. If we set larger value in standard
+ deviation threshold, more frequentry access patern will be more
+ limited ranges. Min standard deviation threshold is 2. This rule
+ needs for realizing realistic calculation costs.
+ </para>
+
+ <para>
+ Example:
+ <programlisting>
+ \setgaussian aid 1 :naccounts 5
+ </programlisting></para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
<literal>\sleep <replaceable>number</> [ us | ms | s ]</literal>
</term>
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers