On Wed, 31 Mar 2021 at 09:02, Fabien COELHO <[email protected]> wrote:
>
> >> First, I have a thing against erand48.
> >
> Also, there is a 64 bits seed provided to the function which instantly
> ignores 16 of them, which looks pretty silly to me.
>
Yeah, that was copied from set_random_seed().
> At least, I suggest that two 48-bits prng could be initialized with parts
> of the seed and used in different places, eg for r & m.
>
That could work. I'd certainly feel better about that than
implementing a whole new PRNG.
> Also, the seed could be used to adjust the rotation, maybe.
>
Perhaps. I'm not sure it's really necessary though.
> >> I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of
> >> values are kept out of STEERING. [...]
> >
> > Ah, that's a good point. Something else that also concerned me there was
> > that it might lead to 2 consecutive full shifts with nothing in between,
> > which would lead to less uniform randomness (like the Irwin-Hall
> > distribution). I just did a quick test without the first full shift, and
> > the results do appear to be better,
>
> Indeed, it makes sense to me.
>
OK, attached is an update making this change and simplifying the
rotate code, which hopefully just leaves the question of what (if
anything) to do with pg_erand48().
> >> Third, I think that the rotate code can be simplified, in particular
> >> the ?: should be avoided because it may induce branches quite damaging
> >> to processor performance.
> >
> > Yeah, I wondered about that. Perhaps there's a "trick" that can be
> > used to simplify it. Pre-computing the number of bits in the mask
> > would probably help.
>
> See pg_popcount64().
>
Actually, I used pg_leftmost_one_pos64() to calculate the mask length,
allowing the mask to be computed from that, so there is no longer a
need for compute_mask(), which seems like a neat little
simplification.
Regards,
Dean
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
new file mode 100644
index 50cf22b..84d9566
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1057,7 +1057,7 @@ pgbench <optional> <replaceable>options<
<row>
<entry> <literal>default_seed</literal> </entry>
- <entry>seed used in hash functions by default</entry>
+ <entry>seed used in hash and pseudorandom permutation functions by default</entry>
</row>
<row>
@@ -1866,6 +1866,24 @@ SELECT 4 AS four \; SELECT 5 AS five \as
<row>
<entry role="func_table_entry"><para role="func_signature">
+ <function>permute</function> ( <parameter>i</parameter>, <parameter>size</parameter> [, <parameter>seed</parameter> ] )
+ <returnvalue>integer</returnvalue>
+ </para>
+ <para>
+ Permuted value of <parameter>i</parameter>, in the range
+ <literal>[0, size)</literal>. This is the new position of
+ <parameter>i</parameter> (modulo <parameter>size</parameter>) in a
+ pseudorandom permutation of the integers <literal>0...size-1</literal>,
+ parameterized by <parameter>seed</parameter>.
+ </para>
+ <para>
+ <literal>permute(0, 4)</literal>
+ <returnvalue>an integer between 0 and 3</returnvalue>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
<function>pi</function> ()
<returnvalue>double</returnvalue>
</para>
@@ -2071,29 +2089,70 @@ f(x) = PHI(2.0 * parameter * (x - mu) /
</listitem>
</itemizedlist>
+ <note>
+ <para>
+ When designing a benchmark which selects rows non-uniformly, be aware
+ that the rows chosen may be correlated with other data such as IDs from
+ a sequence or the physical row ordering, which may skew performance
+ measurements.
+ </para>
+ <para>
+ To avoid this, you may wish to use the <function>permute</function>
+ function, or some other additional step with similar effect, to shuffle
+ the selected rows and remove such correlations.
+ </para>
+ </note>
+
<para>
Hash functions <literal>hash</literal>, <literal>hash_murmur2</literal> and
<literal>hash_fnv1a</literal> accept an input value and an optional seed parameter.
In case the seed isn't provided the value of <literal>:default_seed</literal>
is used, which is initialized randomly unless set by the command-line
- <literal>-D</literal> option. Hash functions can be used to scatter the
- distribution of random functions such as <literal>random_zipfian</literal> or
- <literal>random_exponential</literal>. For instance, the following pgbench
- script simulates possible real world workload typical for social media and
- blogging platforms where few accounts generate excessive load:
+ <literal>-D</literal> option.
+ </para>
+
+ <para>
+ <literal>permute</literal> accepts an input value, a size, and an optional
+ seed parameter. It generates a pseudorandom permutation of integers in
+ the range <literal>[0, size)</literal>, and returns the index of the input
+ value in the permuted values. The permutation chosen is parameterized by
+ the seed, which defaults to <literal>:default_seed</literal>, if not
+ specified. Unlike the hash functions, <literal>permute</literal> ensures
+ that there are no collisions or holes in the output values. Input values
+ outside the interval are interpreted modulo the size. The function raises
+ an error if the size is not positive. <function>permute</function> can be
+ used to scatter the distribution of non-uniform random functions such as
+ <literal>random_zipfian</literal> or <literal>random_exponential</literal>
+ so that values drawn more often are not trivially correlated. For
+ instance, the following <application>pgbench</application> script
+ simulates a possible real world workload typical for social media and
+ blogging platforms where a few accounts generate excessive load:
<programlisting>
-\set r random_zipfian(0, 100000000, 1.07)
-\set k abs(hash(:r)) % 1000000
+\set size 1000000
+\set r random_zipfian(1, :size, 1.07)
+\set k 1 + permute(:r, :size)
</programlisting>
In some cases several distinct distributions are needed which don't correlate
- with each other and this is when implicit seed parameter comes in handy:
+ with each other and this is when the optional seed parameter comes in handy:
<programlisting>
-\set k1 abs(hash(:r, :default_seed + 123)) % 1000000
-\set k2 abs(hash(:r, :default_seed + 321)) % 1000000
+\set k1 1 + permute(:r, :size, :default_seed + 123)
+\set k2 1 + permute(:r, :size, :default_seed + 321)
+</programlisting>
+
+ A similar behavior can also be approximated with <function>hash</function>:
+
+<programlisting>
+\set size 1000000
+\set r random_zipfian(1, 100 * :size, 1.07)
+\set k 1 + abs(hash(:r)) % :size
</programlisting>
+
+ However, since <function>hash</function> generates collisions, some values
+ will not be reachable and others will be more frequent than expected from
+ the original distribution.
</para>
<para>
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
new file mode 100644
index 4d529ea..56f75cc
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -19,6 +19,7 @@
#define PGBENCH_NARGS_VARIABLE (-1)
#define PGBENCH_NARGS_CASE (-2)
#define PGBENCH_NARGS_HASH (-3)
+#define PGBENCH_NARGS_PERMUTE (-4)
PgBenchExpr *expr_parse_result;
@@ -370,6 +371,9 @@ static const struct
{
"hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A
},
+ {
+ "permute", PGBENCH_NARGS_PERMUTE, PGBENCH_PERMUTE
+ },
/* keep as last array element */
{
NULL, 0, 0
@@ -479,6 +483,19 @@ make_func(yyscan_t yyscanner, int fnumbe
{
PgBenchExpr *var = make_variable("default_seed");
args = make_elist(var, args);
+ }
+ break;
+
+ /* pseudorandom permutation function with optional seed argument */
+ case PGBENCH_NARGS_PERMUTE:
+ if (len < 2 || len > 3)
+ expr_yyerror_more(yyscanner, "unexpected number of arguments",
+ PGBENCH_FUNCTIONS[fnumber].fname);
+
+ if (len == 2)
+ {
+ PgBenchExpr *var = make_variable("default_seed");
+ args = make_elist(var, args);
}
break;
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
new file mode 100644
index 48ce171..a90b01d
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -66,6 +66,7 @@
#include "getopt_long.h"
#include "libpq-fe.h"
#include "pgbench.h"
+#include "port/pg_bitutils.h"
#include "portability/instr_time.h"
#ifndef M_PI
@@ -1128,6 +1129,106 @@ getHashMurmur2(int64 val, uint64 seed)
}
/*
+ * Pseudorandom permutation function
+ *
+ * For small sizes, this generates each of the (size!) possible permutations
+ * of integers in the range [0, size) with roughly equal probability. Once
+ * the size is larger than 16, the number of possible permutations exceeds the
+ * number of distinct states of the internal pseudorandom number generator,
+ * and so not all possible permutations can be generated, but the permutations
+ * chosen should continue to give the appearance of being random.
+ *
+ * THIS FUNCTION IS NOT CRYPTOGRAPHICALLY SECURE.
+ * DO NOT USE FOR SUCH PURPOSE.
+ */
+static int64
+permute(const int64 val, const int64 isize, const int64 seed)
+{
+ RandomState random_state;
+ uint64 size;
+ uint64 v;
+ int masklen;
+ uint64 mask;
+ int i;
+
+ if (isize < 2)
+ return 0; /* nothing to permute */
+
+ /* Initialize random state with low-order bits of seed */
+ random_state.xseed[0] = seed & 0xFFFF;
+ random_state.xseed[1] = (seed >> 16) & 0xFFFF;
+ random_state.xseed[2] = (seed >> 32) & 0xFFFF;
+
+ /* Computations are performed on unsigned values */
+ size = (uint64) isize;
+ v = (uint64) val % size;
+
+ /* Mask to work modulo largest power of 2 less than or equal to size */
+ masklen = pg_leftmost_one_pos64(size);
+ mask = (((uint64) 1) << masklen) - 1;
+
+ /*
+ * Permute the input value by applying 4 rounds of pseudorandom bijective
+ * transformations. The intention here is to distribute each input
+ * uniformly randomly across the range, and separate adjacent inputs
+ * approximately uniformly randomly from each other, leading to a fairly
+ * random overall choice of permutation.
+ *
+ * To separate adjacent inputs, we multiply by a random number modulo
+ * (mask + 1), which is a power of 2. For this to be a bijection, the
+ * multiplier must be odd. Since this is known to lead to less randomness
+ * in the lower bits, we also apply a rotation that shifts the topmost bit
+ * into the least significant bit. In the special cases where size <= 3,
+ * mask = 1 and each of these operations is actually a no-op, so we also
+ * XOR with a different random number to inject additional randomness.
+ * Since the size is generally not a power of 2, we apply this bijection
+ * on overlapping upper and lower halves of the input.
+ *
+ * To distribute the inputs uniformly across the range, we then also apply
+ * a random offset modulo the full range.
+ *
+ * Taken together, these operations resemble a modified linear
+ * congruential generator, as is commonly used in pseudorandom number
+ * generators. Empirically, it is found that for small sizes it selects
+ * each of the (size!) possible permutations with roughly equal
+ * probability. For larger sizes, not all permutations can be generated,
+ * but the intended random spread is still produced.
+ */
+ for (i = 0; i < 4; i++)
+ {
+ uint64 m,
+ r,
+ t;
+
+ /* Random multiply (by an odd number), XOR and rotate of lower half */
+ m = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)) | 1;
+ r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1));
+ if (v <= mask)
+ {
+ v = ((v * m) ^ r) & mask;
+ v = ((v << 1) & mask) | (v >> (masklen - 1));
+ }
+
+ /* Random multiply (by an odd number), XOR and rotate of upper half */
+ m = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)) | 1;
+ r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1));
+ t = size - 1 - v;
+ if (t <= mask)
+ {
+ t = ((t * m) ^ r) & mask;
+ t = ((t << 1) & mask) | (t >> (masklen - 1));
+ v = size - 1 - t;
+ }
+
+ /* Random offset */
+ r = (uint64) (pg_erand48(random_state.xseed) * size);
+ v = (v + r) % size;
+ }
+
+ return (int64) v;
+}
+
+/*
* Initialize the given SimpleStats struct to all zeroes
*/
static void
@@ -2475,6 +2576,29 @@ evalStandardFunc(CState *st,
return true;
}
+ case PGBENCH_PERMUTE:
+ {
+ int64 val,
+ size,
+ seed;
+
+ Assert(nargs == 3);
+
+ if (!coerceToInt(&vargs[0], &val) ||
+ !coerceToInt(&vargs[1], &size) ||
+ !coerceToInt(&vargs[2], &seed))
+ return false;
+
+ if (size <= 0)
+ {
+ pg_log_error("permute size parameter must be greater than zero");
+ return false;
+ }
+
+ setIntValue(retval, permute(val, size, seed));
+ return true;
+ }
+
default:
/* cannot get here */
Assert(0);
diff --git a/src/bin/pgbench/pgbench.h b/src/bin/pgbench/pgbench.h
new file mode 100644
index 3a9d89e..6ce1c98
--- a/src/bin/pgbench/pgbench.h
+++ b/src/bin/pgbench/pgbench.h
@@ -99,7 +99,8 @@ typedef enum PgBenchFunction
PGBENCH_IS,
PGBENCH_CASE,
PGBENCH_HASH_FNV1A,
- PGBENCH_HASH_MURMUR2
+ PGBENCH_HASH_MURMUR2,
+ PGBENCH_PERMUTE
} PgBenchFunction;
typedef struct PgBenchExpr PgBenchExpr;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
new file mode 100644
index 82a46c7..cf50975
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -4,6 +4,7 @@ use warnings;
use PostgresNode;
use TestLib;
use Test::More;
+use Config;
# start a pgbench specific server
my $node = get_new_node('main');
@@ -483,6 +484,17 @@ pgbench(
qr{command=98.: int 5432\b}, # :random_seed
qr{command=99.: int -9223372036854775808\b}, # min int
qr{command=100.: int 9223372036854775807\b}, # max int
+ # pseudorandom permutation tests
+ qr{command=101.: boolean true\b},
+ qr{command=102.: boolean true\b},
+ qr{command=103.: boolean true\b},
+ qr{command=104.: boolean true\b},
+ qr{command=105.: boolean true\b},
+ qr{command=109.: boolean true\b},
+ qr{command=110.: boolean true\b},
+ qr{command=111.: boolean true\b},
+ qr{command=112.: int 9223372036854775797\b},
+ qr{command=113.: boolean true\b},
],
'pgbench expressions',
{
@@ -610,6 +622,33 @@ SELECT :v0, :v1, :v2, :v3;
-- minint constant parsing
\set min debug(-9223372036854775808)
\set max debug(-(:min + 1))
+-- parametric pseudorandom permutation function
+\set t debug(permute(0, 2) + permute(1, 2) = 1)
+\set t debug(permute(0, 3) + permute(1, 3) + permute(2, 3) = 3)
+\set t debug(permute(0, 4) + permute(1, 4) + permute(2, 4) + permute(3, 4) = 6)
+\set t debug(permute(0, 5) + permute(1, 5) + permute(2, 5) + permute(3, 5) + permute(4, 5) = 10)
+\set t debug(permute(0, 16) + permute(1, 16) + permute(2, 16) + permute(3, 16) + \
+ permute(4, 16) + permute(5, 16) + permute(6, 16) + permute(7, 16) + \
+ permute(8, 16) + permute(9, 16) + permute(10, 16) + permute(11, 16) + \
+ permute(12, 16) + permute(13, 16) + permute(14, 16) + permute(15, 16) = 120)
+-- random sanity checks
+\set size random(2, 1000)
+\set v random(0, :size - 1)
+\set p permute(:v, :size)
+\set t debug(0 <= :p and :p < :size and :p = permute(:v + :size, :size) and :p <> permute(:v + 1, :size))
+-- actual values
+\set t debug(permute(:v, 1) = 0)
+\set t debug(permute(0, 2, 5432) = 0 and permute(1, 2, 5432) = 1 and \
+ permute(0, 2, 5433) = 1 and permute(1, 2, 5433) = 0)
+-- 63 bits tests
+\set size debug(:max - 10)
+\set t debug(permute(:size-1, :size, 5432) = 5453903545070026760 and \
+ permute(:size-2, :size, 5432) = 3852510547464191995 and \
+ permute(:size-3, :size, 5432) = 6519944788385497068 and \
+ permute(:size-4, :size, 5432) = 2393897006749810651 and \
+ permute(:size-5, :size, 5432) = 435256285874192331 and \
+ permute(:size-6, :size, 5432) = 5260571010402451383 and \
+ permute(:size-7, :size, 5432) = 5245374096631496614)
}
});
@@ -1048,6 +1087,10 @@ SELECT LEAST(} . join(', ', (':i') x 256
'bad boolean', 2,
[qr{malformed variable.*trueXXX}], q{\set b :badtrue or true}
],
+ [
+ 'invalid permute size', 2,
+ [qr{permute size parameter must be greater than zero}], q{\set i permute(0, 0)}
+ ],
# GSET
[
diff --git a/src/bin/pgbench/t/002_pgbench_no_server.pl b/src/bin/pgbench/t/002_pgbench_no_server.pl
new file mode 100644
index e38c7d7..4027e68
--- a/src/bin/pgbench/t/002_pgbench_no_server.pl
+++ b/src/bin/pgbench/t/002_pgbench_no_server.pl
@@ -341,6 +341,16 @@ my @script_tests = (
'set i',
[ qr{set i 1 }, qr{\^ error found here} ],
{ 'set_i_op' => "\\set i 1 +\n" }
+ ],
+ [
+ 'not enough arguments to permute',
+ [qr{unexpected number of arguments \(permute\)}],
+ { 'bad-permute-1.sql' => "\\set i permute(1)\n" }
+ ],
+ [
+ 'too many arguments to permute',
+ [qr{unexpected number of arguments \(permute\)}],
+ { 'bad-permute-2.sql' => "\\set i permute(1, 2, 3, 4)\n" }
],);
for my $t (@script_tests)