Hello Peter,
This patch was marked as RFC on 2019-03-30, but since then there have
been a couple more issues pointed out in a review by Thomas Munro, and
it went through 2019-09 and 2019-11 without any attention. Is the RFC
status still appropriate?
Thomas review was about comments/documentation wording and asking for
explanations, which I think I addressed, and the code did not actually
change, so I'm not sure that the "needs review" is really needed, but do
as you feel.
I read the whole thread, I still don't know what this patch is supposed to
do. I know what the words in the subject line mean, but I don't know how
this helps a pgbench user run better benchmarks. I feel this is also the
sentiment expressed by others earlier in the thread. You indicated that this
functionality makes sense to those who want this functionality, but so far
only two people, namely the patch author and the reviewer, have participated
in the discussion on the substance of this patch. So either the feature is
extremely niche, or nobody understands it. I think you ought to take about
three steps back and explain this in more basic terms, even just in email at
first so that we can then discuss what to put into the documentation.
Here is the motivation for this feature:
When you design a benchmark to test performance, you want to avoid
unwanted correlation which may impact performance unduly, one way or
another. For the default pgbench benchmarks, accounts are chosen
uniformly, no problem. However, if you want to test a non uniform
distribution, which is what most people would encounter in practice,
things are different.
For instance, you would replace random by random_exp in the default
benchmarks. If you do that, then all accounts with lower ids would come
out more often. However this creates an unwanted correlation and
performance effect: why frequent accounts would just happen to be those
with small ids? which just happen to reside together in the first few
pages of the table? In order to avoid these effects, you need to shuffle
the chosen account ids, so that frequent accounts are not specifically
those at the beginning of the table.
Basically, as soon as you have a non uniform random generator selecting
step you want to add some shuffle, otherwise you are going to skew your
benchmark measures. Nobody should use a non-uniform generator for
selecting rows without some shuffling, ever. I have no doubt that many
people do without realizing that they are skewing their performance data.
Once you realize that, you can try to invent your own shuffling, but
frankly this is not as easy as it looks to have something non trivial
which would not generate unwanted correlation. I had a lot of (very) bad
design before coming up with the one in the patch: You want something
fast, you want good steering, which are contradictory objectives. There is
also the question of being able to change the shuffling for a given size,
for instance to check that there is no unwanted effects, hence the
seeding. Good seeded shuffling is what an encryption algorithm do, but for
these it is somehow simpler because they would usually work on power of
two sizes.
In conclusion, using a seeded shuffle step is needed as soon as you start
using non random generators. Providing one in pgbench, a tool designed to
run possibly non-uniform benchmarks against postgres, looks like common
courtesy. Not providing one is helping people to design bad benchmarks,
possibly without realizing it, so is outright thoughtlessness.
I have no doubt that the documentation should be improved to explain that
in a concise but clear way.
--
Fabien.