Re: pgbench - add pseudo-random permutation function

Fabien COELHO Sat, 01 Feb 2020 01:08:03 -0800


Hello Peter,

This patch was marked as RFC on 2019-03-30, but since then there have
been a couple more issues pointed out in a review by Thomas Munro, and
it went through 2019-09 and 2019-11 without any attention. Is the RFC
status still appropriate?
Thomas review was about comments/documentation wording and asking for
explanations, which I think I addressed, and the code did not actually
change, so I'm not sure that the "needs review" is really needed, but do
as you feel.
I read the whole thread, I still don't know what this patch is supposed todo. I know what the words in the subject line mean, but I don't know howthis helps a pgbench user run better benchmarks. I feel this is also thesentiment expressed by others earlier in the thread. You indicated that thisfunctionality makes sense to those who want this functionality, but so faronly two people, namely the patch author and the reviewer, have participatedin the discussion on the substance of this patch. So either the feature isextremely niche, or nobody understands it. I think you ought to take aboutthree steps back and explain this in more basic terms, even just in email atfirst so that we can then discuss what to put into the documentation.


Here is the motivation for this feature:

When you design a benchmark to test performance, you want to avoidunwanted correlation which may impact performance unduly, one way oranother. For the default pgbench benchmarks, accounts are chosenuniformly, no problem. However, if you want to test a non uniformdistribution, which is what most people would encounter in practice,things are different.

For instance, you would replace random by random_exp in the defaultbenchmarks. If you do that, then all accounts with lower ids would comeout more often. However this creates an unwanted correlation andperformance effect: why frequent accounts would just happen to be thosewith small ids? which just happen to reside together in the first fewpages of the table? In order to avoid these effects, you need to shufflethe chosen account ids, so that frequent accounts are not specificallythose at the beginning of the table.

Basically, as soon as you have a non uniform random generator selectingstep you want to add some shuffle, otherwise you are going to skew yourbenchmark measures. Nobody should use a non-uniform generator forselecting rows without some shuffling, ever. I have no doubt that manypeople do without realizing that they are skewing their performance data.

Once you realize that, you can try to invent your own shuffling, butfrankly this is not as easy as it looks to have something non trivialwhich would not generate unwanted correlation. I had a lot of (very) baddesign before coming up with the one in the patch: You want somethingfast, you want good steering, which are contradictory objectives. There isalso the question of being able to change the shuffling for a given size,for instance to check that there is no unwanted effects, hence theseeding. Good seeded shuffling is what an encryption algorithm do, but forthese it is somehow simpler because they would usually work on power oftwo sizes.

In conclusion, using a seeded shuffle step is needed as soon as you startusing non random generators. Providing one in pgbench, a tool designed torun possibly non-uniform benchmarks against postgres, looks like commoncourtesy. Not providing one is helping people to design bad benchmarks,possibly without realizing it, so is outright thoughtlessness.

I have no doubt that the documentation should be improved to explain thatin a concise but clear way.


--
Fabien.

Re: pgbench - add pseudo-random permutation function

Reply via email to