Hello Peter,

This patch was marked as RFC on 2019-03-30, but since then there have
been a couple more issues pointed out in a review by Thomas Munro, and
it went through 2019-09 and 2019-11 without any attention. Is the RFC
status still appropriate?

Thomas review was about comments/documentation wording and asking for
explanations, which I think I addressed, and the code did not actually
change, so I'm not sure that the "needs review" is really needed, but do
as you feel.

I read the whole thread, I still don't know what this patch is supposed to do. I know what the words in the subject line mean, but I don't know how this helps a pgbench user run better benchmarks. I feel this is also the sentiment expressed by others earlier in the thread. You indicated that this functionality makes sense to those who want this functionality, but so far only two people, namely the patch author and the reviewer, have participated in the discussion on the substance of this patch. So either the feature is extremely niche, or nobody understands it. I think you ought to take about three steps back and explain this in more basic terms, even just in email at first so that we can then discuss what to put into the documentation.

Here is the motivation for this feature:

When you design a benchmark to test performance, you want to avoid unwanted correlation which may impact performance unduly, one way or another. For the default pgbench benchmarks, accounts are chosen uniformly, no problem. However, if you want to test a non uniform distribution, which is what most people would encounter in practice, things are different.

For instance, you would replace random by random_exp in the default benchmarks. If you do that, then all accounts with lower ids would come out more often. However this creates an unwanted correlation and performance effect: why frequent accounts would just happen to be those with small ids? which just happen to reside together in the first few pages of the table? In order to avoid these effects, you need to shuffle the chosen account ids, so that frequent accounts are not specifically those at the beginning of the table.

Basically, as soon as you have a non uniform random generator selecting step you want to add some shuffle, otherwise you are going to skew your benchmark measures. Nobody should use a non-uniform generator for selecting rows without some shuffling, ever. I have no doubt that many people do without realizing that they are skewing their performance data.

Once you realize that, you can try to invent your own shuffling, but frankly this is not as easy as it looks to have something non trivial which would not generate unwanted correlation. I had a lot of (very) bad design before coming up with the one in the patch: You want something fast, you want good steering, which are contradictory objectives. There is also the question of being able to change the shuffling for a given size, for instance to check that there is no unwanted effects, hence the seeding. Good seeded shuffling is what an encryption algorithm do, but for these it is somehow simpler because they would usually work on power of two sizes.

In conclusion, using a seeded shuffle step is needed as soon as you start using non random generators. Providing one in pgbench, a tool designed to run possibly non-uniform benchmarks against postgres, looks like common courtesy. Not providing one is helping people to design bad benchmarks, possibly without realizing it, so is outright thoughtlessness.

I have no doubt that the documentation should be improved to explain that in a concise but clear way.

--
Fabien.


Reply via email to