Re: pgbench - implement strict TPC-B benchmark

Fabien COELHO Sat, 03 Aug 2019 02:31:11 -0700


Hello Andres,

Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
a pretty significant share.


Fine, but what is the corresponding server load? 211%? 611%? And what actual
time is spent in pgbench itself, vs libpq and syscalls?


System wide pgbench, including libpq, is about 22% of the whole system.

Hmmm. I guess that the consistency between 189% CPU on 4 cores/8 threadsand 22% overall load is that 189/800 = 23.6% ~ 22%.

Given the simplicity of the select-only transaction the stuff is CPUbound, so postgres 8 server processes should saturate the 4 core CPU, andpgbench & postgres are competing for CPU time. The overall load isprobably 100%, i.e. 22% pgbench vs 78% postgres (assuming system isincluded), 78/22 = 3.5, i.e. pgbench on one core would saturate postgreson 3.5 cores on a CPU bound load.

I'm not chocked by these results for near worst-case conditions (i.e. theserver side has very little to do).

It seems quite consistent with the really worst-case example I reported(empty query, cannot do less). Looking at the same empty-sql-query loadthrough "htop", I have 95% postgres and 75% pgbench. This is not fullyconsistent with "time" which reports 55% pgbench overall, over 2/3 ofwhich in system, under 1/3 pgbench which must be devided into pgbenchactual code and external libpq/lib* other stuff.

Yet again, pgbench code is not the issue from my point of view, becausetime is spent mostly elsewhere and any other driver would have to do thesame.

As far as I can tell there's a number of things that are wrong:


Sure, I agree that things could be improved.

- prepared statement names are recomputed for every query execution

I'm not sure it is a bug issue, but it should be precomputed somewhere,though.

- variable name lookup is done for every command, rather than once, when
 parsing commands

Hmmm. The names of variables are not all known in advance, eg \gset.Possibly it does not matter, because the name of actually used variablesis known. Each used variables could get a number so that using a variablewould be accessing an array at the corresponding index.

- a lot of string->int->string type back and forths

Yep, that is a pain, ISTM that strings are exchanged at the protocollevel, but this is libpq design, not pgbench.

As far as variable values are concerned, AFAICR conversion are performedon demand only, and just once.

Overall, my point if that even if all pgbench-specific costs were wipedout it would not change the final result (pgbench load) much because mostof the time is spent in libpq and system. Any other test driver wouldincur the same cost.

Conclusion: pgbench-specific overheads are typically (much) below 10% of the
total client-side cpu cost of a transaction, while over 90% of the cpu cost
is spent in libpq and system, for the worst case do-nothing query.

I don't buy that that's the actual worst case, or even remotely close toit.

Hmmm. I'm not sure I can do much worse than 3 complex expressions againstone empty sql query. Ok, I could put 27 complex expressions to reach50-50, but the 3-to-1 complex-expression-to-empty-sql ratio already seemsok for a realistic worst-case test script.

I e.g. see higher pgbench overhead for the *modify* case than for
the pgbench's readonly case. And that's because some of the meta
commands are slow, in particular everything related to variables. And
the modify case just has more variables.

Hmmm. WRT \set and expressions, the two main cost seems to be the largeswitch and the variable management. Yet again, I still interpret thefigures I collected as these costs are small compared to libpq/systemoverheads, and the overall result is below postgres own CPU costs (on aper client basis).

+   12.35%  pgbench  pgbench                [.] threadRun
+    3.54%  pgbench  pgbench                [.] dopr.constprop.0


~ 21%, probably some inlining has been performed, because I would have
expected to see significant time in "advanceConnectionState".


Yea, there's plenty inlining.  Note dopr() is string processing.

Which is a pain, no doubt about that. Some of it as been taken out ofpgbench already, eg comparing commands vs using an enum.

+    2.95%  pgbench  libpq.so.5.13          [.] PQsendQueryPrepared
+    2.15%  pgbench  libpq.so.5.13          [.] pqPutInt
+    4.47%  pgbench  libpq.so.5.13          [.] pqParseInput3
+    1.66%  pgbench  libpq.so.5.13          [.] pqPutMsgStart
+    1.63%  pgbench  libpq.so.5.13          [.] pqGetInt
~ 13%
A lot of that is really stupid. We need to improve libpq.PqsendQueryGuts (attributed to PQsendQueryPrepared here), builds thecommand in many separate pqPut* commands, which reside in anothertranslation unit, is pretty sad.

Indeed, I'm definitely convinced that libpq costs are high and should bereduced where possible. Now, yet again, they are much smaller than thetime spent in the system to send and receive the data on a local socket,so somehow they could be interpreted as good enough, even if not thatgood.

+    3.16%  pgbench  libc-2.28.so           [.] __strcmp_avx2
+    2.95%  pgbench  libc-2.28.so           [.] malloc
+    1.85%  pgbench  libc-2.28.so           [.] ppoll
+    1.85%  pgbench  libc-2.28.so           [.] __strlen_avx2
+    1.85%  pgbench  libpthread-2.28.so     [.] __libc_recv


~ 11%, str is a pain… Not sure who is calling though, pgbench or
libpq.


Both. Most of the strcmp is from getQueryParams()/getVariable(). The
dopr() is from pg_*printf, which is mostly attributable to
preparedStatementName() and getVariable().

Hmmm. Franckly I can optimize pgbench code pretty easily, but I'm not sureof maintainability, and as I said many times, about the real effect itwould have, because these cost are a minor part of the client sidebenchmark part.

This is basically 47% pgbench, 53% lib*, on the sample provided. I'm unclear
about where system time is measured.


It was excluded in this profile, both to reduce profiling costs, and to
focus on pgbench.

Ok.

If we take my other figures and round up, for a running pgbench we have1/6 actual pgbench, 1/6 libpq, 2/3 system.

If I get a factor of 10 speedup in actual pgbench (let us assume I'm thatgood:-), then the overall gain is 1/6 - 1/6/10 = 15%. Although I can doit, it would be some fun, but the code would get ugly (not too bad, butnevertheless probably less maintainable, with a partial typing phase andexpression compilation, and my bet is that however good the patch would berejected).

Do you see an error in my evaluation of pgbench actual costs and itscontribution to the overall performance of running a benchmark?


If yes, which it is?

If not, do you think advisable to spend time improving the evaluator &variable stuff and possibly other places for an overall 15% gain?


Also, what would be the likelyhood of such optimization patch to pass?

I could do a limited variable management improvement patch, eventually, Ihave funny ideas to speedup the thing, some of which outlined above, someothers even more terrible.


--
Fabien.

Re: pgbench - implement strict TPC-B benchmark

Reply via email to