Hello Andres,

Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
a pretty significant share.

Fine, but what is the corresponding server load? 211%? 611%? And what actual
time is spent in pgbench itself, vs libpq and syscalls?

System wide pgbench, including libpq, is about 22% of the whole system.

Hmmm. I guess that the consistency between 189% CPU on 4 cores/8 threads and 22% overall load is that 189/800 = 23.6% ~ 22%.

Given the simplicity of the select-only transaction the stuff is CPU bound, so postgres 8 server processes should saturate the 4 core CPU, and pgbench & postgres are competing for CPU time. The overall load is probably 100%, i.e. 22% pgbench vs 78% postgres (assuming system is included), 78/22 = 3.5, i.e. pgbench on one core would saturate postgres on 3.5 cores on a CPU bound load.

I'm not chocked by these results for near worst-case conditions (i.e. the server side has very little to do).

It seems quite consistent with the really worst-case example I reported (empty query, cannot do less). Looking at the same empty-sql-query load through "htop", I have 95% postgres and 75% pgbench. This is not fully consistent with "time" which reports 55% pgbench overall, over 2/3 of which in system, under 1/3 pgbench which must be devided into pgbench actual code and external libpq/lib* other stuff.

Yet again, pgbench code is not the issue from my point of view, because time is spent mostly elsewhere and any other driver would have to do the same.

As far as I can tell there's a number of things that are wrong:

Sure, I agree that things could be improved.

- prepared statement names are recomputed for every query execution

I'm not sure it is a bug issue, but it should be precomputed somewhere, though.

- variable name lookup is done for every command, rather than once, when
 parsing commands

Hmmm. The names of variables are not all known in advance, eg \gset. Possibly it does not matter, because the name of actually used variables is known. Each used variables could get a number so that using a variable would be accessing an array at the corresponding index.

- a lot of string->int->string type back and forths

Yep, that is a pain, ISTM that strings are exchanged at the protocol level, but this is libpq design, not pgbench.

As far as variable values are concerned, AFAICR conversion are performed on demand only, and just once.

Overall, my point if that even if all pgbench-specific costs were wiped out it would not change the final result (pgbench load) much because most of the time is spent in libpq and system. Any other test driver would incur the same cost.

Conclusion: pgbench-specific overheads are typically (much) below 10% of the
total client-side cpu cost of a transaction, while over 90% of the cpu cost
is spent in libpq and system, for the worst case do-nothing query.

I don't buy that that's the actual worst case, or even remotely close to it.

Hmmm. I'm not sure I can do much worse than 3 complex expressions against one empty sql query. Ok, I could put 27 complex expressions to reach 50-50, but the 3-to-1 complex-expression-to-empty-sql ratio already seems ok for a realistic worst-case test script.

I e.g. see higher pgbench overhead for the *modify* case than for
the pgbench's readonly case. And that's because some of the meta
commands are slow, in particular everything related to variables. And
the modify case just has more variables.

Hmmm. WRT \set and expressions, the two main cost seems to be the large switch and the variable management. Yet again, I still interpret the figures I collected as these costs are small compared to libpq/system overheads, and the overall result is below postgres own CPU costs (on a per client basis).

+   12.35%  pgbench  pgbench                [.] threadRun
+    3.54%  pgbench  pgbench                [.] dopr.constprop.0

~ 21%, probably some inlining has been performed, because I would have
expected to see significant time in "advanceConnectionState".

Yea, there's plenty inlining.  Note dopr() is string processing.

Which is a pain, no doubt about that. Some of it as been taken out of pgbench already, eg comparing commands vs using an enum.

+    2.95%  pgbench  libpq.so.5.13          [.] PQsendQueryPrepared
+    2.15%  pgbench  libpq.so.5.13          [.] pqPutInt
+    4.47%  pgbench  libpq.so.5.13          [.] pqParseInput3
+    1.66%  pgbench  libpq.so.5.13          [.] pqPutMsgStart
+    1.63%  pgbench  libpq.so.5.13          [.] pqGetInt

~ 13%

A lot of that is really stupid. We need to improve libpq. PqsendQueryGuts (attributed to PQsendQueryPrepared here), builds the command in many separate pqPut* commands, which reside in another translation unit, is pretty sad.

Indeed, I'm definitely convinced that libpq costs are high and should be reduced where possible. Now, yet again, they are much smaller than the time spent in the system to send and receive the data on a local socket, so somehow they could be interpreted as good enough, even if not that good.

+    3.16%  pgbench  libc-2.28.so           [.] __strcmp_avx2
+    2.95%  pgbench  libc-2.28.so           [.] malloc
+    1.85%  pgbench  libc-2.28.so           [.] ppoll
+    1.85%  pgbench  libc-2.28.so           [.] __strlen_avx2
+    1.85%  pgbench  libpthread-2.28.so     [.] __libc_recv

~ 11%, str is a pain… Not sure who is calling though, pgbench or
libpq.

Both. Most of the strcmp is from getQueryParams()/getVariable(). The
dopr() is from pg_*printf, which is mostly attributable to
preparedStatementName() and getVariable().

Hmmm. Franckly I can optimize pgbench code pretty easily, but I'm not sure of maintainability, and as I said many times, about the real effect it would have, because these cost are a minor part of the client side benchmark part.

This is basically 47% pgbench, 53% lib*, on the sample provided. I'm unclear
about where system time is measured.

It was excluded in this profile, both to reduce profiling costs, and to
focus on pgbench.

Ok.

If we take my other figures and round up, for a running pgbench we have 1/6 actual pgbench, 1/6 libpq, 2/3 system.

If I get a factor of 10 speedup in actual pgbench (let us assume I'm that good:-), then the overall gain is 1/6 - 1/6/10 = 15%. Although I can do it, it would be some fun, but the code would get ugly (not too bad, but nevertheless probably less maintainable, with a partial typing phase and expression compilation, and my bet is that however good the patch would be rejected).

Do you see an error in my evaluation of pgbench actual costs and its contribution to the overall performance of running a benchmark?

If yes, which it is?

If not, do you think advisable to spend time improving the evaluator & variable stuff and possibly other places for an overall 15% gain?

Also, what would be the likelyhood of such optimization patch to pass?

I could do a limited variable management improvement patch, eventually, I have funny ideas to speedup the thing, some of which outlined above, some others even more terrible.

--
Fabien.

Reply via email to