Dear Robert and Greg,

I think so.  If it doesn't get fixed now, it's not likely to get fixed
later.  And the fact that nobody understands why it's happening is
kinda worrisome...

Possibly, but I thing that it is not my fault:-)

So, I spent some time at performance debugging...

First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu laptop. It needs short transactions and many clients per thread so as to be a spike. With "pgbench" standard transactions and not too many clients per thread it is more of a little bump, or even it is not there at all.

After some poking around, and pursuing various red herrings, I resorted to measure the delay for calling "PQfinish()", which is really the only special thing going around at the end of pgbench run...

BINGO!

In his tests Greg is using one thread. Once the overall timer is exceeded, clients start closing their connections by calling PQfinish once their transactions are done. This call takes between a few us and a few ... ms. So if some client closing time hit the bad figures, the transactions of other clients are artificially delayed by this time, and it seems they have a high latency, but it is really because the thread is in another client's PQfinish and was not available to process the data. If you have one thread per client, no problem, especially as the final PQfinish() time is not measured at all by pgbench:-) Also, the more clients, the higher the spike because more are to be stopped and may hit the bad figures.

Here is a trace, with the simple SELECT transaction.

  sh> ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench
  ...

  sh> less pgbench_log.*

 # short transactions, about 0.250 ms
 ...
 20 4849 241 0 1371916583.455400
 21 4844 256 0 1371916583.455470
 22 4832 348 0 1371916583.455569
 23 4829 218 0 1371916583.455627

   ** TIMER EXCEEDED **

 25 4722 390 0 1371916583.455820
 25 done in 1560 [1,2]              # BING, 1560 us for PQfinish()
 26 4557 1969 0 1371916583.457407
 26 done in 21 [1,2]
 27 4372 1969 0 1371916583.457447
 27 done in 19 [1,2]
 28 4009 1910 0 1371916583.457486
 28 done in 1445 [1,2]              # BOUM
 2 interrupted in 1300 [0,0]        # BANG
 3 interrupted in 15 [0,0]
 4 interrupted in 40 [0,0]
 5 interrupted in 203 [0,0]         # boum?
 6 interrupted in 1352 [0,0]        # ZIP
 7 interrupted in 18 [0,0]
 ...
 23 interrupted in 12 [0,0]

 ## the cumulated PQfinish() time above is about 6 ms which
 ## appears as an apparent latency for the next clients:

  0 4880 6521 0 1371916583.462157
  0 done in 9 [1,2]
  1 4877 6397 0 1371916583.462194
  1 done in 9 [1,2]
 24 4807 6796 0 1371916583.462217
 24 done in 9 [1,2]
 ...

Note that the bad measures also appear when there is no throttling:

  sh> ./pgbench -T 10 -r -l -c 30 -S bench
  sh> grep 'done.*[0-9][0-9][0-9]' pgbench_log.*
   0 done in 1974 [1,2]
   1 done in 312 [1,2]
   3 done in 2159 [1,2]
   7 done in 409 [1,2]
  11 done in 393 [1,2]
  12 done in 2212 [1,2]
  13 done in 1458 [1,2]
  ## other clients execute PQfinish in less than 100 us

This "done" is issued by my instrumented version of clientDone().

The issue does also appear if I instrument pgbench from master, without anything from the throttling patch at all:

  sh> git diff master
  diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
  index 1303217..7c5ea81 100644
  --- a/contrib/pgbench/pgbench.c
  +++ b/contrib/pgbench/pgbench.c
  @@ -869,7 +869,15 @@ clientDone(CState *st, bool ok)

          if (st->con != NULL)
          {
  +               instr_time now;
  +               int64 s0, s1;
  +               INSTR_TIME_SET_CURRENT(now);
  +               s0 = INSTR_TIME_GET_MICROSEC(now);
                  PQfinish(st->con);
  +               INSTR_TIME_SET_CURRENT(now);
  +               s1 = INSTR_TIME_GET_MICROSEC(now);
  +               fprintf(stderr, "%d done in %ld [%d,%d]\n",
  +                               st->id, s1-s0, st->listen, st->state);
                  st->con = NULL;
          }
          return false;                           /* always false */

  sh> ./pgbench -T 10 -r -l -c 30 -S bench 2> x.err

  sh> grep 'done.*[0-9][0-9][0-9]' x.err
  14 done in 1985 [1,2]
  16 done in 276 [1,2]
  17 done in 1418 [1,2]

So my argumented conclusion is that the issue is somewhere within PQfinish(), possibly in interaction with pgbench doings, but is *NOT* related in any way to the throttling patch, as it is preexisting it. Gregs stumbled upon it because he looked at latencies.

I'll submit a slightly improved v12 shortly.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to