Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:

      1  LWLockTranche   | buffer_mapping
      9  LWLockNamed     | CLogControlLock
     14  LWLockNamed     | ProcArrayLock
     16  Lock            | tuple
     25  LWLockNamed     | CheckpointerCommLock
     49  LWLockNamed     | WALBufMappingLock
    122  LWLockTranche   | clog
    182  Lock            | transactionid
    287  LWLockNamed     | XidGenLock
   1300  Client          | ClientRead
   1375  LWLockTranche   | buffer_content
   3990  Lock            | extend
  21014  LWLockNamed     | WALWriteLock
  28497                  |
  58279  LWLockTranche   | wal_insert

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:

      1  BufferPin       | BufferPin
      1  LWLockTranche   | lock_manager
      2  LWLockTranche   | buffer_mapping
      8  LWLockNamed     | ProcArrayLock
      9  LWLockNamed     | CheckpointerCommLock
      9  LWLockNamed     | CLogControlLock
     11  LWLockTranche   | buffer_content
     37  LWLockTranche   | clog
    153  Lock            | tuple
    388  LWLockNamed     | XidGenLock
    827  Lock            | transactionid
   1267  Client          | ClientRead
  20631  Lock            | extend
  91767                  |

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:

     14  Lock            | tuple
     22  LWLockTranche   | lock_manager
     39  LWLockNamed     | WALBufMappingLock
    331  LWLockNamed     | CLogControlLock
    461  LWLockNamed     | ProcArrayLock
    536  Lock            | transactionid
    552  Lock            | extend
    716  LWLockTranche   | buffer_content
    763  LWLockNamed     | XidGenLock
   2113  LWLockNamed     | WALWriteLock
   6190  LWLockTranche   | wal_insert
  25002  Client          | ClientRead
  78466                  |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work.  ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare.  Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.  It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
get these results:

      6  LWLockTranche   | wal_insert
     12  Lock            | tuple
     26  LWLockTranche   | lock_manager
     39  LWLockTranche   | buffer_content
    353  LWLockNamed     | CLogControlLock
    377  Lock            | extend
    491  Lock            | transactionid
    500  LWLockNamed     | ProcArrayLock
    696  LWLockNamed     | XidGenLock
  27685  Client          | ClientRead
  84930                  |

WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.

64 clients:

      1  BufferPin       | BufferPin
      2  LWLockTranche   | buffer_mapping
     51  LWLockTranche   | wal_insert
     52  Lock            | relation
    132  LWLockTranche   | lock_manager
    154  LWLockTranche   | buffer_content
    209  Lock            | tuple
    940  Lock            | extend
   1412  LWLockNamed     | XidGenLock
   2337  LWLockNamed     | CLogControlLock
   2813  LWLockNamed     | ProcArrayLock
   3859  Lock            | transactionid
  83760  Client          | ClientRead
 133840                  |

tps = 36672.484524 (including connections establishing)

96 clients:

      1  LWLockTranche   | proc
     53  LWLockTranche   | wal_insert
    305  LWLockTranche   | buffer_content
    391  LWLockTranche   | lock_manager
    797  Lock            | tuple
   1074  Lock            | extend
   2492  LWLockNamed     | XidGenLock
   4185  LWLockNamed     | ProcArrayLock
   6003  LWLockNamed     | CLogControlLock
  11348  Lock            | transactionid
 142881  Client          | ClientRead
 174011                  |

tps = 37214.257661 (including connections establishing)

192 clients:

      1  LWLockTranche   | buffer_mapping
      2  BufferPin       | BufferPin
      4  LWLockTranche   | clog
      8  LWLockTranche   | proc
     87  LWLockTranche   | wal_insert
    708  LWLockTranche   | buffer_content
   1892  LWLockTranche   | lock_manager
   2225  Lock            | extend
   4796  LWLockNamed     | ProcArrayLock
   5105  LWLockNamed     | XidGenLock
   7982  Lock            | tuple
  11882  LWLockNamed     | CLogControlLock
  64091  Lock            | transactionid
 275481                  |
 308627  Client          | ClientRead

tps = 35018.906863 (including connections establishing)

There are a couple of interesting things here.  First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others.  As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant.  Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time.  It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.  Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time.  Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to