Robert,

This contention on WAL reminds me of another scenario I've heard about that was similar.

To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him".

The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of course blocks, then C comes along and blocks as well, then D.

Finally A finishes its write and then ....

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then drops the lock. Then C gets the lock, writes its buffer, drops the lock, then finally D gets the lock, writes its buffer and then drops the lock.

2) A then writes out B's, C's, and D's buffers, then A drops the lock, B, C and D wake up, note that their respective buffers are written and just return. This greatly speeds up the system. (just be careful to make sure A doesn't do "too much work" otherwise you can get a sort of livelock if too many threads are blocked behind it, generally only issue one additional flush on behalf of other threads, do not "loop until the queue is empty")

I'm not sure if this is actually possible with the way WAL is implemented, (or perhaps if this strategy is already implemented) but it's definitely worth if not done already as it can speed things up enormously.

On 10/6/16 11:38 AM, Robert Haas wrote:
Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:

       1  LWLockTranche   | buffer_mapping
       9  LWLockNamed     | CLogControlLock
      14  LWLockNamed     | ProcArrayLock
      16  Lock            | tuple
      25  LWLockNamed     | CheckpointerCommLock
      49  LWLockNamed     | WALBufMappingLock
     122  LWLockTranche   | clog
     182  Lock            | transactionid
     287  LWLockNamed     | XidGenLock
    1300  Client          | ClientRead
    1375  LWLockTranche   | buffer_content
    3990  Lock            | extend
   21014  LWLockNamed     | WALWriteLock
   28497                  |
   58279  LWLockTranche   | wal_insert

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:

       1  BufferPin       | BufferPin
       1  LWLockTranche   | lock_manager
       2  LWLockTranche   | buffer_mapping
       8  LWLockNamed     | ProcArrayLock
       9  LWLockNamed     | CheckpointerCommLock
       9  LWLockNamed     | CLogControlLock
      11  LWLockTranche   | buffer_content
      37  LWLockTranche   | clog
     153  Lock            | tuple
     388  LWLockNamed     | XidGenLock
     827  Lock            | transactionid
    1267  Client          | ClientRead
   20631  Lock            | extend
   91767                  |

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:

      14  Lock            | tuple
      22  LWLockTranche   | lock_manager
      39  LWLockNamed     | WALBufMappingLock
     331  LWLockNamed     | CLogControlLock
     461  LWLockNamed     | ProcArrayLock
     536  Lock            | transactionid
     552  Lock            | extend
     716  LWLockTranche   | buffer_content
     763  LWLockNamed     | XidGenLock
    2113  LWLockNamed     | WALWriteLock
    6190  LWLockTranche   | wal_insert
   25002  Client          | ClientRead
   78466                  |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work.  ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare.  Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.  It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
get these results:

       6  LWLockTranche   | wal_insert
      12  Lock            | tuple
      26  LWLockTranche   | lock_manager
      39  LWLockTranche   | buffer_content
     353  LWLockNamed     | CLogControlLock
     377  Lock            | extend
     491  Lock            | transactionid
     500  LWLockNamed     | ProcArrayLock
     696  LWLockNamed     | XidGenLock
   27685  Client          | ClientRead
   84930                  |

WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.

64 clients:

       1  BufferPin       | BufferPin
       2  LWLockTranche   | buffer_mapping
      51  LWLockTranche   | wal_insert
      52  Lock            | relation
     132  LWLockTranche   | lock_manager
     154  LWLockTranche   | buffer_content
     209  Lock            | tuple
     940  Lock            | extend
    1412  LWLockNamed     | XidGenLock
    2337  LWLockNamed     | CLogControlLock
    2813  LWLockNamed     | ProcArrayLock
    3859  Lock            | transactionid
   83760  Client          | ClientRead
  133840                  |

tps = 36672.484524 (including connections establishing)

96 clients:

       1  LWLockTranche   | proc
      53  LWLockTranche   | wal_insert
     305  LWLockTranche   | buffer_content
     391  LWLockTranche   | lock_manager
     797  Lock            | tuple
    1074  Lock            | extend
    2492  LWLockNamed     | XidGenLock
    4185  LWLockNamed     | ProcArrayLock
    6003  LWLockNamed     | CLogControlLock
   11348  Lock            | transactionid
  142881  Client          | ClientRead
  174011                  |

tps = 37214.257661 (including connections establishing)

192 clients:

       1  LWLockTranche   | buffer_mapping
       2  BufferPin       | BufferPin
       4  LWLockTranche   | clog
       8  LWLockTranche   | proc
      87  LWLockTranche   | wal_insert
     708  LWLockTranche   | buffer_content
    1892  LWLockTranche   | lock_manager
    2225  Lock            | extend
    4796  LWLockNamed     | ProcArrayLock
    5105  LWLockNamed     | XidGenLock
    7982  Lock            | tuple
   11882  LWLockNamed     | CLogControlLock
   64091  Lock            | transactionid
  275481                  |
  308627  Client          | ClientRead

tps = 35018.906863 (including connections establishing)

There are a couple of interesting things here.  First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others.  As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant.  Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time.  It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.  Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time.  Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to