Hi, I decided to do some testing on hydra (IBM-provided community resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using the newly-enhanced wait event stuff to try to get an idea of what we're waiting for during pgbench. I did 30-minute pgbench runs with various configurations, but all had max_connections = 200, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = on. During each run, I ran this psql script in another window and captured the output:
\t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 Then, I used a little shell-scripting to count up the number of times each wait event occurred in the output. First, I tried scale factor 3000 with 32 clients and got these results: 1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed | ProcArrayLock 16 Lock | tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed | WALBufMappingLock 122 LWLockTranche | clog 182 Lock | transactionid 287 LWLockNamed | XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock | extend 21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert tps = 1150.803133 (including connections establishing) What jumps out here is, at least to me, is that there is furious contention on both the wal_insert locks and on WALWriteLock. Apparently, the system simply can't get WAL on disk fast enough to keep up with this workload. Relation extension locks and buffer_content locks also are also pretty common, both ahead of ClientRead, a relatively uncommon wait event on this test. The load average on the system was only about 3 during this test, indicating that most processes are in fact spending most of their time off-CPU. The first thing I tried was switching to unlogged tables, which produces these results: 1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8 LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock 11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock | tuple 388 LWLockNamed | XidGenLock 827 Lock | transactionid 1267 Client | ClientRead 20631 Lock | extend 91767 | tps = 1223.239416 (including connections establishing) If you don't look at the TPS number, these results look like a vast improvement. The overall amount of time spent not waiting for anything is now much higher, and the problematic locks have largely disappeared from the picture. However, the load average now shoots up to about 30, because most of the time that the backends are "not waiting for anything" they are in fact in kernel wait state D; that is, they're stuck doing I/O. This suggests that we might want to consider advertising a wait state when a backend is doing I/O, so we can measure this sort of thing. Next, I tried lowering the scale factor to something that fits in shared buffers. Here are the results at scale factor 300: 14 Lock | tuple 22 LWLockTranche | lock_manager 39 LWLockNamed | WALBufMappingLock 331 LWLockNamed | CLogControlLock 461 LWLockNamed | ProcArrayLock 536 Lock | transactionid 552 Lock | extend 716 LWLockTranche | buffer_content 763 LWLockNamed | XidGenLock 2113 LWLockNamed | WALWriteLock 6190 LWLockTranche | wal_insert 25002 Client | ClientRead 78466 | tps = 27651.562835 (including connections establishing) Obviously, there's a vast increase in TPS, and the backends seem to spend most of their time actually doing work. ClientRead is now the overwhelmingly dominant wait event, although wal_insert and WALWriteLock contention is clearly still a significant problem. Contention on other locks is apparently quite rare. Notice that client reads are really significant here - more than 20% of the time we sample what a backend is doing, it's waiting for the next query. It seems unlikely that performance on this workload can be improved very much by optimizing anything other than WAL writing, because no other wait event appears in more than 1% of the samples. It's not clear how much of the WAL-related stuff is artificial lock contention and how much of it is the limited speed at which the disk can rotate. However, I was curious about what's going on with CLogControlLock, persuant to the other thread on that topic, so I reran this with unlogged tables. At scale factor 300, 32 clients, unlogged tables, I get these results: 6 LWLockTranche | wal_insert 12 Lock | tuple 26 LWLockTranche | lock_manager 39 LWLockTranche | buffer_content 353 LWLockNamed | CLogControlLock 377 Lock | extend 491 Lock | transactionid 500 LWLockNamed | ProcArrayLock 696 LWLockNamed | XidGenLock 27685 Client | ClientRead 84930 | WAL contention is eliminated, and CLog contention doesn't increase. Next, I tried ramping up the client count to see what effect that had. These next three results are all at scale factor 300 with unlogged tables. 64 clients: 1 BufferPin | BufferPin 2 LWLockTranche | buffer_mapping 51 LWLockTranche | wal_insert 52 Lock | relation 132 LWLockTranche | lock_manager 154 LWLockTranche | buffer_content 209 Lock | tuple 940 Lock | extend 1412 LWLockNamed | XidGenLock 2337 LWLockNamed | CLogControlLock 2813 LWLockNamed | ProcArrayLock 3859 Lock | transactionid 83760 Client | ClientRead 133840 | tps = 36672.484524 (including connections establishing) 96 clients: 1 LWLockTranche | proc 53 LWLockTranche | wal_insert 305 LWLockTranche | buffer_content 391 LWLockTranche | lock_manager 797 Lock | tuple 1074 Lock | extend 2492 LWLockNamed | XidGenLock 4185 LWLockNamed | ProcArrayLock 6003 LWLockNamed | CLogControlLock 11348 Lock | transactionid 142881 Client | ClientRead 174011 | tps = 37214.257661 (including connections establishing) 192 clients: 1 LWLockTranche | buffer_mapping 2 BufferPin | BufferPin 4 LWLockTranche | clog 8 LWLockTranche | proc 87 LWLockTranche | wal_insert 708 LWLockTranche | buffer_content 1892 LWLockTranche | lock_manager 2225 Lock | extend 4796 LWLockNamed | ProcArrayLock 5105 LWLockNamed | XidGenLock 7982 Lock | tuple 11882 LWLockNamed | CLogControlLock 64091 Lock | transactionid 275481 | 308627 Client | ClientRead tps = 35018.906863 (including connections establishing) There are a couple of interesting things here. First, CLogControlLock contention becomes more significant as the number of the clients goes up. At 32 clients, it is the 6th-most frequent wait event; at 192 clients, it's moved up to 4th, and it's clearly growing at a faster rate than some of the others. As we go from 32 to 64 to 96 to 192 clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples, but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so it goes from being only 2/3 of ProcArrayLock to being more than twice ProcArrayLock. However, in absolute terms, it's still not very significant. Even at 192 clients, where it's the fourth-most-common wait event, it appears in only 1.7% of samples, which means backends are only waiting for it about 1.7% of the time. It appears that, on this system, this workload is NEVER going to get meaningfully faster by improving CLogControlLock. Second, ClientRead becomes a bigger and bigger issue as the number of clients increases; by 192 clients, it appears in 45% of the samples. That basically means that pgbench is increasingly unable to keep up with the server; for whatever reason, it suffers more than the server does from the increasing lack of CPU resources. Third, Lock/transactionid and Lock/tuple become more and more common as the number of clients increases; these happen when two different pgbench threads deciding to hit the same row at the same time. Due to the birthday paradox this increases pretty quickly as the number of clients ramps up, but it's not really a server issue: there's nothing the server can do about the possibility that two or more clients pick the same random number at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers