Here is a patch that applies over the "reducing the overhead of frequent table locks" (fastlock-v3) patch and allows heavyweight VXID locks to spring into existence only when someone wants to wait on them. I believe there is a large benefit to be had from this optimization, because the combination of these two patches virtually eliminates lock manager traffic on "pgbench -S" workloads. However, there are several flies in the ointment.
1. It's a bit of a kludge. I leave it to readers of the patch to determine exactly what about this patch they think is kludgey, but it's likely not the empty set. I suspect that MyProc->fpLWLock needs to be renamed to something a bit more generic if we're going to use it like this, but I don't immediately know what to call it. Also, the mechanism whereby we take SInvalWriteLock to work out the mapping from BackendId to PGPROC * is not exactly awesome. I don't think it matters from a performance point of view, because operations that need VXID locks are sufficiently rare that the additional lwlock traffic won't matter a bit. However, we could avoid this altogether if we rejiggered the mechanism for allocating PGPROCs and backend IDs. Right now, we allocate PGPROCs off of linked lists, except for auxiliary procs which allocate them by scanning a three-element array for an empty slot. Then, when the PGPROC subscribes to sinval, the sinval mechanism allocates a backend ID by scanning for the lowest unused backend ID in the ProcState array. If we changed the logic for allocating PGPROCs to mimic what the sinval queue currently does, then the backend ID could be defined as the offset into the PGPROC array. Translating between a backend ID and a PGPROC * now becomes a matter of pointer arithmetic. Not sure if this is worth doing. 2. Bad thing happen with large numbers of connections. This patch increases peak performance, but as you increase the number of concurrent connections beyond the number of CPU cores, performance drops off faster with the patch than without it. For example, on the 32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS; with this patch also applied, it drops down to about ~64K TPS, despite the fact that nearly all the contention on the lock manager locks has been eliminated. On Stefan Kaltenbrunner's 40-core box, he was actually able to see performance drop down below unpatched HEAD with this applied! This is immensely counterintuitive. What is going on? Profiling reveals that the system spends enormous amounts of CPU time in s_lock. LWLOCK_STATS reveals that the only lwlock with significant amounts of blocking is the BufFreelistLock; but that doesn't explain the high CPU utilization. In fact, it appears that the problem is with the LWLocks that are frequently acquired in *shared* mode. There is no actual lock conflict, but each LWLock is protected by a spinlock which must be acquired and released to bump the shared locker counts. In HEAD, everything bottlenecks on the lock manager locks and so it's not really possible for enough traffic to build up on any single spinlock to have a serious impact on performance. The locks being sought there are exclusive, so when they are contended, processes just get descheduled. But with the exclusive locks out of the way, everyone very quickly lines up to acquire shared buffer manager locks, buffer content locks, etc. and large pile-ups ensue, leaving to massive cache line contention and tons of CPU usage. My initial thought was that this was contention over the root block of the index on the pgbench_accounts table and the buf mapping lock protecting it, but instrumentation showed otherwise. I hacked up the system to report how often each lwlock spinlock exceeded spins_per_delay. The following is the end of a report showing the locks with the greatest amounts of excess spinning: lwlock 0: shacq 0 exacq 191032 blk 42554 spin 272 lwlock 41: shacq 5982347 exacq 11937 blk 1825 spin 4217 lwlock 38: shacq 6443278 exacq 11960 blk 1726 spin 4440 lwlock 47: shacq 6106601 exacq 12096 blk 1555 spin 4497 lwlock 34: shacq 6423317 exacq 11896 blk 1863 spin 4776 lwlock 45: shacq 6455173 exacq 12052 blk 1825 spin 4926 lwlock 39: shacq 6867446 exacq 12067 blk 1899 spin 5071 lwlock 44: shacq 6824502 exacq 12040 blk 1655 spin 5153 lwlock 37: shacq 6727304 exacq 11935 blk 2077 spin 5252 lwlock 46: shacq 6862206 exacq 12017 blk 2046 spin 5352 lwlock 36: shacq 6854326 exacq 11920 blk 1914 spin 5441 lwlock 43: shacq 7184761 exacq 11874 blk 1863 spin 5625 lwlock 48: shacq 7612458 exacq 12109 blk 2029 spin 5780 lwlock 35: shacq 7150616 exacq 11916 blk 2026 spin 5782 lwlock 33: shacq 7536878 exacq 11985 blk 2105 spin 6273 lwlock 40: shacq 7199089 exacq 12068 blk 2305 spin 6290 lwlock 456: shacq 36258224 exacq 0 blk 0 spin 54264 lwlock 42: shacq 43012736 exacq 11851 blk 10675 spin 62017 lwlock 4: shacq 72516569 exacq 190 blk 196 spin 341914 lwlock 5: shacq 145042917 exacq 0 blk 0 spin 798891 grand total: shacq 544277977 exacq 181886079 blk 82371 spin 1338135 So, the majority (60%) of the excess spinning appears to be due to SInvalReadLock. A good chunk are due to ProcArrayLock (25%). And everything else is peanuts by comparison, though I am guessing the third and fourth places (5% and 4%, respectively) are in fact the buffer mapping lock that covers the pgbench_accounts_pkey root index block, and the content lock on that buffer. What is to be done? The SInvalReadLock acquisitions are all attributable, I believe, to AcceptInvalidationMessages(), which is called in a number of places, but in particular, after every heavyweight lock acquisition. I think we need a quick way to short-circuit the lock acquisition there when no work is to be done, which is to say, nearly always. Indeed, Noah Misch just proposed something along these lines on another thread ("Make relation_openrv atomic wrt DDL"), though I think this data may cast a new light on the details. I haven't tracked down where the ProcArrayLock acquisitions are coming from. The realistic possibilities appear to be TransactionIdIsInProgress(), TransactionIdIsActive(), GetOldestXmin(), and GetSnapshotData(). Nor do I have a clear idea what to do about this. The remaining candidates are mild by comparison, so I won't analyze them further here for the moment. Another way to attack this problem would be to come up with some more general mechanism to make shared-lwlock acquisition cheaper, such as having 3 or 4 shared-locker counts per lwlock, each with a separate spinlock. Then, at least in the case where there's no real lwlock contention, the spin-waiters can spread out across all of them. But I'm not sure it's really worth it, considering that we have only a handful of cases where this problem appears to be severe. But we probably need to see what happens when we fix some of the current cases where this is happening. If throughput goes up, then we're good. If it just shifts the spin lock pile-up to someplace where it's not so easily eliminated, then we might need to either eliminate all the problem cases one by one, or else come up with some more general mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
lazyvxid-v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers