[HACKERS] lazy vxid locks, v1

Robert Haas Sun, 12 Jun 2011 14:41:05 -0700

Here is a patch that applies over the "reducing the overhead of
frequent table locks" (fastlock-v3) patch and allows heavyweight VXID
locks to spring into existence only when someone wants to wait on
them.  I believe there is a large benefit to be had from this
optimization, because the combination of these two patches virtually
eliminates lock manager traffic on "pgbench -S" workloads.  However,
there are several flies in the ointment.


1. It's a bit of a kludge.  I leave it to readers of the patch to
determine exactly what about this patch they think is kludgey, but
it's likely not the empty set.  I suspect that MyProc->fpLWLock needs
to be renamed to something a bit more generic if we're going to use it
like this, but I don't immediately know what to call it.  Also, the
mechanism whereby we take SInvalWriteLock to work out the mapping from
BackendId to PGPROC * is not exactly awesome.  I don't think it
matters from a performance point of view, because operations that need
VXID locks are sufficiently rare that the additional lwlock traffic
won't matter a bit.  However, we could avoid this altogether if we
rejiggered the mechanism for allocating PGPROCs and backend IDs.
Right now, we allocate PGPROCs off of linked lists, except for
auxiliary procs which allocate them by scanning a three-element array
for an empty slot.  Then, when the PGPROC subscribes to sinval, the
sinval mechanism allocates a backend ID by scanning for the lowest
unused backend ID in the ProcState array.  If we changed the logic for
allocating PGPROCs to mimic what the sinval queue currently does, then
the backend ID could be defined as the offset into the PGPROC array.
Translating between a backend ID and a PGPROC * now becomes a matter
of pointer arithmetic.  Not sure if this is worth doing.

2. Bad thing happen with large numbers of connections.  This patch
increases peak performance, but as you increase the number of
concurrent connections beyond the number of CPU cores, performance
drops off faster with the patch than without it.  For example, on the
32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched
HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS;
with this patch also applied, it drops down to about ~64K TPS, despite
the fact that nearly all the contention on the lock manager locks has
been eliminated.    On Stefan Kaltenbrunner's 40-core box, he was
actually able to see performance drop down below unpatched HEAD with
this applied!  This is immensely counterintuitive.  What is going on?

Profiling reveals that the system spends enormous amounts of CPU time
in s_lock.  LWLOCK_STATS reveals that the only lwlock with significant
amounts of blocking is the BufFreelistLock; but that doesn't explain
the high CPU utilization.  In fact, it appears that the problem is
with the LWLocks that are frequently acquired in *shared* mode.  There
is no actual lock conflict, but each LWLock is protected by a spinlock
which must be acquired and released to bump the shared locker counts.
In HEAD, everything bottlenecks on the lock manager locks and so it's
not really possible for enough traffic to build up on any single
spinlock to have a serious impact on performance.  The locks being
sought there are exclusive, so when they are contended, processes just
get descheduled.  But with the exclusive locks out of the way,
everyone very quickly lines up to acquire shared buffer manager locks,
buffer content locks, etc. and large pile-ups ensue, leaving to
massive cache line contention and tons of CPU usage.  My initial
thought was that this was contention over the root block of the index
on the pgbench_accounts table and the buf mapping lock protecting it,
but instrumentation showed otherwise.  I hacked up the system to
report how often each lwlock spinlock exceeded spins_per_delay.  The
following is the end of a report showing the locks with the greatest
amounts of excess spinning:

lwlock 0: shacq 0 exacq 191032 blk 42554 spin 272
lwlock 41: shacq 5982347 exacq 11937 blk 1825 spin 4217
lwlock 38: shacq 6443278 exacq 11960 blk 1726 spin 4440
lwlock 47: shacq 6106601 exacq 12096 blk 1555 spin 4497
lwlock 34: shacq 6423317 exacq 11896 blk 1863 spin 4776
lwlock 45: shacq 6455173 exacq 12052 blk 1825 spin 4926
lwlock 39: shacq 6867446 exacq 12067 blk 1899 spin 5071
lwlock 44: shacq 6824502 exacq 12040 blk 1655 spin 5153
lwlock 37: shacq 6727304 exacq 11935 blk 2077 spin 5252
lwlock 46: shacq 6862206 exacq 12017 blk 2046 spin 5352
lwlock 36: shacq 6854326 exacq 11920 blk 1914 spin 5441
lwlock 43: shacq 7184761 exacq 11874 blk 1863 spin 5625
lwlock 48: shacq 7612458 exacq 12109 blk 2029 spin 5780
lwlock 35: shacq 7150616 exacq 11916 blk 2026 spin 5782
lwlock 33: shacq 7536878 exacq 11985 blk 2105 spin 6273
lwlock 40: shacq 7199089 exacq 12068 blk 2305 spin 6290
lwlock 456: shacq 36258224 exacq 0 blk 0 spin 54264
lwlock 42: shacq 43012736 exacq 11851 blk 10675 spin 62017
lwlock 4: shacq 72516569 exacq 190 blk 196 spin 341914
lwlock 5: shacq 145042917 exacq 0 blk 0 spin 798891
grand total: shacq 544277977 exacq 181886079 blk 82371 spin 1338135

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).  And
everything else is peanuts by comparison, though I am guessing the
third and fourth places (5% and 4%, respectively) are in fact the
buffer mapping lock that covers the pgbench_accounts_pkey root index
block, and the content lock on that buffer.

What is to be done?

The SInvalReadLock acquisitions are all attributable, I believe, to
AcceptInvalidationMessages(), which is called in a number of places,
but in particular, after every heavyweight lock acquisition.  I think
we need a quick way to short-circuit the lock acquisition there when
no work is to be done, which is to say, nearly always.  Indeed, Noah
Misch just proposed something along these lines on another thread
("Make relation_openrv atomic wrt DDL"), though I think this data may
cast a new light on the details.

I haven't tracked down where the ProcArrayLock acquisitions are coming
from.  The realistic possibilities appear to be
TransactionIdIsInProgress(), TransactionIdIsActive(), GetOldestXmin(),
and GetSnapshotData().  Nor do I have a clear idea what to do about
this.

The remaining candidates are mild by comparison, so I won't analyze
them further here for the moment.

Another way to attack this problem would be to come up with some more
general mechanism to make shared-lwlock acquisition cheaper, such as
having 3 or 4 shared-locker counts per lwlock, each with a separate
spinlock.  Then, at least in the case where there's no real lwlock
contention, the spin-waiters can spread out across all of them.  But
I'm not sure it's really worth it, considering that we have only a
handful of cases where this problem appears to be severe.  But we
probably need to see what happens when we fix some of the current
cases where this is happening.  If throughput goes up, then we're
good.  If it just shifts the spin lock pile-up to someplace where it's
not so easily eliminated, then we might need to either eliminate all
the problem cases one by one, or else come up with some more general
mechanism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

lazyvxid-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] lazy vxid locks, v1

Reply via email to