On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <f...@phlo.org> wrote: > On Jun12, 2011, at 23:39 , Robert Haas wrote: >> So, the majority (60%) of the excess spinning appears to be due to >> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). > > Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. > However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, > so I guess that two adjacent LWLocks currently share one cache line. > > Currently, the ProcArrayLock has index 4 while SInvalReadLock has > index 5, so if I'm not mistaken exactly the two locks where you saw > the largest contention on are on the same cache line... > > Might make sense to try and see if these numbers change if you > either make LWLockPadded 64bytes or arrange the locks differently...
I fooled around with this a while back and saw no benefit. It's possible a more careful test would turn up something, but I think the only real way forward here is going to be to eliminate some of that locking altogether. SInvalReadLock is a good example of a lock which seems barely to be necessary at all. Except on extraordinarily rare occasions, it is only ever taken in share mode. Only SICleanupQueue() needs to lock out readers, and in the (probably common) case where all backends are reasonably close to being caught up, it wouldn't even matter if the determination of minMsgNum were a little bit fuzzy. I've been straining my brain trying to figure out whether there's some way to rewrite this logic so that it can run concurrently with readers; then we could get rid of SInvalReadLock altogether. I have a feeling if I were 25% smarter I could do it, but I'm not. So my fallback position is to try to find some way to optimize the common case where there are no messages to be read. We're already allowing readers and writers to run in parallel, so it must not be important for a reader to see a message that a writer is in the process of writing, but has not yet finished writing. I believe the idea here is that sinval messages should be emitted before releasing the related locks, so if we've successfully acquired a lock on an object, then any pertinent sinval messages must already be completely written. What I'm thinking we could do is just keep a global counter indicating how many messages have been written, and each backend could keep a counter indicating the highest-numbered message it has so far read. When a message is written, the global counter is bumped. When a backend goes to AcceptInvalidationMessages(), it compares the global counter to its own counter, and if they're the same, then it doesn't bother taking SInvalReadLock and just exits quickly. There is an obvious (potential) memory-ordering problem here. If it's possible for a backend doing AcceptInvalidationMessages() to see a stale version of the counter, then it might fail to read messages from the queue that it really ought to have seen. Protecting the global counter with a spinlock defeats the purpose of doing any of this in the first place, because the whole problem is too many people fighting over the spinlock protecting SInvalReadLock. It should be sufficient, though, for the writing backend to execute a memory-barrier operation just after bumping the global counter and the reading backend to execute one just before. As it happens, LWLockRelease() is such a barrier, since it must acquire and release a spinlock, so we should be able to just bump the counter right before releasing the LWLock and call it good. On the reading side, the immediately-preceding lock acquisition seems like it ought to be sufficient - surely it can't be possible to acquire a heavyweight lock on an object without seeing memory updates that some other process made before releasing the same heavyweight lock. A possible objection here is that a 32-bit counter might wrap around, giving a false negative, and a read from a 64-bit counter might not be atomic. In practice, the chances of a problem here seem remote, but I think we can guard against it easily enough. We can put each backend's counter of messages read into shared memory, protected by a per-backend lock (probably the same LWLock the fast relation lock patch already introduces). When a backend compares its counter to the global counter, it does so while holding this lock. When SICleanupQueue() resets a backend, it grabs the lock and sets that backend's counter to a special value (like 0) that we guarantee will never match the global counter (which we can cause to wrap from 2^32-1 to 1). So then AcceptInvalidationMessages() - in the common case where there are no invalidation messages to process - will only need the per-backend LWLock rather than fighting over SInvalLock. If anyone has read this far, you have my undying gratitude.... especially if you respond with comments. ProcArrayLock looks like a tougher nut to crack - there's simply no way, with the system we have right now, that you can take a snapshot without locking the list of running processes. I'm not sure what to do about that, but we're probably going to have to come up with something, because it seems clear that once we eliminate the lock manager LWLock contention, this is a major bottleneck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers