On Jul7, 2011, at 03:35 , Robert Haas wrote: > Some poking around suggests that the problem isn't that > spinlocks are routinely contended - it seems that we nearly always get > the spinlock right off the bat. I'm wondering if the problem may be > not so much that we have continuous spinlock contention, but rather > than every once in a while a process gets time-sliced out while it > holds a spinlock. If it's an important spinlock (like the one > protecting SInvalReadLock), the system will quickly evolve into a > state where every single processor is doing nothing but trying to get > that spinlock. Even after the hapless lock-holder gets to run again > and lets go of the lock, you have a whole pile of other backends who > are sitting there firing of lock xchgb in a tight loop, and they can > only get it one at a time, so you have ferocious cache line contention > until the backlog clears.
Pondering this some more, I came up with another idea, pretty much orthogonal to the shared counter partitioning I posted a patch for. If indeed that problem isn't spin lock contention, but rather losing control of the CPU while holding a spin lock, then why not try to get rid of the spin lock entirely? On Linux, that's easy - this is exactly what futexes are about. But it occurred to me that kernel support isn't actually needed to do that - futexes can effectively be emulated in user-space using just a semaphore, and without doing a syscall except when there's contention. The algorithm is quite straight forward, if one assumes a lock-free implementation of a queue (More on that below) LockAcquire: (1) CAS the lock state to increment the reader count or set the exclusive bit in a loop while the lock looks free. If successful, we're done, otherwise we continue with (2) (2) Add ourself to the wait queue [ Since we've added ourself to the queue, we *must* now decrement the semaphore no matter what, to keep the increment/decrement calls balanced. We're careful to maintain that invariant below. ] (3) Fence (AKA issue full memory barrier) (4) Re-check if the lock still looks taken. If it does, we decrement the semaphore (PGSemaphoreLock), and (upon wake-up) restart at (1). Otherwise, continue with (5) (5) Remove the first waiter from the queue and increment her semaphore. Rinse-and-repeat until we either removed ourself from the queue or the queue is empty. (6) Decrement our semaphore. [ (6) is necessary in the general case, see the remark below (2). But we can of course detect the case were we'd increment our own semaphore in (5) only to decrement it again in (6), and skip both operations ] LockRelease: (1) Set the lock state to 0, i.e. release the lock. (2) Fence (AKA issue full memory barrier) (3) If the lock still looks free, remove the first waiter from the queue and increment her semaphore. Rinse-and-repeat while the lock looks free and the queue is non-empty. [ From a correctness POV, we only have to wake up one waiter here, and that only if there isn't one who was been woken up but hasn't yet retried to take the lock. In reality, the decision if and how many waiter to wake up would depend on their desired lock level, and some variant of what we currently call releaseOK. ] The fencing steps (LockAcquire:3) and (LockRelease:2) guarantee that if we block in LockAcquire() a lock holder will see our queue entry and thus will wake us up eventually. Because we use a semaphore and not, say, a simple signal, we don't have to worry about the precise ordering of block and unblock operations - we just need to ensure they're balanced. Now to to enqueue() and dequeue() primitives that the algorithm above depends on. There are multiple published algorithms for lock-free queues. Some googling turned up "An optimistic approach to lock-free FIFO queues", E.L. Mozes, N. Shavit, DOI 10.1007/s00446-007-0050-0 and "CAS-Based Lock-Free Algorithm for Shared Deques", M.M. Michael The second one looks promising, since it only requires a single CAS to enqueue and dequeue entries in the common case. Thus, it should be just as efficient as our current spin-lock-based queue in the uncontended case (and much better in the contested case, of course). [ Our case is, in fact, simpler than the generic setting that these algorithms support. We only ever enqueue our *own* proc array entry (so allocation and re-use of queue entries isn't much of an issue), and always *process* the queue after enqueuing something - either directly in LockAcquire or later in LockRelease. We thus don't really need to support concurrent removal of queue entries, but might get away with simply skipping queue processing if we detect that somebody else is in the process of doing so. I think I have an idea for a simpler lock-less queue that fits our needs, but I haven't yet ironed out all the details (As it turns out, removing the very last entry from the queue is the hard case). ] Thoughts, comments and criticism are all very welcome! best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers