Hi, Thanks for working on this - I think it's something we need to improve.
On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote: > If the lock state contains references to the queue head and tail, we can > implement a lockless queue of waiters for the LWLock. Adding new items to > the queue head or tail can be done with a single CAS operation (adding to > the tail will also require further setting the reference from the previous > tail). Given that there could be only one lock releaser, which wakes up > waiters in the queue, we can handle all the concurrency issues with > reasonable complexity. Right now lock releases happen *after* the lock is released. I suspect that is at least part of the reason for the regression you're seeing. It also looks like you're going to need a substantially higher number of atomic operations - right now the queue processing needs O(1) atomic ops, but your approach ends up with O(waiters) atomic ops. I suspect it might be worth going halfway, i.e. put the list head/tail in the atomic variable, but process the list with a lock held, after the lock is released. > Removing the queue spinlock bit and the corresponding contention should > give the performance gain on high concurrent LWLock usage scenarios. > Currently, the maximum size of procarray is 2^18 (as stated in > buf_internals.h), so with the use of the 64-bit LWLock state variable, we > can store the procarray indexes for both head and tail of the queue, all > the existing machinery, and even have some spare bits for future flags. Another advantage: It shrinks the LWLock struct, which makes it cheaper to make locks more granular... > I can not understand why the performance of a lockless queue patch has a > minor regression in the region of around 20 connections, even when compared > to the current master branch. I suspect that's the point where the additional atomic operations hurt the most, while not yet providing a substantial gain. Greetings, Andres Freund