Re: Lockless queue of waiters in LWLock
On Sat, 20 Jan 2024 at 07:28, vignesh C wrote: > > On Sat, 26 Nov 2022 at 00:24, Pavel Borisov wrote: > > > > Hi, hackers! > > In the measurements above in the thread, I've been using LIFO wake > > queue in a primary lockless patch (and it was attached as the previous > > versions of a patch) and an "inverted wake queue" (in faсt FIFO) as > > the alternative benchmarking option. I think using the latter is more > > fair and natural and provided they show no difference in the speed, > > I'd make the main patch using it (attached as v6). No other changes > > from v5, though. > > There has not been much interest on this as the thread has been idle > for more than a year now. I'm not sure if we should take it forward or > not. I would prefer to close this in the current commitfest unless > someone wants to take it further. I have returned this patch in commitfest as nobody had shown any interest in pursuing it. Feel free to add a new entry when someone wants to work on this more actively. Regards, Vignesh
Re: Lockless queue of waiters in LWLock
On Sat, 26 Nov 2022 at 00:24, Pavel Borisov wrote: > > Hi, hackers! > In the measurements above in the thread, I've been using LIFO wake > queue in a primary lockless patch (and it was attached as the previous > versions of a patch) and an "inverted wake queue" (in faсt FIFO) as > the alternative benchmarking option. I think using the latter is more > fair and natural and provided they show no difference in the speed, > I'd make the main patch using it (attached as v6). No other changes > from v5, though. There has not been much interest on this as the thread has been idle for more than a year now. I'm not sure if we should take it forward or not. I would prefer to close this in the current commitfest unless someone wants to take it further. Regards, Vignesh
Re: Lockless queue of waiters in LWLock
Hi, hackers! In the measurements above in the thread, I've been using LIFO wake queue in a primary lockless patch (and it was attached as the previous versions of a patch) and an "inverted wake queue" (in faсt FIFO) as the alternative benchmarking option. I think using the latter is more fair and natural and provided they show no difference in the speed, I'd make the main patch using it (attached as v6). No other changes from v5, though. Regards, Pavel. v6-0001-Lockless-queue-of-LWLock-waiters.patch Description: Binary data
Re: Lockless queue of waiters in LWLock
Hi, hackers! Andres Freund recently committed his nice LWLock optimization a4adc31f6902f6f. So I've rebased the patch on top of the current master (PFA v5). Regards, Pavel Borisov, Supabase. v5-0001-Lockless-queue-of-LWLock-waiters.patch Description: Binary data
Re: Lockless queue of waiters in LWLock
Hi, hackers! I've noticed that alignment requirements for using pg_atomic_init_u64() for LWlock state (there's an assertion that it's aligned at 8 bytes) do not correspond to the code in SimpleLruInit() on 32-bit arch when MAXIMUM_ALIGNOF = 4. Fixed this in v4 patch (PFA). Regards, Pavel. v4-0001-Lockless-queue-of-LWLock-waiters.patch Description: Binary data
Re: Lockless queue of waiters in LWLock
Hi, Pavel! On Fri, Nov 11, 2022 at 2:40 PM Pavel Borisov wrote: > I've done some more measurements to check the hypotheses regarding the > performance of a previous patch v2, and explain the results of tests > in [1]. > > The results below are the same (tps vs connections) plots as in [1], > and the test is identical to the insert test in this thread [2]. > Additionally, in each case, there is a plot with results relative to > Andres Freund's patch [3]. Log plots are good for seeing details in > the range of 20-30 connections, but they somewhat hide the fact that > the effect in the range of 500+ connections is much more significant > overall, so I'd recommend looking at the linear plots as well. Thank you for doing all the experiments! BTW, sometimes it's hard to distinguish so many lines on a jpg picture. Could I ask you to post the same graphs in png and also post raw data in csv format? > I'm also planning to do the same tests on an ARM server when the free > one comes available to me. > Thoughts? ARM tests should be great. We definitely need to check this on more than just one architecture. Please, check with and without LSE instructions. They could lead to dramatic speedup [1]. Although, most of precompiled binaries are distributed without them. So, both cases seems important to me so far. >From what we have so far, I think we could try combine the multiple strategies to achieve the best result. 2x1ms is one of the leaders before ~200 connections, and 1x1ms is once of the leaders after. We could implement simple heuristics to switch between 1 and 2 retries similar to what we have to spin delays. But let's have ARM results first. Links 1. https://akorotkov.github.io/blog/2021/04/30/arm/ -- Regards, Alexander Korotkov
Re: Lockless queue of waiters in LWLock
CFbot isn't happy because of additional patches in the previous message, so here I attach v3 patch alone. Regards, Pavel v3-0001-Lockless-queue-of-LWLock-waiters.patch Description: Binary data
Re: Lockless queue of waiters in LWLock
Hi, On 2022-11-05 12:05:43 +0300, Alexander Korotkov wrote: > On Fri, Nov 4, 2022 at 10:07 PM Andres Freund wrote: > > The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why > > lwlocks are slower than a spinlock (but obviously are better under > > contention > > nonetheless). > > > > > > I have a benchmark program that starts a thread for each physical core and > > just increments a counter on an atomic value. > > Thank you for this insight! I didn't know xadd is much cheaper than > cmpxchg unless there are retries. The magnitude of the effect is somewhat surprising, I agree. Some difference makes sense to me, but... > I also wonder how cmpxchg becomes faster with higher concurrency. If you're referring to the leading 32/64 that's not concurrency, that's 32/64bit values. Sorry for not being clearer on that. Greetings, Andres Freund
Re: Lockless queue of waiters in LWLock
Hi, Andres! On Fri, Nov 4, 2022 at 10:07 PM Andres Freund wrote: > The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why > lwlocks are slower than a spinlock (but obviously are better under contention > nonetheless). > > > I have a benchmark program that starts a thread for each physical core and > just increments a counter on an atomic value. Thank you for this insight! I didn't know xadd is much cheaper than cmpxchg unless there are retries. I also wonder how cmpxchg becomes faster with higher concurrency. -- Regards, Alexander Korotkov
Re: Lockless queue of waiters in LWLock
Hi, On 2022-11-03 14:50:11 +0400, Pavel Borisov wrote: > Or maybe there is another explanation for now small performance > difference around 20 connections described in [0]? > Thoughts? Using xadd is quite a bit cheaper than cmpxchg, and now every lock release uses a compare-exchange, I think. In the past I had a more complicated version of LWLockAcquire which tried to use an xadd to acquire locks. IIRC (and this is long enough ago that I might not) that proved to be a benefit, but I was worried about the complexity. And just getting in the version that didn't always use a spinlock was the higher priority. The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why lwlocks are slower than a spinlock (but obviously are better under contention nonetheless). I have a benchmark program that starts a thread for each physical core and just increments a counter on an atomic value. On my dual Xeon Gold 5215 workstation: cmpxchg: 32: throughput per thread: 0.55M/s, total: 11.02M/s 64: throughput per thread: 0.63M/s, total: 12.68M/s lock add: 32: throughput per thread: 2.10M/s, total: 41.98M/s 64: throughput per thread: 2.12M/s, total: 42.40M/s xadd: 32: throughput per thread: 2.10M/s, total: 41.91M/s 64: throughput per thread: 2.04M/s, total: 40.71M/s and even when there's no contention, every thread just updating its own cacheline: cmpxchg: 32: throughput per thread: 88.83M/s, total: 1776.51M/s 64: throughput per thread: 96.46M/s, total: 1929.11M/s lock add: 32: throughput per thread: 166.07M/s, total: 3321.31M/s 64: throughput per thread: 165.86M/s, total: 3317.22M/s add (no lock): 32: throughput per thread: 530.78M/s, total: 10615.62M/s 64: throughput per thread: 531.22M/s, total: 10624.35M/s xadd: 32: throughput per thread: 165.88M/s, total: 3317.51M/s 64: throughput per thread: 165.93M/s, total: 3318.53M/s Greetings, Andres Freund
Re: Lockless queue of waiters in LWLock
Hi, Pavel! On Thu, Nov 3, 2022 at 1:51 PM Pavel Borisov wrote: > > I'm planning to gather more detailed statistics on different > > LWLockAcquire calls soon to understand prospects for further > > optimizations. > > So, I've made more measurements. > > 1. Applied measuring patch 0001 to a patch with lockless queue > optimization (v2) from [0] in this thread and run the same concurrent > insert test described in [1] on 20 pgbench connections. > The new results for ProcArray lwlock are as follows: > exacq 45132 // Overall number of exclusive locks taken > ex_attempt[0] 20755 // Exclusive locks taken immediately > ex_attempt[1] 18800 // Exclusive locks taken after one waiting on semaphore > ex_attempt[2] 5577// Exclusive locks taken after two or more > waiting on semaphore > shacq 494871// .. same stats for shared locks > sh_attempt[0] 463211 // .. > sh_attempt[1] 29767 // .. > sh_attempt[2] 1893 // .. same stats for shared locks > sh_wake_calls 31070 // Number of calls to wake up shared waiters > sh_wakes 36190// Number of shared waiters woken up. > GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid > EndTransactionInternal: 236193 // Number of calls > ProcArrayEndTransactionInternal > > 2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2] > and run the same concurrent insert test described in [1] on 20 pgbench > connections. > The results for ProcArray lwlock are as follows: > exacq 49300 // Overall number of exclusive locks taken > ex_attempt1[0] 18353// Exclusive locks taken immediately by first > call of LWLockAttemptLock in LWLockAcquire loop > ex_attempt2[0] 18144. // Exclusive locks taken immediately by second > call of LWLockAttemptLock in LWLockAcquire loop > ex_attempt1[1] 9985 // Exclusive locks taken after one waiting on > semaphore by first call of LWLockAttemptLock in LWLockAcquire loop > ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on > semaphore by second call of LWLockAttemptLock in LWLockAcquire loop > ex_attempt1[2] 823. // Exclusive locks taken after two or more > waiting on semaphore by first call of LWLockAttemptLock in > LWLockAcquire loop > ex_attempt2[2] 157. // Exclusive locks taken after two or more > waiting on semaphore by second call of LWLockAttemptLock in > LWLockAcquire loop > shacq 508131 // .. same stats for shared locks > sh_attempt1[0] 469410 //.. > sh_attempt2[0] 27858. //.. > sh_attempt1[1] 10309. //.. > sh_attempt2[1] 460. //.. > sh_attempt1[2] 90. //.. > sh_attempt2[2] 4. // .. same stats for shared locks > dequeue self 48461 // Number of dequeue_self calls > sh_wake_calls 27560 // Number of calls to wake up > shared waiters > sh_wakes 19408 // Number of shared waiters woken up. > GroupClearXid 65021. // Number of calls of > ProcArrayGroupClearXid > EndTransactionInternal: 249003 // Number of calls > ProcArrayEndTransactionInternal > > It seems that two calls in each look in Andres's (and master) code > help evade semaphore-waiting loops that may be relatively expensive. > The probable reason for this is that the small delay between these two > calls is sometimes enough for concurrent takers to free spinlock for > the queue modification. Could we get even more performance if we do > three or more tries to take the lock in the queue? Will this degrade > performance in some other cases? Thank you for gathering the statistics. Let me do some relative analysis of that. *Lockless queue patch* 1. Shared lockers 1.1. 93.60% of them acquire lock without sleeping on semaphore 1.2. 6.02% of them acquire lock after sleeping on semaphore 1 time 1.3. 0.38% of them acquire lock after sleeping on semaphore 2 or more times 2. Exclusive lockers 2.1. 45.99% of them acquire lock without sleeping on semaphore 2.2. 41.66% of them acquire lock after sleeping on semaphore 1 time 2.3. 12.36% of them acquire lock after sleeping on semaphore 2 or more times In general, about 10% of lockers sleep on the semaphore. *Andres's patch* 1. Shared lockers 1.1. 97.86% of them acquire lock without sleeping on the semaphore (92.38% do this immediately and 5.48% after queuing) 1.2. 2.12% of them acquire lock after sleeping on semaphore 1 time (2.03% do this immediately and 0.09% after queuing) 1.3. 0.02% of them acquire lock after sleeping on semaphore 2 or more times (0.02% do this immediately and 0.00% after queuing) 2. Exclusive lockers 2.1. 74.03% of them acquire lock without sleeping on the semaphore (37.23% do this immediately and 36.80% after queuing) 2.2. 23.98% of them acquire lock after sleeping on semaphore 1 time (20.25% do this immediately and 3.73% after queuing) 2.3. 1.99% of them acquire lock after sleeping on semaphore 2 or more times (1.67% do this immediately and 0.32% after queuing) In general, about 4% of lockers sleep on the se
Re: Lockless queue of waiters in LWLock
Hi, hackers! > I'm planning to gather more detailed statistics on different > LWLockAcquire calls soon to understand prospects for further > optimizations. So, I've made more measurements. 1. Applied measuring patch 0001 to a patch with lockless queue optimization (v2) from [0] in this thread and run the same concurrent insert test described in [1] on 20 pgbench connections. The new results for ProcArray lwlock are as follows: exacq 45132 // Overall number of exclusive locks taken ex_attempt[0] 20755 // Exclusive locks taken immediately ex_attempt[1] 18800 // Exclusive locks taken after one waiting on semaphore ex_attempt[2] 5577// Exclusive locks taken after two or more waiting on semaphore shacq 494871// .. same stats for shared locks sh_attempt[0] 463211 // .. sh_attempt[1] 29767 // .. sh_attempt[2] 1893 // .. same stats for shared locks sh_wake_calls 31070 // Number of calls to wake up shared waiters sh_wakes 36190// Number of shared waiters woken up. GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid EndTransactionInternal: 236193 // Number of calls ProcArrayEndTransactionInternal 2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2] and run the same concurrent insert test described in [1] on 20 pgbench connections. The results for ProcArray lwlock are as follows: exacq 49300 // Overall number of exclusive locks taken ex_attempt1[0] 18353// Exclusive locks taken immediately by first call of LWLockAttemptLock in LWLockAcquire loop ex_attempt2[0] 18144. // Exclusive locks taken immediately by second call of LWLockAttemptLock in LWLockAcquire loop ex_attempt1[1] 9985 // Exclusive locks taken after one waiting on semaphore by first call of LWLockAttemptLock in LWLockAcquire loop ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on semaphore by second call of LWLockAttemptLock in LWLockAcquire loop ex_attempt1[2] 823. // Exclusive locks taken after two or more waiting on semaphore by first call of LWLockAttemptLock in LWLockAcquire loop ex_attempt2[2] 157. // Exclusive locks taken after two or more waiting on semaphore by second call of LWLockAttemptLock in LWLockAcquire loop shacq 508131 // .. same stats for shared locks sh_attempt1[0] 469410 //.. sh_attempt2[0] 27858. //.. sh_attempt1[1] 10309. //.. sh_attempt2[1] 460. //.. sh_attempt1[2] 90. //.. sh_attempt2[2] 4. // .. same stats for shared locks dequeue self 48461 // Number of dequeue_self calls sh_wake_calls 27560 // Number of calls to wake up shared waiters sh_wakes 19408 // Number of shared waiters woken up. GroupClearXid 65021. // Number of calls of ProcArrayGroupClearXid EndTransactionInternal: 249003 // Number of calls ProcArrayEndTransactionInternal It seems that two calls in each look in Andres's (and master) code help evade semaphore-waiting loops that may be relatively expensive. The probable reason for this is that the small delay between these two calls is sometimes enough for concurrent takers to free spinlock for the queue modification. Could we get even more performance if we do three or more tries to take the lock in the queue? Will this degrade performance in some other cases? Or maybe there is another explanation for now small performance difference around 20 connections described in [0]? Thoughts? Regards, Pavel Borisov [0] https://www.postgresql.org/message-id/CALT9ZEF7q%2BSarz1MjrX-fM7OsoU7CK16%3DONoGCOkY3Efj%2BGrnw%40mail.gmail.com [1] https://www.postgresql.org/message-id/CALT9ZEEz%2B%3DNepc5eti6x531q64Z6%2BDxtP3h-h_8O5HDdtkJcPw%40mail.gmail.com [2] https://www.postgresql.org/message-id/20221031235114.ftjkife57zil7ryw%40awork3.anarazel.de 0001-Print-extended-lwlock_stats-and-proc_stats-on-CAS-re.patch Description: Binary data 0002-Print-extended-lwlock_stats-and-proc_stats.patch Description: Binary data
Re: Lockless queue of waiters in LWLock
Hi Andres, Thank you for your feedback. On Tue, Nov 1, 2022 at 2:15 AM Andres Freund wrote: > On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote: > > If the lock state contains references to the queue head and tail, we can > > implement a lockless queue of waiters for the LWLock. Adding new items to > > the queue head or tail can be done with a single CAS operation (adding to > > the tail will also require further setting the reference from the previous > > tail). Given that there could be only one lock releaser, which wakes up > > waiters in the queue, we can handle all the concurrency issues with > > reasonable complexity. > > Right now lock releases happen *after* the lock is released. That makes sense. The patch makes the locker hold the lock, which it's processing the queue. So, the lock is held for a longer time. > I suspect that is > at least part of the reason for the regression you're seeing. It also looks > like you're going to need a substantially higher number of atomic operations - > right now the queue processing needs O(1) atomic ops, but your approach ends > up with O(waiters) atomic ops. Hmm... In the patch, queue processing calls CAS once after processing the queue. There could be retries to process the queue parts, which were added concurrently. But I doubt it ends up with O(waiters) atomic ops. Pavel, I think we could gather some statistics to check how many retries we have on average. > I suspect it might be worth going halfway, i.e. put the list head/tail in the > atomic variable, but process the list with a lock held, after the lock is > released. Good idea. We, anyway, only allow one locker at a time to process the queue. -- Regards, Alexander Korotkov
Re: Lockless queue of waiters in LWLock
Hi, Thanks for working on this - I think it's something we need to improve. On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote: > If the lock state contains references to the queue head and tail, we can > implement a lockless queue of waiters for the LWLock. Adding new items to > the queue head or tail can be done with a single CAS operation (adding to > the tail will also require further setting the reference from the previous > tail). Given that there could be only one lock releaser, which wakes up > waiters in the queue, we can handle all the concurrency issues with > reasonable complexity. Right now lock releases happen *after* the lock is released. I suspect that is at least part of the reason for the regression you're seeing. It also looks like you're going to need a substantially higher number of atomic operations - right now the queue processing needs O(1) atomic ops, but your approach ends up with O(waiters) atomic ops. I suspect it might be worth going halfway, i.e. put the list head/tail in the atomic variable, but process the list with a lock held, after the lock is released. > Removing the queue spinlock bit and the corresponding contention should > give the performance gain on high concurrent LWLock usage scenarios. > Currently, the maximum size of procarray is 2^18 (as stated in > buf_internals.h), so with the use of the 64-bit LWLock state variable, we > can store the procarray indexes for both head and tail of the queue, all > the existing machinery, and even have some spare bits for future flags. Another advantage: It shrinks the LWLock struct, which makes it cheaper to make locks more granular... > I can not understand why the performance of a lockless queue patch has a > minor regression in the region of around 20 connections, even when compared > to the current master branch. I suspect that's the point where the additional atomic operations hurt the most, while not yet providing a substantial gain. Greetings, Andres Freund