Re: Lockless queue of waiters in LWLock

2024-01-26 Thread vignesh C
On Sat, 20 Jan 2024 at 07:28, vignesh C  wrote:
>
> On Sat, 26 Nov 2022 at 00:24, Pavel Borisov  wrote:
> >
> > Hi, hackers!
> > In the measurements above in the thread, I've been using LIFO wake
> > queue in a primary lockless patch (and it was attached as the previous
> > versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
> > the alternative benchmarking option. I think using the latter is more
> > fair and natural and provided they show no difference in the speed,
> > I'd make the main patch using it (attached as v6). No other changes
> > from v5, though.
>
> There has not been much interest on this as the thread has been idle
> for more than a year now. I'm not sure if we should take it forward or
> not. I would prefer to close this in the current commitfest unless
> someone wants to take it further.

I have returned this patch in commitfest as nobody had shown any
interest in pursuing it. Feel free to add a new entry when someone
wants to work on this more actively.

Regards,
Vignesh




Re: Lockless queue of waiters in LWLock

2024-01-19 Thread vignesh C
On Sat, 26 Nov 2022 at 00:24, Pavel Borisov  wrote:
>
> Hi, hackers!
> In the measurements above in the thread, I've been using LIFO wake
> queue in a primary lockless patch (and it was attached as the previous
> versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
> the alternative benchmarking option. I think using the latter is more
> fair and natural and provided they show no difference in the speed,
> I'd make the main patch using it (attached as v6). No other changes
> from v5, though.

There has not been much interest on this as the thread has been idle
for more than a year now. I'm not sure if we should take it forward or
not. I would prefer to close this in the current commitfest unless
someone wants to take it further.

Regards,
Vignesh




Re: Lockless queue of waiters in LWLock

2022-11-25 Thread Pavel Borisov
Hi, hackers!
In the measurements above in the thread, I've been using LIFO wake
queue in a primary lockless patch (and it was attached as the previous
versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
the alternative benchmarking option. I think using the latter is more
fair and natural and provided they show no difference in the speed,
I'd make the main patch using it (attached as v6). No other changes
from v5, though.

Regards,
Pavel.


v6-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-24 Thread Pavel Borisov
Hi, hackers!
Andres Freund recently committed his nice LWLock optimization
a4adc31f6902f6f. So I've rebased the patch on top of the current
master (PFA v5).

Regards,
Pavel Borisov,
Supabase.


v5-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-17 Thread Pavel Borisov
Hi, hackers!

I've noticed that alignment requirements for using
pg_atomic_init_u64() for LWlock state (there's an assertion that it's
aligned at 8 bytes) do not correspond to the code in SimpleLruInit()
on 32-bit arch when MAXIMUM_ALIGNOF = 4.
Fixed this in v4 patch (PFA).

Regards,
Pavel.


v4-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-11 Thread Alexander Korotkov
Hi, Pavel!

On Fri, Nov 11, 2022 at 2:40 PM Pavel Borisov  wrote:
> I've done some more measurements to check the hypotheses regarding the
> performance of a previous patch v2, and explain the results of tests
> in [1].
>
> The results below are the same (tps vs connections) plots as in [1],
> and the test is identical to the insert test in this thread [2].
> Additionally, in each case, there is a plot with results relative to
> Andres Freund's patch [3]. Log plots are good for seeing details in
> the range of 20-30 connections, but they somewhat hide the fact that
> the effect in the range of 500+ connections is much more significant
> overall, so I'd recommend looking at the linear plots as well.

Thank you for doing all the experiments!

BTW, sometimes it's hard to distinguish so many lines on a jpg
picture.  Could I ask you to post the same graphs in png and also post
raw data in csv format?

> I'm also planning to do the same tests on an ARM server when the free
> one comes available to me.
> Thoughts?

ARM tests should be great.  We definitely need to check this on more
than just one architecture.  Please, check with and without LSE
instructions.  They could lead to dramatic speedup [1].  Although,
most of precompiled binaries are distributed without them.  So, both
cases seems important to me so far.

>From what we have so far, I think we could try combine the multiple
strategies to achieve the best result.  2x1ms is one of the leaders
before ~200 connections, and 1x1ms is once of the leaders after.  We
could implement simple heuristics to switch between 1 and 2 retries
similar to what we have to spin delays.  But let's have ARM results
first.


Links
1. https://akorotkov.github.io/blog/2021/04/30/arm/

--
Regards,
Alexander Korotkov




Re: Lockless queue of waiters in LWLock

2022-11-11 Thread Pavel Borisov
CFbot isn't happy because of additional patches in the previous
message, so here I attach v3 patch alone.

Regards,
Pavel


v3-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 12:05:43 +0300, Alexander Korotkov wrote:
> On Fri, Nov 4, 2022 at 10:07 PM Andres Freund  wrote:
> > The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
> > lwlocks are slower than a spinlock (but obviously are better under 
> > contention
> > nonetheless).
> >
> >
> > I have a benchmark program that starts a thread for each physical core and
> > just increments a counter on an atomic value.
> 
> Thank you for this insight!  I didn't know xadd is much cheaper than
> cmpxchg unless there are retries.

The magnitude of the effect is somewhat surprising, I agree. Some difference
makes sense to me, but...


> I also wonder how cmpxchg becomes faster with higher concurrency.

If you're referring to the leading 32/64 that's not concurrency, that's
32/64bit values. Sorry for not being clearer on that.

Greetings,

Andres Freund




Re: Lockless queue of waiters in LWLock

2022-11-05 Thread Alexander Korotkov
Hi, Andres!

On Fri, Nov 4, 2022 at 10:07 PM Andres Freund  wrote:
> The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
> lwlocks are slower than a spinlock (but obviously are better under contention
> nonetheless).
>
>
> I have a benchmark program that starts a thread for each physical core and
> just increments a counter on an atomic value.

Thank you for this insight!  I didn't know xadd is much cheaper than
cmpxchg unless there are retries.  I also wonder how cmpxchg becomes
faster with higher concurrency.

--
Regards,
Alexander Korotkov




Re: Lockless queue of waiters in LWLock

2022-11-04 Thread Andres Freund
Hi,

On 2022-11-03 14:50:11 +0400, Pavel Borisov wrote:
> Or maybe there is another explanation for now small performance
> difference around 20 connections described in [0]?
> Thoughts?

Using xadd is quite a bit cheaper than cmpxchg, and now every lock release
uses a compare-exchange, I think.

In the past I had a more complicated version of LWLockAcquire which tried to
use an xadd to acquire locks. IIRC (and this is long enough ago that I might
not) that proved to be a benefit, but I was worried about the complexity. And
just getting in the version that didn't always use a spinlock was the higher
priority.

The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
lwlocks are slower than a spinlock (but obviously are better under contention
nonetheless).


I have a benchmark program that starts a thread for each physical core and
just increments a counter on an atomic value.

On my dual Xeon Gold 5215 workstation:

cmpxchg:
32: throughput per thread: 0.55M/s, total: 11.02M/s
64: throughput per thread: 0.63M/s, total: 12.68M/s

lock add:
32: throughput per thread: 2.10M/s, total: 41.98M/s
64: throughput per thread: 2.12M/s, total: 42.40M/s

xadd:
32: throughput per thread: 2.10M/s, total: 41.91M/s
64: throughput per thread: 2.04M/s, total: 40.71M/s


and even when there's no contention, every thread just updating its own
cacheline:

cmpxchg:
32: throughput per thread: 88.83M/s, total: 1776.51M/s
64: throughput per thread: 96.46M/s, total: 1929.11M/s

lock add:
32: throughput per thread: 166.07M/s, total: 3321.31M/s
64: throughput per thread: 165.86M/s, total: 3317.22M/s

add (no lock):
32: throughput per thread: 530.78M/s, total: 10615.62M/s
64: throughput per thread: 531.22M/s, total: 10624.35M/s

xadd:
32: throughput per thread: 165.88M/s, total: 3317.51M/s
64: throughput per thread: 165.93M/s, total: 3318.53M/s


Greetings,

Andres Freund




Re: Lockless queue of waiters in LWLock

2022-11-04 Thread Alexander Korotkov
Hi, Pavel!

On Thu, Nov 3, 2022 at 1:51 PM Pavel Borisov  wrote:
> > I'm planning to gather more detailed statistics on different
> > LWLockAcquire calls soon to understand prospects for further
> > optimizations.
>
> So, I've made more measurements.
>
> 1. Applied measuring patch 0001 to a patch with lockless queue
> optimization (v2) from [0] in this thread and run the same concurrent
> insert test described in [1] on 20 pgbench connections.
> The new results for ProcArray lwlock are as follows:
> exacq 45132  // Overall number of exclusive locks taken
> ex_attempt[0] 20755  // Exclusive locks taken immediately
> ex_attempt[1] 18800  // Exclusive locks taken after one waiting on semaphore
> ex_attempt[2] 5577// Exclusive locks taken after two or more
> waiting on semaphore
> shacq 494871// .. same stats for shared locks
> sh_attempt[0] 463211 // ..
> sh_attempt[1] 29767   // ..
> sh_attempt[2] 1893 // .. same stats for shared locks
> sh_wake_calls 31070 // Number of calls to wake up shared waiters
> sh_wakes 36190// Number of shared waiters woken up.
> GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid
> EndTransactionInternal: 236193 // Number of calls
> ProcArrayEndTransactionInternal
>
> 2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2]
> and run the same concurrent insert test described in [1] on 20 pgbench
> connections.
> The results for ProcArray lwlock are as follows:
> exacq 49300   // Overall number of exclusive locks taken
> ex_attempt1[0] 18353// Exclusive locks taken immediately by first
> call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt2[0] 18144.   // Exclusive locks taken immediately by second
> call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt1[1] 9985  // Exclusive locks taken after one waiting on
> semaphore by first call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on
> semaphore by second call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt1[2] 823.   // Exclusive locks taken after two or more
> waiting on semaphore by first call of LWLockAttemptLock in
> LWLockAcquire loop
> ex_attempt2[2] 157.   // Exclusive locks taken after two or more
> waiting on semaphore by second call of LWLockAttemptLock in
> LWLockAcquire loop
> shacq 508131 // .. same stats for shared locks
> sh_attempt1[0] 469410   //..
> sh_attempt2[0] 27858.   //..
> sh_attempt1[1] 10309.   //..
> sh_attempt2[1] 460.   //..
> sh_attempt1[2] 90. //..
> sh_attempt2[2] 4.   // .. same stats for shared locks
> dequeue self 48461   // Number of dequeue_self calls
> sh_wake_calls 27560   // Number of calls to wake up
> shared waiters
> sh_wakes 19408  // Number of shared waiters woken up.
> GroupClearXid 65021. // Number of calls of
> ProcArrayGroupClearXid
> EndTransactionInternal: 249003 // Number of calls
> ProcArrayEndTransactionInternal
>
> It seems that two calls in each look in Andres's (and master) code
> help evade semaphore-waiting loops that may be relatively expensive.
> The probable reason for this is that the small delay between these two
> calls is sometimes enough for concurrent takers to free spinlock for
> the queue modification. Could we get even more performance if we do
> three or more tries to take the lock in the queue? Will this degrade
> performance in some other cases?

Thank you for gathering the statistics. Let me do some relative
analysis of that.

*Lockless queue patch*

1. Shared lockers
1.1. 93.60% of them acquire lock without sleeping on semaphore
1.2. 6.02% of them acquire lock after sleeping on semaphore 1 time
1.3. 0.38% of them acquire lock after sleeping on semaphore 2 or more times
2. Exclusive lockers
2.1. 45.99% of them acquire lock without sleeping on semaphore
2.2. 41.66% of them acquire lock after sleeping on semaphore 1 time
2.3. 12.36% of them acquire lock after sleeping on semaphore 2 or more times

In general, about 10% of lockers sleep on the semaphore.

*Andres's patch*

1. Shared lockers
1.1. 97.86% of them acquire lock without sleeping on the semaphore
(92.38% do this immediately and 5.48% after queuing)
1.2. 2.12% of them acquire lock after sleeping on semaphore 1 time
(2.03% do this immediately and 0.09% after queuing)
1.3. 0.02% of them acquire lock after sleeping on semaphore 2 or more
times (0.02% do this immediately and 0.00% after queuing)
2. Exclusive lockers
2.1. 74.03% of them acquire lock without sleeping on the semaphore
(37.23% do this immediately and 36.80% after queuing)
2.2. 23.98% of them acquire lock after sleeping on semaphore 1 time
(20.25% do this immediately and 3.73% after queuing)
2.3. 1.99% of them acquire lock after sleeping on semaphore 2 or more
times (1.67% do this immediately and 0.32% after queuing)

In general, about 4% of lockers sleep on the se

Re: Lockless queue of waiters in LWLock

2022-11-03 Thread Pavel Borisov
Hi, hackers!
> I'm planning to gather more detailed statistics on different
> LWLockAcquire calls soon to understand prospects for further
> optimizations.

So, I've made more measurements.

1. Applied measuring patch 0001 to a patch with lockless queue
optimization (v2) from [0] in this thread and run the same concurrent
insert test described in [1] on 20 pgbench connections.
The new results for ProcArray lwlock are as follows:
exacq 45132  // Overall number of exclusive locks taken
ex_attempt[0] 20755  // Exclusive locks taken immediately
ex_attempt[1] 18800  // Exclusive locks taken after one waiting on semaphore
ex_attempt[2] 5577// Exclusive locks taken after two or more
waiting on semaphore
shacq 494871// .. same stats for shared locks
sh_attempt[0] 463211 // ..
sh_attempt[1] 29767   // ..
sh_attempt[2] 1893 // .. same stats for shared locks
sh_wake_calls 31070 // Number of calls to wake up shared waiters
sh_wakes 36190// Number of shared waiters woken up.
GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid
EndTransactionInternal: 236193 // Number of calls
ProcArrayEndTransactionInternal

2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2]
and run the same concurrent insert test described in [1] on 20 pgbench
connections.
The results for ProcArray lwlock are as follows:
exacq 49300   // Overall number of exclusive locks taken
ex_attempt1[0] 18353// Exclusive locks taken immediately by first
call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt2[0] 18144.   // Exclusive locks taken immediately by second
call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt1[1] 9985  // Exclusive locks taken after one waiting on
semaphore by first call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on
semaphore by second call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt1[2] 823.   // Exclusive locks taken after two or more
waiting on semaphore by first call of LWLockAttemptLock in
LWLockAcquire loop
ex_attempt2[2] 157.   // Exclusive locks taken after two or more
waiting on semaphore by second call of LWLockAttemptLock in
LWLockAcquire loop
shacq 508131 // .. same stats for shared locks
sh_attempt1[0] 469410   //..
sh_attempt2[0] 27858.   //..
sh_attempt1[1] 10309.   //..
sh_attempt2[1] 460.   //..
sh_attempt1[2] 90. //..
sh_attempt2[2] 4.   // .. same stats for shared locks
dequeue self 48461   // Number of dequeue_self calls
sh_wake_calls 27560   // Number of calls to wake up
shared waiters
sh_wakes 19408  // Number of shared waiters woken up.
GroupClearXid 65021. // Number of calls of
ProcArrayGroupClearXid
EndTransactionInternal: 249003 // Number of calls
ProcArrayEndTransactionInternal

It seems that two calls in each look in Andres's (and master) code
help evade semaphore-waiting loops that may be relatively expensive.
The probable reason for this is that the small delay between these two
calls is sometimes enough for concurrent takers to free spinlock for
the queue modification. Could we get even more performance if we do
three or more tries to take the lock in the queue? Will this degrade
performance in some other cases?

Or maybe there is another explanation for now small performance
difference around 20 connections described in [0]?
Thoughts?

Regards,
Pavel Borisov

[0] 
https://www.postgresql.org/message-id/CALT9ZEF7q%2BSarz1MjrX-fM7OsoU7CK16%3DONoGCOkY3Efj%2BGrnw%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/CALT9ZEEz%2B%3DNepc5eti6x531q64Z6%2BDxtP3h-h_8O5HDdtkJcPw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20221031235114.ftjkife57zil7ryw%40awork3.anarazel.de


0001-Print-extended-lwlock_stats-and-proc_stats-on-CAS-re.patch
Description: Binary data


0002-Print-extended-lwlock_stats-and-proc_stats.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-01 Thread Alexander Korotkov
Hi Andres,

Thank you for your feedback.

On Tue, Nov 1, 2022 at 2:15 AM Andres Freund  wrote:
> On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote:
> > If the lock state contains references to the queue head and tail, we can
> > implement a lockless queue of waiters for the LWLock.  Adding new items to
> > the queue head or tail can be done with a single CAS operation (adding to
> > the tail will also require further setting the reference from the previous
> > tail).  Given that there could be only one lock releaser, which wakes up
> > waiters in the queue, we can handle all the concurrency issues with
> > reasonable complexity.
>
> Right now lock releases happen *after* the lock is released.

That makes sense. The patch makes the locker hold the lock, which it's
processing the queue. So, the lock is held for a longer time.

> I suspect that is
> at least part of the reason for the regression you're seeing. It also looks
> like you're going to need a substantially higher number of atomic operations -
> right now the queue processing needs O(1) atomic ops, but your approach ends
> up with O(waiters) atomic ops.

Hmm... In the patch, queue processing calls CAS once after processing
the queue. There could be retries to process the queue parts, which
were added concurrently. But I doubt it ends up with O(waiters) atomic
ops. Pavel, I think we could gather some statistics to check how many
retries we have on average.

> I suspect it might be worth going halfway, i.e. put the list head/tail in the
> atomic variable, but process the list with a lock held, after the lock is
> released.

Good idea. We, anyway, only allow one locker at a time to process the queue.

--
Regards,
Alexander Korotkov




Re: Lockless queue of waiters in LWLock

2022-10-31 Thread Andres Freund
Hi,

Thanks for working on this - I think it's something we need to improve.


On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote:
> If the lock state contains references to the queue head and tail, we can
> implement a lockless queue of waiters for the LWLock.  Adding new items to
> the queue head or tail can be done with a single CAS operation (adding to
> the tail will also require further setting the reference from the previous
> tail).  Given that there could be only one lock releaser, which wakes up
> waiters in the queue, we can handle all the concurrency issues with
> reasonable complexity.

Right now lock releases happen *after* the lock is released. I suspect that is
at least part of the reason for the regression you're seeing. It also looks
like you're going to need a substantially higher number of atomic operations -
right now the queue processing needs O(1) atomic ops, but your approach ends
up with O(waiters) atomic ops.

I suspect it might be worth going halfway, i.e. put the list head/tail in the
atomic variable, but process the list with a lock held, after the lock is
released.


> Removing the queue spinlock bit and the corresponding contention should
> give the performance gain on high concurrent LWLock usage scenarios.

> Currently, the maximum size of procarray is 2^18 (as stated in
> buf_internals.h), so with the use of the 64-bit LWLock state variable, we
> can store the procarray indexes for both head and tail of the queue, all
> the existing machinery, and even have some spare bits for future flags.

Another advantage: It shrinks the LWLock struct, which makes it cheaper to
make locks more granular...


> I can not understand why the performance of a lockless queue patch has a
> minor regression in the region of around 20 connections, even when compared
> to the current master branch.

I suspect that's the point where the additional atomic operations hurt the
most, while not yet providing a substantial gain.

Greetings,

Andres Freund