subject:"spin_delay\(\) for ARM"

Re: spin_delay() for ARM

2020-04-20 Thread Amit Khandekar

On Sat, 18 Apr 2020 at 03:30, Thomas Munro  wrote:
>
> On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma  wrote:
> > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> > > what I know, pgbench cannot be used for testing spinlocks problems.
> > >
> > > Maybe you can see this issue when a) use higher number clients - 
> > > hundreds, thousands. Decrease share memory, so there will be press on 
> > > related spin lock.
> >
> > There really aren't many spinlocks left that could be tickled by a
> > normal workload. I looked for a way to trigger spinlock contention
> > when I prototyped a patch to replace spinlocks with futexes. The only
> > one that I could figure out a way to make contended was the lock
> > protecting parallel btree scan. A highly parallel index only scan on a
> > fully cached index should create at least some spinlock contention.
>
> I suspect the snapshot-too-old "mutex_threshold" spinlock can become
> contended under workloads that generate a high rate of
> heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
> way to do that is with a bunch of concurrent index scans that hit the
> heap in random order.  Some notes about that:
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com

Thanks all for the inputs. Will keep these two particular scenarios in
mind, and try to get some bandwidth on this soon.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

2020-04-17 Thread Thomas Munro

On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma  wrote:
> On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> > what I know, pgbench cannot be used for testing spinlocks problems.
> >
> > Maybe you can see this issue when a) use higher number clients - hundreds, 
> > thousands. Decrease share memory, so there will be press on related spin 
> > lock.
>
> There really aren't many spinlocks left that could be tickled by a
> normal workload. I looked for a way to trigger spinlock contention
> when I prototyped a patch to replace spinlocks with futexes. The only
> one that I could figure out a way to make contended was the lock
> protecting parallel btree scan. A highly parallel index only scan on a
> fully cached index should create at least some spinlock contention.

I suspect the snapshot-too-old "mutex_threshold" spinlock can become
contended under workloads that generate a high rate of
heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
way to do that is with a bunch of concurrent index scans that hit the
heap in random order.  Some notes about that:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com

Re: spin_delay() for ARM

2020-04-17 Thread Robert Haas

On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar  wrote:
> Not relevant to the PAUSE stuff  Note that when the parallel
> clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> shoots from 454189 to 1097592 which is more than double speed gain
> with just a 30% increase in parallel sessions.

I've seen stuff like this too. For instance, check out the graph from
this 2012 blog post:

http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html

You can see that the performance growth is basically on a straight
line up to about 16 cores, but then it kinks downward until about 28,
after which it kinks sharply upward until about 36 cores.

I think this has something to do with the process scheduling behavior
of Linux, because I vaguely recall some discussion where somebody did
benchmarking on the same hardware on both Linux and one of the BSD
systems, and the effect didn't appear on BSD. They had other problems,
like a huge drop-off at higher core counts, but they didn't have that
effect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spin_delay() for ARM

2020-04-17 Thread Ants Aasma

On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> what I know, pgbench cannot be used for testing spinlocks problems.
>
> Maybe you can see this issue when a) use higher number clients - hundreds, 
> thousands. Decrease share memory, so there will be press on related spin lock.

There really aren't many spinlocks left that could be tickled by a
normal workload. I looked for a way to trigger spinlock contention
when I prototyped a patch to replace spinlocks with futexes. The only
one that I could figure out a way to make contended was the lock
protecting parallel btree scan. A highly parallel index only scan on a
fully cached index should create at least some spinlock contention.

Regards,
Ants Aasma

Re: spin_delay() for ARM

2020-04-16 Thread Pavel Stehule

čt 16. 4. 2020 v 9:18 odesílatel Amit Khandekar 
napsal:

> On Mon, 13 Apr 2020 at 20:16, Amit Khandekar 
> wrote:
> > On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
> > >
> > > I wrote:
> > > > A more useful test would be to directly experiment with contended
> > > > spinlocks.  As I recall, we had some test cases laying about when
> > > > we were fooling with the spin delay stuff on Intel --- maybe
> > > > resurrecting one of those would be useful?
> > >
> > > The last really significant performance testing we did in this area
> > > seems to have been in this thread:
> > >
> > >
> https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> > >
> > > A relevant point from that is Haas' comment
> > >
> > > I think optimizing spinlocks for machines with only a few CPUs is
> > > probably pointless.  Based on what I've seen so far, spinlock
> > > contention even at 16 CPUs is negligible pretty much no matter what
> > > you do.  Whether your implementation is fast or slow isn't going to
> > > matter, because even an inefficient implementation will account for
> > > only a negligible percentage of the total CPU time - much less
> than 1%
> > > - as opposed to a 64-core machine, where it's not that hard to find
> > > cases where spin-waits consume the *majority* of available CPU time
> > > (recall previous discussion of lseek).
> >
> > Yeah, will check if I find some machines with large cores.
>
> I got hold of a 32 CPUs VM (actually it was a 16-core, but being
> hyperthreaded, CPUs were 32).
> It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
> KVM. Single NUMA node.
> PG parameters changed : shared_buffer: 8G ; max_connections : 1000
>
> I compared pgbench results with HEAD versus PAUSE removed like this :
>  perform_spin_delay(SpinDelayStatus *status)
>  {
> -   /* CPU-specific delay each time through the loop */
> -   SPIN_DELAY();
>
> Ran with increasing number of parallel clients :
> pgbench -S -c $num -j $num -T 60 -M prepared
> But couldn't find any significant change in the TPS numbers with or
> without PAUSE:
>
> Clients HEAD Without_PAUSE
> 8 26   247264
> 16399939   399549
> 24454189   453244
> 32   1097592  1098844
> 40   1090424  1087984
> 48   1068645  1075173
> 64   1035035  1039973
> 96976578   970699
>
> May be it will indeed show some difference only with around 64 cores,
> or perhaps a bare metal machine will help; but as of now I didn't get
> such a machine. Anyways, I thought why not archive the results with
> whatever I have.
>
> Not relevant to the PAUSE stuff  Note that when the parallel
> clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> shoots from 454189 to 1097592 which is more than double speed gain
> with just a 30% increase in parallel sessions. I was not expecting
> this much speed gain, because, with contended scenario already pgbench
> processes are already taking around 20% of the total CPU time of
> pgbench run. May be later on, I will get a chance to run with some
> customized pgbench script that runs a server function which keeps on
> running an index scan on pgbench_accounts, so as to make pgbench
> clients almost idle.
>

what I know, pgbench cannot be used for testing spinlocks problems.

Maybe you can see this issue when a) use higher number clients - hundreds,
thousands. Decrease share memory, so there will be press on related spin
lock.

Regards

Pavel


> Thanks
> -Amit Khandekar
>
>
>

Re: spin_delay() for ARM

2020-04-16 Thread Amit Khandekar

On Mon, 13 Apr 2020 at 20:16, Amit Khandekar  wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> > I think optimizing spinlocks for machines with only a few CPUs is
> > probably pointless.  Based on what I've seen so far, spinlock
> > contention even at 16 CPUs is negligible pretty much no matter what
> > you do.  Whether your implementation is fast or slow isn't going to
> > matter, because even an inefficient implementation will account for
> > only a negligible percentage of the total CPU time - much less than 1%
> > - as opposed to a 64-core machine, where it's not that hard to find
> > cases where spin-waits consume the *majority* of available CPU time
> > (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-   /* CPU-specific delay each time through the loop */
-   SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients HEAD Without_PAUSE
8 26   247264
16399939   399549
24454189   453244
32   1097592  1098844
40   1090424  1087984
48   1068645  1075173
64   1035035  1039973
96976578   970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff  Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

Thanks
-Amit Khandekar

Re: spin_delay() for ARM

2020-04-13 Thread Amit Khandekar

On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
>
> I wrote:
> > A more useful test would be to directly experiment with contended
> > spinlocks.  As I recall, we had some test cases laying about when
> > we were fooling with the spin delay stuff on Intel --- maybe
> > resurrecting one of those would be useful?
>
> The last really significant performance testing we did in this area
> seems to have been in this thread:
>
>
https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
>
> A relevant point from that is Haas' comment
>
> I think optimizing spinlocks for machines with only a few CPUs is
> probably pointless.  Based on what I've seen so far, spinlock
> contention even at 16 CPUs is negligible pretty much no matter what
> you do.  Whether your implementation is fast or slow isn't going to
> matter, because even an inefficient implementation will account for
> only a negligible percentage of the total CPU time - much less than 1%
> - as opposed to a 64-core machine, where it's not that hard to find
> cases where spin-waits consume the *majority* of available CPU time
> (recall previous discussion of lseek).

Yeah, will check if I find some machines with large cores.


> So I wonder whether this patch is getting ahead of the game.  It does
> seem that ARM systems with a couple dozen cores exist, but are they
> common enough to optimize for yet?  Can we even find *one* to test on
> and verify that this is a win and not a loss?  (Also, seeing that
> there are so many different ARM vendors, results from just one
> chipset might not be too trustworthy ...)

Ok. Yes, it would be worth waiting to see if there are others in the
community with ARM systems that have implemented YIELD. May be after that
we might gain some confidence. I myself also hope that I will get one soon
to test, but right now I have one that does not support it, so it will be
just a no-op.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

2020-04-13 Thread Amit Khandekar

On Sat, 11 Apr 2020 at 00:47, Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> > using a sample C program (attached spin.c). Here, many child processes
> > (much more than CPUs) wait in a tight loop for a shared variable to
> > become 0, while the parent process continuously increments a sequence
> > number for a fixed amount of time, after which, it sets the shared
> > variable to 0. The child's tight loop calls PAUSE in each iteration.
> > What I hoped was that because of PAUSE in children, the parent process
> > would get more share of the CPU, due to which, in a given time, the
> > sequence number will reach a higher value. Also, I expected the CPU
> > cycles spent by child processes to drop down, thanks to PAUSE. None of
> > these happened. There was no change.
>
> > Possibly, this testcase is not right. Probably the process preemption
> > occurs only within the set of hyperthreads attached to a single core.
> > And in my testcase, the parent process is the only one who is ready to
> > run. Still, I have anyway attached the program (spin.c) for archival;
> > in case somebody with a YIELD-supporting ARM machine wants to use it
> > to test YIELD.
>
> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running.

Yeah, I thought that the OS scheduling would be an *indirect* consequence
of the pause because of it's slowing down the CPU, but looks like that does
not happen.


> You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.
Hmm, I had tried with perf and could see the pause itself consuming 5% cpu.
But I haven't yet played with per-process figures.



-- 
Thanks,
-Amit Khandekar
Huawei Technologies
-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

2020-04-10 Thread Tom Lane

I wrote:
> A more useful test would be to directly experiment with contended
> spinlocks.  As I recall, we had some test cases laying about when
> we were fooling with the spin delay stuff on Intel --- maybe
> resurrecting one of those would be useful?

The last really significant performance testing we did in this area
seems to have been in this thread:

https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com

A relevant point from that is Haas' comment

I think optimizing spinlocks for machines with only a few CPUs is
probably pointless.  Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.  Whether your implementation is fast or slow isn't going to
matter, because even an inefficient implementation will account for
only a negligible percentage of the total CPU time - much less than 1%
- as opposed to a 64-core machine, where it's not that hard to find
cases where spin-waits consume the *majority* of available CPU time
(recall previous discussion of lseek).

So I wonder whether this patch is getting ahead of the game.  It does
seem that ARM systems with a couple dozen cores exist, but are they
common enough to optimize for yet?  Can we even find *one* to test on
and verify that this is a win and not a loss?  (Also, seeing that
there are so many different ARM vendors, results from just one
chipset might not be too trustworthy ...)

regards, tom lane

Re: spin_delay() for ARM

2020-04-10 Thread Tom Lane

Andres Freund  writes:
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
>> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
>> using a sample C program (attached spin.c).

> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running. You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.

A more useful test would be to directly experiment with contended
spinlocks.  As I recall, we had some test cases laying about when
we were fooling with the spin delay stuff on Intel --- maybe
resurrecting one of those would be useful?

regards, tom lane

Re: spin_delay() for ARM

2020-04-10 Thread Andres Freund

Hi,

On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> using a sample C program (attached spin.c). Here, many child processes
> (much more than CPUs) wait in a tight loop for a shared variable to
> become 0, while the parent process continuously increments a sequence
> number for a fixed amount of time, after which, it sets the shared
> variable to 0. The child's tight loop calls PAUSE in each iteration.
> What I hoped was that because of PAUSE in children, the parent process
> would get more share of the CPU, due to which, in a given time, the
> sequence number will reach a higher value. Also, I expected the CPU
> cycles spent by child processes to drop down, thanks to PAUSE. None of
> these happened. There was no change.

> Possibly, this testcase is not right. Probably the process preemption
> occurs only within the set of hyperthreads attached to a single core.
> And in my testcase, the parent process is the only one who is ready to
> run. Still, I have anyway attached the program (spin.c) for archival;
> in case somebody with a YIELD-supporting ARM machine wants to use it
> to test YIELD.

PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
just schedule another process - you won't see different CPU usage if you
measure it purely as the time running. You should be able to see a
difference if you measure with a profiler that shows you data from the
CPUs performance monitoring unit.

Greetings,

Andres Freund

We use (an equivalent of) the PAUSE instruction in spin_delay() for
Intel architectures. The goal is to slow down the spinlock tight loop
and thus prevent it from eating CPU and causing CPU starvation, so
that other processes get their fair share of the CPU time. Intel
documentation [1] clearly mentions this, along with other benefits of
PAUSE, like, low power consumption, and avoidance of memory order
violation while exiting the loop.

Similar to PAUSE, the ARM architecture has YIELD instruction, which is
also clearly documented [2]. It explicitly says that it is a way to
hint the CPU that it is being called in a spinlock loop and this
process can be preempted out. But for ARM, we are not using any kind
of spin delay.

For PG spinlocks, the goal of both of these instructions are the same,
and also both architectures recommend using them in spinlock loops.
Also, I found multiple places where YIELD is already used in same
situations : Linux kernel [3] ; OpenJDK [4],[5]

Now, for ARM implementations that don't implement YIELD, it runs as a
no-op. Unfortunately the ARM machine I have does not implement YIELD.
But recently there has been some ARM implementations that are
hyperthreaded, so they are expected to actually do the YIELD, although
the docs do not explicitly say that YIELD has to be implemented only
by hyperthreaded implementations.

I ran some pgbench tests to test PAUSE/YIELD on the respective
architectures, once with the instruction present, and once with the
instruction removed. Didn't see change in the TPS numbers; they were
more or less same. For Arm, this was expected because my ARM machine
does not implement it.

On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
using a sample C program (attached spin.c). Here, many child processes
(much more than CPUs) wait in a tight loop for a shared variable to
become 0, while the parent process continuously increments a sequence
number for a fixed amount of time, after which, it sets the shared
variable to 0. The child's tight loop calls PAUSE in each iteration.
What I hoped was that because of PAUSE in children, the parent process
would get more share of the CPU, due to which, in a given time, the
sequence number will reach a higher value. Also, I expected the CPU
cycles spent by child processes to drop down, thanks to PAUSE. None of
these happened. There was no change.

Possibly, this testcase is not right. Probably the process preemption
occurs only within the set of hyperthreads attached to a single core.
And in my testcase, the parent process is the only one who is ready to
run. Still, I have anyway attached the program (spin.c) for archival;
in case somebody with a YIELD-supporting ARM machine wants to use it
to test YIELD.

Nevertheless, I think because we have clear documentation that
strongly recommends to use it, and because it has been used in other
use-cases such as linux kernel and JDK, we should start using YIELD
for spin_delay() in ARM.

Attached is the trivial patch (spin_delay_for_arm.patch). To start
with, it contains changes only for aarch64. I haven't yet added
changes in configure[.in] for making sure yield compiles successfully
(YIELD is present in manuals from ARMv6 onwards). Before that I
thought of getting some comments; so didn't do configure changes yet.

[1] https://c9x.me/x86/html/file_module_x86_id_232.html
[2]
https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield
[3]
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259
[4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
[5]
http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html

--
Thanks,
-Amit Khandekar
Huawei Technologies
/*
* Sample program to test the effect of PAUSE/YIELD instruction in a highly
* contended scenario. The Intel and ARM docs recommend the use of PAUSE and
* YIELD respectively, in spinlock tight loops.
*
* This program can be run with :
* gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes]
* By default, 4 processes are spawned.
*
* Child processes wait in a tight loop for a shared variable to become 0,
* while the parent process continuously increments a sequence number for a
* fixed amount of time, after which, it sets the shared variable to 0. The
* child tight loop calls YIELD/PAUSE in each iteration.
*
* The intention is to create a number of processes much larger than the
* available CPUs, so that the scheduler hopefully pre-empts the processes
* because of the PAUSE, and the main process gets more CPU share because of
* which it will increment its sequence number more number of times. So the
* expectation is that with PAUSE, the program will end up with a much higher
* sequence number than without PAUSE. Similarly, the child processes should
* have lesser CPU cycles with PAUSE than without PAUSE.
*
* Author: Amit Khandekar
*/

#include
#include
#include
#include

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

Re: spin_delay() for ARM

spin_delay() for ARM

12 matches

Site Navigation

Mail list logo

Footer information