Re: spin_delay() for ARM
On Sat, 18 Apr 2020 at 03:30, Thomas Munro wrote: > > On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma wrote: > > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule wrote: > > > what I know, pgbench cannot be used for testing spinlocks problems. > > > > > > Maybe you can see this issue when a) use higher number clients - > > > hundreds, thousands. Decrease share memory, so there will be press on > > > related spin lock. > > > > There really aren't many spinlocks left that could be tickled by a > > normal workload. I looked for a way to trigger spinlock contention > > when I prototyped a patch to replace spinlocks with futexes. The only > > one that I could figure out a way to make contended was the lock > > protecting parallel btree scan. A highly parallel index only scan on a > > fully cached index should create at least some spinlock contention. > > I suspect the snapshot-too-old "mutex_threshold" spinlock can become > contended under workloads that generate a high rate of > heap_page_prune_opt() calls with old_snapshot_threshold enabled. One > way to do that is with a bunch of concurrent index scans that hit the > heap in random order. Some notes about that: > > https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com Thanks all for the inputs. Will keep these two particular scenarios in mind, and try to get some bandwidth on this soon. -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma wrote: > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule wrote: > > what I know, pgbench cannot be used for testing spinlocks problems. > > > > Maybe you can see this issue when a) use higher number clients - hundreds, > > thousands. Decrease share memory, so there will be press on related spin > > lock. > > There really aren't many spinlocks left that could be tickled by a > normal workload. I looked for a way to trigger spinlock contention > when I prototyped a patch to replace spinlocks with futexes. The only > one that I could figure out a way to make contended was the lock > protecting parallel btree scan. A highly parallel index only scan on a > fully cached index should create at least some spinlock contention. I suspect the snapshot-too-old "mutex_threshold" spinlock can become contended under workloads that generate a high rate of heap_page_prune_opt() calls with old_snapshot_threshold enabled. One way to do that is with a bunch of concurrent index scans that hit the heap in random order. Some notes about that: https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com
Re: spin_delay() for ARM
On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar wrote: > Not relevant to the PAUSE stuff Note that when the parallel > clients reach from 24 to 32 (which equals the machine CPUs), the TPS > shoots from 454189 to 1097592 which is more than double speed gain > with just a 30% increase in parallel sessions. I've seen stuff like this too. For instance, check out the graph from this 2012 blog post: http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html You can see that the performance growth is basically on a straight line up to about 16 cores, but then it kinks downward until about 28, after which it kinks sharply upward until about 36 cores. I think this has something to do with the process scheduling behavior of Linux, because I vaguely recall some discussion where somebody did benchmarking on the same hardware on both Linux and one of the BSD systems, and the effect didn't appear on BSD. They had other problems, like a huge drop-off at higher core counts, but they didn't have that effect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: spin_delay() for ARM
On Thu, 16 Apr 2020 at 10:33, Pavel Stehule wrote: > what I know, pgbench cannot be used for testing spinlocks problems. > > Maybe you can see this issue when a) use higher number clients - hundreds, > thousands. Decrease share memory, so there will be press on related spin lock. There really aren't many spinlocks left that could be tickled by a normal workload. I looked for a way to trigger spinlock contention when I prototyped a patch to replace spinlocks with futexes. The only one that I could figure out a way to make contended was the lock protecting parallel btree scan. A highly parallel index only scan on a fully cached index should create at least some spinlock contention. Regards, Ants Aasma
Re: spin_delay() for ARM
čt 16. 4. 2020 v 9:18 odesílatel Amit Khandekar napsal: > On Mon, 13 Apr 2020 at 20:16, Amit Khandekar > wrote: > > On Sat, 11 Apr 2020 at 04:18, Tom Lane wrote: > > > > > > I wrote: > > > > A more useful test would be to directly experiment with contended > > > > spinlocks. As I recall, we had some test cases laying about when > > > > we were fooling with the spin delay stuff on Intel --- maybe > > > > resurrecting one of those would be useful? > > > > > > The last really significant performance testing we did in this area > > > seems to have been in this thread: > > > > > > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > > > > > A relevant point from that is Haas' comment > > > > > > I think optimizing spinlocks for machines with only a few CPUs is > > > probably pointless. Based on what I've seen so far, spinlock > > > contention even at 16 CPUs is negligible pretty much no matter what > > > you do. Whether your implementation is fast or slow isn't going to > > > matter, because even an inefficient implementation will account for > > > only a negligible percentage of the total CPU time - much less > than 1% > > > - as opposed to a 64-core machine, where it's not that hard to find > > > cases where spin-waits consume the *majority* of available CPU time > > > (recall previous discussion of lseek). > > > > Yeah, will check if I find some machines with large cores. > > I got hold of a 32 CPUs VM (actually it was a 16-core, but being > hyperthreaded, CPUs were 32). > It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor : > KVM. Single NUMA node. > PG parameters changed : shared_buffer: 8G ; max_connections : 1000 > > I compared pgbench results with HEAD versus PAUSE removed like this : > perform_spin_delay(SpinDelayStatus *status) > { > - /* CPU-specific delay each time through the loop */ > - SPIN_DELAY(); > > Ran with increasing number of parallel clients : > pgbench -S -c $num -j $num -T 60 -M prepared > But couldn't find any significant change in the TPS numbers with or > without PAUSE: > > Clients HEAD Without_PAUSE > 8 26 247264 > 16399939 399549 > 24454189 453244 > 32 1097592 1098844 > 40 1090424 1087984 > 48 1068645 1075173 > 64 1035035 1039973 > 96976578 970699 > > May be it will indeed show some difference only with around 64 cores, > or perhaps a bare metal machine will help; but as of now I didn't get > such a machine. Anyways, I thought why not archive the results with > whatever I have. > > Not relevant to the PAUSE stuff Note that when the parallel > clients reach from 24 to 32 (which equals the machine CPUs), the TPS > shoots from 454189 to 1097592 which is more than double speed gain > with just a 30% increase in parallel sessions. I was not expecting > this much speed gain, because, with contended scenario already pgbench > processes are already taking around 20% of the total CPU time of > pgbench run. May be later on, I will get a chance to run with some > customized pgbench script that runs a server function which keeps on > running an index scan on pgbench_accounts, so as to make pgbench > clients almost idle. > what I know, pgbench cannot be used for testing spinlocks problems. Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so there will be press on related spin lock. Regards Pavel > Thanks > -Amit Khandekar > > >
Re: spin_delay() for ARM
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar wrote: > On Sat, 11 Apr 2020 at 04:18, Tom Lane wrote: > > > > I wrote: > > > A more useful test would be to directly experiment with contended > > > spinlocks. As I recall, we had some test cases laying about when > > > we were fooling with the spin delay stuff on Intel --- maybe > > > resurrecting one of those would be useful? > > > > The last really significant performance testing we did in this area > > seems to have been in this thread: > > > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > > > A relevant point from that is Haas' comment > > > > I think optimizing spinlocks for machines with only a few CPUs is > > probably pointless. Based on what I've seen so far, spinlock > > contention even at 16 CPUs is negligible pretty much no matter what > > you do. Whether your implementation is fast or slow isn't going to > > matter, because even an inefficient implementation will account for > > only a negligible percentage of the total CPU time - much less than 1% > > - as opposed to a 64-core machine, where it's not that hard to find > > cases where spin-waits consume the *majority* of available CPU time > > (recall previous discussion of lseek). > > Yeah, will check if I find some machines with large cores. I got hold of a 32 CPUs VM (actually it was a 16-core, but being hyperthreaded, CPUs were 32). It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor : KVM. Single NUMA node. PG parameters changed : shared_buffer: 8G ; max_connections : 1000 I compared pgbench results with HEAD versus PAUSE removed like this : perform_spin_delay(SpinDelayStatus *status) { - /* CPU-specific delay each time through the loop */ - SPIN_DELAY(); Ran with increasing number of parallel clients : pgbench -S -c $num -j $num -T 60 -M prepared But couldn't find any significant change in the TPS numbers with or without PAUSE: Clients HEAD Without_PAUSE 8 26 247264 16399939 399549 24454189 453244 32 1097592 1098844 40 1090424 1087984 48 1068645 1075173 64 1035035 1039973 96976578 970699 May be it will indeed show some difference only with around 64 cores, or perhaps a bare metal machine will help; but as of now I didn't get such a machine. Anyways, I thought why not archive the results with whatever I have. Not relevant to the PAUSE stuff Note that when the parallel clients reach from 24 to 32 (which equals the machine CPUs), the TPS shoots from 454189 to 1097592 which is more than double speed gain with just a 30% increase in parallel sessions. I was not expecting this much speed gain, because, with contended scenario already pgbench processes are already taking around 20% of the total CPU time of pgbench run. May be later on, I will get a chance to run with some customized pgbench script that runs a server function which keeps on running an index scan on pgbench_accounts, so as to make pgbench clients almost idle. Thanks -Amit Khandekar
Re: spin_delay() for ARM
On Sat, 11 Apr 2020 at 04:18, Tom Lane wrote: > > I wrote: > > A more useful test would be to directly experiment with contended > > spinlocks. As I recall, we had some test cases laying about when > > we were fooling with the spin delay stuff on Intel --- maybe > > resurrecting one of those would be useful? > > The last really significant performance testing we did in this area > seems to have been in this thread: > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > A relevant point from that is Haas' comment > > I think optimizing spinlocks for machines with only a few CPUs is > probably pointless. Based on what I've seen so far, spinlock > contention even at 16 CPUs is negligible pretty much no matter what > you do. Whether your implementation is fast or slow isn't going to > matter, because even an inefficient implementation will account for > only a negligible percentage of the total CPU time - much less than 1% > - as opposed to a 64-core machine, where it's not that hard to find > cases where spin-waits consume the *majority* of available CPU time > (recall previous discussion of lseek). Yeah, will check if I find some machines with large cores. > So I wonder whether this patch is getting ahead of the game. It does > seem that ARM systems with a couple dozen cores exist, but are they > common enough to optimize for yet? Can we even find *one* to test on > and verify that this is a win and not a loss? (Also, seeing that > there are so many different ARM vendors, results from just one > chipset might not be too trustworthy ...) Ok. Yes, it would be worth waiting to see if there are others in the community with ARM systems that have implemented YIELD. May be after that we might gain some confidence. I myself also hope that I will get one soon to test, but right now I have one that does not support it, so it will be just a no-op. -- Thanks, -Amit Khandekar Huawei Technologies -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
On Sat, 11 Apr 2020 at 00:47, Andres Freund wrote: > > Hi, > > On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: > > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also > > using a sample C program (attached spin.c). Here, many child processes > > (much more than CPUs) wait in a tight loop for a shared variable to > > become 0, while the parent process continuously increments a sequence > > number for a fixed amount of time, after which, it sets the shared > > variable to 0. The child's tight loop calls PAUSE in each iteration. > > What I hoped was that because of PAUSE in children, the parent process > > would get more share of the CPU, due to which, in a given time, the > > sequence number will reach a higher value. Also, I expected the CPU > > cycles spent by child processes to drop down, thanks to PAUSE. None of > > these happened. There was no change. > > > Possibly, this testcase is not right. Probably the process preemption > > occurs only within the set of hyperthreads attached to a single core. > > And in my testcase, the parent process is the only one who is ready to > > run. Still, I have anyway attached the program (spin.c) for archival; > > in case somebody with a YIELD-supporting ARM machine wants to use it > > to test YIELD. > > PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't > just schedule another process - you won't see different CPU usage if you > measure it purely as the time running. Yeah, I thought that the OS scheduling would be an *indirect* consequence of the pause because of it's slowing down the CPU, but looks like that does not happen. > You should be able to see a > difference if you measure with a profiler that shows you data from the > CPUs performance monitoring unit. Hmm, I had tried with perf and could see the pause itself consuming 5% cpu. But I haven't yet played with per-process figures. -- Thanks, -Amit Khandekar Huawei Technologies -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
I wrote: > A more useful test would be to directly experiment with contended > spinlocks. As I recall, we had some test cases laying about when > we were fooling with the spin delay stuff on Intel --- maybe > resurrecting one of those would be useful? The last really significant performance testing we did in this area seems to have been in this thread: https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com A relevant point from that is Haas' comment I think optimizing spinlocks for machines with only a few CPUs is probably pointless. Based on what I've seen so far, spinlock contention even at 16 CPUs is negligible pretty much no matter what you do. Whether your implementation is fast or slow isn't going to matter, because even an inefficient implementation will account for only a negligible percentage of the total CPU time - much less than 1% - as opposed to a 64-core machine, where it's not that hard to find cases where spin-waits consume the *majority* of available CPU time (recall previous discussion of lseek). So I wonder whether this patch is getting ahead of the game. It does seem that ARM systems with a couple dozen cores exist, but are they common enough to optimize for yet? Can we even find *one* to test on and verify that this is a win and not a loss? (Also, seeing that there are so many different ARM vendors, results from just one chipset might not be too trustworthy ...) regards, tom lane
Re: spin_delay() for ARM
Andres Freund writes: > On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: >> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also >> using a sample C program (attached spin.c). > PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't > just schedule another process - you won't see different CPU usage if you > measure it purely as the time running. You should be able to see a > difference if you measure with a profiler that shows you data from the > CPUs performance monitoring unit. A more useful test would be to directly experiment with contended spinlocks. As I recall, we had some test cases laying about when we were fooling with the spin delay stuff on Intel --- maybe resurrecting one of those would be useful? regards, tom lane
Re: spin_delay() for ARM
Hi, On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also > using a sample C program (attached spin.c). Here, many child processes > (much more than CPUs) wait in a tight loop for a shared variable to > become 0, while the parent process continuously increments a sequence > number for a fixed amount of time, after which, it sets the shared > variable to 0. The child's tight loop calls PAUSE in each iteration. > What I hoped was that because of PAUSE in children, the parent process > would get more share of the CPU, due to which, in a given time, the > sequence number will reach a higher value. Also, I expected the CPU > cycles spent by child processes to drop down, thanks to PAUSE. None of > these happened. There was no change. > Possibly, this testcase is not right. Probably the process preemption > occurs only within the set of hyperthreads attached to a single core. > And in my testcase, the parent process is the only one who is ready to > run. Still, I have anyway attached the program (spin.c) for archival; > in case somebody with a YIELD-supporting ARM machine wants to use it > to test YIELD. PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't just schedule another process - you won't see different CPU usage if you measure it purely as the time running. You should be able to see a difference if you measure with a profiler that shows you data from the CPUs performance monitoring unit. Greetings, Andres Freund
spin_delay() for ARM
Hi, We use (an equivalent of) the PAUSE instruction in spin_delay() for Intel architectures. The goal is to slow down the spinlock tight loop and thus prevent it from eating CPU and causing CPU starvation, so that other processes get their fair share of the CPU time. Intel documentation [1] clearly mentions this, along with other benefits of PAUSE, like, low power consumption, and avoidance of memory order violation while exiting the loop. Similar to PAUSE, the ARM architecture has YIELD instruction, which is also clearly documented [2]. It explicitly says that it is a way to hint the CPU that it is being called in a spinlock loop and this process can be preempted out. But for ARM, we are not using any kind of spin delay. For PG spinlocks, the goal of both of these instructions are the same, and also both architectures recommend using them in spinlock loops. Also, I found multiple places where YIELD is already used in same situations : Linux kernel [3] ; OpenJDK [4],[5] Now, for ARM implementations that don't implement YIELD, it runs as a no-op. Unfortunately the ARM machine I have does not implement YIELD. But recently there has been some ARM implementations that are hyperthreaded, so they are expected to actually do the YIELD, although the docs do not explicitly say that YIELD has to be implemented only by hyperthreaded implementations. I ran some pgbench tests to test PAUSE/YIELD on the respective architectures, once with the instruction present, and once with the instruction removed. Didn't see change in the TPS numbers; they were more or less same. For Arm, this was expected because my ARM machine does not implement it. On my Intel Xeon machine with 8 cores, I tried to test PAUSE also using a sample C program (attached spin.c). Here, many child processes (much more than CPUs) wait in a tight loop for a shared variable to become 0, while the parent process continuously increments a sequence number for a fixed amount of time, after which, it sets the shared variable to 0. The child's tight loop calls PAUSE in each iteration. What I hoped was that because of PAUSE in children, the parent process would get more share of the CPU, due to which, in a given time, the sequence number will reach a higher value. Also, I expected the CPU cycles spent by child processes to drop down, thanks to PAUSE. None of these happened. There was no change. Possibly, this testcase is not right. Probably the process preemption occurs only within the set of hyperthreads attached to a single core. And in my testcase, the parent process is the only one who is ready to run. Still, I have anyway attached the program (spin.c) for archival; in case somebody with a YIELD-supporting ARM machine wants to use it to test YIELD. Nevertheless, I think because we have clear documentation that strongly recommends to use it, and because it has been used in other use-cases such as linux kernel and JDK, we should start using YIELD for spin_delay() in ARM. Attached is the trivial patch (spin_delay_for_arm.patch). To start with, it contains changes only for aarch64. I haven't yet added changes in configure[.in] for making sure yield compiles successfully (YIELD is present in manuals from ARMv6 onwards). Before that I thought of getting some comments; so didn't do configure changes yet. [1] https://c9x.me/x86/html/file_module_x86_id_232.html [2] https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield [3] https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259 [4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html [5] http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html -- Thanks, -Amit Khandekar Huawei Technologies /* * Sample program to test the effect of PAUSE/YIELD instruction in a highly * contended scenario. The Intel and ARM docs recommend the use of PAUSE and * YIELD respectively, in spinlock tight loops. * * This program can be run with : * gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes] * By default, 4 processes are spawned. * * Child processes wait in a tight loop for a shared variable to become 0, * while the parent process continuously increments a sequence number for a * fixed amount of time, after which, it sets the shared variable to 0. The * child tight loop calls YIELD/PAUSE in each iteration. * * The intention is to create a number of processes much larger than the * available CPUs, so that the scheduler hopefully pre-empts the processes * because of the PAUSE, and the main process gets more CPU share because of * which it will increment its sequence number more number of times. So the * expectation is that with PAUSE, the program will end up with a much higher * sequence number than without PAUSE. Similarly, the child processes should * have lesser CPU cycles with PAUSE than without PAUSE. * * Author: Amit Khandekar */ #include #include #include #include