On Sun, 29 Nov 2020 at 22:23, Alexander Korotkov <aekorot...@gmail.com> wrote:
> On Sat, Nov 28, 2020 at 1:31 PM Alexander Korotkov <aekorot...@gmail.com> > wrote: > > I guess that might depend on the implementation of CAS and TAS. I bet > > usage of CAS in spinlock gives advantage when ldxr/stxr are used, but > > not when swpal/casa are used. I found out that I can force clang to > > use swpal/casa by setting "-march=armv8-a+lse". I'm going to make > > some experiments on a multicore AWS graviton2 instance with different > > atomic implementation. > > I've made some benchmarks on c6gd.16xlarge ec2 instance with graviton2 > processor of 64 virtual CPUs (graphs and raw results are attached). > I've analyzed two patches: spinlock using cas by Krunal Bauskar, and > my implementation of lwlock using lwrex/strex. My arm lwlock patch > has the same idea as my previous patch for power: we can put lwlock > attempt logic between lwrex and strex. In spite of my previous power > patch, the arm patch doesn't contain assembly: instead I've used > C-wrappers over lwrex/strex. > > The first series of experiments I've made using standard compiling > options. So, LSE instructions from ARM v8.1 weren't used. Atomics > were implemented using lwrex/strex pair. > > In the read-only benchmark, both spinlock (cas-spinlock graph) and > lwlock (ldrew-strex-lwlock graph) patches give observable performance > gain of similar value. However, performance of combination of these > patches (ldrew-strex-lwlock-cas-spinlock graph) is close to > performance of unpatched version. That could be counterintuitive, but > I've rechecked that multiple times. > > In the read-write benchmark, both spinlock and lwlock patches give > more significant performance gain, and lwlock patch gives more effect > than spinlock patch. Noticeable, that combination of patches now > gives some cumulative effect instead of counterintuitive slowdown. > > Then I've tried to compile postgres with LSE instruction using > "-march=armv8-a+lse" flag with clang (graphs with -lse suffix). The > effect of LSE is HUGE!!! Unpatched version with LSE is times faster > than any version without LSE on high concurrency. In the both > read-only and read-write benchmarks spinlock patch doesn't show any > significant difference. The lwlock patch shows a great slowdown with > LSE. Noticeable, in read-write benchmark, lwlock patch shows worse > results than unpatched version without LSE. Probably, combining > different atomics implementations isn't a good idea. > > It seems that ARM Kunpeng 920 should support ARM v8.1. I wonder if > the published benchmarks results were made with LSE. I suspect that > it was not. It would be nice to repeat the same benchmarks with LSE. > I'd like to ask Krunal Bauskar and Amit Khandekar to repeat these > benchmarks with LSE. > > My preliminary conclusions are so: > 1) Since the effect of LSE is so huge, we should advise users of > multicore ARM servers to compile PostgreSQL with LSE support. We > probably should provide separate packaging for ARM v8.1 and higher > (packages for ARM v8 are still needed for raspberry etc). > 2) It seems that atomics in ARM v8.1 becomes very similar to x86 > atomics, and it doesn't need special optimizations. And I think ARM > v8 processors don't have so many cores and aren't so heavily used in > high-concurrent environments. So, special optimizations for ARM v8 > probably aren't worth it. > Thanks for the detailed results. 1. Results we shared are w/o lse enabled so using traditional store/load approach. 2. As you pointed out LSE is enabled starting only with arm-v8.1 but not all aarch64 tag machines are arm-v8.1 compatible. This means we would need a separate package or a more optimal way would be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with -moutline-atomics that would emit both traditional and lse code and flow would dynamically select depending on the target machine (I have blogged about it in MySQL context https://mysqlonarm.github.io/ARM-LSE-and-MySQL/) 3. Problem with GCC approach is still a lot of distro don't support gcc 9.4 as default. To use this approach: * PGSQL will have to roll out its packages using gcc-9.4+ only so that they are compatible with all aarch64 machines * but this continues to affect all other users who tend to build pgsql using standard distro based compiler. (unless they upgrade compiler). -------------------- So given all the permutations and combinations, I think we could approach the problem as follows: * Enable use of CAS as it is known to have optimal performance (vs TAS) * Even with LSE enabled, CAS to continue to perform (on par or marginally better than TAS) * Add a patch to compile pgsql with outline-atomics if set GCC supports it so the dynamic 2-way compatible code is emitted. -------------------- Alexander, We will surely benchmark using LSE on Kunpeng 920 and share the result. I am a bit surprised to see things scale by 4-5x times just by switching to LSE. (my working experience with lse (in mysql context and micro-benchmarking) didn't show that great improvement by switching to lse). Maybe some more hotspots (beyond s_lock) are getting addressed with the use of lse. > > Links > 1. > https://www.postgresql.org/message-id/CAB10pyamDkTFWU_BVGeEVmkc8%3DEhgCjr6QBk02SCdJtKpHkdFw%40mail.gmail.com > 2. > https://www.postgresql.org/message-id/CAPpHfdsKrh7c7P8-5eG-qW3VQobybbwqH%3DgL5Ck%2BdOES-gBbFg%40mail.gmail.com > > ------ > Regards, > Alexander Korotkov > -- Regards, Krunal Bauskar