Re: Improving spin-lock implementation on ARM.
Wondering if we can take this to completion (any idea what more we could do?). On Thu, 10 Dec 2020 at 14:48, Krunal Bauskar wrote: > > On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar > wrote: > >> >> >> On Thu, 3 Dec 2020 at 21:32, Tom Lane wrote: >> >>> Krunal Bauskar writes: >>> > Any updates or further inputs on this. >>> >>> As far as LSE goes: my take is that tampering with the >>> compiler/platform's default optimization options requires *very* >>> strong evidence, which we have not got and likely won't get. Users >>> who are building for specific hardware can choose to supply custom >>> CFLAGS, of course. But we shouldn't presume to do that for them, >>> because we don't know what they are building for, or with what. >>> >>> I'm very willing to consider the CAS spinlock patch, but it still >>> feels like there's not enough evidence to show that it's a universal >>> win. The way to move forward on that is to collect more measurements >>> on additional ARM-based platforms. And I continue to think that >>> pgbench is only a very crude tool for testing spinlock performance; >>> we should look at other tests. >>> >> >> Thanks Tom. >> >> Given pg-bench limited option I decided to try things with sysbench to >> expose >> the real contention using zipfian type (zipfian pattern causes part of >> the database >> to get updated there-by exposing main contention point). >> >> >> >> *Baseline for 256 threads update-index use-case:* >> - 44.24%174935 postgres postgres [.] >> s_lock >> transactions: >> transactions:5587105 (92988.40 per sec.) >> >> *Patched for 256 threads update-index use-case:* >> 0.02%80 postgres postgres [.] s_lock >> transactions: >> transactions:10288781 (171305.24 per sec.) >> >> *perf diff* >> >> * 0.02%+44.22% postgres [.] s_lock* >> >> >> As we see from the above result s_lock is exposing major contention that >> could be relaxed using the >> said cas patch. Performance improvement in range of 80% is observed. >> >> Taking this guideline we decided to run it for all scalability for update >> and non-update use-case. >> Check the attached graph. Consistent improvement is observed. >> >> I presume this should help re-establish that for major contention cases >> existing tas approach will always give up. >> >> >> --- >> >> Unfortunately, I don't have access to different ARM arch except for >> Kunpeng or Graviton2 where >> we have already proved the value of the patch. >> [ref: Apple M1 as per your evaluation patch doesn't show regression for >> select. Maybe if possible can you try update scenarios too]. >> >> Do you know anyone from the community who has access to other ARM arches >> we can request them to evaluate? >> But since it is has proven on 2 independent ARM arch I am pretty >> confident it will scale with other ARM arches too. >> >> > > Any direction on how we can proceed on this? > > * We have tested it with both cloud vendors that provide ARM instances. > * We have tested it with Apple M1 (partially at-least) > * Ampere use to provide instance on packet.com but now it is an > evaluation program only. > > No other active arm instance offering a cloud provider. > > Given our evaluation so far has proven to be +ve can we think of > considering it on basis of the available > data which is quite encouraging with 80% improvement seen for heavy > contention use-cases. > > > >> >>> From a system structural standpoint, I seriously dislike that lwlock.c >>> patch: putting machine-specific variant implementations into that file >>> seems like a disaster for maintainability. So it would need to show a >>> very significant gain across a range of hardware before I'd want to >>> consider adopting it ... and it has not shown that. >>> >>> regards, tom lane >>> >> >> >> -- >> Regards, >> Krunal Bauskar >> > > > -- > Regards, > Krunal Bauskar > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar wrote: > > > On Thu, 3 Dec 2020 at 21:32, Tom Lane wrote: > >> Krunal Bauskar writes: >> > Any updates or further inputs on this. >> >> As far as LSE goes: my take is that tampering with the >> compiler/platform's default optimization options requires *very* >> strong evidence, which we have not got and likely won't get. Users >> who are building for specific hardware can choose to supply custom >> CFLAGS, of course. But we shouldn't presume to do that for them, >> because we don't know what they are building for, or with what. >> >> I'm very willing to consider the CAS spinlock patch, but it still >> feels like there's not enough evidence to show that it's a universal >> win. The way to move forward on that is to collect more measurements >> on additional ARM-based platforms. And I continue to think that >> pgbench is only a very crude tool for testing spinlock performance; >> we should look at other tests. >> > > Thanks Tom. > > Given pg-bench limited option I decided to try things with sysbench to > expose > the real contention using zipfian type (zipfian pattern causes part of the > database > to get updated there-by exposing main contention point). > > > > *Baseline for 256 threads update-index use-case:* > - 44.24%174935 postgres postgres [.] s_lock > transactions: > transactions:5587105 (92988.40 per sec.) > > *Patched for 256 threads update-index use-case:* > 0.02%80 postgres postgres [.] s_lock > transactions: > transactions:10288781 (171305.24 per sec.) > > *perf diff* > > * 0.02%+44.22% postgres [.] s_lock* > > > As we see from the above result s_lock is exposing major contention that > could be relaxed using the > said cas patch. Performance improvement in range of 80% is observed. > > Taking this guideline we decided to run it for all scalability for update > and non-update use-case. > Check the attached graph. Consistent improvement is observed. > > I presume this should help re-establish that for major contention cases > existing tas approach will always give up. > > > --- > > Unfortunately, I don't have access to different ARM arch except for > Kunpeng or Graviton2 where > we have already proved the value of the patch. > [ref: Apple M1 as per your evaluation patch doesn't show regression for > select. Maybe if possible can you try update scenarios too]. > > Do you know anyone from the community who has access to other ARM arches > we can request them to evaluate? > But since it is has proven on 2 independent ARM arch I am pretty confident > it will scale with other ARM arches too. > > Any direction on how we can proceed on this? * We have tested it with both cloud vendors that provide ARM instances. * We have tested it with Apple M1 (partially at-least) * Ampere use to provide instance on packet.com but now it is an evaluation program only. No other active arm instance offering a cloud provider. Given our evaluation so far has proven to be +ve can we think of considering it on basis of the available data which is quite encouraging with 80% improvement seen for heavy contention use-cases. > >> From a system structural standpoint, I seriously dislike that lwlock.c >> patch: putting machine-specific variant implementations into that file >> seems like a disaster for maintainability. So it would need to show a >> very significant gain across a range of hardware before I'd want to >> consider adopting it ... and it has not shown that. >> >> regards, tom lane >> > > > -- > Regards, > Krunal Bauskar > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Thu, 3 Dec 2020 at 21:32, Tom Lane wrote: > Krunal Bauskar writes: > > Any updates or further inputs on this. > > As far as LSE goes: my take is that tampering with the > compiler/platform's default optimization options requires *very* > strong evidence, which we have not got and likely won't get. Users > who are building for specific hardware can choose to supply custom > CFLAGS, of course. But we shouldn't presume to do that for them, > because we don't know what they are building for, or with what. > > I'm very willing to consider the CAS spinlock patch, but it still > feels like there's not enough evidence to show that it's a universal > win. The way to move forward on that is to collect more measurements > on additional ARM-based platforms. And I continue to think that > pgbench is only a very crude tool for testing spinlock performance; > we should look at other tests. > Thanks Tom. Given pg-bench limited option I decided to try things with sysbench to expose the real contention using zipfian type (zipfian pattern causes part of the database to get updated there-by exposing main contention point). *Baseline for 256 threads update-index use-case:* - 44.24%174935 postgres postgres [.] s_lock transactions: transactions:5587105 (92988.40 per sec.) *Patched for 256 threads update-index use-case:* 0.02%80 postgres postgres [.] s_lock transactions: transactions:10288781 (171305.24 per sec.) *perf diff* * 0.02%+44.22% postgres [.] s_lock* As we see from the above result s_lock is exposing major contention that could be relaxed using the said cas patch. Performance improvement in range of 80% is observed. Taking this guideline we decided to run it for all scalability for update and non-update use-case. Check the attached graph. Consistent improvement is observed. I presume this should help re-establish that for major contention cases existing tas approach will always give up. --- Unfortunately, I don't have access to different ARM arch except for Kunpeng or Graviton2 where we have already proved the value of the patch. [ref: Apple M1 as per your evaluation patch doesn't show regression for select. Maybe if possible can you try update scenarios too]. Do you know anyone from the community who has access to other ARM arches we can request them to evaluate? But since it is has proven on 2 independent ARM arch I am pretty confident it will scale with other ARM arches too. > > From a system structural standpoint, I seriously dislike that lwlock.c > patch: putting machine-specific variant implementations into that file > seems like a disaster for maintainability. So it would need to show a > very significant gain across a range of hardware before I'd want to > consider adopting it ... and it has not shown that. > > regards, tom lane > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Sat, 5 Dec 2020 at 02:55, Alexander Korotkov wrote: > > On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > > Let me know what do you think about this analysis and any specific > > direction that we should consider to help move forward. > > BTW, it would be also nice to benchmark my lwlock patch on the > Kunpeng. I'm very optimistic about this patch, but it wouldn't be > fair to completely throw it away. It still might be useful for > LSE-disabled builds. Coincidentally I was also looking at some hotspot locations around LWLockAcquire() and LWLockAttempt() for read-only work-loads on both arm64 and x86, and had narrowed down to the place where pg_atomic_compare_exchange_u32() is called. So it's likely we are working on the same hotspot area. When I get a chance, I do plan to look at your patch while I myself am trying to see if we can do some optimizations. Although, this is unrelated to the optimization of this mail thread, so this will need a different mail thread. -- Thanks, -Amit Khandekar Huawei Technologies
Re: Improving spin-lock implementation on ARM.
On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > Let me know what do you think about this analysis and any specific direction > that we should consider to help move forward. BTW, it would be also nice to benchmark my lwlock patch on the Kunpeng. I'm very optimistic about this patch, but it wouldn't be fair to completely throw it away. It still might be useful for LSE-disabled builds. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > 1. CAS patch (applied on the baseline) >- Kunpeng: 10-45% improvement observed [1] >- Graviton2: 30-50% improvement observed [2] What does lower boundary of improvement mean? Does it mean minimal improvement observed? Obviously not, because there is no improvement with a low number of clients or read-only benchmark. >- M1: Only select results are available cas continue to maintain a > marginal gain but not significant. [3] Read-only benchmark doesn't involve spinlocks (I've just rechecked this). So, this difference is purely speculative. Also, Tom observed the regression [1]. The benchmark is artificial, but it may correspond to some real workload with heavily-loaded spinlocks. And that might have an explanation. ldrex/strex themselves don't work as memory barrier (this is why compiler adds explicit memory barrier afterwards). And I bet ldrex unpaired with strex could see an outdated value. On high-contended spinlocks that may cause too pessimistic waits. > 2. Let's ignore CAS for a sec and just think of LSE independently >- Kunpeng: regression observed Yeah, it's sad that there are ARM chips, where making the same action in a single instruction is slower than the same actions in multiple instructions. >- Graviton2: gain observed Have to say it's 5.5x improvement, which is dramatically more than CAS-loop patch can give. >- M1: regression observed > [while lse probably is default explicitly enabling it with +lse causes > regression on the head itself [4]. > client=2/4: 1816/714 vs 892/610] This is plain false. LSE was enabled in all the versions tested in M1 [2]. So not LSE instructions or +lse option caused the regression, but lack of other options enabled by default in Apple's clang. > Let me know what do you think about this analysis and any specific direction > that we should consider to help move forward. In order to move forward with ARM-optimized spinlocks I would investigate other options (at least [3]). Regarding CAS spinlock patch I can propose the following steps to clarify the things. 1. Run Tom's spinlock test on ARM machines other than M1. 2. Try to construct a pgbench script, which produces heavily-loaded spinlock without custom C-function, and also run it across various ARM machines. Links 1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us 2. https://www.postgresql.org/message-id/1274781.1606760475%40sss.pgh.pa.us 3. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Thu, Dec 3, 2020 at 7:02 PM Tom Lane wrote: > From a system structural standpoint, I seriously dislike that lwlock.c > patch: putting machine-specific variant implementations into that file > seems like a disaster for maintainability. So it would need to show a > very significant gain across a range of hardware before I'd want to > consider adopting it ... and it has not shown that. The current shape of the lwlock patch is experimental. I had quite a beautiful (in my opinion) idea to wrap platform-dependent parts of CAS-loops into macros. Then we could provide different low-level implementations of CAS-loops for Power, ARM and rest platforms with single code for LWLockAttempLock() and others. However, I see that modern ARM tends to efficiently implement LSE. Power doesn't seem to be very popular. So, I'm going to give up with this for now. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
Krunal Bauskar writes: > Any updates or further inputs on this. As far as LSE goes: my take is that tampering with the compiler/platform's default optimization options requires *very* strong evidence, which we have not got and likely won't get. Users who are building for specific hardware can choose to supply custom CFLAGS, of course. But we shouldn't presume to do that for them, because we don't know what they are building for, or with what. I'm very willing to consider the CAS spinlock patch, but it still feels like there's not enough evidence to show that it's a universal win. The way to move forward on that is to collect more measurements on additional ARM-based platforms. And I continue to think that pgbench is only a very crude tool for testing spinlock performance; we should look at other tests. >From a system structural standpoint, I seriously dislike that lwlock.c patch: putting machine-specific variant implementations into that file seems like a disaster for maintainability. So it would need to show a very significant gain across a range of hardware before I'd want to consider adopting it ... and it has not shown that. regards, tom lane
Re: Improving spin-lock implementation on ARM.
Any updates or further inputs on this. On Wed, 2 Dec 2020 at 09:27, Krunal Bauskar wrote: > > > On Tue, 1 Dec 2020 at 22:19, Tom Lane wrote: > >> Alexander Korotkov writes: >> > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar >> wrote: >> >> I would request you guys to re-think it from this perspective to help >> ensure that PGSQL can scale well on ARM. >> >> s_lock becomes a top-most function and LSE is not a universal solution >> but CAS surely helps ease the main bottleneck. >> >> > CAS patch isn't proven to be a universal solution as well. We have >> > tested the patch on just a few processors, and Tom has seen the >> > regression [1]. The benchmark used by Tom was artificial, but the >> > results may be relevant for some real-life workload. >> >> Yeah. I think that the main conclusion from what we've seen here is >> that on smaller machines like M1, a standard pgbench benchmark just >> isn't capable of driving PG into serious spinlock contention. (That >> reflects very well on the work various people have done over the years >> to get rid of spinlock contention, because ten or so years ago it was >> a huge problem on this size of machine. But evidently, not any more.) >> Per the results others have posted, nowadays you need dozens of cores >> and hundreds of client threads to measure any such issue with pgbench. >> >> So that is why I experimented with a special test that does nothing >> except pound on one spinlock. Sure it's artificial, but if you want >> to see the effects of different spinlock implementations then it's >> just too hard to get any results with pgbench's regular scripts. >> >> And that's why it disturbs me that the CAS-spinlock patch showed up >> worse in that environment. The fact that it's not visible in the >> regular pgbench test just means that the effect is too small to >> measure in that test. But in a test where we *can* measure an effect, >> it's not looking good. >> >> It would be interesting to see some results from the same test I did >> on other processors. I suspect the results would look a lot different >> from mine ... but we won't know unless someone does it. Or, if someone >> wants to propose some other test case, let's have a look. >> >> > I'm expressing just my personal opinion, other committers can have >> > different opinions. I don't particularly think this topic is >> > necessarily a non-starter. But I do think that given ambiguity we've >> > observed in the benchmark, much more research is needed to push this >> > topic forward. >> >> Yeah. I'm not here to say "do nothing". But I think we need results >> from more machines and more test cases to convince ourselves whether >> there's a consistent, worthwhile win from any specific patch. >> > > I think there is > *an ambiguity with lse and that has been the* > *source of some confusion* so let's make another attempt to > understand all the observations and then define the next steps. > > - > > > *1. CAS patch (applied on the baseline)* - Kunpeng: 10-45% improvement > observed [1] >- Graviton2: 30-50% improvement observed [2] >- M1: Only select results are available cas continue to maintain a > marginal gain but not significant. [3] > [inline with what we observed with Kunpeng and Graviton2 for select > results too]. > > > *2. Let's ignore CAS for a sec and just think of LSE independently* - > Kunpeng: regression observed >- Graviton2: gain observed >- M1: regression observed > [while lse probably is default explicitly enabling it with +lse > causes regression on the head itself [4]. > client=2/4: 1816/714 vs 892/610] > >There is enough reason not to immediately consider enabling LSE given > its unable to perform consistently on all hardware. > - > > With those 2 aspects clear let's evaluate what options we have in hand > > > *1. Enable CAS approach* *- What we gain:* pgsql scale on > Kunpeng/Graviton2 > (m1 awaiting read-write result but may marginally scale [[5]: "but > the patched numbers are only about a few percent better"]) >*- What we lose:* Nothing for now. > > > *2. LSE:* *- What we gain: *Scaled workload with Graviton2 > * - What we lose:* regression on M1 and Kunpeng. > > Let's think of both approaches independently. > > - Enabling CAS would help us scale on all hardware (Kunpeng/Graviton2/M1) > - Enabling LSE would help us scale only on some but regress on others. > [LSE could be considered in the future once it stabilizes and all > hardware adapts to it] > > --- > > *Let me know what do you think about this analysis and any specific > direction that we should consider to help move forward.* > > --- > > Links: > [1]: > https://www.postgresql.org/message-id/attach
Re: Improving spin-lock implementation on ARM.
> On 01/12/2020, 19:08, "Alexander Korotkov" wrote: >BTW, what number of clients did you use? I can't find it in your message. Sure. Important params seem to be: Pgbench: Clients: 256 pgbench_jobs : 32 Scale: 1000 fill_factor: 90 Postgresql: shared buffers: 31GB max_connections: 1024 Thanks! Tsahi.
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 22:19, Tom Lane wrote: > Alexander Korotkov writes: > > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar > wrote: > >> I would request you guys to re-think it from this perspective to help > ensure that PGSQL can scale well on ARM. > >> s_lock becomes a top-most function and LSE is not a universal solution > but CAS surely helps ease the main bottleneck. > > > CAS patch isn't proven to be a universal solution as well. We have > > tested the patch on just a few processors, and Tom has seen the > > regression [1]. The benchmark used by Tom was artificial, but the > > results may be relevant for some real-life workload. > > Yeah. I think that the main conclusion from what we've seen here is > that on smaller machines like M1, a standard pgbench benchmark just > isn't capable of driving PG into serious spinlock contention. (That > reflects very well on the work various people have done over the years > to get rid of spinlock contention, because ten or so years ago it was > a huge problem on this size of machine. But evidently, not any more.) > Per the results others have posted, nowadays you need dozens of cores > and hundreds of client threads to measure any such issue with pgbench. > > So that is why I experimented with a special test that does nothing > except pound on one spinlock. Sure it's artificial, but if you want > to see the effects of different spinlock implementations then it's > just too hard to get any results with pgbench's regular scripts. > > And that's why it disturbs me that the CAS-spinlock patch showed up > worse in that environment. The fact that it's not visible in the > regular pgbench test just means that the effect is too small to > measure in that test. But in a test where we *can* measure an effect, > it's not looking good. > > It would be interesting to see some results from the same test I did > on other processors. I suspect the results would look a lot different > from mine ... but we won't know unless someone does it. Or, if someone > wants to propose some other test case, let's have a look. > > > I'm expressing just my personal opinion, other committers can have > > different opinions. I don't particularly think this topic is > > necessarily a non-starter. But I do think that given ambiguity we've > > observed in the benchmark, much more research is needed to push this > > topic forward. > > Yeah. I'm not here to say "do nothing". But I think we need results > from more machines and more test cases to convince ourselves whether > there's a consistent, worthwhile win from any specific patch. > I think there is *an ambiguity with lse and that has been the* *source of some confusion* so let's make another attempt to understand all the observations and then define the next steps. - *1. CAS patch (applied on the baseline)* - Kunpeng: 10-45% improvement observed [1] - Graviton2: 30-50% improvement observed [2] - M1: Only select results are available cas continue to maintain a marginal gain but not significant. [3] [inline with what we observed with Kunpeng and Graviton2 for select results too]. *2. Let's ignore CAS for a sec and just think of LSE independently* - Kunpeng: regression observed - Graviton2: gain observed - M1: regression observed [while lse probably is default explicitly enabling it with +lse causes regression on the head itself [4]. client=2/4: 1816/714 vs 892/610] There is enough reason not to immediately consider enabling LSE given its unable to perform consistently on all hardware. - With those 2 aspects clear let's evaluate what options we have in hand *1. Enable CAS approach* *- What we gain:* pgsql scale on Kunpeng/Graviton2 (m1 awaiting read-write result but may marginally scale [[5]: "but the patched numbers are only about a few percent better"]) *- What we lose:* Nothing for now. *2. LSE:* *- What we gain: *Scaled workload with Graviton2 * - What we lose:* regression on M1 and Kunpeng. Let's think of both approaches independently. - Enabling CAS would help us scale on all hardware (Kunpeng/Graviton2/M1) - Enabling LSE would help us scale only on some but regress on others. [LSE could be considered in the future once it stabilizes and all hardware adapts to it] --- *Let me know what do you think about this analysis and any specific direction that we should consider to help move forward.* --- Links: [1]: https://www.postgresql.org/message-id/attachment/116612/Screenshot%20from%202020-12-01%2017-55-21.png [2]: https://www.postgresql.org/message-id/attachment/116521/arm-rw.png [3]: https://www.postgresql.org/message-id/1367116.1606802480%40sss.pgh.pa.us [4]: https://www.postgresql.org/message-id
Re: Improving spin-lock implementation on ARM.
> On 01/12/2020, 16:59, "Alexander Korotkov" wrote: > On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar wrote: > > FWIW, here is an earlier discussion on the same (also added the >> proposal author here) : Thanks for looping me in! >> >> > https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com > > >Thank you for pointing! I wonder why the effect of LSE on Graviton2 >observed by Tsahi Zidenberg is so modest. It's probably because he >runs the tests with a low number of clients. There are multiple possible reasons why I saw a smaller effect of LSE, but I think an important one was that I used a 32-core instance rather than a 64-core one. The reason I did so, was that 32-cores gave me better absolute results than 64 cores, and I didn't want to feel like I could misguide anyone. The 64-core instance results is a good example for the benefit of LSE. LSE becomes most important in edges, and with adversarial workloads. If multiple CPUs try to acquire a lock simultaneously - LSE ensures one CPU will indeed get the lock (with just one transaction), while LDRX/STRX could have multiple CPUS looping and no-one acquiring a lock. This is why I believe just looking at "reasonable" benchmarks misses out on effects real customers will run into. Happy to see another arm-optimization thread so quickly :) Thank you! Tsahi.
Re: Improving spin-lock implementation on ARM.
On Tue, Dec 1, 2020 at 7:57 PM Zidenberg, Tsahi wrote: > > On 01/12/2020, 16:59, "Alexander Korotkov" wrote: > > On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar > > wrote: > > > FWIW, here is an earlier discussion on the same (also added the > >> proposal author here) : > > Thanks for looping me in! > > >> > >> > > https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com > > > > > >Thank you for pointing! I wonder why the effect of LSE on Graviton2 > >observed by Tsahi Zidenberg is so modest. It's probably because he > >runs the tests with a low number of clients. > > There are multiple possible reasons why I saw a smaller effect of LSE, but I > think an important one was that I > used a 32-core instance rather than a 64-core one. The reason I did so, was > that 32-cores gave me better > absolute results than 64 cores, and I didn't want to feel like I could > misguide anyone. BTW, what number of clients did you use? I can't find it in your message. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar wrote: >> I would request you guys to re-think it from this perspective to help ensure >> that PGSQL can scale well on ARM. >> s_lock becomes a top-most function and LSE is not a universal solution but >> CAS surely helps ease the main bottleneck. > CAS patch isn't proven to be a universal solution as well. We have > tested the patch on just a few processors, and Tom has seen the > regression [1]. The benchmark used by Tom was artificial, but the > results may be relevant for some real-life workload. Yeah. I think that the main conclusion from what we've seen here is that on smaller machines like M1, a standard pgbench benchmark just isn't capable of driving PG into serious spinlock contention. (That reflects very well on the work various people have done over the years to get rid of spinlock contention, because ten or so years ago it was a huge problem on this size of machine. But evidently, not any more.) Per the results others have posted, nowadays you need dozens of cores and hundreds of client threads to measure any such issue with pgbench. So that is why I experimented with a special test that does nothing except pound on one spinlock. Sure it's artificial, but if you want to see the effects of different spinlock implementations then it's just too hard to get any results with pgbench's regular scripts. And that's why it disturbs me that the CAS-spinlock patch showed up worse in that environment. The fact that it's not visible in the regular pgbench test just means that the effect is too small to measure in that test. But in a test where we *can* measure an effect, it's not looking good. It would be interesting to see some results from the same test I did on other processors. I suspect the results would look a lot different from mine ... but we won't know unless someone does it. Or, if someone wants to propose some other test case, let's have a look. > I'm expressing just my personal opinion, other committers can have > different opinions. I don't particularly think this topic is > necessarily a non-starter. But I do think that given ambiguity we've > observed in the benchmark, much more research is needed to push this > topic forward. Yeah. I'm not here to say "do nothing". But I think we need results from more machines and more test cases to convince ourselves whether there's a consistent, worthwhile win from any specific patch. regards, tom lane
Re: Improving spin-lock implementation on ARM.
On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar wrote: > I would request you guys to re-think it from this perspective to help ensure > that PGSQL can scale well on ARM. > s_lock becomes a top-most function and LSE is not a universal solution but > CAS surely helps ease the main bottleneck. CAS patch isn't proven to be a universal solution as well. We have tested the patch on just a few processors, and Tom has seen the regression [1]. The benchmark used by Tom was artificial, but the results may be relevant for some real-life workload. I'm expressing just my personal opinion, other committers can have different opinions. I don't particularly think this topic is necessarily a non-starter. But I do think that given ambiguity we've observed in the benchmark, much more research is needed to push this topic forward. Links. 1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 20:25, Alexander Korotkov wrote: > On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar > wrote: > > I have completed benchmarking with lse. > > > > Graph attached. > > Thank you for benchmarking. > > Now I agree with this comment by Tom Lane > > > In general, I'm pretty skeptical of *all* the results posted so far on > > this thread, because everybody seems to be testing exactly one machine. > > If there's one thing that it's safe to assume about ARM, it's that > > there are a lot of different implementations; and this area seems very > > very likely to differ across implementations. > > Different ARM implementations look too different. As you pointed out, > LSE is enabled in gcc-10 by default. I doubt we can accept a patch, > which gives benefits for specific platform and only when the compiler > isn't very modern. Also, we didn't cover all ARM planforms. Given > they are so different, we can't guarantee that patch doesn't cause > regression of some ARM. Additionally, the effect of the CAS patch > even for Kunpeng seems modest. It makes the drop off of TPS more > smooth, but it doesn't change the trend. > There are 2 parts: ** Does CAS patch help scale PGSQL. Yes.* ** Is LSE beneficial for all architectures. Probably No.* The patch addresses only the former one which is true for all cases. (Enabling LSE should be an independent process). gcc-10 made it default but when I read [1] it quotes that canonical decided to remove it as default as part of* Ubuntu-20.04 which means LSE has not proven the test of canonical (probably).* Also, most of the distro has not yet started shipping GCC-10 which is way far before it makes it to all distro. So if we keep the LSE effect aside and just look at the patch from performance improvement it surely helps achieve a good gain. I see an improvement in the range of 10-40%. Amit during his independent testing also observed the gain in the same range and your testing with G-2 has re-attested the same point. Pardon me if this is modest as per pgsql standards. With 1024 scalability PGSQL on other arches (beyond ARM) struggle to scale so there is something more inherent that needs to be addressed from a generic perspective. Also, the said patch non-only helps pgbench kind of workload but other workloads too. -- I would request you guys to re-think it from this perspective to help ensure that PGSQL can scale well on ARM. s_lock becomes a top-most function and LSE is not a universal solution but CAS surely helps ease the main bottleneck. And surely let me know if more data is needed. Link: [1]: https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com > -- > Regards, > Alexander Korotkov > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar wrote: > On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar wrote: > > What I meant was outline-atomics support was added in GCC-9.4 and was made > > default in gcc-10. > > LSE support is present for quite some time. > > FWIW, here is an earlier discussion on the same (also added the > proposal author here) : > > https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com Thank you for pointing! I wonder why the effect of LSE on Graviton2 observed by Tsahi Zidenberg is so modest. It's probably because he runs the tests with a low number of clients. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar wrote: > I have completed benchmarking with lse. > > Graph attached. Thank you for benchmarking. Now I agree with this comment by Tom Lane > In general, I'm pretty skeptical of *all* the results posted so far on > this thread, because everybody seems to be testing exactly one machine. > If there's one thing that it's safe to assume about ARM, it's that > there are a lot of different implementations; and this area seems very > very likely to differ across implementations. Different ARM implementations look too different. As you pointed out, LSE is enabled in gcc-10 by default. I doubt we can accept a patch, which gives benefits for specific platform and only when the compiler isn't very modern. Also, we didn't cover all ARM planforms. Given they are so different, we can't guarantee that patch doesn't cause regression of some ARM. Additionally, the effect of the CAS patch even for Kunpeng seems modest. It makes the drop off of TPS more smooth, but it doesn't change the trend. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 02:16, Alexander Korotkov wrote: > On Mon, Nov 30, 2020 at 9:21 PM Tom Lane wrote: > > Alexander Korotkov writes: > > > I tend to think that LSE is enabled by default in Apple's clang based > > > on your previous message[1]. In order to dispel the doubts could you > > > please provide assembly of SpinLockAcquire for following clang > > > options. > > > "-O2" > > > "-O2 -march=armv8-a+lse" > > > "-O2 -march=armv8-a" > > > > Huh. Those options make exactly zero difference to the code generated > > for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread, > > for either the HEAD definition of TAS() or the CAS patch's version. > > > > So now I'm at a loss as to the reason for the performance difference > > I got. -march=armv8-a+lse does make a difference to code generation > > someplace, because the overall size of the postgres executable changes > > by 16kB or so. One might argue that the performance difference is due > > to better code elsewhere than the spinlocks ... but the test I'm running > > is basically just > > > > while (count-- > 0) > > { > > XLogGetLastRemovedSegno(); > > > > CHECK_FOR_INTERRUPTS(); > > } > > > > so it's hard to see where a non-spinlock-related code change would come > > in. That loop itself definitely generates the same code either way. > > > > I did find this interesting output from "clang -v": > > > > -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 > -target-feature +neon -target-feature +crc -target-feature +crypto > -target-feature +fullfp16 -target-feature +ras -target-feature +lse > -target-feature +rdm -target-feature +rcpc -target-feature +zcm > -target-feature +zcz -target-feature +sha2 -target-feature +aes > > > > whereas adding -march=armv8-a+lse changes that to just > > > > -target-cpu vortex -target-feature +neon -target-feature +lse > -target-feature +zcm -target-feature +zcz > > > > On the whole, that would make one think that -march=armv8-a+lse > > should produce worse code than the default settings. > > Great, thanks. > > So, I think the following hypothesis isn't disproved yet. > 1) On ARM with LSE support, PostgreSQL built with LSE is faster than > PostgreSQL built without LSE. Even if the latter is patched with > anything considered in this thread. > 2) None of the patches considered in this thread give a clear > advantage for PostgreSQL built with LSE. > > To further confirm this let's wait for Kunpeng 920 tests by Krunal > Bauskar and Amit Khandekar. Also it would be nice if someone will run > benchmarks similar to [1] on Apple M1. > -- I have completed benchmarking with lse. Graph attached. * For me, lse failed to perform. With lse enabled performance with higher scalability was infact less than baseline (with TAS). [This is true with and without patch (cas)]. So lse is unable to push things. * Graph traces all 4 lines (baseline (tas), baseline-patched (cas), baseline-lse (tas+lse), baseline-patched-lse (cas+lse)) - for select, there is no clear winner but baseline-patched-lse (cas+lse) perform better than others (2-4%). - for update/tpcb like baseline-patched (cas) continue to outperform all the 3 options (10-40%). In fact, with lse enabled a regression (compared to baseline/head) is observed. [I am not surprised since MySQL showed the same behavior with lse of-course with a lesser percentage]. I have repeated the test 2 times to confirm the behavior. Also, to reduce noise I normally run 4 rounds discarding 1st round and taking the median of the last 3 rounds. In all, I see consistent numbers. conf: https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf --- >From all the observations till now following points are clear: * Patch doesn't seem to make a significant difference with lse enabled. Ir-respective of the machine (Kunpeng, Graviton2, M1). [infact introduces regression with Kunpeng]. * Patch helps generate significant +ve difference in performace with lse disabled. Observed and Confirmed with both Kunpeng, Graviton2. (not possible to disable lse on m1). * Does the patch makes things worse with lse enabled. * Graviton2: No (based on Alexander graph) * M1: No (based on Tom's graph. Wondering why was debug build used (--enable-debug?)) * Kunpeng: regression observed. -- Should we enable lse? Based on the comment from [1] I see pgsql aligns with decision made by compiler/distribution vendors. If the compiler/distribution vendors has not made something default then there could be reason for it. - Without lse as we have seen patch continue to score. Effect of
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar wrote: > What I meant was outline-atomics support was added in GCC-9.4 and was made > default in gcc-10. > LSE support is present for quite some time. FWIW, here is an earlier discussion on the same (also added the proposal author here) : https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 15:16, Alexander Korotkov wrote: > On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar > wrote: > > On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov > wrote: > >> BTW, how do you get that required gcc version is 9.4? I've managed to > >> use LSE with gcc 9.3. > > > > Did they backported it to 9.3? > > I am just looking at the gcc guide. > > https://gcc.gnu.org/gcc-9/changes.html > > > > GCC 9.4 > > > > Target Specific Changes > > > > AArch64 > > > > The option -moutline-atomics has been added to aid deployment of the > Large System Extensions (LSE) > > No, you've misread this release notes item. This item relates to a > new option for LSE support. See the rest of the description. > Some confusion here. What I meant was outline-atomics support was added in GCC-9.4 and was made default in gcc-10. LSE support is present for quite some time. In one of my up-thread reply I mentioned about the outline atomic that would help us enable lse if underline architecture support it. 2. As you pointed out LSE is enabled starting only with arm-v8.1 but not all aarch64 tag machines are arm-v8.1 compatible. This means we would need a separate package or a more optimal way would be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with -moutline-atomics that would emit both traditional and lse code and flow would dynamically select depending on the target machine (I have blogged about it in MySQL context https://mysqlonarm.github.io/ARM-LSE-and-MySQL/) Problem with this approach is it would need gcc-9.4+ (I also ready gcc-9.3.1 would do) or gcc10.x but most of the distro doesn't support it so user using old compiler especially default one shipped with OS would not be able to take advantage of this. > > "When the option is specified code is emitted to detect the presence > of LSE instructions at runtime and use them for standard atomic > operations. For more information please refer to the documentation." > > LSE support itself was introduced in gcc 6. > https://gcc.gnu.org/gcc-6/changes.html > * The ARMv8.1-A architecture and the Large System Extensions are now > supported. They can be used by specifying the -march=armv8.1-a option. > Additionally, the +lse option extension can be used in a similar > fashion to other option extensions. The Large System Extensions > introduce new instructions that are used in the implementation of > atomic operations. > > But, -moutline-atomics looks very interesting itself. I think we > should add this option to our arm64 packages if we haven't already. > > -- > Regards, > Alexander Korotkov > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Tue, Dec 1, 2020 at 9:01 AM Tom Lane wrote: > I did what I could in this department. It's late and I'm not going to > have time to run read/write benchmarks before bed, but here are some > results for the "pgbench -S" cases. I tried to match your testing > choices, but could not entirely: > > * Configure options are --enable-debug, --disable-cassert, no other > special configure options or CFLAG choices. > > * I have not been able to find a way to make Apple's compiler not > use the LSE spinlock instructions, so all of these correspond to > your LSE cases. > > * I used shared_buffers = 1GB, because this machine only has 16GB > RAM so 32GB is clearly out of reach. Also I used pgbench scale > factor 100 not 1000. Since we're trying to measure contention > effects not I/O speed, I don't think a huge test case is appropriate. > > * I still haven't gotten pgbench to work with -j settings above 128, > so these runs use -j equal to half -c. Shouldn't really affect > conclusions about scaling. (BTW, I see a similar limitation on > macOS Catalina x86_64, so whatever that is, it's not new.) > > * Otherwise, the first plot shows median of three results from > "pgbench -S -M prepared -T 120 -c $n -j $j", as you had it. > The right-hand plot shows all three of the values in yerrorbars > format, just to give a sense of the noise level. > > Clearly, there is something weird going on at -c 4. There's a cluster > of results around 180K TPS, and another cluster around 210-220K TPS, > and nothing in between. I suspect that the scheduler is doing > something bogus with sometimes putting pgbench onto the slow cores. > Anyway, I believe that the apparent gap between HEAD and the other > curves at -c 4 is probably an artifact: HEAD had two 180K-ish results > and one 220K-ish result, while the other curves had the reverse, so > the medians are different but there's probably not any non-chance > effect there. > > Bottom line is that these patches don't appear to do much of > anything on M1, as you surmised. Great, thank you for your research! -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar wrote: > On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov wrote: >> BTW, how do you get that required gcc version is 9.4? I've managed to >> use LSE with gcc 9.3. > > Did they backported it to 9.3? > I am just looking at the gcc guide. > https://gcc.gnu.org/gcc-9/changes.html > > GCC 9.4 > > Target Specific Changes > > AArch64 > > The option -moutline-atomics has been added to aid deployment of the Large > System Extensions (LSE) No, you've misread this release notes item. This item relates to a new option for LSE support. See the rest of the description. "When the option is specified code is emitted to detect the presence of LSE instructions at runtime and use them for standard atomic operations. For more information please refer to the documentation." LSE support itself was introduced in gcc 6. https://gcc.gnu.org/gcc-6/changes.html * The ARMv8.1-A architecture and the Large System Extensions are now supported. They can be used by specifying the -march=armv8.1-a option. Additionally, the +lse option extension can be used in a similar fashion to other option extensions. The Large System Extensions introduce new instructions that are used in the implementation of atomic operations. But, -moutline-atomics looks very interesting itself. I think we should add this option to our arm64 packages if we haven't already. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > 2) None of the patches considered in this thread give a clear > advantage for PostgreSQL built with LSE. Yeah, I think so. > To further confirm this let's wait for Kunpeng 920 tests by Krunal > Bauskar and Amit Khandekar. Also it would be nice if someone will run > benchmarks similar to [1] on Apple M1. I did what I could in this department. It's late and I'm not going to have time to run read/write benchmarks before bed, but here are some results for the "pgbench -S" cases. I tried to match your testing choices, but could not entirely: * Configure options are --enable-debug, --disable-cassert, no other special configure options or CFLAG choices. * I have not been able to find a way to make Apple's compiler not use the LSE spinlock instructions, so all of these correspond to your LSE cases. * I used shared_buffers = 1GB, because this machine only has 16GB RAM so 32GB is clearly out of reach. Also I used pgbench scale factor 100 not 1000. Since we're trying to measure contention effects not I/O speed, I don't think a huge test case is appropriate. * I still haven't gotten pgbench to work with -j settings above 128, so these runs use -j equal to half -c. Shouldn't really affect conclusions about scaling. (BTW, I see a similar limitation on macOS Catalina x86_64, so whatever that is, it's not new.) * Otherwise, the first plot shows median of three results from "pgbench -S -M prepared -T 120 -c $n -j $j", as you had it. The right-hand plot shows all three of the values in yerrorbars format, just to give a sense of the noise level. Clearly, there is something weird going on at -c 4. There's a cluster of results around 180K TPS, and another cluster around 210-220K TPS, and nothing in between. I suspect that the scheduler is doing something bogus with sometimes putting pgbench onto the slow cores. Anyway, I believe that the apparent gap between HEAD and the other curves at -c 4 is probably an artifact: HEAD had two 180K-ish results and one 220K-ish result, while the other curves had the reverse, so the medians are different but there's probably not any non-chance effect there. Bottom line is that these patches don't appear to do much of anything on M1, as you surmised. regards, tom lane
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov wrote: > On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar > wrote: > > 3. Problem with GCC approach is still a lot of distro don't support gcc > 9.4 as default. > > To use this approach: > > * PGSQL will have to roll out its packages using gcc-9.4+ only so > that they are compatible with all aarch64 machines > > * but this continues to affect all other users who tend to build > pgsql using standard distro based compiler. (unless they upgrade compiler). > > I think if a user, who runs PostgreSQL on a multicore machine with > high-concurrent load, can take care about installing the appropriate > package/using the appropriate compiler (especially if we publish > explicit instructions for that). In the same way such advanced users > tune Linux kernel etc. > > BTW, how do you get that required gcc version is 9.4? I've managed to > use LSE with gcc 9.3. > Did they backported it to 9.3? I am just looking at the gcc guide. https://gcc.gnu.org/gcc-9/changes.html GCC 9.4Target Specific ChangesAArch64 - The option -moutline-atomics has been added to aid deployment of the Large System Extensions (LSE) > > -- > Regards, > Alexander Korotkov > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar wrote: > 3. Problem with GCC approach is still a lot of distro don't support gcc 9.4 > as default. > To use this approach: > * PGSQL will have to roll out its packages using gcc-9.4+ only so that > they are compatible with all aarch64 machines > * but this continues to affect all other users who tend to build pgsql > using standard distro based compiler. (unless they upgrade compiler). I think if a user, who runs PostgreSQL on a multicore machine with high-concurrent load, can take care about installing the appropriate package/using the appropriate compiler (especially if we publish explicit instructions for that). In the same way such advanced users tune Linux kernel etc. BTW, how do you get that required gcc version is 9.4? I've managed to use LSE with gcc 9.3. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Mon, Nov 30, 2020 at 9:21 PM Tom Lane wrote: > Alexander Korotkov writes: > > I tend to think that LSE is enabled by default in Apple's clang based > > on your previous message[1]. In order to dispel the doubts could you > > please provide assembly of SpinLockAcquire for following clang > > options. > > "-O2" > > "-O2 -march=armv8-a+lse" > > "-O2 -march=armv8-a" > > Huh. Those options make exactly zero difference to the code generated > for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread, > for either the HEAD definition of TAS() or the CAS patch's version. > > So now I'm at a loss as to the reason for the performance difference > I got. -march=armv8-a+lse does make a difference to code generation > someplace, because the overall size of the postgres executable changes > by 16kB or so. One might argue that the performance difference is due > to better code elsewhere than the spinlocks ... but the test I'm running > is basically just > > while (count-- > 0) > { > XLogGetLastRemovedSegno(); > > CHECK_FOR_INTERRUPTS(); > } > > so it's hard to see where a non-spinlock-related code change would come > in. That loop itself definitely generates the same code either way. > > I did find this interesting output from "clang -v": > > -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 > -target-feature +neon -target-feature +crc -target-feature +crypto > -target-feature +fullfp16 -target-feature +ras -target-feature +lse > -target-feature +rdm -target-feature +rcpc -target-feature +zcm > -target-feature +zcz -target-feature +sha2 -target-feature +aes > > whereas adding -march=armv8-a+lse changes that to just > > -target-cpu vortex -target-feature +neon -target-feature +lse -target-feature > +zcm -target-feature +zcz > > On the whole, that would make one think that -march=armv8-a+lse > should produce worse code than the default settings. Great, thanks. So, I think the following hypothesis isn't disproved yet. 1) On ARM with LSE support, PostgreSQL built with LSE is faster than PostgreSQL built without LSE. Even if the latter is patched with anything considered in this thread. 2) None of the patches considered in this thread give a clear advantage for PostgreSQL built with LSE. To further confirm this let's wait for Kunpeng 920 tests by Krunal Bauskar and Amit Khandekar. Also it would be nice if someone will run benchmarks similar to [1] on Apple M1. Links 1. https://www.postgresql.org/message-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg%40mail.gmail.com -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > I tend to think that LSE is enabled by default in Apple's clang based > on your previous message[1]. In order to dispel the doubts could you > please provide assembly of SpinLockAcquire for following clang > options. > "-O2" > "-O2 -march=armv8-a+lse" > "-O2 -march=armv8-a" Huh. Those options make exactly zero difference to the code generated for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread, for either the HEAD definition of TAS() or the CAS patch's version. So now I'm at a loss as to the reason for the performance difference I got. -march=armv8-a+lse does make a difference to code generation someplace, because the overall size of the postgres executable changes by 16kB or so. One might argue that the performance difference is due to better code elsewhere than the spinlocks ... but the test I'm running is basically just while (count-- > 0) { XLogGetLastRemovedSegno(); CHECK_FOR_INTERRUPTS(); } so it's hard to see where a non-spinlock-related code change would come in. That loop itself definitely generates the same code either way. I did find this interesting output from "clang -v": -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 -target-feature +neon -target-feature +crc -target-feature +crypto -target-feature +fullfp16 -target-feature +ras -target-feature +lse -target-feature +rdm -target-feature +rcpc -target-feature +zcm -target-feature +zcz -target-feature +sha2 -target-feature +aes whereas adding -march=armv8-a+lse changes that to just -target-cpu vortex -target-feature +neon -target-feature +lse -target-feature +zcm -target-feature +zcz On the whole, that would make one think that -march=armv8-a+lse should produce worse code than the default settings. regards, tom lane
Re: Improving spin-lock implementation on ARM.
On Mon, Nov 30, 2020 at 9:08 AM Tom Lane wrote: > Krunal Bauskar writes: > > On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: > >> The results I posted at [1] seem to contradict this for Apple's new > >> machines. > > > For the results you saw on Mac-Mini was LSE enabled by default. > > Hmm, I don't know how to get Apple's clang to admit what its default > settings are ... anybody? > > However, it does accept "-march=armv8-a+lse", and that seems to > not be the default, because I get different results from my spinlock- > pounding test than I did yesterday. Abbreviating into a table: > > --- CFLAGS=-O2 --- --- CFLAGS="-O2 -march=armv8-a+lse" > --- > > TPS HEADCAS patch HEADCAS patch > > clients=1 2127217426122722 > clients=2 1816859 892 950 > clients=4 714 519 610 468 > clients=8 - - 108 185 > > Unfortunately, that still doesn't lead me to think that either LSE > or CAS are net wins on this hardware. It's quite clear that LSE > makes the uncontended case a good bit faster, but the contended case > is a lot worse, so is that really a tradeoff we want? I tend to think that LSE is enabled by default in Apple's clang based on your previous message[1]. In order to dispel the doubts could you please provide assembly of SpinLockAcquire for following clang options. "-O2" "-O2 -march=armv8-a+lse" "-O2 -march=armv8-a" Thank you! Links 1. https://www.postgresql.org/message-id/663376.1606432828%40sss.pgh.pa.us -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Mon, Nov 30, 2020 at 9:20 AM Krunal Bauskar wrote: > Some of us may be surprised by the fact that enabling lse is causing > regression (1816 -> 892 or 714 -> 610) with HEAD itself. > While lse is meant to improve the performance. This, unfortunately, is not > always the case at-least based on my previous experience with LSE.too. I doubt this is correct. As Tom shown upthread, Apple clang have LSE enabled by default [1]. It might happen that --CFLAGS="-O2 -march=armv8-a+lse" disables some other optimizations, which are enabled by default in Apple clang... Links 1. https://www.postgresql.org/message-id/663376.1606432828%40sss.pgh.pa.us -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Mon, 30 Nov 2020 at 11:38, Tom Lane wrote: > Krunal Bauskar writes: > > On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: > >> The results I posted at [1] seem to contradict this for Apple's new > >> machines. > > > For the results you saw on Mac-Mini was LSE enabled by default. > > Hmm, I don't know how to get Apple's clang to admit what its default > settings are ... anybody? > > However, it does accept "-march=armv8-a+lse", and that seems to > not be the default, because I get different results from my spinlock- > pounding test than I did yesterday. Abbreviating into a table: > > --- CFLAGS=-O2 --- --- CFLAGS="-O2 > -march=armv8-a+lse" --- > > TPS HEADCAS patch HEADCAS patch > > clients=1 2127217426122722 > clients=2 1816859 892 950 > clients=4 714 519 610 468 > clients=8 - - 108 185 > Thanks for trying this Tom. - Some of us may be surprised by the fact that enabling lse is causing regression (1816 -> 892 or 714 -> 610) with HEAD itself. While lse is meant to improve the performance. This, unfortunately, is not always the case at-least based on my previous experience with LSE.too. I am still wondering why CAS is slower than TAS on M1. What is special on M1 that other ARM archs has not picked up. Tom, Sorry to bother you again but this is arising a lot of curiosity about M1. Whenever you get time can do some micro-benchmarking on M1 (to understand TAS vs CAS). Also, if you can share assembly code is emitted for the TAS vs CAS. > > Unfortunately, that still doesn't lead me to think that either LSE > or CAS are net wins on this hardware. It's quite clear that LSE > makes the uncontended case a good bit faster, but the contended case > is a lot worse, so is that really a tradeoff we want? > > > * I would also suggest if possible try with higher scalability (more > than 4 > > to check if with increase scalability CAS out-perform). > > As I said yesterday, running more than 4 processes is just going > to bring the low-performance cores into the equation, which is likely > to swamp any interesting comparison. I did run the test with "-c 8" > today, as shown in the right-hand columns, and the results seem > to bear that out. > > regards, tom lane > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
Krunal Bauskar writes: > On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: >> The results I posted at [1] seem to contradict this for Apple's new >> machines. > For the results you saw on Mac-Mini was LSE enabled by default. Hmm, I don't know how to get Apple's clang to admit what its default settings are ... anybody? However, it does accept "-march=armv8-a+lse", and that seems to not be the default, because I get different results from my spinlock- pounding test than I did yesterday. Abbreviating into a table: --- CFLAGS=-O2 --- --- CFLAGS="-O2 -march=armv8-a+lse" --- TPS HEADCAS patch HEADCAS patch clients=1 2127217426122722 clients=2 1816859 892 950 clients=4 714 519 610 468 clients=8 - - 108 185 Unfortunately, that still doesn't lead me to think that either LSE or CAS are net wins on this hardware. It's quite clear that LSE makes the uncontended case a good bit faster, but the contended case is a lot worse, so is that really a tradeoff we want? > * I would also suggest if possible try with higher scalability (more than 4 > to check if with increase scalability CAS out-perform). As I said yesterday, running more than 4 processes is just going to bring the low-performance cores into the equation, which is likely to swamp any interesting comparison. I did run the test with "-c 8" today, as shown in the right-hand columns, and the results seem to bear that out. regards, tom lane
Re: Improving spin-lock implementation on ARM.
On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: > Krunal Bauskar writes: > > So given all the permutations and combinations, I think we could approach > > the problem as follows: > > > * Enable use of CAS as it is known to have optimal performance (vs TAS) > > The results I posted at [1] seem to contradict this for Apple's new > machines. > > In general, I'm pretty skeptical of *all* the results posted so far on > this thread, because everybody seems to be testing exactly one machine. > If there's one thing that it's safe to assume about ARM, it's that > there are a lot of different implementations; and this area seems very > very likely to differ across implementations. > For the results you saw on Mac-Mini was LSE enabled by default. Maybe clang does it on mac by default to harvest performance. If that is the case then your results are still inline with what Alexander posted. On other hand, Peter saw some % improvement on M1 MacBook. -- * Can you please check/confirm the LSE enable status on MAC (default). * Also, if possible run some quick micro-benchmark to help understand how TAS and CAS perform on MAC. It would be a big surprise to learn that M1 can execute TAS better than CAS (without lse). * I would also suggest if possible try with higher scalability (more than 4 to check if with increase scalability CAS out-perform). Results I have seen to date (in different contexts too), CAS has outperformed on Kunpeng, Graviton2, (Other ARM arch), ... (matches with theory too unless LSE is enabled). If M1 goes the other way around that would be a big surprise for the arm community too. > > I don't have a big problem with catering for a few different spinlock > implementations on ARM ... but it's sure unclear how we could decide > which one to use. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/741389.1606530...@sss.pgh.pa.us > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
Krunal Bauskar writes: > So given all the permutations and combinations, I think we could approach > the problem as follows: > * Enable use of CAS as it is known to have optimal performance (vs TAS) The results I posted at [1] seem to contradict this for Apple's new machines. In general, I'm pretty skeptical of *all* the results posted so far on this thread, because everybody seems to be testing exactly one machine. If there's one thing that it's safe to assume about ARM, it's that there are a lot of different implementations; and this area seems very very likely to differ across implementations. I don't have a big problem with catering for a few different spinlock implementations on ARM ... but it's sure unclear how we could decide which one to use. regards, tom lane [1] https://www.postgresql.org/message-id/741389.1606530...@sss.pgh.pa.us
Re: Improving spin-lock implementation on ARM.
On Sun, 29 Nov 2020 at 22:23, Alexander Korotkov wrote: > On Sat, Nov 28, 2020 at 1:31 PM Alexander Korotkov > wrote: > > I guess that might depend on the implementation of CAS and TAS. I bet > > usage of CAS in spinlock gives advantage when ldxr/stxr are used, but > > not when swpal/casa are used. I found out that I can force clang to > > use swpal/casa by setting "-march=armv8-a+lse". I'm going to make > > some experiments on a multicore AWS graviton2 instance with different > > atomic implementation. > > I've made some benchmarks on c6gd.16xlarge ec2 instance with graviton2 > processor of 64 virtual CPUs (graphs and raw results are attached). > I've analyzed two patches: spinlock using cas by Krunal Bauskar, and > my implementation of lwlock using lwrex/strex. My arm lwlock patch > has the same idea as my previous patch for power: we can put lwlock > attempt logic between lwrex and strex. In spite of my previous power > patch, the arm patch doesn't contain assembly: instead I've used > C-wrappers over lwrex/strex. > > The first series of experiments I've made using standard compiling > options. So, LSE instructions from ARM v8.1 weren't used. Atomics > were implemented using lwrex/strex pair. > > In the read-only benchmark, both spinlock (cas-spinlock graph) and > lwlock (ldrew-strex-lwlock graph) patches give observable performance > gain of similar value. However, performance of combination of these > patches (ldrew-strex-lwlock-cas-spinlock graph) is close to > performance of unpatched version. That could be counterintuitive, but > I've rechecked that multiple times. > > In the read-write benchmark, both spinlock and lwlock patches give > more significant performance gain, and lwlock patch gives more effect > than spinlock patch. Noticeable, that combination of patches now > gives some cumulative effect instead of counterintuitive slowdown. > > Then I've tried to compile postgres with LSE instruction using > "-march=armv8-a+lse" flag with clang (graphs with -lse suffix). The > effect of LSE is HUGE!!! Unpatched version with LSE is times faster > than any version without LSE on high concurrency. In the both > read-only and read-write benchmarks spinlock patch doesn't show any > significant difference. The lwlock patch shows a great slowdown with > LSE. Noticeable, in read-write benchmark, lwlock patch shows worse > results than unpatched version without LSE. Probably, combining > different atomics implementations isn't a good idea. > > It seems that ARM Kunpeng 920 should support ARM v8.1. I wonder if > the published benchmarks results were made with LSE. I suspect that > it was not. It would be nice to repeat the same benchmarks with LSE. > I'd like to ask Krunal Bauskar and Amit Khandekar to repeat these > benchmarks with LSE. > > My preliminary conclusions are so: > 1) Since the effect of LSE is so huge, we should advise users of > multicore ARM servers to compile PostgreSQL with LSE support. We > probably should provide separate packaging for ARM v8.1 and higher > (packages for ARM v8 are still needed for raspberry etc). > 2) It seems that atomics in ARM v8.1 becomes very similar to x86 > atomics, and it doesn't need special optimizations. And I think ARM > v8 processors don't have so many cores and aren't so heavily used in > high-concurrent environments. So, special optimizations for ARM v8 > probably aren't worth it. > Thanks for the detailed results. 1. Results we shared are w/o lse enabled so using traditional store/load approach. 2. As you pointed out LSE is enabled starting only with arm-v8.1 but not all aarch64 tag machines are arm-v8.1 compatible. This means we would need a separate package or a more optimal way would be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with -moutline-atomics that would emit both traditional and lse code and flow would dynamically select depending on the target machine (I have blogged about it in MySQL context https://mysqlonarm.github.io/ARM-LSE-and-MySQL/) 3. Problem with GCC approach is still a lot of distro don't support gcc 9.4 as default. To use this approach: * PGSQL will have to roll out its packages using gcc-9.4+ only so that they are compatible with all aarch64 machines * but this continues to affect all other users who tend to build pgsql using standard distro based compiler. (unless they upgrade compiler). So given all the permutations and combinations, I think we could approach the problem as follows: * Enable use of CAS as it is known to have optimal performance (vs TAS) * Even with LSE enabled, CAS to continue to perform (on par or marginally better than TAS) * Add a patch to compile pgsql with outline-atomics if set GCC supports it so the dynamic 2-way compatible code is emitted. Alexander, We will surely benchmark using LSE on Kunpeng 920 and share the result. I am a bit surprised to see things scale by 4-5x times just by switchin
Re: Improving spin-lock implementation on ARM.
On Thu, Nov 26, 2020 at 7:35 AM Krunal Bauskar wrote: > * x86 uses optimized xchg operation. > ARM too started supporting it (using Large System Extension) with > ARM-v8.1 but since it not supported with ARM-v8, GCC default tends > to roll more generic load-store assembly code. > > * gcc-9.4+ onwards there is support for outline-atomics that could emit > both the variants of the code (load-store and cas/swp) and based on > underlying supported architecture proper variant it used but still a lot > of distros don't support GCC-9.4 as the default compiler. I've checked that gcc 9.3 with "-march=armv8-a+lse" option uses LSE atomics... -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Sat, Nov 28, 2020 at 5:36 AM Tom Lane wrote: > So at least on Apple's hardware, it seems like the CAS > implementation might be a shade faster when uncontended, > but it's very clearly worse when there is contention for > the spinlock. That's interesting, because the argument > that CAS should involve strictly less work seems valid ... > but that's what I'm getting. > > It might be useful to try this on other ARM platforms, > but I lack the energy right now (plus the only other > thing I've got is a Raspberry Pi, which might not be > something we particularly care about performance-wise). I guess that might depend on the implementation of CAS and TAS. I bet usage of CAS in spinlock gives advantage when ldxr/stxr are used, but not when swpal/casa are used. I found out that I can force clang to use swpal/casa by setting "-march=armv8-a+lse". I'm going to make some experiments on a multicore AWS graviton2 instance with different atomic implementation. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
I wrote: > It might be that this hardware is capable of showing a difference with a > better-tuned pgbench test, but with an untuned pgbench run, we just aren't > sufficiently sensitive to the spinlock properties. (Which I guess is good > news, really.) It occurred to me that if we don't insist on a semi-realistic test case, it's not that hard to just pound on a spinlock and see what happens. I made up a simple C function (attached) to repeatedly call XLogGetLastRemovedSegno, which is basically just a spinlock acquire/release. Using this as a "transaction": $ cat bench.sql select drive_spinlocks(5); I get this with HEAD: $ pgbench -f bench.sql -n -T 60 -c 1 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 127597 latency average = 0.470 ms tps = 2126.479699 (including connections establishing) tps = 2126.595015 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 2 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 duration: 60 s number of transactions actually processed: 108979 latency average = 1.101 ms tps = 1816.051930 (including connections establishing) tps = 1816.150556 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 4 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 duration: 60 s number of transactions actually processed: 42862 latency average = 5.601 ms tps = 714.202152 (including connections establishing) tps = 714.237301 (excluding connections establishing) (With only 4 high-performance cores, it's probably not interesting to go further; involving the slower cores will just confuse matters.) And this with the patch: $ pgbench -f bench.sql -n -T 60 -c 1 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 130455 latency average = 0.460 ms tps = 2174.098284 (including connections establishing) tps = 2174.217097 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 2 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 duration: 60 s number of transactions actually processed: 51533 latency average = 2.329 ms tps = 858.765176 (including connections establishing) tps = 858.811132 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 4 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 duration: 60 s number of transactions actually processed: 31154 latency average = 7.705 ms tps = 519.116788 (including connections establishing) tps = 519.144375 (excluding connections establishing) So at least on Apple's hardware, it seems like the CAS implementation might be a shade faster when uncontended, but it's very clearly worse when there is contention for the spinlock. That's interesting, because the argument that CAS should involve strictly less work seems valid ... but that's what I'm getting. It might be useful to try this on other ARM platforms, but I lack the energy right now (plus the only other thing I've got is a Raspberry Pi, which might not be something we particularly care about performance-wise). regards, tom lane /* create function drive_spinlocks(count int) returns void strict volatile language c as '.../spinlocktest.so'; */ #include "postgres.h" #include "access/xlog.h" #include "fmgr.h" #include "miscadmin.h" PG_MODULE_MAGIC; /* * drive_spinlocks(count int) returns void */ PG_FUNCTION_INFO_V1(drive_spinlocks); Datum drive_spinlocks(PG_FUNCTION_ARGS) { int32 count = PG_GETARG_INT32(0); while (count-- > 0) { XLogGetLastRemovedSegno(); CHECK_FOR_INTERRUPTS(); } PG_RETURN_VOID(); }
Re: Improving spin-lock implementation on ARM.
Peter Eisentraut writes: > I tried this on a M1 MacBook Air. I cannot reproduce these results. > The unpatched numbers are about in the neighborhood of what you showed, > but the patched numbers are only about a few percent better, not the > 1.5x or 2x change that you showed. After redoing the test, I can't find any outside-the-noise difference at all between HEAD and the patch. So clearly, I screwed up yesterday. The most likely theory is that I managed to measure an assert-enabled build of HEAD. It might be that this hardware is capable of showing a difference with a better-tuned pgbench test, but with an untuned pgbench run, we just aren't sufficiently sensitive to the spinlock properties. (Which I guess is good news, really.) One thing that did hold up is that the thermal performance of this box is pretty ridiculous. After being beat on for a solid hour, the fan still hasn't turned on to any noticeable level, and the enclosure is only a little warm to the touch. Try that with Intel hardware ;-) regards, tom lane
Re: Improving spin-lock implementation on ARM.
On 2020-11-26 23:55, Tom Lane wrote: ... and, after retrieving my jaw from the floor, I present the attached. Apple's chips evidently like this style of spinlock a LOT better. The difference is so remarkable that I wonder if I made a mistake somewhere. Can anyone else replicate these results? I tried this on a M1 MacBook Air. I cannot reproduce these results. The unpatched numbers are about in the neighborhood of what you showed, but the patched numbers are only about a few percent better, not the 1.5x or 2x change that you showed.
Re: Improving spin-lock implementation on ARM.
On Fri, Nov 27, 2020 at 11:55 AM Michael Paquier wrote: > Not planning to buy one here, anything I have read on that tells that > it is worth a performance study. Another interesting area for experiments is AWS graviton2 instances. Specification says it supports arm v8.2, so it should have swpal/casa instructions as well. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Fri, Nov 27, 2020 at 02:50:30AM -0500, Tom Lane wrote: > Yeah, that wasn't making sense to me either. The most likely explanation > seems to be that I messed up the test somehow ... but I don't see where. > So, again, I'm wondering if anyone else can replicate or refute this. I do find your results extremely surprising not only for 4, but for all tests with connection numbers lower than 32. With a scale factor of 100 that's suspiciously a lot of difference. > I can't be the only geek around here who sprang for an M1. Not planning to buy one here, anything I have read on that tells that it is worth a performance study. -- Michael signature.asc Description: PGP signature
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > On Fri, Nov 27, 2020 at 1:55 AM Tom Lane wrote: >> ... and, after retrieving my jaw from the floor, I present the >> attached. Apple's chips evidently like this style of spinlock a LOT >> better. The difference is so remarkable that I wonder if I made a >> mistake somewhere. Can anyone else replicate these results? > Results look very surprising to me. I didn't expect there would be > any very busy spin-lock when the number of clients is as low as 4. Yeah, that wasn't making sense to me either. The most likely explanation seems to be that I messed up the test somehow ... but I don't see where. So, again, I'm wondering if anyone else can replicate or refute this. I can't be the only geek around here who sprang for an M1. regards, tom lane
Re: Improving spin-lock implementation on ARM.
On Fri, Nov 27, 2020 at 2:20 AM Tom Lane wrote: > Alexander Korotkov writes: > > On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas wrote: > >> Is there some official ARM documentation, like a programmer's reference > >> manual or something like that, that would show a reference > >> implementation of a spinlock on ARM? It would be good to refer to an > >> authoritative source on this. > > > I've compared assembly output of gcc implementations of CAS and TAS. > > FWIW, I see quite different assembly using Apple's clang on their M1 > processor. What I get for SpinLockAcquire on HEAD is (lock pointer > initially in x0): Yep, arm v8.1 implements single-instruction atomic operations swpal and casa, which much more look like x86 atomic instructions rather than loops of ldxr/stlxr. So, all the reasoning upthread shouldn't work here, but the advantage is much more huge. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Fri, Nov 27, 2020 at 1:55 AM Tom Lane wrote: > > Krunal Bauskar writes: > > On Thu, 26 Nov 2020 at 10:50, Tom Lane wrote: > >> Also, exactly what hardware/software platform were these curves > >> obtained on? > > > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > > server and 8 for client) [2 numa nodes] > > Storage: 3.2 TB NVMe SSD > > OS: CentOS Linux release 7.6 > > PGSQL: baseline = Release Tag 13.1 > > Hmm, might not be the sort of hardware ordinary mortals can get their > hands on. What's likely to be far more common ARM64 hardware in the > near future is Apple's new gear. So I thought I'd try this on the new > M1 mini I just got. > > ... and, after retrieving my jaw from the floor, I present the > attached. Apple's chips evidently like this style of spinlock a LOT > better. The difference is so remarkable that I wonder if I made a > mistake somewhere. Can anyone else replicate these results? Results look very surprising to me. I didn't expect there would be any very busy spin-lock when the number of clients is as low as 4. Especially in read-only pgbench. I don't have an M1 at hand. Could you do some profiling to identify the source of such a huge difference. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas wrote: >> Is there some official ARM documentation, like a programmer's reference >> manual or something like that, that would show a reference >> implementation of a spinlock on ARM? It would be good to refer to an >> authoritative source on this. > I've compared assembly output of gcc implementations of CAS and TAS. FWIW, I see quite different assembly using Apple's clang on their M1 processor. What I get for SpinLockAcquire on HEAD is (lock pointer initially in x0): mov x19, x0 mov w8, #1 swpal w8, w8, [x0] cbz w8, LBB0_2 adrpx1, l_.str@PAGE add x1, x1, l_.str@PAGEOFF adrpx3, l___func__.foo@PAGE add x3, x3, l___func__.foo@PAGEOFF mov x0, x19 mov w2, #12 bl _s_lock LBB0_2: ... lock is acquired while SpinLockRelease is just stlrwzr, [x19] With the patch, I get mov x19, x0 mov w8, #0 mov w9, #1 casaw8, w9, [x0] cmp w8, #0 ; =0 b.eqLBB0_2 adrpx1, l_.str@PAGE add x1, x1, l_.str@PAGEOFF adrpx3, l___func__.foo@PAGE add x3, x3, l___func__.foo@PAGEOFF mov x0, x19 mov w2, #12 bl _s_lock LBB0_2: ... lock is acquired and SpinLockRelease is the same. Don't know much of anything about ARM assembly, so I don't know if these instructions are late-model-only. regards, tom lane
Re: Improving spin-lock implementation on ARM.
Krunal Bauskar writes: > On Thu, 26 Nov 2020 at 10:50, Tom Lane wrote: >> Also, exactly what hardware/software platform were these curves >> obtained on? > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > server and 8 for client) [2 numa nodes] > Storage: 3.2 TB NVMe SSD > OS: CentOS Linux release 7.6 > PGSQL: baseline = Release Tag 13.1 Hmm, might not be the sort of hardware ordinary mortals can get their hands on. What's likely to be far more common ARM64 hardware in the near future is Apple's new gear. So I thought I'd try this on the new M1 mini I just got. ... and, after retrieving my jaw from the floor, I present the attached. Apple's chips evidently like this style of spinlock a LOT better. The difference is so remarkable that I wonder if I made a mistake somewhere. Can anyone else replicate these results? Test conditions are absolutely brain dead: Today's HEAD (dcfff74fb), no special build options All server parameters are out-of-the-box defaults, except I had to raise max_connections for the larger client counts pgbench scale factor 100 Read-only tests are like pgbench -S -T 60 -c 32 -j 16 bench Quoted figure is median of three runs; except for the lowest client count, results were quite repeatable. (I speculate that at -c 4, the scheduler might've been doing something funny about sometimes using the slow cores instead of fast cores.) Read-write tests are like pgbench -T 300 -c 16 -j 8 bench I didn't have the patience to run three full repetitions, but again the numbers seemed pretty repeatable. I used -j equal to half -c, except I could not get -j above 128 to work, so the larger client counts have -j 128. Did not try to run down that problem yet, but I'm probably hitting some ulimit somewhere. (I did have to raise "ulimit -n" to get these results.) Anyway, this seems to be a slam-dunk win on M1. regards, tom lane
Re: Improving spin-lock implementation on ARM.
On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas wrote: > On 26/11/2020 06:30, Krunal Bauskar wrote: > > Improving spin-lock implementation on ARM. > > > > > > * Spin-Lock is known to have a significant effect on performance > >with increasing scalability. > > > > * Existing Spin-Lock implementation for ARM is sub-optimal due to > >use of TAS (test and swap) > > > > * TAS is implemented on ARM as load-store so even if the lock is not > free, > >store operation will execute to replace the same value. > >This redundant operation (mainly store) is costly. > > > > * CAS is implemented on ARM as load-check-store-check that means if the > >lock is not free, check operation, post-load will cause the loop to > >return there-by saving on costlier store operation. [1] > > Can you add some code comments to explain that why CAS is cheaper than > TAS on ARM? > 1. As Alexey too pointed out in followup email CAS = load value -> check if the value is expected -> if yes then replace (store new value) -> else exit/break TAS = load value -> store new value This means each TAS would be converted to 2 operations that are LOAD and STORE and of-course STORE is costlier in terms of latency. CAS ensures optimization in this regard with an early check. Let's look at some micro-benchmarking. I implemented a simple spin-loop using both approaches and as expected with increase scalability, CAS continues to out-perform TAS by a margin up to 50%. TAS Running 128 parallelism Elapsed time: 1.34271 s Running 256 parallelism Elapsed time: 3.6487 s Running 512 parallelism Elapsed time: 11.3725 s Running 1024 parallelism Elapsed time: 43.5376 s CAS Running 128 parallelism Elapsed time: 1.00131 s Running 256 parallelism Elapsed time: 2.53202 s Running 512 parallelism Elapsed time: 7.66829 s Running 1024 parallelism Elapsed time: 22.6294 s This could be also observed from the perf profiling TAS: 15.57 │44: ldxr w0, [x19] 83.93 │ stxr w1, w21, [x19] CAS: 81.29 │58: ↓ b.ne cc 9.86 │cc: ldaxr w0, [x22] 8.84 │ cmp w0, #0x0 │↑ b.ne 58 │ stxr w1, w20, [x22] *In TAS: STORE is pretty costly.* 2. I have added the needed comment in the patch. Updated patch attached. -- Thanks for taking look at this and surely let me know if any more info is needed. > > Is there some official ARM documentation, like a programmer's reference > manual or something like that, that would show a reference > implementation of a spinlock on ARM? It would be good to refer to an > authoritative source on this. > > - Heikki > -- Regards, Krunal Bauskar #include #include #include #include #include #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) const size_t data_block_size = 64*1024; char data_block[64*1024]; size_t k_iter=100; unsigned long global_crcval = 0; /* spin-lock */ uint32_t lock_unit; void lock() { while (true) { uint32_t expected = 0; #ifdef CAS if (!__atomic_compare_exchange_n(&lock_unit, &expected, 1, false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) { #elif TAS if (__sync_lock_test_and_set(&lock_unit, 1)) { #endif /* Spin-and-Retry as the lock is acquired by some other thread */ __asm__ __volatile__("" ::: "memory"); continue; } break; } } void unlock() { #ifdef CAS __atomic_store_n(&lock_unit, 0, __ATOMIC_RELEASE); #else __sync_lock_release(&lock_unit); #endif } /* workload */ void workload_execute() { for (size_t i = 0; i < k_iter; ++i) { lock(); /* Each thread try to take lock -> execute critical section -> unlock */ memset(data_block, rand() % 255, data_block_size); unsigned long crcval = 0; crc32(crcval, (const unsigned char *)data_block, data_block_size); global_crcval += crcval; unlock(); } } int main(int argc, char *argv[]) { if (argc != 2) { std::cerr << "usage: " << std::endl; return 1; } size_t num_of_threads = atol(argv[1]); std::thread* handles[num_of_threads]; auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_of_threads; ++i) { handles[i] = new std::thread(workload_execute); } for (size_t i = 0; i < num_of_threads; ++i) { handles[i]->join(); } auto finish = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_of_threads; ++i) { delete handles[i]; } std::chrono::duration elapsed = finish - start; std::cout << "Elapsed time: " << elapsed.count() << "
Re: Improving spin-lock implementation on ARM.
On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas wrote: > On 26/11/2020 06:30, Krunal Bauskar wrote: > > Improving spin-lock implementation on ARM. > > > > > > * Spin-Lock is known to have a significant effect on performance > >with increasing scalability. > > > > * Existing Spin-Lock implementation for ARM is sub-optimal due to > >use of TAS (test and swap) > > > > * TAS is implemented on ARM as load-store so even if the lock is not free, > >store operation will execute to replace the same value. > >This redundant operation (mainly store) is costly. > > > > * CAS is implemented on ARM as load-check-store-check that means if the > >lock is not free, check operation, post-load will cause the loop to > >return there-by saving on costlier store operation. [1] > > Can you add some code comments to explain that why CAS is cheaper than > TAS on ARM? > > Is there some official ARM documentation, like a programmer's reference > manual or something like that, that would show a reference > implementation of a spinlock on ARM? It would be good to refer to an > authoritative source on this. Let me add my 2 cents. I've compared assembly output of gcc implementations of CAS and TAS. The sample C-program is attached. I've compiled it on raspberry pi 4 using gcc 9.3.0. The inner loop of CAS is as follows. So, if the value loaded by ldaxr doesn't match expected value, then we immediately quit the loop. .L3: ldxrw3, [x0] cmp w3, w1 bne .L4 stlxr w4, w2, [x0] cbnzw4, .L3 .L4: The inner loop of TAS is as follows. So it really does "stxr" unconditionally. In principle, it could check if a new value matches the observed value and there is nothing to do, but it doesn't. Moreover, stxr might fail, then we can spend multiple loops of ldxr/stxr due to concurrent memory access. AFAIK, those concurrent accesses could reflect not only lock release, but also other unsuccessful lock attempts. So, good chance for extra loops to be useless. .L7: ldxrw2, [x0] stxrw3, w1, [x0] cbnzw3, .L7 I've also googled for spinlock implementation on arm and found a blog post about spinlock implementation in linux kernel [1]. Surprisingly it doesn't use the trick to skip stxr if the lock is busy. Instead, they use some arm-specific power-saving option called WFE. So, I think it's quite clear that switching from TAS to CAS on arm would be a win. But there could be other options to do this even better. Links 1. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html -- Regards, Alexander Korotkov test.c Description: Binary data
Re: Improving spin-lock implementation on ARM.
On 26/11/2020 06:30, Krunal Bauskar wrote: Improving spin-lock implementation on ARM. * Spin-Lock is known to have a significant effect on performance with increasing scalability. * Existing Spin-Lock implementation for ARM is sub-optimal due to use of TAS (test and swap) * TAS is implemented on ARM as load-store so even if the lock is not free, store operation will execute to replace the same value. This redundant operation (mainly store) is costly. * CAS is implemented on ARM as load-check-store-check that means if the lock is not free, check operation, post-load will cause the loop to return there-by saving on costlier store operation. [1] Can you add some code comments to explain that why CAS is cheaper than TAS on ARM? Is there some official ARM documentation, like a programmer's reference manual or something like that, that would show a reference implementation of a spinlock on ARM? It would be good to refer to an authoritative source on this. - Heikki
Re: Improving spin-lock implementation on ARM.
On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar wrote: > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > server and 8 for client) [2 numa nodes] > Storage: 3.2 TB NVMe SSD > OS: CentOS Linux release 7.6 > PGSQL: baseline = Release Tag 13.1 > Invocation suite: > https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses > pgbench) Using the same hardware, attached are my improvement figures, which are pretty much in line with your figures. Except that, I did not run for more than 400 number of clients. And, I am getting some improvement even for select-only workloads, in case of 200-400 clients. For read-write load, I had seen that the s_lock() contention was caused when the XLogFlush() uses the spinlock. But for read-only case, I have not analyzed where the improvement occurred. The .png files in the attached tar have the graphs for head versus patch. The GUCs that I changed : work_mem=64MB shared_buffers=128GB maintenance_work_mem = 1GB min_wal_size = 20GB max_wal_size = 100GB checkpoint_timeout = 60min checkpoint_completion_target = 0.9 full_page_writes = on synchronous_commit = on effective_io_concurrency = 200 log_checkpoints = on For backends, 64 CPUs were allotted (covering 2 NUMA nodes) , and for pgbench clients a separate set of 28 CPUs were allotted on a different socket. Server was pre_warmed(). results.tar.gz Description: application/gzip
Re: Improving spin-lock implementation on ARM.
On Thu, 26 Nov 2020 at 10:50, Tom Lane wrote: > Michael Paquier writes: > > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: > >> (Thanks to Amit Khandekar for rigorously performance testing this patch > >> with different combinations). > > > For the simple-update and tpcb-like graphs, do you have any actual > > numbers to share between 128 and 1024 connections? > > Also, exactly what hardware/software platform were these curves > obtained on? > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for server and 8 for client) [2 numa nodes] Storage: 3.2 TB NVMe SSD OS: CentOS Linux release 7.6 PGSQL: baseline = Release Tag 13.1 Invocation suite: https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses pgbench) > regards, tom lane > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
Michael Paquier writes: > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: >> (Thanks to Amit Khandekar for rigorously performance testing this patch >> with different combinations). > For the simple-update and tpcb-like graphs, do you have any actual > numbers to share between 128 and 1024 connections? Also, exactly what hardware/software platform were these curves obtained on? regards, tom lane
Re: Improving spin-lock implementation on ARM.
scalability baseline patched ---- -- updatetpcb update tpcb -- 128 107932 78554 108081 78569 25682877 64682101543 73774 51255174 46494 77886 61105 1024 32267 27020 33170 30597 configuration: https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf On Thu, 26 Nov 2020 at 10:36, Michael Paquier wrote: > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: > > (Thanks to Amit Khandekar for rigorously performance testing this patch > > with different combinations). > > For the simple-update and tpcb-like graphs, do you have any actual > numbers to share between 128 and 1024 connections? The blue lines > look like they are missing some measurements in-between, so it is hard > to tell if this is an actual improvement or just some lack of data. > -- > Michael > -- Regards, Krunal Bauskar
Re: Improving spin-lock implementation on ARM.
On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: > (Thanks to Amit Khandekar for rigorously performance testing this patch > with different combinations). For the simple-update and tpcb-like graphs, do you have any actual numbers to share between 128 and 1024 connections? The blue lines look like they are missing some measurements in-between, so it is hard to tell if this is an actual improvement or just some lack of data. -- Michael signature.asc Description: PGP signature
Improving spin-lock implementation on ARM.
Improving spin-lock implementation on ARM. * Spin-Lock is known to have a significant effect on performance with increasing scalability. * Existing Spin-Lock implementation for ARM is sub-optimal due to use of TAS (test and swap) * TAS is implemented on ARM as load-store so even if the lock is not free, store operation will execute to replace the same value. This redundant operation (mainly store) is costly. * CAS is implemented on ARM as load-check-store-check that means if the lock is not free, check operation, post-load will cause the loop to return there-by saving on costlier store operation. [1] * x86 uses optimized xchg operation. ARM too started supporting it (using Large System Extension) with ARM-v8.1 but since it not supported with ARM-v8, GCC default tends to roll more generic load-store assembly code. * gcc-9.4+ onwards there is support for outline-atomics that could emit both the variants of the code (load-store and cas/swp) and based on underlying supported architecture proper variant it used but still a lot of distros don't support GCC-9.4 as the default compiler. * In light of this, we would like to propose a CAS-based approach based on our local testing has shown improvement in the range of 10-40%. (attaching graph). * Patch enables CAS based approach if the CAS is supported depending on existing compiled flag HAVE_GCC__ATOMIC_INT32_CAS (Thanks to Amit Khandekar for rigorously performance testing this patch with different combinations). [1]: https://godbolt.org/z/jqbEsa P.S: Sorry if I missed any standard pgsql protocol since I am just starting with pgsql. --- Krunal Bauskar #mysqlonarm Huawei Technologies diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index 31a5ca6..940fdcd 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -321,7 +321,24 @@ tas(volatile slock_t *lock) * than other widths. */ #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64) -#ifdef HAVE_GCC__SYNC_INT32_TAS + +#ifdef HAVE_GCC__ATOMIC_INT32_CAS +/* just reusing the same flag to avoid re-declaration of default tas functions below */ +#define HAS_TEST_AND_SET + +#define TAS(lock) cas(lock) +typedef int slock_t; + +static __inline__ int +cas(volatile slock_t *lock) +{ + slock_t expected = 0; + return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1, +false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)); +} + +#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE); +#elif HAVE_GCC__SYNC_INT32_TAS #define HAS_TEST_AND_SET #define TAS(lock) tas(lock)