Improving spin-lock implementation on ARM.
------------------------------------------------------------
* Spin-Lock is known to have a significant effect on performance
with increasing scalability.
* Existing Spin-Lock implementation for ARM is sub-optimal due to
use of TAS (test and swap)
* TAS is implemented on ARM as load-store so even if the lock is not free,
store operation will execute to replace the same value.
This redundant operation (mainly store) is costly.
* CAS is implemented on ARM as load-check-store-check that means if the
lock is not free, check operation, post-load will cause the loop to
return there-by saving on costlier store operation. [1]
* x86 uses optimized xchg operation.
ARM too started supporting it (using Large System Extension) with
ARM-v8.1 but since it not supported with ARM-v8, GCC default tends
to roll more generic load-store assembly code.
* gcc-9.4+ onwards there is support for outline-atomics that could emit
both the variants of the code (load-store and cas/swp) and based on
underlying supported architecture proper variant it used but still a lot
of distros don't support GCC-9.4 as the default compiler.
* In light of this, we would like to propose a CAS-based approach based on
our local testing has shown improvement in the range of 10-40%.
(attaching graph).
* Patch enables CAS based approach if the CAS is supported depending on
existing compiled flag HAVE_GCC__ATOMIC_INT32_CAS
(Thanks to Amit Khandekar for rigorously performance testing this patch
with different combinations).
[1]: https://godbolt.org/z/jqbEsa
P.S: Sorry if I missed any standard pgsql protocol since I am just starting
with pgsql.
---
Krunal Bauskar
#mysqlonarm
Huawei Technologies
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6..940fdcd 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,24 @@ tas(volatile slock_t *lock)
* than other widths.
*/
#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC__SYNC_INT32_TAS
+
+#ifdef HAVE_GCC__ATOMIC_INT32_CAS
+/* just reusing the same flag to avoid re-declaration of default tas functions below */
+#define HAS_TEST_AND_SET
+
+#define TAS(lock) cas(lock)
+typedef int slock_t;
+
+static __inline__ int
+cas(volatile slock_t *lock)
+{
+ slock_t expected = 0;
+ return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1,
+ false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
+}
+
+#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE);
+#elif HAVE_GCC__SYNC_INT32_TAS
#define HAS_TEST_AND_SET
#define TAS(lock) tas(lock)