Improving spin-lock implementation on ARM.
------------------------------------------------------------

* Spin-Lock is known to have a significant effect on performance
  with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
  use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
  store operation will execute to replace the same value.
  This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
  lock is not free, check operation, post-load will cause the loop to
  return there-by saving on costlier store operation. [1]

* x86 uses optimized xchg operation.
  ARM too started supporting it (using Large System Extension) with
  ARM-v8.1 but since it not supported with ARM-v8, GCC default tends
  to roll more generic load-store assembly code.

* gcc-9.4+ onwards there is support for outline-atomics that could emit
  both the variants of the code (load-store and cas/swp) and based on
  underlying supported architecture proper variant it used but still a lot
  of distros don't support GCC-9.4 as the default compiler.

* In light of this, we would like to propose a CAS-based approach based on
  our local testing has shown improvement in the range of 10-40%.
  (attaching graph).

* Patch enables CAS based approach if the CAS is supported depending on
  existing compiled flag HAVE_GCC__ATOMIC_INT32_CAS

(Thanks to Amit Khandekar for rigorously performance testing this patch
 with different combinations).

[1]: https://godbolt.org/z/jqbEsa

P.S: Sorry if I missed any standard pgsql protocol since I am just starting
with pgsql.

---
Krunal Bauskar
#mysqlonarm
Huawei Technologies
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6..940fdcd 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,24 @@ tas(volatile slock_t *lock)
  * than other widths.
  */
 #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC__SYNC_INT32_TAS
+
+#ifdef HAVE_GCC__ATOMIC_INT32_CAS
+/* just reusing the same flag to avoid re-declaration of default tas functions below */
+#define HAS_TEST_AND_SET
+
+#define TAS(lock) cas(lock)
+typedef int slock_t;
+
+static __inline__ int
+cas(volatile slock_t *lock)
+{
+	slock_t expected = 0;
+	return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1,
+				false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
+}
+
+#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE);
+#elif HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
 #define TAS(lock) tas(lock)

Reply via email to