v1->v2
- Fix a bug at UNLOCK_ELISION
- Add checking if glibc version >= 2.21 (OVS_CHECK_GLIBC_TSX)
- Add checking of whether cpu has TSX support (OVS_CHECK_RTM)
- Enable LOCK_ELISION only when CPU has TSX and glibc doesn't
  (if glibc version >= 2.21, then using phtread_mutex has lock elision)
- Add 20% mutation test-cmap testcase
- List failed testcases below 

The patch (for references, not fully passed check yet) shows the preliminary
results of enabling RTM (Restricted Transactional Memory). A successful
transactional execution elides the lock, i.e., the lock is by-passed, and
exposes concurrency. However, transactions might abort due to several reasons
such as data conflicts, I/O operations, syscall, etc. When transaction aborts,
it falls back to the original locking mechanisms. Thus, the performance
improvement depends on the abort rate, and the overhead of speculative execution
and rollback.

The patch adds OVS_LOCK_ELISION, OVS_TRYLOCK_ELISION, and OVS_UNLOCK_ELISION.
Experiments show that for cmap, the TM does not seem to have observable
improvements at 2% mutations. For cmap search over 5% mutation, the cmap search
time of TM shows better numbers. The most significant improvement is at hmap
search, where with TM, there is around x2 or x3 speedup.

Results are shown by using test-cmap benchmark.
$ tests/ovstest test-cmap benchmark 20000000 4 2 1
Benchmarking with n=2000000, 4 threads, 2.00% mutations, batch size 1:
             < baseline >   < TM >
cmap insert     293 ms      293 ms
cmap iterate    37 ms       37 ms
batch search    197 ms      186 ms
cmap destroy    177 ms      166 ms

cmap insert     284 ms      283 ms
cmap iterate    38 ms       38 ms
cmap search     121 ms      120 ms *
cmap destroy    168 ms      245 ms

hmap insert     141 ms      164 ms
hmap iterate    138 ms      134 ms
hmap search     780 ms      414 ms *
hmap destroy    126 ms      122 ms

Benchmarking with n=2000000, 4 threads, 5.00% mutations, batch size 1:
cmap insert     296 ms      298 ms
cmap iterate    38 ms       38 ms
batch search    239 ms      332 ms
cmap destroy    162 ms      159 ms

cmap insert     285 ms      284 ms
cmap iterate    37 ms       37 ms
cmap search     143 ms      126 ms *
cmap destroy    163 ms      159 ms

hmap insert     130 ms      128 ms
hmap iterate    140 ms      134 ms
hmap search     1577 ms     489 ms *
hmap destroy    117 ms      108 ms

Benchmarking with n=2000000, 4 threads, 10.00% mutations, batch size 1:
cmap insert     301 ms      295 ms
cmap iterate    37 ms       37 ms
batch search    355 ms      380 ms
cmap destroy    148 ms      145 ms

cmap insert     286 ms      288 ms
cmap iterate    37 ms       37 ms
cmap search     240 ms      133 ms *
cmap destroy    146 ms      148 ms

hmap insert     133 ms      145 ms
hmap iterate    139 ms      134 ms
hmap search     2087 ms     575 ms *
hmap destroy    94 ms       87 ms

Benchmarking with n=2000000, 4 threads, 20.00% mutations, batch size 1:
cmap insert     295 ms      297 ms
cmap iterate    37 ms       38 ms
batch search    834 ms      568 ms
cmap destroy    126 ms      124 ms

cmap insert     328 ms      283 ms
cmap iterate    37 ms       37 ms
cmap search     518 ms      164 ms *
cmap destroy    146 ms      124 ms

hmap insert     148 ms      128 ms
hmap iterate    134 ms      135 ms
hmap search     3034 ms     904 ms *
hmap destroy    60 ms       58 ms

=== Analysis of Abort rate ===
Linux Perf shows PMU counters for transactional memory as below.
$ perf record -e tx-start,tx-abort,tx-commit,tx-conflict <program>
The results from 'perf report' of the three TM experiments show:
              2%       5%     10%
tx-start     10K      14K     16K (count start transaction)
tx-abort      7K      13K     14K (count all aborts)
tx-commit    10K      14K     16K (count commit of transactions)
tx-conflict   7K      13K     14K (count aborts due to conflict with other CPUs)

Note:
While test-cmap benchmark works fine, the patch fails a few cases
when doing "make check". Failed cases:
  717 ofproto - del group (OpenFlow 1.5), and the following 3
  816 ofproto-dpif - active-backup bonding, and 817, 818
  939 ofproto-dpif megaflow - normal, active-backup bonding, and 940, 955

Signed-off-by: William Tu <u9012...@gmail.com>
---
 configure.ac      |   2 +
 lib/ovs-thread.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 m4/openvswitch.m4 |  21 +++++++++++
 3 files changed, 130 insertions(+)

diff --git a/configure.ac b/configure.ac
index 49aa182..5c3bd15 100644
--- a/configure.ac
+++ b/configure.ac
@@ -113,6 +113,8 @@ OVS_CHECK_PKIDIR
 OVS_CHECK_RUNDIR
 OVS_CHECK_DBDIR
 OVS_CHECK_BACKTRACE
+OVS_CHECK_RTM
+OVS_CHECK_GLIBC_TSX
 OVS_CHECK_PERF_EVENT
 OVS_CHECK_VALGRIND
 OVS_CHECK_SOCKET_LIBS
diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index b0e10ee..6fde904 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -52,6 +52,110 @@ static const char *must_not_fork;
 /* True if we created any threads beyond the main initial thread. */
 static bool multithreaded;
 
+#if defined(HAVE_RTM) && !defined(HAVE_GLIBC_TSX)
+/* Intel Transactional Memory (TSX) support */
+#define _XBEGIN_STARTED         (~0u)
+#define _ABORT_EXPLICIT         (1 << 0)
+#define _ABORT_RETRY            (1 << 1)
+#define _ABORT_CONFLICT         (1 << 2)
+#define _ABORT_CAPACITY         (1 << 3)
+#define _ABORT_DEBUG            (1 << 4)
+#define _ABORT_NESTED           (1 << 5)
+#define _XABORT_CODE(x)         (((x) >> 24) & 0xff)
+#define _ABORT_LOCK_BUSY        0xff
+
+#define MAX_RETRY_XBEGIN        10
+#define MAX_ABORT_RETRY         5
+
+#define __force_inline __attribute__((__always_inline__)) inline
+
+/* See: Intel 64 and IA-32 Architectures Software Developer's Manual
+ * Instruction Set Reference. */
+
+static __force_inline int _xbegin(void)
+{
+    int ret = _XBEGIN_STARTED;
+    __asm__ volatile (".byte 0xc7,0xf8 ; .long 0" : "+a" (ret) :: "memory");
+    return ret;
+}
+
+static __force_inline void _xend(void)
+{
+    __asm__ volatile (".byte 0x0f,0x01,0xd5" ::: "memory");
+}
+
+static __force_inline void _xabort(const unsigned int status)
+{
+     __asm__ volatile (".byte 0xc6,0xf8,%P0" :: "i" (status) : "memory");
+}
+
+static __force_inline int _xtest(void)
+{
+    unsigned char out;
+    /* return 1 if RTM_ACTIVE = 1 */
+    __asm__ volatile (".byte 0x0f,0x01,0xd6 ; setnz %0" : "=r" (out) :: 
"memory");
+    return out;
+}
+
+#define OVS_LOCK_ELISION(mutex) \
+{\
+    unsigned status; \
+    int i; \
+    int abort_retry = 0; \
+    for(i = 0; i < MAX_RETRY_XBEGIN; i++) { \
+        if ((status = _xbegin()) == _XBEGIN_STARTED) { \
+            if ((mutex)->__data.__lock == 0) {\
+                return; \
+            }\
+            _xabort(_ABORT_LOCK_BUSY);\
+        } \
+        if (!(status & _ABORT_RETRY)) { \
+            if (abort_retry >= MAX_ABORT_RETRY) {\
+                break; \
+            }\
+            abort_retry++; \
+        } \
+    } \
+}
+
+/* Same as OVS_LOCK_ELISION, but has return value */
+#define OVS_TRYLOCK_ELISION(mutex) \
+{\
+    unsigned status; \
+    int i; \
+    int abort_retry = 0; \
+    for(i = 0; i < MAX_RETRY_XBEGIN; i++) { \
+        if ((status = _xbegin()) == _XBEGIN_STARTED) { \
+            if ((mutex)->__data.__lock == 0) {\
+                return 0; \
+            }\
+            _xabort(_ABORT_LOCK_BUSY);\
+        } \
+        if (!(status & _ABORT_RETRY)) { \
+            if (abort_retry >= MAX_ABORT_RETRY) {\
+                break; \
+            }\
+            abort_retry++; \
+        } \
+    } \
+}
+
+#define OVS_UNLOCK_ELISION(mutex) \
+{\
+    if (((mutex)->__data.__lock == 0)) { \
+        if (_xtest() == 1) {\
+            _xend();\
+        }\
+        return;\
+    }\
+}
+
+#else /* HAVE_RTM disable below */
+#define OVS_LOCK_ELISION(mutex)
+#define OVS_TRYLOCK_ELISION(mutex)
+#define OVS_UNLOCK_ELISION(mutex)
+#endif
+
 #define LOCK_FUNCTION(TYPE, FUN) \
     void \
     ovs_##TYPE##_##FUN##_at(const struct ovs_##TYPE *l_, \
@@ -60,6 +164,7 @@ static bool multithreaded;
     { \
         struct ovs_##TYPE *l = CONST_CAST(struct ovs_##TYPE *, l_); \
         int error; \
+        OVS_LOCK_ELISION(&l->lock); \
  \
         /* Verify that 'l' was initialized. */ \
         if (OVS_UNLIKELY(!l->where)) { \
@@ -85,6 +190,7 @@ LOCK_FUNCTION(rwlock, wrlock);
     { \
         struct ovs_##TYPE *l = CONST_CAST(struct ovs_##TYPE *, l_); \
         int error; \
+        OVS_TRYLOCK_ELISION(&l->lock); \
  \
         /* Verify that 'l' was initialized. */ \
         if (OVS_UNLIKELY(!l->where)) { \
@@ -112,6 +218,7 @@ TRY_LOCK_FUNCTION(rwlock, trywrlock);
     { \
         struct ovs_##TYPE *l = CONST_CAST(struct ovs_##TYPE *, l_); \
         int error; \
+        OVS_UNLOCK_ELISION(&l->lock); \
  \
         /* Verify that 'l' was initialized. */ \
         ovs_assert(l->where); \
diff --git a/m4/openvswitch.m4 b/m4/openvswitch.m4
index 0149c30..8245da9 100644
--- a/m4/openvswitch.m4
+++ b/m4/openvswitch.m4
@@ -309,6 +309,27 @@ AC_DEFUN([OVS_CHECK_BACKTRACE],
                   [AC_DEFINE([HAVE_BACKTRACE], [1],
                              [Define to 1 if you have backtrace(3).])])])
 
+dnl Defines HAVE_RTM if CPU supports TSX
+AC_DEFUN([OVS_CHECK_RTM],
+  [ if (test -e /proc/cpuinfo && cat /proc/cpuinfo | grep rtm > /dev/null); 
then
+        AC_DEFINE([HAVE_RTM], [1], [Define to 1 if CPU has Transactional 
Memory.])
+    fi])
+
+dnl Defines HAVE_GLIBC_TSX if glibc supports TSX
+AC_DEFUN([OVS_CHECK_GLIBC_TSX],
+  [AC_RUN_IFELSE(
+    [AC_LANG_PROGRAM([],
+    [[
+      #if __GLIBC__ >= 2 && __GLIBC_MINOR__ >= 21
+      return 0;
+      #else
+      return 1;
+      #endif
+    ]])],
+    [AC_DEFINE([HAVE_GLIBC_TSX], [1], [Defines HAVE_GLIBC_TSX if glibc 
supports TSX])],
+    [], [])
+  ])
+
 dnl Defines HAVE_PERF_EVENT if linux/perf_event.h is found.
 AC_DEFUN([OVS_CHECK_PERF_EVENT],
   [AC_CHECK_HEADERS([linux/perf_event.h])])
-- 
2.5.0

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to