v1->v2 - Fix a bug at UNLOCK_ELISION - Add checking if glibc version >= 2.21 (OVS_CHECK_GLIBC_TSX) - Add checking of whether cpu has TSX support (OVS_CHECK_RTM) - Enable LOCK_ELISION only when CPU has TSX and glibc doesn't (if glibc version >= 2.21, then using phtread_mutex has lock elision) - Add 20% mutation test-cmap testcase - List failed testcases below
The patch (for references, not fully passed check yet) shows the preliminary results of enabling RTM (Restricted Transactional Memory). A successful transactional execution elides the lock, i.e., the lock is by-passed, and exposes concurrency. However, transactions might abort due to several reasons such as data conflicts, I/O operations, syscall, etc. When transaction aborts, it falls back to the original locking mechanisms. Thus, the performance improvement depends on the abort rate, and the overhead of speculative execution and rollback. The patch adds OVS_LOCK_ELISION, OVS_TRYLOCK_ELISION, and OVS_UNLOCK_ELISION. Experiments show that for cmap, the TM does not seem to have observable improvements at 2% mutations. For cmap search over 5% mutation, the cmap search time of TM shows better numbers. The most significant improvement is at hmap search, where with TM, there is around x2 or x3 speedup. Results are shown by using test-cmap benchmark. $ tests/ovstest test-cmap benchmark 20000000 4 2 1 Benchmarking with n=2000000, 4 threads, 2.00% mutations, batch size 1: < baseline > < TM > cmap insert 293 ms 293 ms cmap iterate 37 ms 37 ms batch search 197 ms 186 ms cmap destroy 177 ms 166 ms cmap insert 284 ms 283 ms cmap iterate 38 ms 38 ms cmap search 121 ms 120 ms * cmap destroy 168 ms 245 ms hmap insert 141 ms 164 ms hmap iterate 138 ms 134 ms hmap search 780 ms 414 ms * hmap destroy 126 ms 122 ms Benchmarking with n=2000000, 4 threads, 5.00% mutations, batch size 1: cmap insert 296 ms 298 ms cmap iterate 38 ms 38 ms batch search 239 ms 332 ms cmap destroy 162 ms 159 ms cmap insert 285 ms 284 ms cmap iterate 37 ms 37 ms cmap search 143 ms 126 ms * cmap destroy 163 ms 159 ms hmap insert 130 ms 128 ms hmap iterate 140 ms 134 ms hmap search 1577 ms 489 ms * hmap destroy 117 ms 108 ms Benchmarking with n=2000000, 4 threads, 10.00% mutations, batch size 1: cmap insert 301 ms 295 ms cmap iterate 37 ms 37 ms batch search 355 ms 380 ms cmap destroy 148 ms 145 ms cmap insert 286 ms 288 ms cmap iterate 37 ms 37 ms cmap search 240 ms 133 ms * cmap destroy 146 ms 148 ms hmap insert 133 ms 145 ms hmap iterate 139 ms 134 ms hmap search 2087 ms 575 ms * hmap destroy 94 ms 87 ms Benchmarking with n=2000000, 4 threads, 20.00% mutations, batch size 1: cmap insert 295 ms 297 ms cmap iterate 37 ms 38 ms batch search 834 ms 568 ms cmap destroy 126 ms 124 ms cmap insert 328 ms 283 ms cmap iterate 37 ms 37 ms cmap search 518 ms 164 ms * cmap destroy 146 ms 124 ms hmap insert 148 ms 128 ms hmap iterate 134 ms 135 ms hmap search 3034 ms 904 ms * hmap destroy 60 ms 58 ms === Analysis of Abort rate === Linux Perf shows PMU counters for transactional memory as below. $ perf record -e tx-start,tx-abort,tx-commit,tx-conflict <program> The results from 'perf report' of the three TM experiments show: 2% 5% 10% tx-start 10K 14K 16K (count start transaction) tx-abort 7K 13K 14K (count all aborts) tx-commit 10K 14K 16K (count commit of transactions) tx-conflict 7K 13K 14K (count aborts due to conflict with other CPUs) Note: While test-cmap benchmark works fine, the patch fails a few cases when doing "make check". Failed cases: 717 ofproto - del group (OpenFlow 1.5), and the following 3 816 ofproto-dpif - active-backup bonding, and 817, 818 939 ofproto-dpif megaflow - normal, active-backup bonding, and 940, 955 Signed-off-by: William Tu <u9012...@gmail.com> --- configure.ac | 2 + lib/ovs-thread.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ m4/openvswitch.m4 | 21 +++++++++++ 3 files changed, 130 insertions(+) diff --git a/configure.ac b/configure.ac index 49aa182..5c3bd15 100644 --- a/configure.ac +++ b/configure.ac @@ -113,6 +113,8 @@ OVS_CHECK_PKIDIR OVS_CHECK_RUNDIR OVS_CHECK_DBDIR OVS_CHECK_BACKTRACE +OVS_CHECK_RTM +OVS_CHECK_GLIBC_TSX OVS_CHECK_PERF_EVENT OVS_CHECK_VALGRIND OVS_CHECK_SOCKET_LIBS diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c index b0e10ee..6fde904 100644 --- a/lib/ovs-thread.c +++ b/lib/ovs-thread.c @@ -52,6 +52,110 @@ static const char *must_not_fork; /* True if we created any threads beyond the main initial thread. */ static bool multithreaded; +#if defined(HAVE_RTM) && !defined(HAVE_GLIBC_TSX) +/* Intel Transactional Memory (TSX) support */ +#define _XBEGIN_STARTED (~0u) +#define _ABORT_EXPLICIT (1 << 0) +#define _ABORT_RETRY (1 << 1) +#define _ABORT_CONFLICT (1 << 2) +#define _ABORT_CAPACITY (1 << 3) +#define _ABORT_DEBUG (1 << 4) +#define _ABORT_NESTED (1 << 5) +#define _XABORT_CODE(x) (((x) >> 24) & 0xff) +#define _ABORT_LOCK_BUSY 0xff + +#define MAX_RETRY_XBEGIN 10 +#define MAX_ABORT_RETRY 5 + +#define __force_inline __attribute__((__always_inline__)) inline + +/* See: Intel 64 and IA-32 Architectures Software Developer's Manual + * Instruction Set Reference. */ + +static __force_inline int _xbegin(void) +{ + int ret = _XBEGIN_STARTED; + __asm__ volatile (".byte 0xc7,0xf8 ; .long 0" : "+a" (ret) :: "memory"); + return ret; +} + +static __force_inline void _xend(void) +{ + __asm__ volatile (".byte 0x0f,0x01,0xd5" ::: "memory"); +} + +static __force_inline void _xabort(const unsigned int status) +{ + __asm__ volatile (".byte 0xc6,0xf8,%P0" :: "i" (status) : "memory"); +} + +static __force_inline int _xtest(void) +{ + unsigned char out; + /* return 1 if RTM_ACTIVE = 1 */ + __asm__ volatile (".byte 0x0f,0x01,0xd6 ; setnz %0" : "=r" (out) :: "memory"); + return out; +} + +#define OVS_LOCK_ELISION(mutex) \ +{\ + unsigned status; \ + int i; \ + int abort_retry = 0; \ + for(i = 0; i < MAX_RETRY_XBEGIN; i++) { \ + if ((status = _xbegin()) == _XBEGIN_STARTED) { \ + if ((mutex)->__data.__lock == 0) {\ + return; \ + }\ + _xabort(_ABORT_LOCK_BUSY);\ + } \ + if (!(status & _ABORT_RETRY)) { \ + if (abort_retry >= MAX_ABORT_RETRY) {\ + break; \ + }\ + abort_retry++; \ + } \ + } \ +} + +/* Same as OVS_LOCK_ELISION, but has return value */ +#define OVS_TRYLOCK_ELISION(mutex) \ +{\ + unsigned status; \ + int i; \ + int abort_retry = 0; \ + for(i = 0; i < MAX_RETRY_XBEGIN; i++) { \ + if ((status = _xbegin()) == _XBEGIN_STARTED) { \ + if ((mutex)->__data.__lock == 0) {\ + return 0; \ + }\ + _xabort(_ABORT_LOCK_BUSY);\ + } \ + if (!(status & _ABORT_RETRY)) { \ + if (abort_retry >= MAX_ABORT_RETRY) {\ + break; \ + }\ + abort_retry++; \ + } \ + } \ +} + +#define OVS_UNLOCK_ELISION(mutex) \ +{\ + if (((mutex)->__data.__lock == 0)) { \ + if (_xtest() == 1) {\ + _xend();\ + }\ + return;\ + }\ +} + +#else /* HAVE_RTM disable below */ +#define OVS_LOCK_ELISION(mutex) +#define OVS_TRYLOCK_ELISION(mutex) +#define OVS_UNLOCK_ELISION(mutex) +#endif + #define LOCK_FUNCTION(TYPE, FUN) \ void \ ovs_##TYPE##_##FUN##_at(const struct ovs_##TYPE *l_, \ @@ -60,6 +164,7 @@ static bool multithreaded; { \ struct ovs_##TYPE *l = CONST_CAST(struct ovs_##TYPE *, l_); \ int error; \ + OVS_LOCK_ELISION(&l->lock); \ \ /* Verify that 'l' was initialized. */ \ if (OVS_UNLIKELY(!l->where)) { \ @@ -85,6 +190,7 @@ LOCK_FUNCTION(rwlock, wrlock); { \ struct ovs_##TYPE *l = CONST_CAST(struct ovs_##TYPE *, l_); \ int error; \ + OVS_TRYLOCK_ELISION(&l->lock); \ \ /* Verify that 'l' was initialized. */ \ if (OVS_UNLIKELY(!l->where)) { \ @@ -112,6 +218,7 @@ TRY_LOCK_FUNCTION(rwlock, trywrlock); { \ struct ovs_##TYPE *l = CONST_CAST(struct ovs_##TYPE *, l_); \ int error; \ + OVS_UNLOCK_ELISION(&l->lock); \ \ /* Verify that 'l' was initialized. */ \ ovs_assert(l->where); \ diff --git a/m4/openvswitch.m4 b/m4/openvswitch.m4 index 0149c30..8245da9 100644 --- a/m4/openvswitch.m4 +++ b/m4/openvswitch.m4 @@ -309,6 +309,27 @@ AC_DEFUN([OVS_CHECK_BACKTRACE], [AC_DEFINE([HAVE_BACKTRACE], [1], [Define to 1 if you have backtrace(3).])])]) +dnl Defines HAVE_RTM if CPU supports TSX +AC_DEFUN([OVS_CHECK_RTM], + [ if (test -e /proc/cpuinfo && cat /proc/cpuinfo | grep rtm > /dev/null); then + AC_DEFINE([HAVE_RTM], [1], [Define to 1 if CPU has Transactional Memory.]) + fi]) + +dnl Defines HAVE_GLIBC_TSX if glibc supports TSX +AC_DEFUN([OVS_CHECK_GLIBC_TSX], + [AC_RUN_IFELSE( + [AC_LANG_PROGRAM([], + [[ + #if __GLIBC__ >= 2 && __GLIBC_MINOR__ >= 21 + return 0; + #else + return 1; + #endif + ]])], + [AC_DEFINE([HAVE_GLIBC_TSX], [1], [Defines HAVE_GLIBC_TSX if glibc supports TSX])], + [], []) + ]) + dnl Defines HAVE_PERF_EVENT if linux/perf_event.h is found. AC_DEFUN([OVS_CHECK_PERF_EVENT], [AC_CHECK_HEADERS([linux/perf_event.h])]) -- 2.5.0 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev