> > > > > > > > > > > On Wed, Apr 10, 2019 at 06:20:04AM -0500, Honnappa Nagarahalli > > > wrote: > > > > > > Add RCU library supporting quiescent state based memory > > > > > > reclamation > > > > > method. > > > > > > This library helps identify the quiescent state of the reader > > > > > > threads so that the writers can free the memory associated > > > > > > with the lock less data structures. > > > > > > > > > > I don't see any sign of read-side markers (rcu_read_lock() and > > > > > rcu_read_unlock() in the Linux kernel, userspace RCU, etc.). > > > > > > > > > > Yes, strictly speaking, these are not needed for QSBR to > > > > > operate, but they > > > > These APIs would be empty for QSBR. > > > > > > > > > make it way easier to maintain and debug code using RCU. For > > > > > example, given the read-side markers, you can check for errors > > > > > like having a call to > > > > > rte_rcu_qsbr_quiescent() in the middle of a reader quite easily. > > > > > Without those read-side markers, life can be quite hard and you > > > > > will really hate yourself for failing to have provided them. > > > > > > > > Want to make sure I understood this, do you mean the application > > > would mark before and after accessing the shared data structure on > > > the reader side? > > > > > > > > rte_rcu_qsbr_lock() > > > > <begin access shared data structure> ... > > > > ... > > > > <end access shared data structure> > > > > rte_rcu_qsbr_unlock() > > > > > > Yes, that is the idea. > > > > > > > If someone is debugging this code, they have to make sure that > > > > there is > > > an unlock for every lock and there is no call to > > > rte_rcu_qsbr_quiescent in between. > > > > It sounds good to me. Obviously, they will not add any additional > > > > cycles > > > as well. > > > > Please let me know if my understanding is correct. > > > > > > Yes. And in some sort of debug mode, you could capture the counter > > > at > > > rte_rcu_qsbr_lock() time and check it at rte_rcu_qsbr_unlock() time. > > > If the counter has advanced too far (more than one, if I am not too > > > confused) there is a bug. Also in debug mode, you could have > > > rte_rcu_qsbr_lock() increment a per-thread counter and > rte_rcu_qsbr_unlock() decrement it. > > > If the counter is non-zero at a quiescent state, there is a bug. > > > And so on. > > > > > Added this in V5 > > > > <snip> > > > > > > > > + > > > > > > +/* Get the memory size of QSBR variable */ size_t > > > > > > +__rte_experimental rte_rcu_qsbr_get_memsize(uint32_t > > > max_threads) { > > > > > > + size_t sz; > > > > > > + > > > > > > + if (max_threads == 0) { > > > > > > + rte_log(RTE_LOG_ERR, rcu_log_type, > > > > > > + "%s(): Invalid max_threads %u\n", > > > > > > + __func__, max_threads); > > > > > > + rte_errno = EINVAL; > > > > > > + > > > > > > + return 1; > > > > > > + } > > > > > > + > > > > > > + sz = sizeof(struct rte_rcu_qsbr); > > > > > > + > > > > > > + /* Add the size of quiescent state counter array */ > > > > > > + sz += sizeof(struct rte_rcu_qsbr_cnt) * max_threads; > > > > > > + > > > > > > + /* Add the size of the registered thread ID bitmap array */ > > > > > > + sz += RTE_QSBR_THRID_ARRAY_SIZE(max_threads); > > > > > > + > > > > > > + return RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE); > > > > > > > > > > Given that you align here, should you also align in the earlier > > > > > steps in the computation of sz? > > > > > > > > Agree. I will remove the align here and keep the earlier one as > > > > the intent > > > is to align the thread ID array. > > > > > > Sounds good! > > Added this in V5 > > > > > > > > > > > +} > > > > > > + > > > > > > +/* Initialize a quiescent state variable */ int > > > > > > +__rte_experimental rte_rcu_qsbr_init(struct rte_rcu_qsbr *v, > > > uint32_t max_threads) { > > > > > > + size_t sz; > > > > > > + > > > > > > + if (v == NULL) { > > > > > > + rte_log(RTE_LOG_ERR, rcu_log_type, > > > > > > + "%s(): Invalid input parameter\n", __func__); > > > > > > + rte_errno = EINVAL; > > > > > > + > > > > > > + return 1; > > > > > > + } > > > > > > + > > > > > > + sz = rte_rcu_qsbr_get_memsize(max_threads); > > > > > > + if (sz == 1) > > > > > > + return 1; > > > > > > + > > > > > > + /* Set all the threads to offline */ > > > > > > + memset(v, 0, sz); > > > > > > > > > > We calculate sz here, but it looks like the caller must also > > > > > calculate it in order to correctly allocate the memory > > > > > referenced by the "v" argument to this function, with bad things > > > > > happening if the two calculations get different results. Should > > > > > "v" instead be allocated within this function to avoid this sort of > problem? > > > > > > > > Earlier version allocated the memory with-in this library. > > > > However, it was > > > decided to go with the current implementation as it provides > > > flexibility for the application to manage the memory as it sees fit. > > > For ex: it could allocate this as part of another structure in a > > > single allocation. This also falls inline with similar approach taken in > other libraries. > > > > > > So the allocator APIs vary too much to allow a pointer to the > > > desired allocator function to be passed in? Or do you also want to > > > allow static allocation? If the latter, would a DEFINE_RTE_RCU_QSBR() > be of use? > > > > > This is done to allow for allocation of memory for QS variable as part > > of a another bigger data structure. This will help in not fragmenting the > memory. For ex: > > > > struct xyz { > > rte_ring *ring; > > rte_rcu_qsbr *v; > > abc *t; > > }; > > struct xyz c; > > > > Memory for the above structure can be allocated in one chunk by > calculating the size required. > > > > In some use cases static allocation might be enough as 'max_threads' > > might be a compile time constant. I am not sure on how to support both > dynamic and static 'max_threads'. > > Same thought here- would be good to have a static initializer > (DEFINE_RTE_RCU_QSBR), but that means new compile time limit > ('max_threads') - thing that we try to avoid. > > > > > > > > > + v->max_threads = max_threads; > > > > > > + v->num_elems = RTE_ALIGN_MUL_CEIL(max_threads, > > > > > > + RTE_QSBR_THRID_ARRAY_ELM_SIZE) / > > > > > > + RTE_QSBR_THRID_ARRAY_ELM_SIZE; > > > > > > + v->token = RTE_QSBR_CNT_INIT; > > > > > > + > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > +/* Register a reader thread to report its quiescent state > > > > > > + * on a QS variable. > > > > > > + */ > > > > > > +int __rte_experimental > > > > > > +rte_rcu_qsbr_thread_register(struct rte_rcu_qsbr *v, unsigned > > > > > > +int > > > > > > +thread_id) { > > > > > > + unsigned int i, id, success; > > > > > > + uint64_t old_bmap, new_bmap; > > > > > > + > > > > > > + if (v == NULL || thread_id >= v->max_threads) { > > > > > > + rte_log(RTE_LOG_ERR, rcu_log_type, > > > > > > + "%s(): Invalid input parameter\n", __func__); > > > > > > + rte_errno = EINVAL; > > > > > > + > > > > > > + return 1; > > > > > > + } > > > > > > + > > > > > > + id = thread_id & RTE_QSBR_THRID_MASK; > > > > > > + i = thread_id >> RTE_QSBR_THRID_INDEX_SHIFT; > > > > > > + > > > > > > + /* Make sure that the counter for registered threads does > not > > > > > > + * go out of sync. Hence, additional checks are required. > > > > > > + */ > > > > > > + /* Check if the thread is already registered */ > > > > > > + old_bmap = > __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i), > > > > > > + __ATOMIC_RELAXED); > > > > > > + if (old_bmap & 1UL << id) > > > > > > + return 0; > > > > > > + > > > > > > + do { > > > > > > + new_bmap = old_bmap | (1UL << id); > > > > > > + success = __atomic_compare_exchange( > > > > > > + > RTE_QSBR_THRID_ARRAY_ELM(v, i), > > > > > > + &old_bmap, &new_bmap, 0, > > > > > > + __ATOMIC_RELEASE, > > > > > __ATOMIC_RELAXED); > > > > > > + > > > > > > + if (success) > > > > > > + __atomic_fetch_add(&v->num_threads, > > > > > > + 1, > __ATOMIC_RELAXED); > > > > > > + else if (old_bmap & (1UL << id)) > > > > > > + /* Someone else registered this thread. > > > > > > + * Counter should not be incremented. > > > > > > + */ > > > > > > + return 0; > > > > > > + } while (success == 0); > > > > > > > > > > This would be simpler if threads were required to register > themselves. > > > > > Maybe you have use cases requiring registration of other > > > > > threads, but this capability is adding significant complexity, > > > > > so it might be worth some thought. > > > > > > > > > It was simple earlier (__atomic_fetch_or). The complexity is added > > > > as > > > 'num_threads' should not go out of sync. > > > > > > Hmmm... > > > > > > So threads are allowed to register other threads? Or is there some > > > other reason that concurrent registration is required? > > > > > Yes, control plane threads can register the fast path threads. Though, > > I am not sure how useful it is. I did not want to add the restriction. I > expect that reader threads will register themselves. The reader threads > require concurrent registration as they all will be running in parallel. > > If the requirement of keeping track of the number of threads registered > currently goes away, then this function will be simple. > > > > <snip> > > > > > > > > diff --git a/lib/librte_rcu/rte_rcu_qsbr.h > > > > > > b/lib/librte_rcu/rte_rcu_qsbr.h new file mode 100644 index > > > > > > 000000000..ff696aeab > > > > > > --- /dev/null > > > > > > +++ b/lib/librte_rcu/rte_rcu_qsbr.h > > > > > > @@ -0,0 +1,554 @@ > > > > > > +/* SPDX-License-Identifier: BSD-3-Clause > > > > > > + * Copyright (c) 2018 Arm Limited */ > > > > > > + > > > > > > +#ifndef _RTE_RCU_QSBR_H_ > > > > > > +#define _RTE_RCU_QSBR_H_ > > > > > > + > > > > > > +/** > > > > > > + * @file > > > > > > + * RTE Quiescent State Based Reclamation (QSBR) > > > > > > + * > > > > > > + * Quiescent State (QS) is any point in the thread execution > > > > > > + * where the thread does not hold a reference to a data > > > > > > +structure > > > > > > + * in shared memory. While using lock-less data structures, > > > > > > +the writer > > > > > > + * can safely free memory once all the reader threads have > > > > > > +entered > > > > > > + * quiescent state. > > > > > > + * > > > > > > + * This library provides the ability for the readers to > > > > > > +report quiescent > > > > > > + * state and for the writers to identify when all the readers > > > > > > +have > > > > > > + * entered quiescent state. > > > > > > + */ > > > > > > + > > > > > > +#ifdef __cplusplus > > > > > > +extern "C" { > > > > > > +#endif > > > > > > + > > > > > > +#include <stdio.h> > > > > > > +#include <stdint.h> > > > > > > +#include <errno.h> > > > > > > +#include <rte_common.h> > > > > > > +#include <rte_memory.h> > > > > > > +#include <rte_lcore.h> > > > > > > +#include <rte_debug.h> > > > > > > +#include <rte_atomic.h> > > > > > > + > > > > > > +extern int rcu_log_type; > > > > > > + > > > > > > +#if RTE_LOG_DP_LEVEL >= RTE_LOG_DEBUG #define > > > RCU_DP_LOG(level, > > > > > fmt, > > > > > > +args...) \ > > > > > > + rte_log(RTE_LOG_ ## level, rcu_log_type, \ > > > > > > + "%s(): " fmt "\n", __func__, ## args) #else #define > > > > > > +RCU_DP_LOG(level, fmt, args...) #endif > > > > > > + > > > > > > +/* Registered thread IDs are stored as a bitmap of 64b > > > > > > +element > > > array. > > > > > > + * Given thread id needs to be converted to index into the > > > > > > +array and > > > > > > + * the id within the array element. > > > > > > + */ > > > > > > +#define RTE_QSBR_THRID_ARRAY_ELM_SIZE (sizeof(uint64_t) * 8) > > > > > #define > > > > > > +RTE_QSBR_THRID_ARRAY_SIZE(max_threads) \ > > > > > > + RTE_ALIGN(RTE_ALIGN_MUL_CEIL(max_threads, \ > > > > > > + RTE_QSBR_THRID_ARRAY_ELM_SIZE) >> 3, > > > > > RTE_CACHE_LINE_SIZE) #define > > > > > > +RTE_QSBR_THRID_ARRAY_ELM(v, i) ((uint64_t *) \ > > > > > > + ((struct rte_rcu_qsbr_cnt *)(v + 1) + v->max_threads) + i) > > > > > > +#define RTE_QSBR_THRID_INDEX_SHIFT 6 #define > > > RTE_QSBR_THRID_MASK > > > > > > +0x3f > > > > > #define > > > > > > +RTE_QSBR_THRID_INVALID 0xffffffff > > > > > > + > > > > > > +/* Worker thread counter */ > > > > > > +struct rte_rcu_qsbr_cnt { > > > > > > + uint64_t cnt; > > > > > > + /**< Quiescent state counter. Value 0 indicates the thread > > > > > > +is offline */ } __rte_cache_aligned; > > > > > > + > > > > > > +#define RTE_QSBR_CNT_THR_OFFLINE 0 #define > > > RTE_QSBR_CNT_INIT 1 > > > > > > + > > > > > > +/* RTE Quiescent State variable structure. > > > > > > + * This structure has two elements that vary in size based on > > > > > > +the > > > > > > + * 'max_threads' parameter. > > > > > > + * 1) Quiescent state counter array > > > > > > + * 2) Register thread ID array */ struct rte_rcu_qsbr { > > > > > > + uint64_t token __rte_cache_aligned; > > > > > > + /**< Counter to allow for multiple concurrent quiescent > > > > > > +state queries */ > > > > > > + > > > > > > + uint32_t num_elems __rte_cache_aligned; > > > > > > + /**< Number of elements in the thread ID array */ > > > > > > + uint32_t num_threads; > > > > > > + /**< Number of threads currently using this QS variable */ > > > > > > + uint32_t max_threads; > > > > > > + /**< Maximum number of threads using this QS variable */ > > > > > > + > > > > > > + struct rte_rcu_qsbr_cnt qsbr_cnt[0] __rte_cache_aligned; > > > > > > + /**< Quiescent state counter array of 'max_threads' > elements > > > > > > +*/ > > > > > > + > > > > > > + /**< Registered thread IDs are stored in a bitmap array, > > > > > > + * after the quiescent state counter array. > > > > > > + */ > > > > > > +} __rte_cache_aligned; > > > > > > + > > > > <snip> > > > > > > > > + > > > > > > +/* Check the quiescent state counter for registered threads > > > > > > +only, assuming > > > > > > + * that not all threads have registered. > > > > > > + */ > > > > > > +static __rte_always_inline int > > > > > > +__rcu_qsbr_check_selective(struct rte_rcu_qsbr *v, uint64_t > > > > > > +t, bool > > > > > > +wait) { > > > > > > + uint32_t i, j, id; > > > > > > + uint64_t bmap; > > > > > > + uint64_t c; > > > > > > + uint64_t *reg_thread_id; > > > > > > + > > > > > > + for (i = 0, reg_thread_id = RTE_QSBR_THRID_ARRAY_ELM(v, > 0); > > > > > > + i < v->num_elems; > > > > > > + i++, reg_thread_id++) { > > > > > > + /* Load the current registered thread bit map > before > > > > > > + * loading the reader thread quiescent state > counters. > > > > > > + */ > > > > > > + bmap = __atomic_load_n(reg_thread_id, > > > > > __ATOMIC_ACQUIRE); > > > > > > + id = i << RTE_QSBR_THRID_INDEX_SHIFT; > > > > > > + > > > > > > + while (bmap) { > > > > > > + j = __builtin_ctzl(bmap); > > > > > > + RCU_DP_LOG(DEBUG, > > > > > > + "%s: check: token = %lu, wait = %d, > Bit Map > > > > > = 0x%lx, Thread ID = %d", > > > > > > + __func__, t, wait, bmap, id + j); > > > > > > + c = __atomic_load_n( > > > > > > + &v->qsbr_cnt[id + j].cnt, > > > > > > + __ATOMIC_ACQUIRE); > > > > > > + RCU_DP_LOG(DEBUG, > > > > > > + "%s: status: token = %lu, wait = %d, > Thread > > > > > QS cnt = %lu, Thread ID = %d", > > > > > > + __func__, t, wait, c, id+j); > > > > > > + /* Counter is not checked for wrap-around > > > > > condition > > > > > > + * as it is a 64b counter. > > > > > > + */ > > > > > > + if (unlikely(c != > RTE_QSBR_CNT_THR_OFFLINE && c > > > > > < t)) { > > > > > > > > > > This assumes that a 64-bit counter won't overflow, which is > > > > > close enough to true given current CPU clock frequencies. ;-) > > > > > > > > > > > + /* This thread is not in quiescent > state */ > > > > > > + if (!wait) > > > > > > + return 0; > > > > > > + > > > > > > + rte_pause(); > > > > > > + /* This thread might have > unregistered. > > > > > > + * Re-read the bitmap. > > > > > > + */ > > > > > > + bmap = > __atomic_load_n(reg_thread_id, > > > > > > + __ATOMIC_ACQUIRE); > > > > > > + > > > > > > + continue; > > > > > > + } > > > > > > + > > > > > > + bmap &= ~(1UL << j); > > > > > > + } > > > > > > + } > > > > > > + > > > > > > + return 1; > > > > > > +} > > > > > > + > > > > > > +/* Check the quiescent state counter for all threads, > > > > > > +assuming that > > > > > > + * all the threads have registered. > > > > > > + */ > > > > > > +static __rte_always_inline int __rcu_qsbr_check_all(struct > > > > > > +rte_rcu_qsbr *v, uint64_t t, bool > > > > > > +wait) > > > > > > > > > > Does checking the bitmap really take long enough to make this > > > > > worthwhile as a separate function? I would think that the > > > > > bitmap-checking time would be lost in the noise of cache misses > > > > > from > > > the ->cnt loads. > > > > > > > > It avoids accessing one cache line. I think this is where the > > > > savings are > > > (may be in theory). This is the most probable use case. > > > > On the other hand, __rcu_qsbr_check_selective() will result in > > > > savings > > > (depending on how many threads are currently registered) by avoiding > > > accessing unwanted counters. > > > > > > Do you really expect to be calling this function on any kind of fastpath? > > > > Yes. For some of the libraries (rte_hash), the writer is on the fast path. > > > > > > > > > > Sure, if you invoke __rcu_qsbr_check_selective() in a tight loop > > > > > in the absence of readers, you might see __rcu_qsbr_check_all() > > > > > being a bit faster. But is that really what DPDK does? > > > > I see improvements in the synthetic test case (similar to the one > > > > you > > > have described, around 27%). However, in the more practical test > > > cases I do not see any difference. > > > > > > If the performance improvement only occurs in a synthetic test case, > > > does it really make sense to optimize for it? > > I had to fix few issues in the performance test cases and added more to > do the comparison. These changes are in v5. > > There are 4 performance tests involving this API. > > 1) 1 Writer, 'N' readers > > Writer: qsbr_start, qsbr_check(wait = true) > > Readers: qsbr_quiescent > > 2) 'N' writers > > Writers: qsbr_start, qsbr_check(wait == false) > > 3) 1 Writer, 'N' readers (this test uses the lock-free rte_hash data > structure) > > Writer: hash_del, qsbr_start, qsbr_check(wait = true), validate that > the reader was able to complete its work successfully > > Readers: thread_online, hash_lookup, access the pointer - do some > > work on it, qsbr_quiescent, thread_offline > > 4) Same as test 3) but qsbr_check (wait == false) > > > > There are 2 sets of these tests. > > a) QS variable is created with number of threads same as number of > > readers - this will exercise __rcu_qsbr_check_all > > b) QS variable is created with 128 threads, number of registered > > threads is same as in a) - this will exercise > > __rcu_qsbr_check_selective > > > > Following are the results on x86 (E5-2660 v4 @ 2.00GHz) comparing from > > a) to b) (on x86 in my setup, the results are not very stable between > > runs) > > 1) 25% > > 2) -3% > > 3) -0.4% > > 4) 1.38% > > > > Following are the results on an Arm system comparing from a) to b) > > (results are not pretty stable between runs) ^^^ Correction, on the Arm system, the results *are* stable (copy-paste error)
> > 1) -3.45% > > 2) 0% > > 3) -0.03% > > 4) -0.04% > > > > Konstantin, is it possible to run the tests on your setup and look at the > results? > > I did run V5 on my box (SKX 2.1 GHz) with 17 lcores (1 physical core per > thread). > Didn't notice any siginifcatn fluctuations between runs, output below. > > >rcu_qsbr_perf_autotesESC[0Kt > Number of cores provided = 17 > Perf test with all reader threads registered > -------------------------------------------- > > Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true) Total RCU > updates = 65707232899 Cycles per 1000 updates: 18482 Total RCU checks = > 20000000 Cycles per 1000 checks: 3794991 > > Perf Test: 17 Readers > Total RCU updates = 1700000000 > Cycles per 1000 updates: 2128 > > Perf test: 17 Writers ('wait' in qsbr_check == false) Total RCU checks = > 340000000 Cycles per 1000 checks: 10030 > > Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking > QSBR Check Following numbers include calls to rte_hash functions Cycles > per 1 update(online/update/offline): 1984696 Cycles per 1 check(start, > check): 2619002 > > Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non- > Blocking QSBR check Following numbers include calls to rte_hash functions > Cycles per 1 update(online/update/offline): 2028030 Cycles per 1 > check(start, check): 2876667 > > Perf test with some of reader threads registered > ------------------------------------------------ > > Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true) Total RCU > updates = 68850073055 Cycles per 1000 updates: 25490 Total RCU checks = > 20000000 Cycles per 1000 checks: 5484403 > > Perf Test: 17 Readers > Total RCU updates = 1700000000 > Cycles per 1000 updates: 2127 > > Perf test: 17 Writers ('wait' in qsbr_check == false) Total RCU checks = > 340000000 Cycles per 1000 checks: 10034 > > Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking > QSBR Check Following numbers include calls to rte_hash functions Cycles > per 1 update(online/update/offline): 3604489 Cycles per 1 check(start, > check): 7077372 > > Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non- > Blocking QSBR check Following numbers include calls to rte_hash functions > Cycles per 1 update(online/update/offline): 3936831 Cycles per 1 > check(start, check): 7262738 > > > Test OK Thanks for running the test. From the numbers, the comparison is as follows: 1) -44% 2) 0.03% 3) -170% 4) -152% Trend is the same between x86 and Arm. However, x86 has drastic improvement with __rcu_qsbr_check_all function. > > Konstantin