> > > > > > > > On Wed, Apr 10, 2019 at 06:20:04AM -0500, Honnappa Nagarahalli > > wrote: > > > > > Add RCU library supporting quiescent state based memory > > > > > reclamation > > > > method. > > > > > This library helps identify the quiescent state of the reader > > > > > threads so that the writers can free the memory associated with > > > > > the lock less data structures. > > > > > > > > I don't see any sign of read-side markers (rcu_read_lock() and > > > > rcu_read_unlock() in the Linux kernel, userspace RCU, etc.). > > > > > > > > Yes, strictly speaking, these are not needed for QSBR to operate, > > > > but they > > > These APIs would be empty for QSBR. > > > > > > > make it way easier to maintain and debug code using RCU. For > > > > example, given the read-side markers, you can check for errors like > > > > having a call to > > > > rte_rcu_qsbr_quiescent() in the middle of a reader quite easily. > > > > Without those read-side markers, life can be quite hard and you will > > > > really hate yourself for failing to have provided them. > > > > > > Want to make sure I understood this, do you mean the application > > would mark before and after accessing the shared data structure on the > > reader side? > > > > > > rte_rcu_qsbr_lock() > > > <begin access shared data structure> > > > ... > > > ... > > > <end access shared data structure> > > > rte_rcu_qsbr_unlock() > > > > Yes, that is the idea. > > > > > If someone is debugging this code, they have to make sure that there is > > an unlock for every lock and there is no call to rte_rcu_qsbr_quiescent in > > between. > > > It sounds good to me. Obviously, they will not add any additional cycles > > as well. > > > Please let me know if my understanding is correct. > > > > Yes. And in some sort of debug mode, you could capture the counter at > > rte_rcu_qsbr_lock() time and check it at rte_rcu_qsbr_unlock() time. If the > > counter has advanced too far (more than one, if I am not too confused) > > there is a bug. Also in debug mode, you could have rte_rcu_qsbr_lock() > > increment a per-thread counter and rte_rcu_qsbr_unlock() decrement it. > > If the counter is non-zero at a quiescent state, there is a bug. > > And so on. > > > Added this in V5 > > <snip> > > > > > > + > > > > > +/* Get the memory size of QSBR variable */ size_t > > > > > +__rte_experimental rte_rcu_qsbr_get_memsize(uint32_t > > max_threads) { > > > > > + size_t sz; > > > > > + > > > > > + if (max_threads == 0) { > > > > > + rte_log(RTE_LOG_ERR, rcu_log_type, > > > > > + "%s(): Invalid max_threads %u\n", > > > > > + __func__, max_threads); > > > > > + rte_errno = EINVAL; > > > > > + > > > > > + return 1; > > > > > + } > > > > > + > > > > > + sz = sizeof(struct rte_rcu_qsbr); > > > > > + > > > > > + /* Add the size of quiescent state counter array */ > > > > > + sz += sizeof(struct rte_rcu_qsbr_cnt) * max_threads; > > > > > + > > > > > + /* Add the size of the registered thread ID bitmap array */ > > > > > + sz += RTE_QSBR_THRID_ARRAY_SIZE(max_threads); > > > > > + > > > > > + return RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE); > > > > > > > > Given that you align here, should you also align in the earlier > > > > steps in the computation of sz? > > > > > > Agree. I will remove the align here and keep the earlier one as the intent > > is to align the thread ID array. > > > > Sounds good! > Added this in V5 > > > > > > > > +} > > > > > + > > > > > +/* Initialize a quiescent state variable */ int > > > > > +__rte_experimental rte_rcu_qsbr_init(struct rte_rcu_qsbr *v, > > uint32_t max_threads) { > > > > > + size_t sz; > > > > > + > > > > > + if (v == NULL) { > > > > > + rte_log(RTE_LOG_ERR, rcu_log_type, > > > > > + "%s(): Invalid input parameter\n", __func__); > > > > > + rte_errno = EINVAL; > > > > > + > > > > > + return 1; > > > > > + } > > > > > + > > > > > + sz = rte_rcu_qsbr_get_memsize(max_threads); > > > > > + if (sz == 1) > > > > > + return 1; > > > > > + > > > > > + /* Set all the threads to offline */ > > > > > + memset(v, 0, sz); > > > > > > > > We calculate sz here, but it looks like the caller must also > > > > calculate it in order to correctly allocate the memory referenced by > > > > the "v" argument to this function, with bad things happening if the > > > > two calculations get different results. Should "v" instead be > > > > allocated within this function to avoid this sort of problem? > > > > > > Earlier version allocated the memory with-in this library. However, it was > > decided to go with the current implementation as it provides flexibility for > > the application to manage the memory as it sees fit. For ex: it could > > allocate this as part of another structure in a single allocation. This also > > falls inline with similar approach taken in other libraries. > > > > So the allocator APIs vary too much to allow a pointer to the desired > > allocator function to be passed in? Or do you also want to allow static > > allocation? If the latter, would a DEFINE_RTE_RCU_QSBR() be of use? > > > This is done to allow for allocation of memory for QS variable as part of a > another bigger data structure. This will help in not fragmenting > the memory. For ex: > > struct xyz { > rte_ring *ring; > rte_rcu_qsbr *v; > abc *t; > }; > struct xyz c; > > Memory for the above structure can be allocated in one chunk by calculating > the size required. > > In some use cases static allocation might be enough as 'max_threads' might be > a compile time constant. I am not sure on how to support > both dynamic and static 'max_threads'.
Same thought here- would be good to have a static initializer (DEFINE_RTE_RCU_QSBR), but that means new compile time limit ('max_threads') - thing that we try to avoid. > > > > > > + v->max_threads = max_threads; > > > > > + v->num_elems = RTE_ALIGN_MUL_CEIL(max_threads, > > > > > + RTE_QSBR_THRID_ARRAY_ELM_SIZE) / > > > > > + RTE_QSBR_THRID_ARRAY_ELM_SIZE; > > > > > + v->token = RTE_QSBR_CNT_INIT; > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +/* Register a reader thread to report its quiescent state > > > > > + * on a QS variable. > > > > > + */ > > > > > +int __rte_experimental > > > > > +rte_rcu_qsbr_thread_register(struct rte_rcu_qsbr *v, unsigned int > > > > > +thread_id) { > > > > > + unsigned int i, id, success; > > > > > + uint64_t old_bmap, new_bmap; > > > > > + > > > > > + if (v == NULL || thread_id >= v->max_threads) { > > > > > + rte_log(RTE_LOG_ERR, rcu_log_type, > > > > > + "%s(): Invalid input parameter\n", __func__); > > > > > + rte_errno = EINVAL; > > > > > + > > > > > + return 1; > > > > > + } > > > > > + > > > > > + id = thread_id & RTE_QSBR_THRID_MASK; > > > > > + i = thread_id >> RTE_QSBR_THRID_INDEX_SHIFT; > > > > > + > > > > > + /* Make sure that the counter for registered threads does not > > > > > + * go out of sync. Hence, additional checks are required. > > > > > + */ > > > > > + /* Check if the thread is already registered */ > > > > > + old_bmap = __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i), > > > > > + __ATOMIC_RELAXED); > > > > > + if (old_bmap & 1UL << id) > > > > > + return 0; > > > > > + > > > > > + do { > > > > > + new_bmap = old_bmap | (1UL << id); > > > > > + success = __atomic_compare_exchange( > > > > > + RTE_QSBR_THRID_ARRAY_ELM(v, i), > > > > > + &old_bmap, &new_bmap, 0, > > > > > + __ATOMIC_RELEASE, > > > > __ATOMIC_RELAXED); > > > > > + > > > > > + if (success) > > > > > + __atomic_fetch_add(&v->num_threads, > > > > > + 1, __ATOMIC_RELAXED); > > > > > + else if (old_bmap & (1UL << id)) > > > > > + /* Someone else registered this thread. > > > > > + * Counter should not be incremented. > > > > > + */ > > > > > + return 0; > > > > > + } while (success == 0); > > > > > > > > This would be simpler if threads were required to register themselves. > > > > Maybe you have use cases requiring registration of other threads, > > > > but this capability is adding significant complexity, so it might be > > > > worth some thought. > > > > > > > It was simple earlier (__atomic_fetch_or). The complexity is added as > > 'num_threads' should not go out of sync. > > > > Hmmm... > > > > So threads are allowed to register other threads? Or is there some other > > reason that concurrent registration is required? > > > Yes, control plane threads can register the fast path threads. Though, I am > not sure how useful it is. I did not want to add the restriction. I > expect that reader threads will register themselves. The reader threads > require concurrent registration as they all will be running in parallel. > If the requirement of keeping track of the number of threads registered > currently goes away, then this function will be simple. > > <snip> > > > > > > diff --git a/lib/librte_rcu/rte_rcu_qsbr.h > > > > > b/lib/librte_rcu/rte_rcu_qsbr.h new file mode 100644 index > > > > > 000000000..ff696aeab > > > > > --- /dev/null > > > > > +++ b/lib/librte_rcu/rte_rcu_qsbr.h > > > > > @@ -0,0 +1,554 @@ > > > > > +/* SPDX-License-Identifier: BSD-3-Clause > > > > > + * Copyright (c) 2018 Arm Limited */ > > > > > + > > > > > +#ifndef _RTE_RCU_QSBR_H_ > > > > > +#define _RTE_RCU_QSBR_H_ > > > > > + > > > > > +/** > > > > > + * @file > > > > > + * RTE Quiescent State Based Reclamation (QSBR) > > > > > + * > > > > > + * Quiescent State (QS) is any point in the thread execution > > > > > + * where the thread does not hold a reference to a data structure > > > > > + * in shared memory. While using lock-less data structures, the > > > > > +writer > > > > > + * can safely free memory once all the reader threads have > > > > > +entered > > > > > + * quiescent state. > > > > > + * > > > > > + * This library provides the ability for the readers to report > > > > > +quiescent > > > > > + * state and for the writers to identify when all the readers > > > > > +have > > > > > + * entered quiescent state. > > > > > + */ > > > > > + > > > > > +#ifdef __cplusplus > > > > > +extern "C" { > > > > > +#endif > > > > > + > > > > > +#include <stdio.h> > > > > > +#include <stdint.h> > > > > > +#include <errno.h> > > > > > +#include <rte_common.h> > > > > > +#include <rte_memory.h> > > > > > +#include <rte_lcore.h> > > > > > +#include <rte_debug.h> > > > > > +#include <rte_atomic.h> > > > > > + > > > > > +extern int rcu_log_type; > > > > > + > > > > > +#if RTE_LOG_DP_LEVEL >= RTE_LOG_DEBUG #define > > RCU_DP_LOG(level, > > > > fmt, > > > > > +args...) \ > > > > > + rte_log(RTE_LOG_ ## level, rcu_log_type, \ > > > > > + "%s(): " fmt "\n", __func__, ## args) #else #define > > > > > +RCU_DP_LOG(level, fmt, args...) #endif > > > > > + > > > > > +/* Registered thread IDs are stored as a bitmap of 64b element > > array. > > > > > + * Given thread id needs to be converted to index into the array > > > > > +and > > > > > + * the id within the array element. > > > > > + */ > > > > > +#define RTE_QSBR_THRID_ARRAY_ELM_SIZE (sizeof(uint64_t) * 8) > > > > #define > > > > > +RTE_QSBR_THRID_ARRAY_SIZE(max_threads) \ > > > > > + RTE_ALIGN(RTE_ALIGN_MUL_CEIL(max_threads, \ > > > > > + RTE_QSBR_THRID_ARRAY_ELM_SIZE) >> 3, > > > > RTE_CACHE_LINE_SIZE) #define > > > > > +RTE_QSBR_THRID_ARRAY_ELM(v, i) ((uint64_t *) \ > > > > > + ((struct rte_rcu_qsbr_cnt *)(v + 1) + v->max_threads) + i) > > > > > +#define RTE_QSBR_THRID_INDEX_SHIFT 6 #define > > RTE_QSBR_THRID_MASK > > > > > +0x3f > > > > #define > > > > > +RTE_QSBR_THRID_INVALID 0xffffffff > > > > > + > > > > > +/* Worker thread counter */ > > > > > +struct rte_rcu_qsbr_cnt { > > > > > + uint64_t cnt; > > > > > + /**< Quiescent state counter. Value 0 indicates the thread is > > > > > +offline */ } __rte_cache_aligned; > > > > > + > > > > > +#define RTE_QSBR_CNT_THR_OFFLINE 0 #define > > RTE_QSBR_CNT_INIT 1 > > > > > + > > > > > +/* RTE Quiescent State variable structure. > > > > > + * This structure has two elements that vary in size based on the > > > > > + * 'max_threads' parameter. > > > > > + * 1) Quiescent state counter array > > > > > + * 2) Register thread ID array > > > > > + */ > > > > > +struct rte_rcu_qsbr { > > > > > + uint64_t token __rte_cache_aligned; > > > > > + /**< Counter to allow for multiple concurrent quiescent state > > > > > +queries */ > > > > > + > > > > > + uint32_t num_elems __rte_cache_aligned; > > > > > + /**< Number of elements in the thread ID array */ > > > > > + uint32_t num_threads; > > > > > + /**< Number of threads currently using this QS variable */ > > > > > + uint32_t max_threads; > > > > > + /**< Maximum number of threads using this QS variable */ > > > > > + > > > > > + struct rte_rcu_qsbr_cnt qsbr_cnt[0] __rte_cache_aligned; > > > > > + /**< Quiescent state counter array of 'max_threads' elements */ > > > > > + > > > > > + /**< Registered thread IDs are stored in a bitmap array, > > > > > + * after the quiescent state counter array. > > > > > + */ > > > > > +} __rte_cache_aligned; > > > > > + > > <snip> > > > > > > + > > > > > +/* Check the quiescent state counter for registered threads only, > > > > > +assuming > > > > > + * that not all threads have registered. > > > > > + */ > > > > > +static __rte_always_inline int > > > > > +__rcu_qsbr_check_selective(struct rte_rcu_qsbr *v, uint64_t t, > > > > > +bool > > > > > +wait) { > > > > > + uint32_t i, j, id; > > > > > + uint64_t bmap; > > > > > + uint64_t c; > > > > > + uint64_t *reg_thread_id; > > > > > + > > > > > + for (i = 0, reg_thread_id = RTE_QSBR_THRID_ARRAY_ELM(v, 0); > > > > > + i < v->num_elems; > > > > > + i++, reg_thread_id++) { > > > > > + /* Load the current registered thread bit map before > > > > > + * loading the reader thread quiescent state counters. > > > > > + */ > > > > > + bmap = __atomic_load_n(reg_thread_id, > > > > __ATOMIC_ACQUIRE); > > > > > + id = i << RTE_QSBR_THRID_INDEX_SHIFT; > > > > > + > > > > > + while (bmap) { > > > > > + j = __builtin_ctzl(bmap); > > > > > + RCU_DP_LOG(DEBUG, > > > > > + "%s: check: token = %lu, wait = %d, Bit > > > > > Map > > > > = 0x%lx, Thread ID = %d", > > > > > + __func__, t, wait, bmap, id + j); > > > > > + c = __atomic_load_n( > > > > > + &v->qsbr_cnt[id + j].cnt, > > > > > + __ATOMIC_ACQUIRE); > > > > > + RCU_DP_LOG(DEBUG, > > > > > + "%s: status: token = %lu, wait = %d, > > > > > Thread > > > > QS cnt = %lu, Thread ID = %d", > > > > > + __func__, t, wait, c, id+j); > > > > > + /* Counter is not checked for wrap-around > > > > condition > > > > > + * as it is a 64b counter. > > > > > + */ > > > > > + if (unlikely(c != RTE_QSBR_CNT_THR_OFFLINE && c > > > > < t)) { > > > > > > > > This assumes that a 64-bit counter won't overflow, which is close > > > > enough to true given current CPU clock frequencies. ;-) > > > > > > > > > + /* This thread is not in quiescent > > > > > state */ > > > > > + if (!wait) > > > > > + return 0; > > > > > + > > > > > + rte_pause(); > > > > > + /* This thread might have unregistered. > > > > > + * Re-read the bitmap. > > > > > + */ > > > > > + bmap = __atomic_load_n(reg_thread_id, > > > > > + __ATOMIC_ACQUIRE); > > > > > + > > > > > + continue; > > > > > + } > > > > > + > > > > > + bmap &= ~(1UL << j); > > > > > + } > > > > > + } > > > > > + > > > > > + return 1; > > > > > +} > > > > > + > > > > > +/* Check the quiescent state counter for all threads, assuming > > > > > +that > > > > > + * all the threads have registered. > > > > > + */ > > > > > +static __rte_always_inline int > > > > > +__rcu_qsbr_check_all(struct rte_rcu_qsbr *v, uint64_t t, bool > > > > > +wait) > > > > > > > > Does checking the bitmap really take long enough to make this > > > > worthwhile as a separate function? I would think that the > > > > bitmap-checking time would be lost in the noise of cache misses from > > the ->cnt loads. > > > > > > It avoids accessing one cache line. I think this is where the savings are > > (may be in theory). This is the most probable use case. > > > On the other hand, __rcu_qsbr_check_selective() will result in savings > > (depending on how many threads are currently registered) by avoiding > > accessing unwanted counters. > > > > Do you really expect to be calling this function on any kind of fastpath? > > Yes. For some of the libraries (rte_hash), the writer is on the fast path. > > > > > > > Sure, if you invoke __rcu_qsbr_check_selective() in a tight loop in > > > > the absence of readers, you might see __rcu_qsbr_check_all() being a > > > > bit faster. But is that really what DPDK does? > > > I see improvements in the synthetic test case (similar to the one you > > have described, around 27%). However, in the more practical test cases I > > do not see any difference. > > > > If the performance improvement only occurs in a synthetic test case, does > > it really make sense to optimize for it? > I had to fix few issues in the performance test cases and added more to do > the comparison. These changes are in v5. > There are 4 performance tests involving this API. > 1) 1 Writer, 'N' readers > Writer: qsbr_start, qsbr_check(wait = true) > Readers: qsbr_quiescent > 2) 'N' writers > Writers: qsbr_start, qsbr_check(wait == false) > 3) 1 Writer, 'N' readers (this test uses the lock-free rte_hash data > structure) > Writer: hash_del, qsbr_start, qsbr_check(wait = true), validate that the > reader was able to complete its work successfully > Readers: thread_online, hash_lookup, access the pointer - do some work > on it, qsbr_quiescent, thread_offline > 4) Same as test 3) but qsbr_check (wait == false) > > There are 2 sets of these tests. > a) QS variable is created with number of threads same as number of readers - > this will exercise __rcu_qsbr_check_all > b) QS variable is created with 128 threads, number of registered threads is > same as in a) - this will exercise __rcu_qsbr_check_selective > > Following are the results on x86 (E5-2660 v4 @ 2.00GHz) comparing from a) to > b) (on x86 in my setup, the results are not very stable > between runs) > 1) 25% > 2) -3% > 3) -0.4% > 4) 1.38% > > Following are the results on an Arm system comparing from a) to b) (results > are not pretty stable between runs) > 1) -3.45% > 2) 0% > 3) -0.03% > 4) -0.04% > > Konstantin, is it possible to run the tests on your setup and look at the > results? I did run V5 on my box (SKX 2.1 GHz) with 17 lcores (1 physical core per thread). Didn't notice any siginifcatn fluctuations between runs, output below. >rcu_qsbr_perf_autotesESC[0Kt Number of cores provided = 17 Perf test with all reader threads registered -------------------------------------------- Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true) Total RCU updates = 65707232899 Cycles per 1000 updates: 18482 Total RCU checks = 20000000 Cycles per 1000 checks: 3794991 Perf Test: 17 Readers Total RCU updates = 1700000000 Cycles per 1000 updates: 2128 Perf test: 17 Writers ('wait' in qsbr_check == false) Total RCU checks = 340000000 Cycles per 1000 checks: 10030 Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking QSBR Check Following numbers include calls to rte_hash functions Cycles per 1 update(online/update/offline): 1984696 Cycles per 1 check(start, check): 2619002 Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking QSBR check Following numbers include calls to rte_hash functions Cycles per 1 update(online/update/offline): 2028030 Cycles per 1 check(start, check): 2876667 Perf test with some of reader threads registered ------------------------------------------------ Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true) Total RCU updates = 68850073055 Cycles per 1000 updates: 25490 Total RCU checks = 20000000 Cycles per 1000 checks: 5484403 Perf Test: 17 Readers Total RCU updates = 1700000000 Cycles per 1000 updates: 2127 Perf test: 17 Writers ('wait' in qsbr_check == false) Total RCU checks = 340000000 Cycles per 1000 checks: 10034 Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking QSBR Check Following numbers include calls to rte_hash functions Cycles per 1 update(online/update/offline): 3604489 Cycles per 1 check(start, check): 7077372 Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking QSBR check Following numbers include calls to rte_hash functions Cycles per 1 update(online/update/offline): 3936831 Cycles per 1 check(start, check): 7262738 Test OK Konstantin