Re: [dpdk-dev] [PATCH v4 1/3] rcu: add RCU library supporting QSBR mechanism

Ananyev, Konstantin Mon, 15 Apr 2019 09:52:09 -0700


> > > >
> > > > On Wed, Apr 10, 2019 at 06:20:04AM -0500, Honnappa Nagarahalli
> > wrote:
> > > > > Add RCU library supporting quiescent state based memory
> > > > > reclamation
> > > > method.
> > > > > This library helps identify the quiescent state of the reader
> > > > > threads so that the writers can free the memory associated with
> > > > > the lock less data structures.
> > > >
> > > > I don't see any sign of read-side markers (rcu_read_lock() and
> > > > rcu_read_unlock() in the Linux kernel, userspace RCU, etc.).
> > > >
> > > > Yes, strictly speaking, these are not needed for QSBR to operate,
> > > > but they
> > > These APIs would be empty for QSBR.
> > >
> > > > make it way easier to maintain and debug code using RCU.  For
> > > > example, given the read-side markers, you can check for errors like
> > > > having a call to
> > > > rte_rcu_qsbr_quiescent() in the middle of a reader quite easily.
> > > > Without those read-side markers, life can be quite hard and you will
> > > > really hate yourself for failing to have provided them.
> > >
> > > Want to make sure I understood this, do you mean the application
> > would mark before and after accessing the shared data structure on the
> > reader side?
> > >
> > > rte_rcu_qsbr_lock()
> > > <begin access shared data structure>
> > > ...
> > > ...
> > > <end access shared data structure>
> > > rte_rcu_qsbr_unlock()
> >
> > Yes, that is the idea.
> >
> > > If someone is debugging this code, they have to make sure that there is
> > an unlock for every lock and there is no call to rte_rcu_qsbr_quiescent in
> > between.
> > > It sounds good to me. Obviously, they will not add any additional cycles
> > as well.
> > > Please let me know if my understanding is correct.
> >
> > Yes.  And in some sort of debug mode, you could capture the counter at
> > rte_rcu_qsbr_lock() time and check it at rte_rcu_qsbr_unlock() time.  If the
> > counter has advanced too far (more than one, if I am not too confused)
> > there is a bug.  Also in debug mode, you could have rte_rcu_qsbr_lock()
> > increment a per-thread counter and rte_rcu_qsbr_unlock() decrement it.
> > If the counter is non-zero at a quiescent state, there is a bug.
> > And so on.
> >
> Added this in V5
> 
> <snip>
> 
> > > > > +
> > > > > +/* Get the memory size of QSBR variable */ size_t
> > > > > +__rte_experimental rte_rcu_qsbr_get_memsize(uint32_t
> > max_threads) {
> > > > > +     size_t sz;
> > > > > +
> > > > > +     if (max_threads == 0) {
> > > > > +             rte_log(RTE_LOG_ERR, rcu_log_type,
> > > > > +                     "%s(): Invalid max_threads %u\n",
> > > > > +                     __func__, max_threads);
> > > > > +             rte_errno = EINVAL;
> > > > > +
> > > > > +             return 1;
> > > > > +     }
> > > > > +
> > > > > +     sz = sizeof(struct rte_rcu_qsbr);
> > > > > +
> > > > > +     /* Add the size of quiescent state counter array */
> > > > > +     sz += sizeof(struct rte_rcu_qsbr_cnt) * max_threads;
> > > > > +
> > > > > +     /* Add the size of the registered thread ID bitmap array */
> > > > > +     sz += RTE_QSBR_THRID_ARRAY_SIZE(max_threads);
> > > > > +
> > > > > +     return RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
> > > >
> > > > Given that you align here, should you also align in the earlier
> > > > steps in the computation of sz?
> > >
> > > Agree. I will remove the align here and keep the earlier one as the intent
> > is to align the thread ID array.
> >
> > Sounds good!
> Added this in V5
> 
> >
> > > > > +}
> > > > > +
> > > > > +/* Initialize a quiescent state variable */ int
> > > > > +__rte_experimental rte_rcu_qsbr_init(struct rte_rcu_qsbr *v,
> > uint32_t max_threads) {
> > > > > +     size_t sz;
> > > > > +
> > > > > +     if (v == NULL) {
> > > > > +             rte_log(RTE_LOG_ERR, rcu_log_type,
> > > > > +                     "%s(): Invalid input parameter\n", __func__);
> > > > > +             rte_errno = EINVAL;
> > > > > +
> > > > > +             return 1;
> > > > > +     }
> > > > > +
> > > > > +     sz = rte_rcu_qsbr_get_memsize(max_threads);
> > > > > +     if (sz == 1)
> > > > > +             return 1;
> > > > > +
> > > > > +     /* Set all the threads to offline */
> > > > > +     memset(v, 0, sz);
> > > >
> > > > We calculate sz here, but it looks like the caller must also
> > > > calculate it in order to correctly allocate the memory referenced by
> > > > the "v" argument to this function, with bad things happening if the
> > > > two calculations get different results.  Should "v" instead be
> > > > allocated within this function to avoid this sort of problem?
> > >
> > > Earlier version allocated the memory with-in this library. However, it was
> > decided to go with the current implementation as it provides flexibility for
> > the application to manage the memory as it sees fit. For ex: it could
> > allocate this as part of another structure in a single allocation. This also
> > falls inline with similar approach taken in other libraries.
> >
> > So the allocator APIs vary too much to allow a pointer to the desired
> > allocator function to be passed in?  Or do you also want to allow static
> > allocation?  If the latter, would a DEFINE_RTE_RCU_QSBR() be of use?
> >
> This is done to allow for allocation of memory for QS variable as part of a 
> another bigger data structure. This will help in not fragmenting
> the memory. For ex:
> 
> struct xyz {
>     rte_ring *ring;
>     rte_rcu_qsbr *v;
>     abc *t;
> };
> struct xyz c;
> 
> Memory for the above structure can be allocated in one chunk by calculating 
> the size required.
> 
> In some use cases static allocation might be enough as 'max_threads' might be 
> a compile time constant. I am not sure on how to support
> both dynamic and static 'max_threads'.


Same thought here-  would be good to have a static initializer 
(DEFINE_RTE_RCU_QSBR),
but that means new compile time limit ('max_threads') - thing that we try to 
avoid.

> 
> > > > > +     v->max_threads = max_threads;
> > > > > +     v->num_elems = RTE_ALIGN_MUL_CEIL(max_threads,
> > > > > +                     RTE_QSBR_THRID_ARRAY_ELM_SIZE) /
> > > > > +                     RTE_QSBR_THRID_ARRAY_ELM_SIZE;
> > > > > +     v->token = RTE_QSBR_CNT_INIT;
> > > > > +
> > > > > +     return 0;
> > > > > +}
> > > > > +
> > > > > +/* Register a reader thread to report its quiescent state
> > > > > + * on a QS variable.
> > > > > + */
> > > > > +int __rte_experimental
> > > > > +rte_rcu_qsbr_thread_register(struct rte_rcu_qsbr *v, unsigned int
> > > > > +thread_id) {
> > > > > +     unsigned int i, id, success;
> > > > > +     uint64_t old_bmap, new_bmap;
> > > > > +
> > > > > +     if (v == NULL || thread_id >= v->max_threads) {
> > > > > +             rte_log(RTE_LOG_ERR, rcu_log_type,
> > > > > +                     "%s(): Invalid input parameter\n", __func__);
> > > > > +             rte_errno = EINVAL;
> > > > > +
> > > > > +             return 1;
> > > > > +     }
> > > > > +
> > > > > +     id = thread_id & RTE_QSBR_THRID_MASK;
> > > > > +     i = thread_id >> RTE_QSBR_THRID_INDEX_SHIFT;
> > > > > +
> > > > > +     /* Make sure that the counter for registered threads does not
> > > > > +      * go out of sync. Hence, additional checks are required.
> > > > > +      */
> > > > > +     /* Check if the thread is already registered */
> > > > > +     old_bmap = __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i),
> > > > > +                                     __ATOMIC_RELAXED);
> > > > > +     if (old_bmap & 1UL << id)
> > > > > +             return 0;
> > > > > +
> > > > > +     do {
> > > > > +             new_bmap = old_bmap | (1UL << id);
> > > > > +             success = __atomic_compare_exchange(
> > > > > +                                     RTE_QSBR_THRID_ARRAY_ELM(v, i),
> > > > > +                                     &old_bmap, &new_bmap, 0,
> > > > > +                                     __ATOMIC_RELEASE,
> > > > __ATOMIC_RELAXED);
> > > > > +
> > > > > +             if (success)
> > > > > +                     __atomic_fetch_add(&v->num_threads,
> > > > > +                                             1, __ATOMIC_RELAXED);
> > > > > +             else if (old_bmap & (1UL << id))
> > > > > +                     /* Someone else registered this thread.
> > > > > +                      * Counter should not be incremented.
> > > > > +                      */
> > > > > +                     return 0;
> > > > > +     } while (success == 0);
> > > >
> > > > This would be simpler if threads were required to register themselves.
> > > > Maybe you have use cases requiring registration of other threads,
> > > > but this capability is adding significant complexity, so it might be
> > > > worth some thought.
> > > >
> > > It was simple earlier (__atomic_fetch_or). The complexity is added as
> > 'num_threads' should not go out of sync.
> >
> > Hmmm...
> >
> > So threads are allowed to register other threads?  Or is there some other
> > reason that concurrent registration is required?
> >
> Yes, control plane threads can register the fast path threads. Though, I am 
> not sure how useful it is. I did not want to add the restriction. I
> expect that reader threads will register themselves. The reader threads 
> require concurrent registration as they all will be running in parallel.
> If the requirement of keeping track of the number of threads registered 
> currently goes away, then this function will be simple.
> 
> <snip>
> 
> > > > > diff --git a/lib/librte_rcu/rte_rcu_qsbr.h
> > > > > b/lib/librte_rcu/rte_rcu_qsbr.h new file mode 100644 index
> > > > > 000000000..ff696aeab
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_rcu/rte_rcu_qsbr.h
> > > > > @@ -0,0 +1,554 @@
> > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > + * Copyright (c) 2018 Arm Limited  */
> > > > > +
> > > > > +#ifndef _RTE_RCU_QSBR_H_
> > > > > +#define _RTE_RCU_QSBR_H_
> > > > > +
> > > > > +/**
> > > > > + * @file
> > > > > + * RTE Quiescent State Based Reclamation (QSBR)
> > > > > + *
> > > > > + * Quiescent State (QS) is any point in the thread execution
> > > > > + * where the thread does not hold a reference to a data structure
> > > > > + * in shared memory. While using lock-less data structures, the
> > > > > +writer
> > > > > + * can safely free memory once all the reader threads have
> > > > > +entered
> > > > > + * quiescent state.
> > > > > + *
> > > > > + * This library provides the ability for the readers to report
> > > > > +quiescent
> > > > > + * state and for the writers to identify when all the readers
> > > > > +have
> > > > > + * entered quiescent state.
> > > > > + */
> > > > > +
> > > > > +#ifdef __cplusplus
> > > > > +extern "C" {
> > > > > +#endif
> > > > > +
> > > > > +#include <stdio.h>
> > > > > +#include <stdint.h>
> > > > > +#include <errno.h>
> > > > > +#include <rte_common.h>
> > > > > +#include <rte_memory.h>
> > > > > +#include <rte_lcore.h>
> > > > > +#include <rte_debug.h>
> > > > > +#include <rte_atomic.h>
> > > > > +
> > > > > +extern int rcu_log_type;
> > > > > +
> > > > > +#if RTE_LOG_DP_LEVEL >= RTE_LOG_DEBUG #define
> > RCU_DP_LOG(level,
> > > > fmt,
> > > > > +args...) \
> > > > > +     rte_log(RTE_LOG_ ## level, rcu_log_type, \
> > > > > +             "%s(): " fmt "\n", __func__, ## args) #else #define
> > > > > +RCU_DP_LOG(level, fmt, args...) #endif
> > > > > +
> > > > > +/* Registered thread IDs are stored as a bitmap of 64b element
> > array.
> > > > > + * Given thread id needs to be converted to index into the array
> > > > > +and
> > > > > + * the id within the array element.
> > > > > + */
> > > > > +#define RTE_QSBR_THRID_ARRAY_ELM_SIZE (sizeof(uint64_t) * 8)
> > > > #define
> > > > > +RTE_QSBR_THRID_ARRAY_SIZE(max_threads) \
> > > > > +     RTE_ALIGN(RTE_ALIGN_MUL_CEIL(max_threads, \
> > > > > +             RTE_QSBR_THRID_ARRAY_ELM_SIZE) >> 3,
> > > > RTE_CACHE_LINE_SIZE) #define
> > > > > +RTE_QSBR_THRID_ARRAY_ELM(v, i) ((uint64_t *) \
> > > > > +     ((struct rte_rcu_qsbr_cnt *)(v + 1) + v->max_threads) + i)
> > > > > +#define RTE_QSBR_THRID_INDEX_SHIFT 6 #define
> > RTE_QSBR_THRID_MASK
> > > > > +0x3f
> > > > #define
> > > > > +RTE_QSBR_THRID_INVALID 0xffffffff
> > > > > +
> > > > > +/* Worker thread counter */
> > > > > +struct rte_rcu_qsbr_cnt {
> > > > > +     uint64_t cnt;
> > > > > +     /**< Quiescent state counter. Value 0 indicates the thread is
> > > > > +offline */ } __rte_cache_aligned;
> > > > > +
> > > > > +#define RTE_QSBR_CNT_THR_OFFLINE 0 #define
> > RTE_QSBR_CNT_INIT 1
> > > > > +
> > > > > +/* RTE Quiescent State variable structure.
> > > > > + * This structure has two elements that vary in size based on the
> > > > > + * 'max_threads' parameter.
> > > > > + * 1) Quiescent state counter array
> > > > > + * 2) Register thread ID array
> > > > > + */
> > > > > +struct rte_rcu_qsbr {
> > > > > +     uint64_t token __rte_cache_aligned;
> > > > > +     /**< Counter to allow for multiple concurrent quiescent state
> > > > > +queries */
> > > > > +
> > > > > +     uint32_t num_elems __rte_cache_aligned;
> > > > > +     /**< Number of elements in the thread ID array */
> > > > > +     uint32_t num_threads;
> > > > > +     /**< Number of threads currently using this QS variable */
> > > > > +     uint32_t max_threads;
> > > > > +     /**< Maximum number of threads using this QS variable */
> > > > > +
> > > > > +     struct rte_rcu_qsbr_cnt qsbr_cnt[0] __rte_cache_aligned;
> > > > > +     /**< Quiescent state counter array of 'max_threads' elements */
> > > > > +
> > > > > +     /**< Registered thread IDs are stored in a bitmap array,
> > > > > +      *   after the quiescent state counter array.
> > > > > +      */
> > > > > +} __rte_cache_aligned;
> > > > > +
> 
> <snip>
> 
> > > > > +
> > > > > +/* Check the quiescent state counter for registered threads only,
> > > > > +assuming
> > > > > + * that not all threads have registered.
> > > > > + */
> > > > > +static __rte_always_inline int
> > > > > +__rcu_qsbr_check_selective(struct rte_rcu_qsbr *v, uint64_t t,
> > > > > +bool
> > > > > +wait) {
> > > > > +     uint32_t i, j, id;
> > > > > +     uint64_t bmap;
> > > > > +     uint64_t c;
> > > > > +     uint64_t *reg_thread_id;
> > > > > +
> > > > > +     for (i = 0, reg_thread_id = RTE_QSBR_THRID_ARRAY_ELM(v, 0);
> > > > > +             i < v->num_elems;
> > > > > +             i++, reg_thread_id++) {
> > > > > +             /* Load the current registered thread bit map before
> > > > > +              * loading the reader thread quiescent state counters.
> > > > > +              */
> > > > > +             bmap = __atomic_load_n(reg_thread_id,
> > > > __ATOMIC_ACQUIRE);
> > > > > +             id = i << RTE_QSBR_THRID_INDEX_SHIFT;
> > > > > +
> > > > > +             while (bmap) {
> > > > > +                     j = __builtin_ctzl(bmap);
> > > > > +                     RCU_DP_LOG(DEBUG,
> > > > > +                             "%s: check: token = %lu, wait = %d, Bit 
> > > > > Map
> > > > = 0x%lx, Thread ID = %d",
> > > > > +                             __func__, t, wait, bmap, id + j);
> > > > > +                     c = __atomic_load_n(
> > > > > +                                     &v->qsbr_cnt[id + j].cnt,
> > > > > +                                     __ATOMIC_ACQUIRE);
> > > > > +                     RCU_DP_LOG(DEBUG,
> > > > > +                             "%s: status: token = %lu, wait = %d, 
> > > > > Thread
> > > > QS cnt = %lu, Thread ID = %d",
> > > > > +                             __func__, t, wait, c, id+j);
> > > > > +                     /* Counter is not checked for wrap-around
> > > > condition
> > > > > +                      * as it is a 64b counter.
> > > > > +                      */
> > > > > +                     if (unlikely(c != RTE_QSBR_CNT_THR_OFFLINE && c
> > > > < t)) {
> > > >
> > > > This assumes that a 64-bit counter won't overflow, which is close
> > > > enough to true given current CPU clock frequencies.  ;-)
> > > >
> > > > > +                             /* This thread is not in quiescent 
> > > > > state */
> > > > > +                             if (!wait)
> > > > > +                                     return 0;
> > > > > +
> > > > > +                             rte_pause();
> > > > > +                             /* This thread might have unregistered.
> > > > > +                              * Re-read the bitmap.
> > > > > +                              */
> > > > > +                             bmap = __atomic_load_n(reg_thread_id,
> > > > > +                                             __ATOMIC_ACQUIRE);
> > > > > +
> > > > > +                             continue;
> > > > > +                     }
> > > > > +
> > > > > +                     bmap &= ~(1UL << j);
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > > +     return 1;
> > > > > +}
> > > > > +
> > > > > +/* Check the quiescent state counter for all threads, assuming
> > > > > +that
> > > > > + * all the threads have registered.
> > > > > + */
> > > > > +static __rte_always_inline int
> > > > > +__rcu_qsbr_check_all(struct rte_rcu_qsbr *v, uint64_t t, bool
> > > > > +wait)
> > > >
> > > > Does checking the bitmap really take long enough to make this
> > > > worthwhile as a separate function?  I would think that the
> > > > bitmap-checking time would be lost in the noise of cache misses from
> > the ->cnt loads.
> > >
> > > It avoids accessing one cache line. I think this is where the savings are
> > (may be in theory). This is the most probable use case.
> > > On the other hand, __rcu_qsbr_check_selective() will result in savings
> > (depending on how many threads are currently registered) by avoiding
> > accessing unwanted counters.
> >
> > Do you really expect to be calling this function on any kind of fastpath?
> 
> Yes. For some of the libraries (rte_hash), the writer is on the fast path.
> 
> >
> > > > Sure, if you invoke __rcu_qsbr_check_selective() in a tight loop in
> > > > the absence of readers, you might see __rcu_qsbr_check_all() being a
> > > > bit faster.  But is that really what DPDK does?
> > > I see improvements in the synthetic test case (similar to the one you
> > have described, around 27%). However, in the more practical test cases I
> > do not see any difference.
> >
> > If the performance improvement only occurs in a synthetic test case, does
> > it really make sense to optimize for it?
> I had to fix few issues in the performance test cases and added more to do 
> the comparison. These changes are in v5.
> There are 4 performance tests involving this API.
> 1) 1 Writer, 'N' readers
>      Writer: qsbr_start, qsbr_check(wait = true)
>      Readers: qsbr_quiescent
> 2) 'N' writers
>      Writers: qsbr_start, qsbr_check(wait == false)
> 3) 1 Writer, 'N' readers (this test uses the lock-free rte_hash data 
> structure)
>      Writer: hash_del, qsbr_start, qsbr_check(wait = true), validate that the 
> reader was able to complete its work successfully
>      Readers: thread_online, hash_lookup, access the pointer - do some work 
> on it, qsbr_quiescent, thread_offline
> 4) Same as test 3) but qsbr_check (wait == false)
> 
> There are 2 sets of these tests.
> a) QS variable is created with number of threads same as number of readers - 
> this will exercise __rcu_qsbr_check_all
> b) QS variable is created with 128 threads, number of registered threads is 
> same as in a) - this will exercise __rcu_qsbr_check_selective
> 
> Following are the results on x86 (E5-2660 v4 @ 2.00GHz) comparing from a) to 
> b) (on x86 in my setup, the results are not very stable
> between runs)
> 1) 25%
> 2) -3%
> 3) -0.4%
> 4) 1.38%
> 
> Following are the results on an Arm system comparing from a) to b) (results 
> are not pretty stable between runs)
> 1) -3.45%
> 2) 0%
> 3) -0.03%
> 4) -0.04%
> 
> Konstantin, is it possible to run the tests on your setup and look at the 
> results?

I did run V5 on my box (SKX 2.1 GHz) with 17 lcores (1 physical core per 
thread).
Didn't notice any siginifcatn fluctuations between runs, output below.

>rcu_qsbr_perf_autotesESC[0Kt
Number of cores provided = 17
Perf test with all reader threads registered
--------------------------------------------

Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true)
Total RCU updates = 65707232899
Cycles per 1000 updates: 18482
Total RCU checks = 20000000
Cycles per 1000 checks: 3794991

Perf Test: 17 Readers
Total RCU updates = 1700000000
Cycles per 1000 updates: 2128

Perf test: 17 Writers ('wait' in qsbr_check == false)
Total RCU checks = 340000000
Cycles per 1000 checks: 10030

Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking QSBR 
Check
Following numbers include calls to rte_hash functions
Cycles per 1 update(online/update/offline): 1984696
Cycles per 1 check(start, check): 2619002

Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking 
QSBR check
Following numbers include calls to rte_hash functions
Cycles per 1 update(online/update/offline): 2028030
Cycles per 1 check(start, check): 2876667

Perf test with some of reader threads registered
------------------------------------------------

Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true)
Total RCU updates = 68850073055
Cycles per 1000 updates: 25490
Total RCU checks = 20000000
Cycles per 1000 checks: 5484403

Perf Test: 17 Readers
Total RCU updates = 1700000000
Cycles per 1000 updates: 2127

Perf test: 17 Writers ('wait' in qsbr_check == false)
Total RCU checks = 340000000
Cycles per 1000 checks: 10034

Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking QSBR 
Check
Following numbers include calls to rte_hash functions
Cycles per 1 update(online/update/offline): 3604489
Cycles per 1 check(start, check): 7077372

Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking 
QSBR check
Following numbers include calls to rte_hash functions
Cycles per 1 update(online/update/offline): 3936831
Cycles per 1 check(start, check): 7262738


Test OK

Konstantin

Re: [dpdk-dev] [PATCH v4 1/3] rcu: add RCU library supporting QSBR mechanism

Reply via email to