Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer

Ola Liljedahl Mon, 28 Jan 2019 14:27:19 -0800

On Mon, 2019-01-28 at 21:04 +0000, Eads, Gage wrote:
> Hey Ola,
> 
> > 
> > -----Original Message-----
> > From: Ola Liljedahl [mailto:ola.liljed...@arm.com]
> > Sent: Monday, January 28, 2019 6:29 AM
> > To: dev@dpdk.org; Eads, Gage <gage.e...@intel.com>; Honnappa Nagarahalli
> > <honnappa.nagaraha...@arm.com>; Richardson, Bruce
> > <bruce.richard...@intel.com>
> > Cc: nd <n...@arm.com>; Ola Liljedahl <ola.liljed...@arm.com>
> > Subject: [RFC] lfring: lock-free ring buffer
> > 
> > Lock-free MP/MC ring buffer with SP/SC options.
> > The ring buffer metadata only maintains one set of head and tail pointers.
> > Tail is
> > updated on enqueue, head is updated on dequeue.
> > The ring slots between head and tail always contains valid (unconsumed)
> > slots.
> > Each ring slot consists of a struct of data pointer ("ptr") and ring index
> > ("idx").
> > Slot.idx is only updated on enqueue with the new ring index. The slot index
> > is
> > used to ensure that slots are written sequentially (important for FIFO
> > ordering).
> > It is also used to synchronise multiple producers.
> > 
> > MP enqueue scans the ring from the tail until head+size, looking for an
> > empty
> > slot as indicated by slot.idx equaling the ring index of the previous lap
> > (index -
> > ring size). The empty slot is written (.ptr = data, .idx = index) using CAS.
> > If CAS
> > fails, some other thread wrote the slot and the thread continues with the
> > next
> > slot.
> > 
> > When all elements have been enqueued or if the ring buffer is full, the
> > thread
> > updates the ring buffer tail pointer (unless this has not already been
> > updated by
> > some other thread). Thus this thread will help other threads that have
> > written
> > ring slots but not yet updated the tail pointer.
> > 
> > If a slot with an index different from current or previous lap is found, it
> > is
> > assumed that the thread has fallen behind (e.g. been preempted) and the
> > local
> > tail pointer is (conditionally) updated from the ring buffer metadata in
> > order to
> > quickly catch up.
> > 
> > MP dequeue reads (up to) N slots from the head index and attempts to commit
> > the operation using CAS on the head pointer.
> > 
> > Enqueue N elements: N+1 CAS operations (but the last CAS operation might not
> > be necessary during contention as producers will help each other) Dequeue N
> > elements: 1 CAS operation
> > 
> > As the lock-free ring supports both MT-safe (MP/MC) and single-threaded
> > (SP/SC) operations, there is a question whether the modes should be chosen
> > only when creating a ring buffer or if the application can mix MT-safe and
> > single-
> > threaded enqueue (or dequeue) calls. To follow the pattern set by rte_ring,
> > specific functions for MP and SP enqueue and MC and SC dequeue are made
> > available. The wisdom of this ability can be questioned. The lock-free
> > design
> > allows threads to be blocked for long periods of time without interfering
> > with
> > the operation of the ring buffer but mixing (lock-free) MT-safe and single-
> > threaded calls requires that there are on such threads that wake up at the
> > wrong
> > moment (when a single-threaded operation is in progress).
> > 
> I see this as a user error. The rte_ring documentation is clear that
> ring_sp_enqueue and ring_sc_dequeue functions are not MT-safe, and (if I'm
> reading this correctly) this scenario involves having 2+ threads executing an
> operation in parallel.
So the individual MP/SP/MC/SC functions should also be directly available to the
user? This is what I have implemented here (but not in my original PROGRESS64
implementation where MP/SP/MC/SC mode is fixed at creation). But I don't like
it.


> 
> > 
> > 128-bit lock-free atomic compare exchange operations for AArch64
> > (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these functions
> > to
> > find a different home and possibly a different API.
> > Note that a 'frail' version atomic_compare_exchange is implemented, this
> > means that in the case of failure, the old value returned is not guaranteed
> > to be
> > atomically read (as this is not possible on ARMv8.0 without also writing to
> > the
> > location). Atomically read values are often not necessary, it is a
> > successful
> > compare and exchange operation which provides atomicity.
> > 
> > Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
> A few high-level thoughts:
> 1. The enqueue and dequeue functions are not compatible with the rte_ring's
> bulk semantics,
I wasn't considering 100% compatibility with other modules that already use
rte_ring, the code rather reflects the requirements of my own use cases. If we
want this, then we need to provide the same bulk enqueue/dequeue behaviour as
rte_ring.

>  where either all pointers or none are operated on. This is necessary to
> implement the ring API of course, but in a broader sense I think it's
> necessary for usability in general. For example, how would a user (or the ring
> mempool handler) distinguish between partial dequeues that occur because the
> ring is empty vs. those that occur due to a race? dequeue_bulk_mc could return
> 0 even though the ring is full, if its first CAS fails and the head has
> advanced beyond its local tail variable.
Tail could be re-read on a CAS(&head) failure. This is how I originally coded
ring buffer enqueue and dequeue functions. But CAS(&head) failure doesn't by
itself mean that *tail* has changed. So I hoisted the read of tail before the
CAS-loop which made the code slightly faster since we avoid unnecessarily
reading a location which might have been modified and thus could cause a cache
miss. Maybe this microptimisation isn't always a good idea (it could lead to
spurious queue full/empty failures). Another solution is to only re-read tail if
we see the queue as empty (or we can't acquire as many slots as the user
requested).

        uint64_t head = __atomic_load_n(&r->head, __ATOMIC_RELAXED);
        uint64_t tail = __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE);
        do {
                actual = RTE_MIN((int64_t)(tail - head), (int64_t)nelems);
                if (unlikely(actual <= 0)) {
                        /* Re-load tail only when queue looks empty */
                        tail = __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE);
             
          actual = RTE_MIN((int64_t)(tail - head), (int64_t)nelems);
                        if
(unlikely(actual <= 0))
                                return 0;
                }
                rte_lfring_deq_elems(elems, r->ring, mask, head, actual);
        } while (!__atomic_compare_exchange_n(&r->head,
                                              &head,
                                              head + actual,
                                              /*weak*/false,
                                              __ATOMIC_RELEASE,
                                              __ATOMIC_RELAXED));
>  The user could retry dequeue, of course, but how many times until they feel
> confident that the ring is truly empty and give up? I don't think we can avoid
> reserving ring slots prior to enqueueing/dequeueing.
If tail is (re-) read inside the CAS-loop, there should be no spurious queue
empty failures and no need to repeatedly call the dequeue function "just to be
sure". The bulk dequeue/enqueue behaviour is separate.

> 
> 2. On the plus side, the enqueue function design that allows it to use 1/2 the
> atomics of mine appears to be independent of reserving ring slots, and should
> transfer over fairly cleanly. I'm a little concerned about the performance
> variability it introduces (i.e. if the thread gets into "catch up" mode), 
If a thread has to catch up, it means it is using a stale (head/tail) index
(more than one ring lap behaind). Better to try to load a fresh value (if one is
available) than to iterate over the ring until it has caught up. So I think this
is the better/faster design.

Catch up mode is not triggered by finding an occupied slot for the current lap
(that was just written by some other thread). Or at least this is the idea. If
we find a freshly written slot, we just move to the next slot.

> particularly for larger rings, since real-time software values predictability.
> What if the reload criteria was instead something like:
> 
> #define ENQ_RETRY_LIMIT 32 //arbitrary
> 
> if (old.e.idx != tail - size) {
>     if (++fail_cnt < ENQ_RETRY_LIMIT) {
>         tail++;
>     } else {
>         fail_cnt = 0;
>         tail = rte_lfring_reload(...); 
>     }
>     goto restart;
> }
> fail_cnt = 0;
There are three cases (slot index must be between q->tail and q->head + q-
>size):
slot.idx == tail - size: slot is free, try to write it
slot.idx == tail: slot has just been written (by other thread), skip to next
slot (tail++)
none of the above: thread is behind (at least one lap), re-load tail from q-
>tail

I think using the retry count actually delays catching up to a fresh position.

> 
> 3. Using a zero-length array to mark the start of the ring is a nice approach
> -- I'll incorporate that into the patchset.
> 
> At an algorithm level, I don't see any other differences. 
> Implementation-wise, 
> we'll need to nail the memory ordering flags to best support weak consistency
> machines, as you pointed out elsewhere.
There is no pre-acquisition of slots in enqueue and dequeue. That separate step
makes lock-freedom impossible (I think).

With a merged implementation, I think the NON-BLOCKING/LOCK-FREE mode most have
separate implementations also for SP/SC. In non-blocking/lock-free mode (whether
MP/MC or SP/SC), only one set of head/tail pointers is used. In blocking mode,
two sets of head/tail are used. There isn't a set of SP/SC code that supports
both non-blocking/lock-free and blocking behaviours.

> 
> Out of curiosity, is p64_lfring based on the nb ring patchset? I noticed it
> used a different algorithm until pretty recently.
I created the original lfring design (which isn't that different I think) back
in November last and it seemed to work but I didn't think it could guarantee
FIFO order as elements were enqueued and dequeued wherever a thread found a
suitable slot and in my use cases, FIFO order is an important characteristic
(need to maintain ingress-to-egress packet order). Already then I was
considering adding the index to the ring slots to enforce FIFO enqueue/dequeue
order, I knew this would be possible on ARMv7a (which has 2x32 bit CAS using
exclusives) and ARMv8/AArch64 (which has 2x64-bit CAS using exclusives). I had
seen this design idea on the Internet before (e.g. here https://www.youtube.com/
watch?v=HP2InVqgBFM, I actually stopped viewing this presentation half-way
because I wanted to iron out the details myself), it's not a novel idea (sorry).
However I didn't have time to do this change before Christmas holidays and just
came back to work two weeks ago. So I would rather call this simultaneous
"invention" (or publication). Difficult to prove this of course...

> 
> Thanks,
> Gage
> 
> </snip>
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl

Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer

Reply via email to