> On Jan 29, 2019, at 11:17 PM, Eads, Gage <gage.e...@intel.com> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: Ola Liljedahl [mailto:ola.liljed...@arm.com]
>> Sent: Monday, January 28, 2019 4:26 PM
>> To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; Richardson,
>> Bruce <bruce.richard...@intel.com>; Eads, Gage <gage.e...@intel.com>;
>> dev@dpdk.org
>> Cc: nd <n...@arm.com>
>> Subject: Re: [RFC] lfring: lock-free ring buffer
>> 
>>> On Mon, 2019-01-28 at 21:04 +0000, Eads, Gage wrote:
>>> Hey Ola,
>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Ola Liljedahl [mailto:ola.liljed...@arm.com]
>>>> Sent: Monday, January 28, 2019 6:29 AM
>>>> To: dev@dpdk.org; Eads, Gage <gage.e...@intel.com>; Honnappa
>>>> Nagarahalli <honnappa.nagaraha...@arm.com>; Richardson, Bruce
>>>> <bruce.richard...@intel.com>
>>>> Cc: nd <n...@arm.com>; Ola Liljedahl <ola.liljed...@arm.com>
>>>> Subject: [RFC] lfring: lock-free ring buffer
>>>> 
>>>> Lock-free MP/MC ring buffer with SP/SC options.
>>>> The ring buffer metadata only maintains one set of head and tail pointers.
>>>> Tail is
>>>> updated on enqueue, head is updated on dequeue.
>>>> The ring slots between head and tail always contains valid
>>>> (unconsumed) slots.
>>>> Each ring slot consists of a struct of data pointer ("ptr") and ring
>>>> index ("idx").
>>>> Slot.idx is only updated on enqueue with the new ring index. The
>>>> slot index is used to ensure that slots are written sequentially
>>>> (important for FIFO ordering).
>>>> It is also used to synchronise multiple producers.
>>>> 
>>>> MP enqueue scans the ring from the tail until head+size, looking for
>>>> an empty slot as indicated by slot.idx equaling the ring index of
>>>> the previous lap (index - ring size). The empty slot is written
>>>> (.ptr = data, .idx = index) using CAS.
>>>> If CAS
>>>> fails, some other thread wrote the slot and the thread continues
>>>> with the next slot.
>>>> 
>>>> When all elements have been enqueued or if the ring buffer is full,
>>>> the thread updates the ring buffer tail pointer (unless this has not
>>>> already been updated by some other thread). Thus this thread will
>>>> help other threads that have written ring slots but not yet updated
>>>> the tail pointer.
>>>> 
>>>> If a slot with an index different from current or previous lap is
>>>> found, it is assumed that the thread has fallen behind (e.g. been
>>>> preempted) and the local tail pointer is (conditionally) updated
>>>> from the ring buffer metadata in order to quickly catch up.
>>>> 
>>>> MP dequeue reads (up to) N slots from the head index and attempts to
>>>> commit the operation using CAS on the head pointer.
>>>> 
>>>> Enqueue N elements: N+1 CAS operations (but the last CAS operation
>>>> might not be necessary during contention as producers will help each
>>>> other) Dequeue N
>>>> elements: 1 CAS operation
>>>> 
>>>> As the lock-free ring supports both MT-safe (MP/MC) and
>>>> single-threaded
>>>> (SP/SC) operations, there is a question whether the modes should be
>>>> chosen only when creating a ring buffer or if the application can
>>>> mix MT-safe and
>>>> single-
>>>> threaded enqueue (or dequeue) calls. To follow the pattern set by
>>>> rte_ring, specific functions for MP and SP enqueue and MC and SC
>>>> dequeue are made available. The wisdom of this ability can be
>>>> questioned. The lock-free design allows threads to be blocked for
>>>> long periods of time without interfering with the operation of the
>>>> ring buffer but mixing (lock-free) MT-safe and single- threaded
>>>> calls requires that there are on such threads that wake up at the
>>>> wrong moment (when a single-threaded operation is in progress).
>>>> 
>>> I see this as a user error. The rte_ring documentation is clear that
>>> ring_sp_enqueue and ring_sc_dequeue functions are not MT-safe, and (if
>>> I'm reading this correctly) this scenario involves having 2+ threads
>>> executing an operation in parallel.
>> So the individual MP/SP/MC/SC functions should also be directly available to 
>> the
>> user? This is what I have implemented here (but not in my original PROGRESS64
>> implementation where MP/SP/MC/SC mode is fixed at creation). But I don't like
>> it.
>> 
>>> 
>>>> 
>>>> 128-bit lock-free atomic compare exchange operations for AArch64
>>>> (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these
>>>> functions to find a different home and possibly a different API.
>>>> Note that a 'frail' version atomic_compare_exchange is implemented,
>>>> this means that in the case of failure, the old value returned is
>>>> not guaranteed to be atomically read (as this is not possible on
>>>> ARMv8.0 without also writing to the location). Atomically read
>>>> values are often not necessary, it is a successful compare and
>>>> exchange operation which provides atomicity.
>>>> 
>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>> A few high-level thoughts:
>>> 1. The enqueue and dequeue functions are not compatible with the
>>> rte_ring's bulk semantics,
>> I wasn't considering 100% compatibility with other modules that already use
>> rte_ring, the code rather reflects the requirements of my own use cases. If 
>> we
>> want this, then we need to provide the same bulk enqueue/dequeue behaviour
>> as rte_ring.
>> 
>>> where either all pointers or none are operated on. This is necessary
>>> to implement the ring API of course, but in a broader sense I think
>>> it's necessary for usability in general. For example, how would a user
>>> (or the ring mempool handler) distinguish between partial dequeues
>>> that occur because the ring is empty vs. those that occur due to a
>>> race? dequeue_bulk_mc could return
>>> 0 even though the ring is full, if its first CAS fails and the head
>>> has advanced beyond its local tail variable.
>> Tail could be re-read on a CAS(&head) failure. This is how I originally 
>> coded ring
>> buffer enqueue and dequeue functions. But CAS(&head) failure doesn't by 
>> itself
>> mean that *tail* has changed. So I hoisted the read of tail before the 
>> CAS-loop
>> which made the code slightly faster since we avoid unnecessarily reading a
>> location which might have been modified and thus could cause a cache miss.
>> Maybe this microptimisation isn't always a good idea (it could lead to 
>> spurious
>> queue full/empty failures). Another solution is to only re-read tail if we 
>> see the
>> queue as empty (or we can't acquire as many slots as the user requested).
>> 
>>         uint64_t head = __atomic_load_n(&r->head, __ATOMIC_RELAXED);
>>         uint64_t tail = __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE);
>>         do {
>>                 actual = RTE_MIN((int64_t)(tail - head), (int64_t)nelems);
>>                 if (unlikely(actual <= 0)) {
>>            /* Re-load tail only when queue looks empty */
>>                tail = __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE);
>> 
>>           actual = RTE_MIN((int64_t)(tail - head), (int64_t)nelems);
>>            if
>> (unlikely(actual <= 0))
>>                            return 0;
>>        }
>>                 rte_lfring_deq_elems(elems, r->ring, mask, head, actual);
>>         } while (!__atomic_compare_exchange_n(&r->head,
>>                                               &head,
>>                                               head + actual,
>>                                               /*weak*/false,
>>                                               __ATOMIC_RELEASE,
>>                                               __ATOMIC_RELAXED));
>>> The user could retry dequeue, of course, but how many times until
>>> they feel confident that the ring is truly empty and give up? I don't
>>> think we can avoid reserving ring slots prior to enqueueing/dequeueing.
>> If tail is (re-) read inside the CAS-loop, there should be no spurious queue 
>> empty
>> failures and no need to repeatedly call the dequeue function "just to be 
>> sure".
>> The bulk dequeue/enqueue behaviour is separate.
>> 
>>> 
>>> 2. On the plus side, the enqueue function design that allows it to use
>>> 1/2 the atomics of mine appears to be independent of reserving ring
>>> slots, and should transfer over fairly cleanly. I'm a little concerned
>>> about the performance variability it introduces (i.e. if the thread
>>> gets into "catch up" mode),
>> If a thread has to catch up, it means it is using a stale (head/tail) index 
>> (more
>> than one ring lap behaind). Better to try to load a fresh value (if one is
>> available) than to iterate over the ring until it has caught up. So I think 
>> this is the
>> better/faster design.
>> 
>> Catch up mode is not triggered by finding an occupied slot for the current 
>> lap
>> (that was just written by some other thread). Or at least this is the idea. 
>> If we
>> find a freshly written slot, we just move to the next slot.
>> 
>>> particularly for larger rings, since real-time software values 
>>> predictability.
>>> What if the reload criteria was instead something like:
>>> 
>>> #define ENQ_RETRY_LIMIT 32 //arbitrary
>>> 
>>> if (old.e.idx != tail - size) {
>>>     if (++fail_cnt < ENQ_RETRY_LIMIT) {
>>>         tail++;
>>>     } else {
>>>         fail_cnt = 0;
>>>         tail = rte_lfring_reload(...);
>>>     }
>>>     goto restart;
>>> }
>>> fail_cnt = 0;
>> There are three cases (slot index must be between q->tail and q->head + q-
>>> size):
>> slot.idx == tail - size: slot is free, try to write it slot.idx == tail: 
>> slot has just been
>> written (by other thread), skip to next slot (tail++) none of the above: 
>> thread is
>> behind (at least one lap), re-load tail from q-
>>> tail
>> 
>> I think using the retry count actually delays catching up to a fresh 
>> position.
>> 
> 
> Miscommunication here -- by "catch up", I meant the case where the thread is 
> behind but by less than one lap (the second case you describe). In the worst 
> case, the thread would have to read N-1 (N = ring size) entries before 
> reaching the next available entry, and N could easily be in the thousands. 
> That's the performance variability I was referring to, and why I suggested 
> capping the failed slot reads at 32. Maintaining a local fail_cnt variable is 
> a small price to pay (relative to a CAS) to prevent a ring enqueue latency 
> spike.
> 
> But you're right that we should still catch the 1+ lap behind case, so the 
> reload criteria could be:
> 
> #define ENQ_RETRY_LIMIT 32 //arbitrary
> 
> if (old.e.idx != tail - size) {
>    if (++fail_cnt < ENQ_RETRY_LIMIT && old.e.idx == tail) {
>        tail++;
>    } else {
>        fail_cnt = 0;
>        tail = rte_lfring_reload(...);
>    }
>    goto restart;
> }
> fail_cnt = 0;
> 

Scratch that. This proposal breaks lock-freedom.

>>> 
>>> 3. Using a zero-length array to mark the start of the ring is a nice
>>> approach
>>> -- I'll incorporate that into the patchset.
>>> 
>>> At an algorithm level, I don't see any other differences.
>>> Implementation-wise, we'll need to nail the memory ordering flags to
>>> best support weak consistency machines, as you pointed out elsewhere.
>> There is no pre-acquisition of slots in enqueue and dequeue. That separate 
>> step
>> makes lock-freedom impossible (I think).
>> 
> 
> Can you elaborate? I don't currently see any other way to support rte_ring 
> bulk semantics, which is necessary for creating a non-blocking mempool 
> handler, so we should clear up any doubt.
> 
> In the NB ring patchset each thread reserves a number of slots before 
> performing the enq/deq, but doesn't reserve *specific* slots (unlike 
> rte_ring). This reservation is atomic, so that we never over-subscribe valid 
> ring entries (for dequeue) or unused ring entries (for enqueue). This 
> guarantees that the enq/deq operation will eventually complete, regardless of 
> the behavior of other threads. This is why the enqueue loop doesn't check if 
> space is available and the dequeue loop doesn't check if any more valid 
> entries remain.
> 
> This sort of reservation should be compatible with lfring, but requires 
> changes (e.g. two sets of head/tail pointers).
> 
>> With a merged implementation, I think the NON-BLOCKING/LOCK-FREE mode
>> most have separate implementations also for SP/SC. In non-blocking/lock-free
>> mode (whether MP/MC or SP/SC), only one set of head/tail pointers is used. In
>> blocking mode, two sets of head/tail are used. There isn't a set of SP/SC 
>> code
>> that supports both non-blocking/lock-free and blocking behaviours.
>> 
> 
> Sure, I don't see a problem with that. The NB ring patchset has separate NB 
> SP/SC code from the blocking SP/SC code as well.
> 
>>> 
>>> Out of curiosity, is p64_lfring based on the nb ring patchset? I
>>> noticed it used a different algorithm until pretty recently.
>> I created the original lfring design (which isn't that different I think) 
>> back in
>> November last and it seemed to work but I didn't think it could guarantee 
>> FIFO
>> order as elements were enqueued and dequeued wherever a thread found a
>> suitable slot and in my use cases, FIFO order is an important characteristic 
>> (need
>> to maintain ingress-to-egress packet order). Already then I was considering
>> adding the index to the ring slots to enforce FIFO enqueue/dequeue order, I
>> knew this would be possible on ARMv7a (which has 2x32 bit CAS using
>> exclusives) and ARMv8/AArch64 (which has 2x64-bit CAS using exclusives). I 
>> had
>> seen this design idea on the Internet before (e.g.
>> here https://www.youtube.com/ watch?v=HP2InVqgBFM, I actually stopped
>> viewing this presentation half-way because I wanted to iron out the details
>> myself), it's not a novel idea (sorry).
>> However I didn't have time to do this change before Christmas holidays and 
>> just
>> came back to work two weeks ago. So I would rather call this simultaneous
>> "invention" (or publication). Difficult to prove this of course...
>> 
>>> 
>>> Thanks,
>>> Gage
>>> 
>>> </snip>
>> --
>> Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype
>> ola.liljedahl
> 

Reply via email to