> -----Original Message----- > From: Ola Liljedahl [mailto:ola.liljed...@arm.com] > Sent: Wednesday, January 30, 2019 5:36 AM > To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; Richardson, > Bruce <bruce.richard...@intel.com>; Eads, Gage <gage.e...@intel.com>; > dev@dpdk.org > Cc: nd <n...@arm.com> > Subject: Re: [RFC] lfring: lock-free ring buffer > > On Wed, 2019-01-30 at 05:17 +0000, Eads, Gage wrote: > <SNIP> > > > > > > 2. On the plus side, the enqueue function design that allows it to > > > > use > > > > 1/2 the atomics of mine appears to be independent of reserving > > > > ring slots, and should transfer over fairly cleanly. I'm a little > > > > concerned about the performance variability it introduces (i.e. if > > > > the thread gets into "catch up" mode), > > > If a thread has to catch up, it means it is using a stale > > > (head/tail) index (more than one ring lap behaind). Better to try to > > > load a fresh value (if one is > > > available) than to iterate over the ring until it has caught up. So > > > I think this is the better/faster design. > > > > > > Catch up mode is not triggered by finding an occupied slot for the > > > current lap (that was just written by some other thread). Or at > > > least this is the idea. > > > If we > > > find a freshly written slot, we just move to the next slot. > > > > > > > > > > > particularly for larger rings, since real-time software values > > > > predictability. > > > > What if the reload criteria was instead something like: > > > > > > > > #define ENQ_RETRY_LIMIT 32 //arbitrary > > > > > > > > if (old.e.idx != tail - size) { > > > > if (++fail_cnt < ENQ_RETRY_LIMIT) { > > > > tail++; > > > > } else { > > > > fail_cnt = 0; > > > > tail = rte_lfring_reload(...); > > > > } > > > > goto restart; > > > > } > > > > fail_cnt = 0; > > > There are three cases (slot index must be between q->tail and > > > q->head + q- > > > > > > > > size): > > > slot.idx == tail - size: slot is free, try to write it slot.idx == tail: > > > slot has just been > > > written (by other thread), skip to next slot (tail++) none of the above: > > > thread is > > > behind (at least one lap), re-load tail from q- > > > > > > > > tail > > > I think using the retry count actually delays catching up to a fresh > > > position. > > > > > Miscommunication here -- by "catch up", I meant the case where the > > thread is behind but by less than one lap (the second case you > > describe). In the worst case, the thread would have to read N-1 (N = > > ring size) entries before reaching the next available entry, and N could > > easily > be in the thousands. > > That's the performance variability I was referring to, and why I > > suggested capping the failed slot reads at 32. Maintaining a local > > fail_cnt variable is a small price to pay (relative to a CAS) to > > prevent a ring enqueue latency spike. > OK, I misunderstood. We can reload the private tail from the shared ring > buffer > tail earlier. You could actually do this on every failed CAS but I think that > would > be overkill. Your design with a retry limit is better, we need to find out > what is a > suitable value for the retry limit. >
Too late for lfring v2 unfortunately, but this idea will actually break lock-freedom. For example, say the tail index is 33 slots behind the next available slot. An enqueueing thread would get stuck retrying slots from tail index -> tail index + 32 until another thread updates the tail index. > > > > But you're right that we should still catch the 1+ lap behind case, so > > the reload criteria could be: > > > > #define ENQ_RETRY_LIMIT 32 //arbitrary > > > > if (old.e.idx != tail - size) { > > if (++fail_cnt < ENQ_RETRY_LIMIT && old.e.idx == tail) { > > tail++; > > } else { > > fail_cnt = 0; > > tail = rte_lfring_reload(...); > > } > > goto restart; > > } > > fail_cnt = 0; > Agree. > > > > > > > > > > > > > > > > > > 3. Using a zero-length array to mark the start of the ring is a > > > > nice approach > > > > -- I'll incorporate that into the patchset. > > > > > > > > At an algorithm level, I don't see any other differences. > > > > Implementation-wise, we'll need to nail the memory ordering flags > > > > to best support weak consistency machines, as you pointed out elsewhere. > > > There is no pre-acquisition of slots in enqueue and dequeue. That > > > separate step makes lock-freedom impossible (I think). > > > > > Can you elaborate? > I think lock-freedom is impossible with the initial acquisition of elements. > This acquisition creates a side effect that cannot be undone or helped by > other > threads. > > You can still implement a "non-blocking" ring buffer (like your original > design) > which hides any delay in threads that access the ring buffer but isn't > properly > lock-free (which could be considered unnecessary in a DPDK environment, > threads may get delayed but shouldn't die or block forever). > I think I see what you're saying. For example If a dequeueing thread reserves N elements and then is preempted, it's effectively reduced the number of available elements by N during that period. Practically speaking, I don't think this is a problem. Not just because threads shouldn't die or block forever, but if a thread can be preempted after reserving N ring slots (but before enqueueing to or dequeueing from them), the net effect is those N buffers are taken out of the system. This isn't really different than if it that thread was preempted *outside* of the ring functions while holding N buffers -- the application should be designed to be resilient to that. > > I don't currently see any other way to support rte_ring bulk > > semantics, which is necessary for creating a non-blocking mempool > > handler, so we should clear up any doubt. > OK, I wasn't aware of this strict dependency on bulk enqueue and dequeue. Bulk > dequeue (mempool allocates buffers from the ring) should be easy to support > though. Bulk enqueue (mempool frees buffers to the ring) will work as long as > the ring is large enough to hold all free buffers so it can never become full > (need > to reload head/tail pointers at the appropriate places to avoid spurious > full/empty status). I assume this is the case, initially all free buffers > will be stored > in the ring? > > > > > In the NB ring patchset each thread reserves a number of slots before > > performing the enq/deq, but doesn't reserve *specific* slots (unlike > > rte_ring). This reservation is atomic, so that we never over-subscribe > > valid ring entries (for dequeue) or unused ring entries (for enqueue). > > This guarantees that the enq/deq operation will eventually complete, > > regardless of the behavior of other threads. This is why the enqueue > > loop doesn't check if space is available and the dequeue loop doesn't > > check if any more valid entries remain. > I initially thought these acquisitions were just for compatibility with > rte_ring > (update the same metadata so that the user can mix MP/MC and SP/SC calls) but > realise they are there for the bulk enqueue/dequeue. But doing this > acquisition > means each enqueue or dequeue will update two metadata variables so creates > more contention and less scalability. I think it would be good if we could > provide > the bulk behaviour *without* this initial acquisition and only update one > metadata location per enqueue/dequeue operation. > Agreed. I see lfring v2 has avoided this on the dequeue side, but I think it's unavoidable on the enqueue side -- since we can't write ring entries and CAS the tail pointer in one big atomic operation. AFAICT, slot reservations are unavoidable there to achieve bulk semantics. > > > > This sort of reservation should be compatible with lfring, but > > requires changes (e.g. two sets of head/tail pointers). > > > <SNIP> > > > -- > Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype > ola.liljedahl