> > > On 2025-09-20, 14:01, "Konstantin Ananyev" <[email protected] > <mailto:[email protected]>> wrote: > > > > > > > > > > To avoid information loss I combined reply to two Wathsala replies into > > > one. > > > > > > > > > > > > The function __rte_ring_headtail_move_head() assumes that the > > > > > > barrier > > > > > (fence) between the load of the head and the load-acquire of the > > > > > > opposing tail guarantees the following: if a first thread reads > > > > > > tail > > > > > > and then writes head and a second thread reads the new value of > > > > > > head > > > > > > and then reads tail, then it should observe the same (or a later) > > > > > > value of tail. > > > > > > > > > > > > This assumption is incorrect under the C11 memory model. If the > > > > > > barrier > > > > > > (fence) is intended to establish a total ordering of ring > > > > > > operations, > > > > > > it fails to do so. Instead, the current implementation only > > > > > > enforces a > > > > > > partial ordering, which can lead to unsafe interleavings. In > > > > > > particular, > > > > > > some partial orders can cause underflows in free slot or available > > > > > > element computations, potentially resulting in data corruption. > > > > > > > > > > Hmm... sounds exactly like the problem from the patch we discussed > > > > > earlier that year: > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- > > > > > > > > > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-> > > [email protected] <mailto:[email protected]> > > <mailto:20250521111432.207936-4- > > [email protected] <mailto:[email protected]>>/ > > > > > In two words: > > > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value > > > > > for 'prod.tail' or visa-versa. > > > > > In other words: 'cons.head' value depends on 'prod.tail', so before > > > > > making latest 'cons.head' > > > > > value visible to other threads, we need to ensure that latest > > > > > 'prod.tail' is also visible." > > > > > Is that the one? > > > > > > > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR), > > > > which is why we didn’t catch it earlier. A fuller explanation, with > > > > Herd7 simulations, is in the blog post linked in the cover letter. > > > > > > > > https://community.arm.com/arm-community-blogs/b/architectures-and- > > > > <https://community.arm.com/arm-community-blogs/b/architectures-and-> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order > > <https://community.arm.com/arm-community-blogs/b/architectures-and- > > <https://community.arm.com/arm-community-blogs/b/architectures-and-> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order> > > > > > > > > > I see, so now it is reproducible with core rte_ring on real HW. > > > > > > > > > > > > > > > > > The issue manifests when a CPU first acts as a producer and later > > > > > > as a > > > > > > consumer. In this scenario, the barrier assumption may fail when > > > > > > another > > > > > > core takes the consumer role. A Herd7 litmus test in C11 can > > > > > > demonstrate > > > > > > this violation. The problem has not been widely observed so far > > > > > > because: > > > > > > (a) on strong memory models (e.g., x86-64) the assumption holds, > > > > > > and > > > > > > (b) on relaxed models with RCsc semantics the ordering is still > > > > > > strong > > > > > > enough to prevent hazards. > > > > > > The problem becomes visible only on weaker models, when load- > > > > > > acquire is > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which > > > > > > support > > > > > > the LDAPR and LDAPUR instructions). > > > > > > > > > > > > Three possible solutions exist: > > > > > > 1. Strengthen ordering by upgrading release/acquire semantics to > > > > > > sequential consistency. This requires using seq-cst for > > > > > > stores, > > > > > > loads, and CAS operations. However, this approach introduces a > > > > > > significant performance penalty on relaxed-memory > > > > > > architectures. > > > > > > > > > > > > 2. Establish a safe partial order by enforcing a pair-wise > > > > > > happens-before relationship between thread of same role by > > > > > > changing > > > > > > the CAS and the preceding load of the head by converting them > > > > > > to > > > > > > release and acquire respectively. This approach makes the > > > > > > original > > > > > > barrier assumption unnecessary and allows its removal. > > > > > > > > > > For the sake of clarity, can you outline what would be exact code > > > > > changes for > > > > > approach #2? Same as in that patch: > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- > > > > > > > > > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-> > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-> > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->> > > > > [email protected] <mailto:[email protected]> > > > > <mailto:[email protected] > > > > <mailto:[email protected]>>/ > > > > > Or something different? > > > > > > > > Sorry, I missed the later half you your comment before. > > > > Yes, you have proposed the same solution there. > > > > > > > > > Ok, thanks for confirmation. > > > > > > > > > > > > > > > > > > > > > > 3. Retain partial ordering but ensure only safe partial orders > > > > > > are > > > > > > committed. This can be done by detecting underflow conditions > > > > > > (producer < consumer) and quashing the update in such cases. > > > > > > This approach makes the original barrier assumption > > > > > > unnecessary > > > > > > and allows its removal. > > > > > > > > > > > This patch implements solution (3) for performance reasons. > > > > > > > > > > > > Signed-off-by: Wathsala Vithanage <[email protected] > > > > > > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> > > > > > > Signed-off-by: Ola Liljedahl <[email protected] > > > > > > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> > > > > > > Reviewed-by: Honnappa Nagarahalli <[email protected] > > > > > > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> > > > > > > Reviewed-by: Dhruv Tripathi <[email protected] > > > > > > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> > > > > > > --- > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++--- > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h > > > > > > b/lib/ring/rte_ring_c11_pvt.h > > > > > > index b9388af0da..e5ac1f6b9e 100644 > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct > > > > > > rte_ring_headtail > > > > > > *d, > > > > > > /* Reset n to the initial burst count */ > > > > > > n = max; > > > > > > > > > > > > - /* Ensure the head is read before tail */ > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire); > > > > > > - > > > > > > /* load-acquire synchronize with store-release of > > > > > > ht->tail > > > > > > * in update_tail. > > > > > > */ > > > > > > > > > > But then cons.head can be read a before prod.tail (and visa-versa), > > > > > right? > > > > > > > > Right, we let it happen but eliminate any resulting states that are > > > > semantically incorrect at the end. > > > > > > > > > Two comments here: > > > 1) I think it is probably safer to do the check like that: > > > If (*entries > ring->capacity) ... > > Yes, this might be another way of handling underflow situations. We could > > study > > this. > > > > I have used the check for negative without problems in my ring buffer > > implementations > > https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c > > <https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c> > > but can't say that has been battle-tested. > > > My thought was about the case (probably hypothetical) when the difference > between stale tail and head will be bigger then 2^31 + 1. > > > > > 2) My concern that without forcing a proper read ordering > > > (cons.head first then prod.tail) we re-introduce a window for all sorts of > > > ABA-like problems. > > Head and tail indexes are monotonically increasing so I don't see a risk > > for ABA-like > > problems. > > > I understand that, but with current CPU speeds it can take rte_ring just few > seconds to > wrap around head/tail values. If user doing something really fancy - like > using rte_ring ZC API > (i.e. just moving head/tail without reading actual objects) that can probably > happen even > faster (less than a second?). > Are we sure that the stale tail value will never persist that long? > Let say user calling move_head() in a loop till it succeeds? > > > > Indeed, adding a monotonically increasing tag to pointers is the common way > > of > > avoiding ABA > > problems in lock-free designs. > > > Yep, using 64-bit values for head/tail counters will help to avoid these > concerns. > But it will probably break HTS/RTS modes, plus it is an ABI change for sure. > > > Actually after another thought, I have one more concern here: > > > + /* > + * Ensure the entries calculation was not based on a stale > + * and unsafe stail observation that causes underflow. > + */ > + if ((int)*entries < 0) > + *entries = 0; > + > > > With that change, it might return not-valid information back to the user > about number of free/occupied entries in the ring. > Plus rte_ring_enqueue() now might fail even when there are enough free entries > in the ring (same for dequeue). How do you (or the thread) know there are enough free (or used) entries? Do you assume sequentially consistent behaviour (a total order of memory accesses)? Otherwise, you would need to explicitly create a happens-before relation between threads, e.g. a consumer which made room in the ring buffer must synchronize-with the producer that there is now room for more elements. That synchronize-with edge will ensure the producer reads a fresh value of stail. But without it, how can a thread know the state of the ring buffer that is being manipulated by another thread?
> That looks like a change in our public API behavior that might break many > things. > There are quite few places when caller expects enqueue/dequeue > operation to always succeed (let say there always should be enough free space > in the ring). Single-threaded scenarios are not a problem. Do you have a multithreaded scenario where the caller expects enqueue/dequeue to always succeed? How are the threads involved in such a scenario synchronizing with each other? > For example: rte_mempool works like that. > I am pretty sure there are quite few other places like that inside DPDK, > not to mention third-party code. > > > Considering all of the above, I am actually more in favor > to combine approaches #2 and #3 for the final patch: > establish a safe partial order (#2) and keep the check from #3 (should it > become an assert()/verify()?) I agree that using acquire/release for all prod/cons_head accesses will make it easier to reason about the ring buffer state. Sequential consistency (total order) is the easiest to reason about and often seems to be desired and expected by programmers (e.g. "I'll just add a barrier here to ensure A happens before B in this thread, now there is a total order..."). - Ola > > > Another thing to note: whatever final approach we choose - > we need to make sure that the problem is addressed across all other > rte_ring flavors/modes too (generic implementation, rts/hts mode, soring). > > > Konstantin IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

