> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Sunday, 19 May 2024 17.14 > > On Sat, 18 May 2024 16:00:55 +0200 > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > > Sent: Friday, 17 May 2024 18.18 > > > > > > On Fri, 17 May 2024 08:44:42 +0200 > > > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > > > > > Sent: Friday, 17 May 2024 06.27 > > > > > > > > > > + Bruce for feedback on x86 architecture > > > > > > > > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger > > > <step...@networkplumber.org> > > > > > wrote: > > > > > > > > > > > > On Fri, 17 May 2024 02:45:12 +0000 > > > > > > Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> wrote: > > > > > > > > > > > >>> + * A counter is 64 bit value that is safe from split > read/write > > > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a > time > > > > > >> If we are defining the counter in this manner, then > > > implementation cannot > > > > > be generic. I think architectures will have constraints if they > have > > > to ensure > > > > > the 64b variables are not split. > > > > > >> > > > > > >> I think we at least need the counter to be aligned on 8B > boundary > > > to have > > > > > generic code. > > > > > > > > > > > > The C standard has always guaranteed that read and write to > > > unsigned log > > > > > will not be split. > > > > > As I understand, this is true only if the variable is an atomic > > > variable. If > > > > > not, there is no difference between atomic variables and non- > atomic > > > variables. > > > > > > > > > > > Therefore if arch is 64 bit native there is no need for > atomics > > > > > At least on ARM architecture, if the variable is not aligned on > 8B > > > boundary, > > > > > the load or store are not atomic. I am sure it is the same on > other > > > > > architectures. > > > > > > After reading this: Who's afraid of a big bad optimizing compiler? > > > https://lwn.net/Articles/793253/ > > > > Very interesting article! > > > > > > > > Looks like you are right, and atomic or read/write once is required. > > > > I don't like the performance tradeoff (for 64 bit architectures) in > the v7 patch. > > For single-tread updated counters, we MUST have a high performance > counter_add(), not using atomic read-modify-write. > > The fundamental issue is that for SW drivers, having a totally safe > reset function requires > an atomic operation in the fast path for increment. Otherwise the > following (highly unlikely) race > is possible: > > CPU A CPU B > load counter (value = X) > store counter = 0 > store counter (value = X + 1) >
Yes, this is why I suggest the "offset" method, option 3. Depending on which CPU wins the race, reset() will set "offset" to either X or X + 1 using your example here. "offset" is atomic, so reading it cannot race writing it. And thus: if counter = X was visible at reset(), fetch() will return counter - offset = X - X = 0, and if counter = X + 1 was visible at reset(), fetch() will return counter - offset = (X + 1) - (X + 1) = 0. > > > > > IMO calling counter_fetch() MUST be possible from another thread. > > This requires that the fast path thread stores the counter atomically > (or using volatile), to ensure that the counter is not only kept in a > CPU register, but stored in memory visible by other threads. > > > > For counter_reset(), we have multiple options: > > 0. Don't support counter resetting. Not really on option? > > We could reject it in the SW drivers. But better not to. Agree. > > > 1. Only allow calling counter_reset() in the fast path thread updating > the counter. This introduces no further requirements. > > Not a good restriction Not good, but the alternatives seem to be worse. > > > 2. Allow calling counter_reset() from another thread, thereby zeroing > the "counter" variable. This introduces a requirement for the "counter" > to be thread-safe, so counter_add() must atomically read-modify-write > the counter, which has a performance cost. Bad for performance. > > > 3. Allow calling counter_reset() from another thread, and introduce an > "offset" variable for counter_fetch() and counter_reset() to provide > thread-safe fetch/reset from other threads, using the consume-release > pattern. > > Too confusing Using offsets for pseudo-reset is a common design pattern. And the performance is excellent. Perhaps the Linux kernel doesn't use this design pattern, but that is not a valid reason to disqualify it. > > > > > I don't like option 2. > > I consider counters too important and frequently used in the fast > path, to compromise on performance for counters. > > Agree. > > > > > For counters updated by multiple fast path threads, > atomic_fetch_add_explicit() of the "counter" variable seems unavoidable. > > Not at all worried about overhead in slow (fetch) path. Agree. Fetch() is slow path, so performance is "nice to have", but nothing more. > > > > > > Perhaps introducing rte_read_once and rte_write_once is good idea? > > > Several drivers are implementing it already. > > > > The read_once/write_once are easier to use, but they lack the > flexibility (regarding barriers and locking) provided by their > atomic_explicit alternatives, which will impact performance in some use > cases. > > They solve the compiler reordering problem but do nothing about cpu > ordering. > Also, there is no such increment. Agree. They are designed for hardware access, not multithreading. > > > > > We should strive for the highest possible performance, which means > that we shouldn't introduce functions or design patterns preventing > this. > > Please note: Security vs. performance is another matter - we certainly > don't want to promote insecure code for the benefit of performance. But > for "ease of coding" vs. performance, I prefer performance. > > > > That said, I agree that introducing rte_read/write_once functions for > use by drivers to access hardware makes sense, to eliminate copy-paste > variants in drivers. > > But how can we prevent such functions from being used for other > purposes, where atomic_explicit should be used? > > > > Looking at x86 result on godbolt shows that using atomic only adds a > single locked operation. > Perhaps this can be ameliorated by doing bulk add at end of loop, like > many drivers were already. Absolutely; whenever possible, local counters should be maintained inside the loop, and added to the public counters at the end of a loop. Please note that application counters might be spread all over memory. E.g. per-flow in a flow structure, per QoS class in a QoS class structure, per subscriber in a subscriber structure, etc. And a burst of packets might touch multiple of these. My point is: Atomic read-modify-write of counters will cause serious stalling, waiting for memory access.