On Sat, 18 May 2024 16:00:55 +0200
Morten Brørup <m...@smartsharesystems.com> wrote:

> > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > Sent: Friday, 17 May 2024 18.18
> > 
> > On Fri, 17 May 2024 08:44:42 +0200
> > Morten Brørup <m...@smartsharesystems.com> wrote:
> >   
> > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> > > > Sent: Friday, 17 May 2024 06.27
> > > >
> > > > + Bruce for feedback on x86 architecture
> > > >  
> > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger  
> > <step...@networkplumber.org>  
> > > > wrote:  
> > > > >
> > > > > On Fri, 17 May 2024 02:45:12 +0000
> > > > > Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> wrote:
> > > > >  
> > > > >>> + * A counter is 64 bit value that is safe from split read/write
> > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time  
> > > > >> If we are defining the counter in this manner, then  
> > implementation cannot  
> > > > be generic. I think architectures will have constraints if they have  
> > to ensure  
> > > > the 64b variables are not split.  
> > > > >>
> > > > >> I think we at least need the counter to be aligned on 8B boundary  
> > to have  
> > > > generic code.  
> > > > >
> > > > > The C standard has always guaranteed that read and write to  
> > unsigned log  
> > > > will not be split.
> > > > As I understand, this is true only if the variable is an atomic  
> > variable. If  
> > > > not, there is no difference between atomic variables and non-atomic  
> > variables.  
> > > >  
> > > > > Therefore if arch is 64 bit native there is no need for atomics  
> > > > At least on ARM architecture, if the variable is not aligned on 8B  
> > boundary,  
> > > > the load or store are not atomic. I am sure it is the same on other
> > > > architectures.  
> > 
> > After reading this: Who's afraid of a big bad optimizing compiler?
> >  https://lwn.net/Articles/793253/  
> 
> Very interesting article!
> 
> > 
> > Looks like you are right, and atomic or read/write once is required.  
> 
> I don't like the performance tradeoff (for 64 bit architectures) in the v7 
> patch.
> For single-tread updated counters, we MUST have a high performance 
> counter_add(), not using atomic read-modify-write.

The fundamental issue is that for SW drivers, having a totally safe reset 
function requires
an atomic operation in the fast path for increment. Otherwise the following 
(highly unlikely) race
is possible:
        
        CPU A                                   CPU B
                load counter (value = X)
                                                store counter = 0
                store counter (value = X + 1)


> 
> IMO calling counter_fetch() MUST be possible from another thread.
> This requires that the fast path thread stores the counter atomically (or 
> using volatile), to ensure that the counter is not only kept in a CPU 
> register, but stored in memory visible by other threads.
> 
> For counter_reset(), we have multiple options:
> 0. Don't support counter resetting. Not really on option?

We could reject it in the SW drivers. But better not to.

> 1. Only allow calling counter_reset() in the fast path thread updating the 
> counter. This introduces no further requirements.

Not a good restriction

> 2. Allow calling counter_reset() from another thread, thereby zeroing the 
> "counter" variable. This introduces a requirement for the "counter" to be 
> thread-safe, so counter_add() must atomically read-modify-write the counter, 
> which has a performance cost.

> 3. Allow calling counter_reset() from another thread, and introduce an 
> "offset" variable for counter_fetch() and counter_reset() to provide 
> thread-safe fetch/reset from other threads, using the consume-release pattern.

Too confusing

> 
> I don't like option 2.
> I consider counters too important and frequently used in the fast path, to 
> compromise on performance for counters.

Agree.

> 
> For counters updated by multiple fast path threads, 
> atomic_fetch_add_explicit() of the "counter" variable seems unavoidable.

Not at all worried about overhead in slow (fetch) path.

> 
> > Perhaps introducing rte_read_once and rte_write_once is good idea?
> > Several drivers are implementing it already.  
> 
> The read_once/write_once are easier to use, but they lack the flexibility 
> (regarding barriers and locking) provided by their atomic_explicit 
> alternatives, which will impact performance in some use cases.

They solve the compiler reordering problem but do nothing about cpu ordering.
Also, there is no such increment.

> 
> We should strive for the highest possible performance, which means that we 
> shouldn't introduce functions or design patterns preventing this.
> Please note: Security vs. performance is another matter - we certainly don't 
> want to promote insecure code for the benefit of performance. But for "ease 
> of coding" vs. performance, I prefer performance.
> 
> That said, I agree that introducing rte_read/write_once functions for use by 
> drivers to access hardware makes sense, to eliminate copy-paste variants in 
> drivers.
> But how can we prevent such functions from being used for other purposes, 
> where atomic_explicit should be used?
> 

Looking at x86 result on godbolt shows that using atomic only adds a single 
locked operation.
Perhaps this can be ameliorated by doing bulk add at end of loop, like many 
drivers were already.


Reply via email to