On Sat, Apr 20, 2019 at 10:26:15AM +1000, Nicholas Piggin wrote:
> Paul E. McKenney's on April 20, 2019 4:26 am:
> > On Fri, Apr 19, 2019 at 08:00:17PM +0200, Peter Zijlstra wrote:
> >> On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
> >> > Index: usb-devel/Documentation/atomic_t.txt
> >> > ===================================================================
> >> > --- usb-devel.orig/Documentation/atomic_t.txt
> >> > +++ usb-devel/Documentation/atomic_t.txt
> >> > @@ -171,7 +171,10 @@ The barriers:
> >> >    smp_mb__{before,after}_atomic()
> >> >  
> >> >  only apply to the RMW ops and can be used to augment/upgrade the 
> >> > ordering
> >> > -inherent to the used atomic op. These barriers provide a full smp_mb().
> >> > +inherent to the used atomic op. Unlike normal smp_mb() barriers, they 
> >> > order
> >> > +only the RMW op itself against the instructions preceding the
> >> > +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> >> > +not order instructions on the other side of the RMW op at all.
> >> 
> >> Now it is I who is confused; what?
> >> 
> >>    x = 1;
> >>    smp_mb__before_atomic();
> >>    atomic_add(1, &a);
> >>    y = 1;
> >> 
> >> the stores to both x and y will be ordered as if an smp_mb() where
> >> there. There is no order between a and y otoh.
> > 
> > Let's look at x86.  And a slightly different example:
> > 
> >     x = 1;
> >     smp_mb__before_atomic();
> >     atomic_add(1, &a);
> >     r1 = y;
> > 
> > The atomic_add() asm does not have the "memory" constraint, which is
> > completely legitimate because atomic_add() does not return a value,
> > and thus guarantees no ordering.  The compiler is therefore within
> > its rights to transform the code into the following:
> > 
> >     x = 1;
> >     smp_mb__before_atomic();
> >     r1 = y;
> >     atomic_add(1, &a);
> > 
> > But x86's smp_mb__before_atomic() is just a compiler barrier, and
> > x86 is further allowed to reorder prior stores with later loads.
> > The CPU can therefore execute this code as follows:
> > 
> >     r1 = y;
> >     x = 1;
> >     smp_mb__before_atomic();
> >     atomic_add(1, &a);
> > 
> > So in general, the ordering is guaranteed only to the atomic itself,
> > not to accesses on the other side of the atomic.
> 
> That's interesting. I don't think that's what all our code expects.
> I had the same idea as Peter.
> 
> IIRC the primitive was originally introduced exactly so x86 would not
> need to have the unnecessary hardware barrier with sequences like
> 
>   smp_mb();
>   ...
>   atomic_inc(&v);
> 
> The "new" semantics are a bit subtle. One option might be just to
> replace it entirely with atomic_xxx_mb() ?

Hmmm...  There are more than 2,000 uses of atomic_inc() in the kernel.
There are about 300-400 total between smp_mb__before_atomic() and
smp_mb__after_atomic().

So what are our options?

1.      atomic_xxx_mb() as you say.

        From a quick scan of smp_mb__before_atomic() uses, we need this
        for atomic_inc(), atomic_dec(), atomic_add(), atomic_sub(),
        clear_bit(), set_bit(), test_bit(), atomic_long_dec(),
        atomic_long_add(), refcount_dec(), cmpxchg_relaxed(),
        set_tsk_thread_flag(), clear_bit_unlock().

        Another random look identifies atomic_andnot().

        And atomic_set(): set_preempt_state().  This fails
        on x86, s390, and TSO friends, does it not?  Or is
        this ARM-only?  Still, why not just smp_mb() before and
        after?  Same issue in __kernfs_new_node(), bio_cnt_set(),
        sbitmap_queue_update_wake_batch(), 

        Ditto for atomic64_set() in __ceph_dir_set_complete().

        Ditto for atomic_read() in rvt_qp_is_avail().  This function
        has a couple of other oddly placed smp_mb__before_atomic().

        And atomic_cmpxchg(): msc_buffer_alloc().  This instance
        of smp_mb__before_atomic() can be removed unless I am missing
        something subtle.  Ditto for kvm_vcpu_exiting_guest_mode(),
        pv_kick_node(), __sbq_wake_up(), 

        And lock acquisition??? acm_read_bulk_callback().

        In nfnl_acct_fill_info(), a smp_mb__before_atomic() after
        a atomic64_xchg()???  Also before a clear_bit(), but the
        clear_bit() is inside an "if".

        There are a few cases that would see added overhead.  For example,
        svc_get_next_xprt() has the following:

                smp_mb__before_atomic();
                clear_bit(SP_CONGESTED, &pool->sp_flags);
                clear_bit(RQ_BUSY, &rqstp->rq_flags);
                smp_mb__after_atomic();

        And xs_sock_reset_connection_flags() has this:

                smp_mb__before_atomic();
                clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
                clear_bit(XPRT_CLOSING, &xprt->state);
                xs_sock_reset_state_flags(xprt);  /* Also a clear_bit(). */
                smp_mb__after_atomic();

        Yeah, there are more than a few misuses, aren't there?  :-/
        A coccinelle script seems in order.  In 0day test robot.

        But there are a number of helper functions whose purpose
        seems to be to wrap an atomic in smp_mb__before_atomic() and
        smp_mb__after_atomic(), so some of the atomic_xxx_mb() functions
        might be a good idea just for improved readability.

2.      Add something to checkpatch.pl warning about non-total ordering,
        with the error message explaining that the ordering is partial.

3.      Make non-value-returning atomics provide full ordering.
        This would of course need some benchmarking, but would be a
        simple change to make and would eliminate a large class of
        potential bugs.  My guess is that the loss in performance
        would be non-negligible, but who knows?

4.      Your idea here!

                                                        Thanx, Paul

Reply via email to