On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
> 
> Yes, me too.
> 
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
> 
> Consider this code:
> 
>       object_t *object;
>       spinlock_t lock;
> 
>       void update(void)
>       {
>               object_t *o;
> 
>               spin_lock(&lock);
>               o = READ_ONCE(object);
>               if (o) {
>                       BUG_ON(o->dead);
>                       do_something(o);
>               }
>               spin_unlock(&lock);
>       }
> 
>       void destroy(void) // can be called only once, can't race with itself
>       {
>               object_t *o;
> 
>               o = object;
>               object = NULL;
> 
>               /*
>                * pairs with lock/ACQUIRE. The next update() must see
>                * object == NULL after spin_lock();
>                */
>               smp_mb();
> 
>               spin_unlock_wait(&lock);
> 
>               /*
>                * pairs with unlock/RELEASE. The previous update() has
>                * already passed BUG_ON(o->dead).
>                *
>                * (Yes, yes, in this particular case it is not needed,
>                *  we can rely on the control dependency).
>                */
>               smp_mb();
> 
>               o->dead = true;
>       }
> 
> I believe the code above is correct and it needs the barriers on both sides.
> 

Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)

        spin_lock(&lock):
          r1 = *lock; // LL, r1 == 0
        o = READ_ONCE(object); // could be reordered here.
          *lock = 1; // SC

This could happen because of the ACQUIRE semantics of spin_lock(), and 
the current implementation of spin_lock() on PPC allows this happen.

(Cc PPC maintainers for their opinions on this one)

Therefore the case below can happen:

        CPU 1                   CPU 2                   CPU 3
        ==================      ====================    ==============
                                                        spin_unlock(&lock);
                                spin_lock(&lock):
                                  r1 = *lock; // r1 == 0;
                                o = READ_ONCE(object); // reordered here
        object = NULL;
        smp_mb();
        spin_unlock_wait(&lock);
                                  *lock = 1;
        smp_mb();               
        o->dead = true;
                                if (o) // true
                                  BUG_ON(o->dead); // true!!


To show this, I also translate this situation into a PPC litmus for
herd[1]:

        PPC spin-lock-wait
        "
        r1: local variable of lock
        r2: constant 1
        r3: constant 0 or NULL
        r4: local variable of object, i.e. o
        r5: local variable of *o (simulate ->dead as I didn't know 
            how to write fields of structure in herd ;-()
        r13: the address of lock, i.e. &lock
        r14: the address of object, i.e. &object
        "
        {
        0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object;
        1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object;
        2:r1=0;2:r13=lock;
        lock=1; object=old; old=0;
        }

        P0              | P1                 | P2 ;
        ld r4,0(r14)    | Lock:              | stw r1,0(r13);
        std r3,0(r14)   | lwarx r1,r3,r13    | ;
                        | cmpwi r1,0         | ;
        sync            | bne Lock           | ;
                        | stwcx. r2,r3,r13   | ;
        Wait:           | bne Lock           | ;
        lwz r1,0(r13)   | lwsync             | ;
        cmpwi r1,0      | ld r4,0(r14)       | ;
        bne Wait        | cmpwi r4,0         | ;
                        | beq Fail           | ;
        sync            | lwz r5, 0(r4)      | ;
        stw r2,0(r4)    | Fail:              | ;
                        | lwsync             | ;
                        | stw r3, 0(r13)     | ;

        exists
        (1:r4=old /\ 1:r5=1)

,whose result says that (1:r4=old /\ 1:r5=1) can happen:

        Test spin-lock-wait Allowed
        States 3
        1:r4=0; 1:r5=0;
        1:r4=old; 1:r5=0;
        1:r4=old; 1:r5=1;
        Loop Ok
        Witnesses
        Positive: 18 Negative: 108
        Condition exists (1:r4=old /\ 1:r5=1)
        Observation spin-lock-wait Sometimes 18 108
        Hash=244f8c0f91df5a5ed985500ed7230272

Please note that I use backwards jump in this litmus, which is only
supported by herd(not by ppcmem[2]), and it will take a while to get the
result. And I'm not that confident that I'm familiar with this tool,
maybe Paul and Will can help check my translate and usage here ;-)


IIUC, the problem here is that spin_lock_wait() can be implemented with
only LOAD operations, and to have a RELEASE semantics, one primivite
must have a *STORE* part, therefore spin_lock_wait() can not be a
RELEASE.

So.. we probably need to take the lock here.

> If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
> cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
> should see the result of our cmpxchg(), it must not try to read work->next or
> work->func.
> 
> Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
> I mean that unlock_wait() should be removed because it is too subtle for me ;)
> 

;-)

I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
a control dependency to pair with spin_unlock(), for example, the
following snippet in do_exit() is OK, except the smp_mb() is redundant,
unless I'm missing something subtle:


        /*
         * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
         * when the following two conditions become true.
         *   - There is race condition of mmap_sem (It is acquired by
         *     exit_mm()), and
         *   - SMI occurs before setting TASK_RUNINNG.
         *     (or hypervisor of virtual machine switches to other guest)
         *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
         *
         * To avoid it, we have to wait for releasing tsk->pi_lock which
         * is held by try_to_wake_up()
         */
        smp_mb();
        raw_spin_unlock_wait(&tsk->pi_lock);

        /* causes final put_task_struct in finish_task_switch(). */
        tsk->state = TASK_DEAD;

Ref:
[1]: https://github.com/herd/herdtools
[2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html

Regards,
Boqun

Attachment: signature.asc
Description: PGP signature

Reply via email to