On Thu, Oct 27, 2016 at 10:36:00PM +0200, Thomas Gleixner wrote:
> > +   new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> > +   if (!new_owner) {
> > +           /*
> > +            * This is the case where futex_lock_pi() has not yet or failed
> > +            * to acquire the lock but still has the futex_q enqueued. So
> > +            * the futex state has a 'waiter' while the rt_mutex state does
> > +            * not.
> > +            *
> > +            * Even though there still is pi_state for this futex, we can
> > +            * clear FUTEX_WAITERS. Either:
> > +            *
> > +            *  - we or futex_lock_pi() will drop the last reference and
> > +            *    clean up this pi_state,
> > +            *
> > +            *  - userspace acquires the futex through its fastpath
> > +            *    and the above pi_state cleanup still happens,
> > +            *
> > +            *  - or futex_lock_pi() will re-set the WAITERS bit in
> > +            *    fixup_owner().
> > +            */
> > +           newval = 0;
> > +           /*
> > +            * Since pi_state->owner must point to a valid task, and
> > +            * task_pid_vnr(pi_state->owner) must match TID_MASK, use
> > +            * init_task.
> > +            */
> > +           new_owner = &init_task;
> 
> So that waiter which is now spinning on pi_mutex->lock has already set the
> waiters bit, which you undo. So you created the following scenario:
> 
> CPU0                    CPU1                          CPU2
> 
> TID 0x1001              TID 0x1000                    TID 0x1002
> 
>                         Acquires futex in user space
>                         futex value = 0x1000;
> 
> futex_lock_pi()
>   
>   lock_hb()
>   set_waiter_bit()
>   --> futex value = 0x40001000;
> 
>                         futex_unlock_pi()
>                          lock_hb()
>   attach_to_pi_owner()
>     rt_mutex_init_proxy_locked()
>   queue_me()
>     unlock_hb()
>                          unlock_hb()
>   rt_mutex_lock()        wake_futex_pi()
>                          lock(pi_mutex->lock);
>    lock(pi_mutex->lock)          new_owner is NULL;
>                           --> futex value = 0;
>                          rt_mutex_futex_unlock(pi_mutex);
>                          unlock(pi_mutex->lock);
>    acquire_rt_mutex()    return to user space
>                                                       Acquires futex in user 
> space
>                                                       --> futex value = 0x1002
>   fixup_owner()
>     fixup_pi_state_owner()
>        uval = 0x1002;
>        newval = 0x40001001;
>        cmpxchg(uval, newval) succeeds
>        --> futex value = 0x40001001
> 
> Voila. Inconsistent state .... TID 0x1002 is borked now.

Urgh, right.

> The other option we have is to set the futex value to FUTEX_WAITERS instead
> of 0.

Yeah, I initially did that but didn't really like it, I then went on to
convince myself setting it to 0 was good, silly me. Leaking the WAITERS
bit is _by_far_ the simplest option though.

Ok, I went and implemented various broken and discarded alternatives
while re-learning all about futexes that I forgot the past few weeks,
while trying to figure out wtf the below does.

I also tried to create some 4+ CPU races that would hit holes in the
below, no luck so far.

> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2246,16 +2246,27 @@ static int fixup_pi_state_owner(u32 __us
>       if (get_futex_value_locked(&uval, uaddr))
>               goto handle_fault;
>  
> +     /*
> +      * If wake_futex_pi() set the futex to 0 and made init_task the
> +      * transient owner another task might have acquired the futex
> +      * in user space.
> +      */

True, but that doesn't explain why we do this. Naively leaving
pi_state->owner set while returning EAGAIN shouldn't be a problem
because put_pi_state() should clean that up.

_However_, that does rt_mutex_proxy_unlock() on it as well, and _that_
is a problem, because it's not the rt_mutex owner.

Then again, we can hit this very problem through any of the other
put_pi_state() calls after setting ->owner = &init_task I think. Which
would argue to special case this in put_pi_state() instead.

> +     if (oldowner == &init_task && uval != 0) {
> +             raw_spin_lock(&pi_state->owner->pi_lock);
> +             list_del_init(&pi_state->list);
> +             raw_spin_unlock(&pi_state->owner->pi_lock);
> +             pi_state->owner = NULL;
> +             return -EAGAIN;
>       }
>  
> +     newval = (uval & FUTEX_OWNER_DIED) | newtid;
> +
> +     if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> +             goto handle_fault;
> +
> +     if (curval != uval)
> +             goto retry;

This is slightly weird in that we loose the obvious cmpxchg loop
construct. So I'd write it differently, also that
get_futex_value_locked() call is entirely superfluous the second time
around, we got curval after all.

> +
>       /*
>        * We fixed up user space. Now we need to fix the pi_state
>        * itself.
> @@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
>  
>  out_put_key:
>       put_futex_key(&q.key);
> +
> +     if (ret == -EAGAIN)
> +             goto retry;
> +

And this is far too clever and really needs a comment. So the crucial
point is that this is after unqueue_me_pi(), which drops the pi_state
and loops back to lookup the pi_state again, which, hopefully, has now
been completely destroyed -- and therefore we hit the regular
attach_to_pi_owner() path, fixing up our 'funny' state.

This is where I was playing with 4+ CPU scenarios, to see if we could
somehow keep the pi_state alive and not make progress.

I think we can do the retry slightly earlier, right after
unqueue_me_pi() and then add retry_queue: right before
futex_lock_pi_atomic(), that would avoid dropping the hb->lock (and
avoids put/get_futex_key).

>  out:
>       if (to)
>               destroy_hrtimer_on_stack(&to->timer);

Also, since fixup_pi_state_owner() can now return -EAGAIN, all users
need handling for that.


I'll try and do a patch that does all that and attempt to write coherent
comments on our fancy new state.

Reply via email to