On Mon, 2014-01-06 at 12:52 -0800, Darren Hart wrote: > On Thu, 2014-01-02 at 07:05 -0800, Davidlohr Bueso wrote: > > From: Davidlohr Bueso <davidl...@hp.com> > > > > In futex_wake() there is clearly no point in taking the hb->lock if we know > > beforehand that there are no tasks to be woken. While the hash bucket's > > plist > > head is a cheap way of knowing this, we cannot rely 100% on it as there is a > > racy window between the futex_wait call and when the task is actually added > > to > > the plist. To this end, we couple it with the spinlock check as tasks > > trying to > > enter the critical region are most likely potential waiters that will be > > added > > to the plist, thus preventing tasks sleeping forever if wakers don't > > acknowledge > > all possible waiters. > > > > Furthermore, the futex ordering guarantees are preserved, ensuring that > > waiters > > either observe the changed user space value before blocking or is woken by a > > concurrent waker. For wakers, this is done by relying on the barriers in > > get_futex_key_refs() -- for archs that do have implicit mb in atomic_inc() > > we > > do NOT have implicit mb in atomic_inc() > ^
oh, yes! > > Sorry to be a pedant, but this is gnarly stuff and we have to get the > documentation right. > Absolutely! > > explicitly add them through a new futex_get_mm function. For waiters we rely > > on the fact that spin_lock calls already update the head counter, so > > spinners > > are visible even if the lock hasn't been acquired yet. > > > > For more details please refer to the updated comments in the code and > > related > > discussion: https://lkml.org/lkml/2013/11/26/556 > > > > Special thanks to tglx for careful review and feedback. > > > > Cc: Ingo Molnar <mi...@kernel.org> > > Cc: Darren Hart <dvh...@linux.intel.com> > > Cc: Peter Zijlstra <pet...@infradead.org> > > Cc: Thomas Gleixner <t...@linutronix.de> > > Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com> > > Cc: Mike Galbraith <efa...@gmx.de> > > Cc: Jeff Mahoney <je...@suse.com> > > Suggested-by: Linus Torvalds <torva...@linux-foundation.org> > > Cc: Scott Norton <scott.nor...@hp.com> > > Cc: Tom Vaden <tom.va...@hp.com> > > Cc: Aswin Chandramouleeswaran <as...@hp.com> > > Cc: Waiman Long <waiman.l...@hp.com> > > Cc: Jason Low <jason.l...@hp.com> > > Signed-off-by: Davidlohr Bueso <davidl...@hp.com> > > --- > > kernel/futex.c | 113 > > +++++++++++++++++++++++++++++++++++++++++++++------------ > > 1 file changed, 90 insertions(+), 23 deletions(-) > > > > diff --git a/kernel/futex.c b/kernel/futex.c > > index fcc6850..5b4d09e 100644 > > --- a/kernel/futex.c > > +++ b/kernel/futex.c > > @@ -75,17 +75,20 @@ > > * The waiter reads the futex value in user space and calls > > * futex_wait(). This function computes the hash bucket and acquires > > * the hash bucket lock. After that it reads the futex user space value > > - * again and verifies that the data has not changed. If it has not > > - * changed it enqueues itself into the hash bucket, releases the hash > > - * bucket lock and schedules. > > + * again and verifies that the data has not changed. If it has not changed > > + * it enqueues itself into the hash bucket, releases the hash bucket lock > > + * and schedules. > > * > > * The waker side modifies the user space value of the futex and calls > > - * futex_wake(). This functions computes the hash bucket and acquires > > - * the hash bucket lock. Then it looks for waiters on that futex in the > > - * hash bucket and wakes them. > > + * futex_wake(). This function computes the hash bucket and acquires the > > + * hash bucket lock. Then it looks for waiters on that futex in the hash > > + * bucket and wakes them. > > * > > - * Note that the spin_lock serializes waiters and wakers, so that the > > - * following scenario is avoided: > > + * In scenarios where wakeups are called and no tasks are blocked on a > > futex, > > > "wakeups are called" reads awkwardly to me. Perhaps: > > "In futex wake up scenarios where no tasks are blocked on the > futex, ..." > I have no particular preference, so I'll update it. > > > + * taking the hb spinlock can be avoided and simply return. In order for > > this > > + * optimization to work, ordering guarantees must exist so that the waiter > > + * being added to the list is acknowledged when the list is concurrently > > being > > + * checked by the waker, avoiding scenarios like the following: > > * > > * CPU 0 CPU 1 > > * val = *futex; > > @@ -106,24 +109,50 @@ > > * This would cause the waiter on CPU 0 to wait forever because it > > * missed the transition of the user space value from val to newval > > * and the waker did not find the waiter in the hash bucket queue. > > - * The spinlock serializes that: > > + * > > + * The correct serialization ensures that a waiter either observes > > + * the changed user space value before blocking or is woken by a > > + * concurrent waker: > > * > > * CPU 0 CPU 1 > > * val = *futex; > > * sys_futex(WAIT, futex, val); > > * futex_wait(futex, val); > > - * lock(hash_bucket(futex)); > > - * uval = *futex; > > - * *futex = newval; > > - * sys_futex(WAKE, futex); > > - * futex_wake(futex); > > - * lock(hash_bucket(futex)); > > + * > > + * waiters++; > > + * mb(); (A) <-- paired with -. > > + * | > > + * lock(hash_bucket(futex)); | > > + * | > > + * uval = *futex; | > > + * | *futex = newval; > > + * | sys_futex(WAKE, futex); > > + * | futex_wake(futex); > > + * | > > + * `-------> mb(); (B) > > * if (uval == val) > > - * queue(); > > + * queue(); > > * unlock(hash_bucket(futex)); > > - * schedule(); if (!queue_empty()) > > - * wake_waiters(futex); > > - * unlock(hash_bucket(futex)); > > + * schedule(); if (waiters) > > + * lock(hash_bucket(futex)); > > + * wake_waiters(futex); > > + * unlock(hash_bucket(futex)); > > + * > > + * Where (A) orders the waiters increment and the futex value read -- this > > + * is guaranteed by the head counter in the hb spinlock; and where (B) > > + * orders the write to futex and the waiters read. > > + * > > + * This yields the following case (where X:=waiters, Y:=futex): > > + * > > + * X = Y = 0 > > + * > > + * w[X]=1 w[Y]=1 > > + * MB MB > > + * r[Y]=y r[X]=x > > + * > > + * Which guarantees that x==0 && y==0 is impossible; which translates back > > into > > + * the guarantee that we cannot both miss the futex variable change and the > > + * enqueue. > > */ > > > > int __read_mostly futex_cmpxchg_enabled; > > @@ -211,6 +240,38 @@ static unsigned long __read_mostly futex_hashsize; > > > > static struct futex_hash_bucket *futex_queues; > > > > +static inline void futex_get_mm(union futex_key *key) > > +{ > > + atomic_inc(&key->private.mm->mm_count); > > +#ifdef CONFIG_SMP > > + /* > > + * Ensure futex_get_mm() implies a full barrier such that > > + * get_futex_key() implies a full barrier. This is relied upon > > + * as full barrier (B), see the ordering comment above. > > + */ > > + smp_mb__after_atomic_inc(); > > +#endif > > +} > > + > > +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb) > > +{ > > +#ifdef CONFIG_SMP > > + /* > > + * Tasks trying to enter the critical region are most likely > > + * potential waiters that will be added to the plist. Ensure > > + * that wakers won't miss to-be-slept tasks in the window between > > + * the wait call and the actual plist_add. > > + */ > > + if (spin_is_locked(&hb->lock)) > > + return true; > > + smp_rmb(); /* Make sure we check the lock state first */ > > + > > + return !plist_head_empty(&hb->chain); > > +#else > > + return true; > > +#endif > > +} > > > I thought someone, Peter Z?, had commented on these CONFIG_SMP bits. Are > they really necessary? Does smp_mb__after_atomic_inc() and smp_rmb() not > already just do the right thing as far as we're concerned here? I don't think so. Thomas and I agreed that this was in fact the way to go. I rechecked old email and didn't notice any objections to CONFIG_SMP. Also for things like hb_waiters_pending we definitely need it. > > + > > /* > > * We hash on the keys returned from get_futex_key (see below). > > */ > > @@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key) > > > > switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { > > case FUT_OFF_INODE: > > - ihold(key->shared.inode); > > + ihold(key->shared.inode); /* implies MB (B) */ > > break; > > case FUT_OFF_MMSHARED: > > - atomic_inc(&key->private.mm->mm_count); > > + futex_get_mm(key); /* implies MB (B) */ > > break; > > } > > } > > @@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union > > futex_key *key, int rw) > > if (!fshared) { > > key->private.mm = mm; > > key->private.address = address; > > - get_futex_key_refs(key); > > + get_futex_key_refs(key); /* implies MB (B) */ > > return 0; > > } > > > > @@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, > > int nr_wake, u32 bitset) > > goto out; > > > > > Given the subtlety of the implementation - I think it would be good to > explicitly annotate the get_futex_key() call site in futex_wake() as > providing the MB (B). > > Similar comment for futex_wait() and futex_requeue() for MB (A). > > These will also raise the appropriate red flags for people looking to > optimize or modify these paths in the future. It would be good to have > it in the top level futex_* function to make the MB placement and > relationship explicitly clear. > Something quite similar was already there for v2 but PeterZ's feedback made me update the main documentation at the top of futex.c to as it is now... > > > hb = hash_futex(&key); > > + > > + /* Make sure we really have tasks to wakeup */ > > + if (!hb_waiters_pending(hb)) > > + goto out_put_key; > > + > > spin_lock(&hb->lock); > > > > plist_for_each_entry_safe(this, next, &hb->chain, list) { > > @@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int > > nr_wake, u32 bitset) > > } > > > > spin_unlock(&hb->lock); > > +out_put_key: > > put_futex_key(&key); > > out: > > return ret; > > @@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket > > *queue_lock(struct futex_q *q) > > hb = hash_futex(&q->key); > > q->lock_ptr = &hb->lock; > > > > - spin_lock(&hb->lock); > > + spin_lock(&hb->lock); /* implies MB (A) */ > > return hb; > > } > > > > Functionally, this looks correct to me and Davidlohr's testing has been > well documented. > > Reviewed-by: Darren Hart <dvh...@linux.intel.com> Thanks for looking into this Darren! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/