[PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2014-01-12 Thread Davidlohr Bueso
From: Davidlohr Bueso 

In futex_wake() there is clearly no point in taking the hb->lock if we know
beforehand that there are no tasks to be woken. While the hash bucket's plist
head is a cheap way of knowing this, we cannot rely 100% on it as there is a
racy window between the futex_wait call and when the task is actually added to
the plist. To this end, we couple it with the spinlock check as tasks trying to
enter the critical region are most likely potential waiters that will be added
to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
either observe the changed user space value before blocking or is woken by a
concurrent waker. For wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(),
we explicitly add them through a new futex_get_mm function. For waiters we rely
on the fact that spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the code and related
discussion: https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar 
Reviewed-by: Darren Hart 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Paul E. McKenney 
Cc: Mike Galbraith 
Cc: Jeff Mahoney 
Suggested-by: Linus Torvalds 
Cc: Scott Norton 
Cc: Tom Vaden 
Cc: Aswin Chandramouleeswaran 
Cc: Waiman Long 
Cc: Jason Low 
Signed-off-by: Davidlohr Bueso 
---
 kernel/futex.c | 115 +
 1 file changed, 91 insertions(+), 24 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc6850..be6399a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,17 +75,20 @@
  * The waiter reads the futex value in user space and calls
  * futex_wait(). This function computes the hash bucket and acquires
  * the hash bucket lock. After that it reads the futex user space value
- * again and verifies that the data has not changed. If it has not
- * changed it enqueues itself into the hash bucket, releases the hash
- * bucket lock and schedules.
+ * again and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash bucket lock
+ * and schedules.
  *
  * The waker side modifies the user space value of the futex and calls
- * futex_wake(). This functions computes the hash bucket and acquires
- * the hash bucket lock. Then it looks for waiters on that futex in the
- * hash bucket and wakes them.
+ * futex_wake(). This function computes the hash bucket and acquires the
+ * hash bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In futex wake up scenarios where no tasks are blocked on a futex, taking
+ * the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0   CPU 1
  * val = *futex;
@@ -106,24 +109,52 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0   CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- * *futex = newval;
- * sys_futex(WAKE, futex);
- *   futex_wake(futex);
- *   lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *  |
+ *   lock(hash_bucket(futex));  |
+ *  |
+ *   uval = *futex; |
+ *  |*futex = newval;
+ *  |sys_futex(WAKE, futex);
+ *  |  futex_wake(futex);
+ *  |
+ *  `--->   mb(); (B)
  *   if (uval == val)
- *  queue();
+ * queue();
  * unlock(hash_bucket(futex));
- * schedule();   if (!queue_empty())
- * wake_waiters(futex);
- *   unlock(

Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-26 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > So now your code melts down to:
> > 
> >  write(hb->waiters)|write(uaddr)
> >  mb|read(hb->waiters)
> >  read(uaddr)
> > 
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
> 
> Does seem to be the case.

Actually not. It's protected by an atomic_inc() between the
write(uaddr) and read(hb->waiters) on the waker side.

It took me a while to realize, that get_futex_key_refs() is providing
the protection inadvertently. But as I explained we cannot rely on
that for all architectures.

Thanks,

tglx








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > > which restores the ordering guarantee, which the hash bucket lock
> > > provided so far.
> > 
> > Actually that's not true by design, it just happens to work.
> > 
> > atomic_inc() on x86 is a "lock incl".
> > 
> >  The LOCK prefix just guarantees that the cache line which is affected
> >  by the INCL is locked. And it guarantees that locked operations
> >  serialize all outstanding load and store operations.
> > 
> >  But Documentation/atomic_ops.txt says about atomic_inc():
> > 
> >  "One very important aspect of these two routines is that they DO NOT
> >   require any explicit memory barriers.  They need only perform the
> >   atomic_t counter update in an SMP safe manner."
> > 
> >  So while this has a barrier on x86, it's not guaranteed.
> 
> 
> But it is guaranteed to be "in an SMP safe manner"... which I guess just
> means that two writes will not intermix bytes, but no guarantee that the
> value will be visible to other CPUs unless some kind of barrier is
> explicitly imposed.
> 
> Correct?

Yep, that's what it sayes.
 
> > So now your code melts down to:
> > 
> >  write(hb->waiters)|write(uaddr)
> >  mb|read(hb->waiters)
> >  read(uaddr)
> > 
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
> 
> Does seem to be the case.

You must be aware, that between the the write(uaddr) and the
read(hb->waiters) is the syscall, i.e. a user/kernel space
transition.

sysenter/syscall have no documented barriers inside, but we don't know
whether the actual hw implementation provides one or if the memory ops
between modifying uaddr and reaching the read(hb->waiters) point
including the real atomic op on the waiter side are good enough to
paper over the issue.

Thanks,

tglx



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Darren Hart
On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> > On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> > So with the atomic ops you are changing that to:
> > 
> > CPU 0 CPU 1
> > 
> > val = *futex;
> > futex_wait(futex, val);
> > 
> > spin_lock(&hb->lock);
> > 
> > atomic_inc(&hb->waiters);
> > uval = *futex;
> >   *futex = newval;
> > 
> > if (uval != val) {futex_wake();
> >spin_unlock(&hb->lock);if 
> > (!atomic_read(&hb->waiters))
> >return;   return;
> > }
> >   spin_lock((&hb->lock);  
> > plist_add(hb, self);
> > spin_unlock(&hb->lock);
> > schedule();   wakeup_waiters(hb);
> > ...
> > 
> > which restores the ordering guarantee, which the hash bucket lock
> > provided so far.
> 
> Actually that's not true by design, it just happens to work.
> 
> atomic_inc() on x86 is a "lock incl".
> 
>  The LOCK prefix just guarantees that the cache line which is affected
>  by the INCL is locked. And it guarantees that locked operations
>  serialize all outstanding load and store operations.
> 
>  But Documentation/atomic_ops.txt says about atomic_inc():
> 
>  "One very important aspect of these two routines is that they DO NOT
>   require any explicit memory barriers.  They need only perform the
>   atomic_t counter update in an SMP safe manner."
> 
>  So while this has a barrier on x86, it's not guaranteed.


But it is guaranteed to be "in an SMP safe manner"... which I guess just
means that two writes will not intermix bytes, but no guarantee that the
value will be visible to other CPUs unless some kind of barrier is
explicitly imposed.

Correct?


> atomic_read() is a simple read.
> 
>  This does not guarantee anything. And if you read
>  Documentation/atomic_ops.txt it's well documented:
> 
>  "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"
> 
> 
> So now your code melts down to:
> 
>  write(hb->waiters)|write(uaddr)
>  mb|read(hb->waiters)
>  read(uaddr)
> 
> I fear you simply managed to make the window small enough that your
> testing was not longer able expose it.

Does seem to be the case.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Davidlohr Bueso wrote:
> On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
> > If the smp_mb() is heavy weight, then it will hurt massivly in the
> > case where the hash bucket is not empty, because we add the price for
> > the smp_mb() just for no gain.
> > 
> > In that context it would also be helpful to measure the overhead on
> > x86 for the !empty case.
> 
> Absolutely, I will add these comparisons. If we do notice that we end up
> hurting the !empty case, would the current patch using atomic ops still
> be considered? We have made sure that none of the changes in this set
> affects performance on other workloads/smaller systems.

Please read my last reply to the atomic ops approach.

Aside of that we need numbers for a significant range of !x86.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> So with the atomic ops you are changing that to:
> 
> CPU 0 CPU 1
> 
> val = *futex;
> futex_wait(futex, val);
> 
> spin_lock(&hb->lock);
> 
> atomic_inc(&hb->waiters);
> uval = *futex;
>   *futex = newval;
> 
> if (uval != val) {futex_wake();
>spin_unlock(&hb->lock);if (!atomic_read(&hb->waiters))
>return;   return;
> }
>   spin_lock((&hb->lock);  
> plist_add(hb, self);
> spin_unlock(&hb->lock);
> schedule();   wakeup_waiters(hb);
> ...
> 
> which restores the ordering guarantee, which the hash bucket lock
> provided so far.

Actually that's not true by design, it just happens to work.

atomic_inc() on x86 is a "lock incl".

 The LOCK prefix just guarantees that the cache line which is affected
 by the INCL is locked. And it guarantees that locked operations
 serialize all outstanding load and store operations.

 But Documentation/atomic_ops.txt says about atomic_inc():

 "One very important aspect of these two routines is that they DO NOT
  require any explicit memory barriers.  They need only perform the
  atomic_t counter update in an SMP safe manner."

 So while this has a barrier on x86, it's not guaranteed.

atomic_read() is a simple read.

 This does not guarantee anything. And if you read
 Documentation/atomic_ops.txt it's well documented:

 "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"


So now your code melts down to:

 write(hb->waiters)|write(uaddr)
 mb|read(hb->waiters)
 read(uaddr)

I fear you simply managed to make the window small enough that your
testing was not longer able expose it.

Thanks,

tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Davidlohr Bueso
On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
> On Mon, 25 Nov 2013, Peter Zijlstra wrote:
> > On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> > > On Sat, 23 Nov 2013, Linus Torvalds wrote:
> > > 
> > > > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  
> > > > wrote:
> > > > >
> > > > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > > > space value. The comment in the code is pretty non sensical:
> > > > >
> > > > >* On the other hand, we insert q and release the hash-bucket only
> > > > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > > > >* absorb a wakeup if *uaddr does not match the desired values
> > > > >* while the syscall executes.
> > > > >
> > > > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > > > value. We just have to dequeue in all the error handling cases, but
> > > > > for the fast path it does not matter at all.
> > > > >
> > > > > CPU 0   CPU 1
> > > > >
> > > > > val = *futex;
> > > > > futex_wait(futex, val);
> > > > >
> > > > > spin_lock(&hb->lock);
> > > > >
> > > > > plist_add(hb, self);
> > > > > smp_wmb();
> > > > >
> > > > > uval = *futex;
> > > > > *futex = newval;
> > > > > futex_wake();
> > > > >
> > > > > smp_rmb();
> > > > > if (plist_empty(hb))
> > > > >return;
> > > > > ...
> > > > 
> > > > This would seem to be a nicer approach indeed, without needing the
> > > > extra atomics.
> > > 
> > > I went through the issue with Peter and he noticed, that we need
> > > smp_mb() in both places. That's what we have right now with the
> > > spin_lock() and it is required as we need to guarantee that
> > > 
> > >  The waiter observes the change to the uaddr value after it added
> > >  itself to the plist
> > > 
> > >  The waker observes plist not empty if the change to uaddr was made
> > >  after the waiter checked the value.
> > > 
> > > 
> > >   write(plist)|   write(futex_uaddr)
> > >   mb()|   mb()
> > >   read(futex_uaddr)   |   read(plist)
> > > 
> > > The spin_lock mb() on the waiter side does not help here because it
> > > happpens before the write(plist) and not after it.
> > 
> > Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
> > ACQUIRE barrier which is weaker than a full mb and will not suffice in
> > this case even it if were in the right place.
> 
> So now the question is whether this lockless empty check optimization
> which seems to be quite nice on x86 for a particular workload will
> have any negative side effects on other architectures.
> 
> If the smp_mb() is heavy weight, then it will hurt massivly in the
> case where the hash bucket is not empty, because we add the price for
> the smp_mb() just for no gain.
> 
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Absolutely, I will add these comparisons. If we do notice that we end up
hurting the !empty case, would the current patch using atomic ops still
be considered? We have made sure that none of the changes in this set
affects performance on other workloads/smaller systems.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 06:32:55PM +0100, Thomas Gleixner wrote:
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Yes, because mfence is by no means a cheap instruction on x86.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> > On Sat, 23 Nov 2013, Linus Torvalds wrote:
> > 
> > > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  
> > > wrote:
> > > >
> > > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > > space value. The comment in the code is pretty non sensical:
> > > >
> > > >* On the other hand, we insert q and release the hash-bucket only
> > > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > > >* absorb a wakeup if *uaddr does not match the desired values
> > > >* while the syscall executes.
> > > >
> > > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > > value. We just have to dequeue in all the error handling cases, but
> > > > for the fast path it does not matter at all.
> > > >
> > > > CPU 0   CPU 1
> > > >
> > > > val = *futex;
> > > > futex_wait(futex, val);
> > > >
> > > > spin_lock(&hb->lock);
> > > >
> > > > plist_add(hb, self);
> > > > smp_wmb();
> > > >
> > > > uval = *futex;
> > > > *futex = newval;
> > > > futex_wake();
> > > >
> > > > smp_rmb();
> > > > if (plist_empty(hb))
> > > >return;
> > > > ...
> > > 
> > > This would seem to be a nicer approach indeed, without needing the
> > > extra atomics.
> > 
> > I went through the issue with Peter and he noticed, that we need
> > smp_mb() in both places. That's what we have right now with the
> > spin_lock() and it is required as we need to guarantee that
> > 
> >  The waiter observes the change to the uaddr value after it added
> >  itself to the plist
> > 
> >  The waker observes plist not empty if the change to uaddr was made
> >  after the waiter checked the value.
> > 
> > 
> > write(plist)|   write(futex_uaddr)
> > mb()|   mb()
> > read(futex_uaddr)   |   read(plist)
> > 
> > The spin_lock mb() on the waiter side does not help here because it
> > happpens before the write(plist) and not after it.
> 
> Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
> ACQUIRE barrier which is weaker than a full mb and will not suffice in
> this case even it if were in the right place.

So now the question is whether this lockless empty check optimization
which seems to be quite nice on x86 for a particular workload will
have any negative side effects on other architectures.

If the smp_mb() is heavy weight, then it will hurt massivly in the
case where the hash bucket is not empty, because we add the price for
the smp_mb() just for no gain.

In that context it would also be helpful to measure the overhead on
x86 for the !empty case.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> On Sat, 23 Nov 2013, Linus Torvalds wrote:
> 
> > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> > >
> > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > space value. The comment in the code is pretty non sensical:
> > >
> > >* On the other hand, we insert q and release the hash-bucket only
> > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > >* absorb a wakeup if *uaddr does not match the desired values
> > >* while the syscall executes.
> > >
> > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > value. We just have to dequeue in all the error handling cases, but
> > > for the fast path it does not matter at all.
> > >
> > > CPU 0   CPU 1
> > >
> > > val = *futex;
> > > futex_wait(futex, val);
> > >
> > > spin_lock(&hb->lock);
> > >
> > > plist_add(hb, self);
> > > smp_wmb();
> > >
> > > uval = *futex;
> > > *futex = newval;
> > > futex_wake();
> > >
> > > smp_rmb();
> > > if (plist_empty(hb))
> > >return;
> > > ...
> > 
> > This would seem to be a nicer approach indeed, without needing the
> > extra atomics.
> 
> I went through the issue with Peter and he noticed, that we need
> smp_mb() in both places. That's what we have right now with the
> spin_lock() and it is required as we need to guarantee that
> 
>  The waiter observes the change to the uaddr value after it added
>  itself to the plist
> 
>  The waker observes plist not empty if the change to uaddr was made
>  after the waiter checked the value.
> 
> 
>   write(plist)|   write(futex_uaddr)
>   mb()|   mb()
>   read(futex_uaddr)   |   read(plist)
> 
> The spin_lock mb() on the waiter side does not help here because it
> happpens before the write(plist) and not after it.

Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
ACQUIRE barrier which is weaker than a full mb and will not suffice in
this case even it if were in the right place.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Linus Torvalds wrote:

> On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> >
> > Now the question is why we queue the waiter _AFTER_ reading the user
> > space value. The comment in the code is pretty non sensical:
> >
> >* On the other hand, we insert q and release the hash-bucket only
> >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> >* absorb a wakeup if *uaddr does not match the desired values
> >* while the syscall executes.
> >
> > There is no reason why we cannot queue _BEFORE_ reading the user space
> > value. We just have to dequeue in all the error handling cases, but
> > for the fast path it does not matter at all.
> >
> > CPU 0   CPU 1
> >
> > val = *futex;
> > futex_wait(futex, val);
> >
> > spin_lock(&hb->lock);
> >
> > plist_add(hb, self);
> > smp_wmb();
> >
> > uval = *futex;
> > *futex = newval;
> > futex_wake();
> >
> > smp_rmb();
> > if (plist_empty(hb))
> >return;
> > ...
> 
> This would seem to be a nicer approach indeed, without needing the
> extra atomics.

I went through the issue with Peter and he noticed, that we need
smp_mb() in both places. That's what we have right now with the
spin_lock() and it is required as we need to guarantee that

 The waiter observes the change to the uaddr value after it added
 itself to the plist

 The waker observes plist not empty if the change to uaddr was made
 after the waiter checked the value.


write(plist)|   write(futex_uaddr)
mb()|   mb()
read(futex_uaddr)   |   read(plist)

The spin_lock mb() on the waiter side does not help here because it
happpens before the write(plist) and not after it.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Davidlohr Bueso wrote:
> On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
> > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> > >
> > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > space value. The comment in the code is pretty non sensical:
> > >
> > >* On the other hand, we insert q and release the hash-bucket only
> > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > >* absorb a wakeup if *uaddr does not match the desired values
> > >* while the syscall executes.
> > >
> > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > value. We just have to dequeue in all the error handling cases, but
> > > for the fast path it does not matter at all.
> > >
> > > CPU 0   CPU 1
> > >
> > > val = *futex;
> > > futex_wait(futex, val);
> > >
> > > spin_lock(&hb->lock);
> > >
> > > plist_add(hb, self);
> > > smp_wmb();
> > >
> > > uval = *futex;
> > > *futex = newval;
> > > futex_wake();
> > >
> > > smp_rmb();
> > > if (plist_empty(hb))
> > >return;
> > > ...
> > 
> > This would seem to be a nicer approach indeed, without needing the
> > extra atomics.
> 
> Yep, I think we can all agree that doing this optization without atomic
> ops is a big plus.
> 
> > 
> > Davidlohr, mind trying Thomas' approach?
> 
> I just took a quick look and it seems pretty straightforward, but not
> without some details to consider. We basically have to redo/reorder
> futex_wait_setup(), which checks that uval == val, and
> futex_wait_queue_me(), which adds the task to the list and blocks. Now,
> both futex_wait() and futex_wait_requeue_pi() have this logic, but since
> we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
> ok to only change futex_wait(), while the order of the uval checking
> doesn't matter for futex_wait_requeue_pi() so it can stay as is.

There is no mechanism which prevents a futex_wake() call on the inner
futex of the wait_requeue_pi mechanism. So no, we have to change both.

futexes are no place for believe. Either you understand them
completely or you just leave them alone.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Davidlohr Bueso
On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
> On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> >
> > Now the question is why we queue the waiter _AFTER_ reading the user
> > space value. The comment in the code is pretty non sensical:
> >
> >* On the other hand, we insert q and release the hash-bucket only
> >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> >* absorb a wakeup if *uaddr does not match the desired values
> >* while the syscall executes.
> >
> > There is no reason why we cannot queue _BEFORE_ reading the user space
> > value. We just have to dequeue in all the error handling cases, but
> > for the fast path it does not matter at all.
> >
> > CPU 0   CPU 1
> >
> > val = *futex;
> > futex_wait(futex, val);
> >
> > spin_lock(&hb->lock);
> >
> > plist_add(hb, self);
> > smp_wmb();
> >
> > uval = *futex;
> > *futex = newval;
> > futex_wake();
> >
> > smp_rmb();
> > if (plist_empty(hb))
> >return;
> > ...
> 
> This would seem to be a nicer approach indeed, without needing the
> extra atomics.

Yep, I think we can all agree that doing this optization without atomic
ops is a big plus.

> 
> Davidlohr, mind trying Thomas' approach?

I just took a quick look and it seems pretty straightforward, but not
without some details to consider. We basically have to redo/reorder
futex_wait_setup(), which checks that uval == val, and
futex_wait_queue_me(), which adds the task to the list and blocks. Now,
both futex_wait() and futex_wait_requeue_pi() have this logic, but since
we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
ok to only change futex_wait(), while the order of the uval checking
doesn't matter for futex_wait_requeue_pi() so it can stay as is.

The following is the general skeleton of what Thomas is probably
referring to (yes, it needs comments, cleanups, refactoring, etc, etc).
Futexes are easy to screw up but at least this passes a kernel build and
all futextests on a large box.

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
  ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int prio, ret = 0;
u32 uval;

if (!bitset)
return -EINVAL;
q.bitset = bitset;

if (abs_time) {
to = &timeout;

hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
  CLOCK_REALTIME : CLOCK_MONOTONIC,
  HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 current->timer_slack_ns);
}
retry:  
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_READ);
if (unlikely(ret != 0))
goto out;

retry_private:
hb = queue_lock(&q);
prio = min(current->normal_prio, MAX_RT_PRIO);

plist_node_init(&q.list, prio);
plist_add(&q.list, &hb->chain);
q.task = current;
/* do NOT drop the lock here */
smp_wmb();

ret = get_futex_value_locked(&uval, uaddr);
if (ret) {
plist_del(&q.list, &hb->chain);
spin_unlock(&hb->lock);

ret = get_user(uval, uaddr);
if (ret)
goto out_put_key;

if (!(flags & FLAGS_SHARED))
goto retry_private;

put_futex_key(&q.key);
goto retry;

}

if (uval != val) {
plist_del(&q.list, &hb->chain);
spin_unlock(&hb->lock);
ret = -EWOULDBLOCK;
goto out_put_key;
}

set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(&hb->lock);

/* Arm the timer */
if (to) {
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
if (!hrtimer_active(&to->timer))
to->task = NULL;
}

/*
 * If we have been removed from the hash list, then another task
 * has tried to wake us, and we can skip the call to schedule().
 */
if (likely(!plist_node_empty(&q.list))) {
/*
 * If the timer has already expired, current will already be
 * flagged for rescheduling. Only call schedule if there
 * is no timeout, or if it has yet to expire.
 */
if 

Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Linus Torvalds
On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
>
> Now the question is why we queue the waiter _AFTER_ reading the user
> space value. The comment in the code is pretty non sensical:
>
>* On the other hand, we insert q and release the hash-bucket only
>* after testing *uaddr.  This guarantees that futex_wait() will NOT
>* absorb a wakeup if *uaddr does not match the desired values
>* while the syscall executes.
>
> There is no reason why we cannot queue _BEFORE_ reading the user space
> value. We just have to dequeue in all the error handling cases, but
> for the fast path it does not matter at all.
>
> CPU 0   CPU 1
>
> val = *futex;
> futex_wait(futex, val);
>
> spin_lock(&hb->lock);
>
> plist_add(hb, self);
> smp_wmb();
>
> uval = *futex;
> *futex = newval;
> futex_wake();
>
> smp_rmb();
> if (plist_empty(hb))
>return;
> ...

This would seem to be a nicer approach indeed, without needing the
extra atomics.

Davidlohr, mind trying Thomas' approach?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Thomas Gleixner
On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> > 
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
> 
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

Right. The reason is:

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(&hb->lock);

uval = *futex;
*futex = newval;

if (uval != val) {  futex_wake();
   spin_unlock(&hb->lock);  if (plist_empty(hb))
   return; return;
}

plist_add(hb, self);
spin_unlock(&hb->lock);
schedule();
...

So the waiter on CPU0 gets queued after the user space value changed
and the wakee on CPU1 returns because plist is empty. Waiter therefor
waits forever.

So with the atomic ops you are changing that to:

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(&hb->lock);

atomic_inc(&hb->waiters);
uval = *futex;
*futex = newval;

if (uval != val) {  futex_wake();
   spin_unlock(&hb->lock);  if (!atomic_read(&hb->waiters))
   return; return;
}
spin_lock((&hb->lock);  
plist_add(hb, self);
spin_unlock(&hb->lock);
schedule(); wakeup_waiters(hb);
...

which restores the ordering guarantee, which the hash bucket lock
provided so far.

Now the question is why we queue the waiter _AFTER_ reading the user
space value. The comment in the code is pretty non sensical:

   * On the other hand, we insert q and release the hash-bucket only
   * after testing *uaddr.  This guarantees that futex_wait() will NOT
   * absorb a wakeup if *uaddr does not match the desired values
   * while the syscall executes.

There is no reason why we cannot queue _BEFORE_ reading the user space
value. We just have to dequeue in all the error handling cases, but
for the fast path it does not matter at all.

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(&hb->lock);

plist_add(hb, self);
smp_wmb();

uval = *futex;
*futex = newval;
futex_wake();

smp_rmb();
if (plist_empty(hb))
   return;

spin_lock((&hb->lock);  
if (uval != val) {  
   plist_del(self); 
   spin_unlock(&hb->lock);
   return;  
}

spin_unlock(&hb->lock);
schedule(); wakeup_waiters(hb);
...

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 19:19 -0800, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> > 
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
> 
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

In your usage, the worst case scenario is that you detect 0 when locking
may have blocked and found a waiter. Correct?

In this case, you return 0, instead of 1 (or more).

This suggests to me a bug in the futextest testcase. Which test
specifically hung up waiting?

Futex hangs are almost always bad userspace code (my bad userspace code
in this case ;-)

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
> 
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().

  ^match

> 
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

__unqueue_futex (not __unqueue_me)


> @@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
> nr_wake, u32 bitset)
>   goto out;
>  
>   hb = hash_futex(&key);
> + /* make sure we really have tasks to wakeup */

Nit, but please use proper sentence formatting for consistency with the
rest of the comments in futex.c (most of them anyway).

/* Make sure we really have tasks to wake up. */

Now... I'm not thrilled with adding atomics if we don't need to,
especially for an optimization since the atomics themselves cause enough
problems, especially across a large number of CPUs... I'll respond to
Linus's thread though.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Hart, Darren
On Fri, 2013-11-22 at 21:40 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size. Specifically, increment the counter when an element is
> > added to the list, and decrement when it is removed. Of course, if the
> > counter is 0, then there are no tasks blocked on a futex. Some special
> > considerations:
> > 
> > - increment the counter at queue_lock() as we always end up calling
> >   queue_me() which adds the element to the list. Upon any error,
> >   queue_unlock() is called for housekeeping, for which we decrement
> >   to mach the increment done in queue_lock().
> > 
> > - decrement the counter at __unqueue_me() to reflect when an element is
> >   removed from the queue for wakeup related purposes.
> 
> What is the problem you are trying to solve here?

Apologies, too quick on the trigger. I see plenty of detail in 0/5. Will
spend some time reviewing that.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
> 
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().
> 
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

What is the problem you are trying to solve here?

The fast-path in futexes is not calling into the kernel at all, if
you've made the syscall, you've already lost. What kind of improvement
are you seeing by adding this optimization? Is it worth the introduction
of atomic operations into a file known to the state of California to
increase the risk of liver failure?

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Waiman Long

On 11/22/2013 08:25 PM, Linus Torvalds wrote:

On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size.

Hmm. Why? Afaik, you only care about "empty or not". And if you don't
need the serialization from locking, then afaik you can just do a
"plist_head_empty()" without holding the lock.

NOTE!

The "list_empty()" function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

 if (!list_empty(waiters))
 wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

 Linus


Before the patch, a waker will wake up one or waiters who call 
spin_lock() before the waker. If we just use plist_head_empty(), we will 
miss waiters who have called spin_lock(), but have not yet got the lock 
or inserted itself into the wait list. By adding atomic counter for 
checking purpose, we again has a single point where any waiters who pass 
that point (inc atomic counter) can be woken up by a waiter who check 
the atomic counter after they inc it. This acts sort of like 
serialization without using a lock.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Davidlohr Bueso
On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size.
> 
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.

I remember this being the original approach, but after noticing some
strange behavior we quickly decided it wasn't the path. And sure enough,
I just double checked and tried the patch without atomic ops and can see
things being off: one of the futextest performance cases is stuck
blocked on a futex and I couldn't reboot the machine either -- nothing
apparent in dmesg, just not 100% functional. The thing is, we can only
avoid taking the lock only if nobody else is trying to add itself to the
list.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Jason Low
On Fri, Nov 22, 2013 at 5:25 PM, Linus Torvalds
 wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
>> In futex_wake() there is clearly no point in taking the hb->lock if
>> we know beforehand that there are no tasks to be woken. This comes
>> at the smaller cost of doing some atomic operations to keep track of
>> the list's size.
>
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.
>
> NOTE!
>
> The "list_empty()" function is very much designed to work even without
> holding a lock (as long as the head itself exists reliably, of course)
> BUT you have to then guarantee yourself that your algorithm doesn't
> have any races wrt other CPU's adding an entry to the list at the same
> time. Not holding a lock obviously means that you are not serialized
> against that.. We've had problems with people doing
>
> if (!list_empty(waiters))
> wake_up_list(..)
>
> because they wouldn't wake people up who just got added.
>
> But considering that your atomic counter checking has the same lack of
> serialization, at least the plist_head_empty() check shouldn't be any
> worse than that counter thing.. And doesn't need any steenking atomic
> ops or a new counter field.

Hi Linus,

In this patch, since we do the atomic increments before holding the
hb->lock during situations where we may potentially add a task to the
list, then we would only skip attempting the wake up if no other
thread has already held the hb->lock and is in the process of adding a
task to the list, in addition to the list being empty. Would this be
enough to protect it from any race condition?

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Linus Torvalds
On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size.

Hmm. Why? Afaik, you only care about "empty or not". And if you don't
need the serialization from locking, then afaik you can just do a
"plist_head_empty()" without holding the lock.

NOTE!

The "list_empty()" function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

if (!list_empty(waiters))
wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Davidlohr Bueso
In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
  queue_me() which adds the element to the list. Upon any error,
  queue_unlock() is called for housekeeping, for which we decrement
  to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
  removed from the queue for wakeup related purposes.

Cc: Ingo Molnar 
Cc: Darren Hart 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Jeff Mahoney 
Cc: Linus Torvalds 
Cc: Scott Norton 
Cc: Tom Vaden 
Cc: Aswin Chandramouleeswaran 
Signed-off-by: Waiman Long 
Signed-off-by: Jason Low 
Signed-off-by: Davidlohr Bueso 
---
 kernel/futex.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5fa9eb0..2f9dd5d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -152,6 +152,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+   atomic_t len;
spinlock_t lock;
struct plist_head chain;
 };
@@ -843,6 +844,7 @@ static void __unqueue_futex(struct futex_q *q)
 
hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
plist_del(&q->list, &hb->chain);
+   atomic_dec(&hb->len);
 }
 
 /*
@@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
nr_wake, u32 bitset)
goto out;
 
hb = hash_futex(&key);
+   /* make sure we really have tasks to wakeup */
+   if (atomic_read(&hb->len) == 0)
+   goto out_put_key;
+
spin_lock(&hb->lock);
 
plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1019,6 +1025,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
nr_wake, u32 bitset)
}
 
spin_unlock(&hb->lock);
+out_put_key:
put_futex_key(&key);
 out:
return ret;
@@ -1137,7 +1144,9 @@ void requeue_futex(struct futex_q *q, struct 
futex_hash_bucket *hb1,
 */
if (likely(&hb1->chain != &hb2->chain)) {
plist_del(&q->list, &hb1->chain);
+   atomic_dec(&hb1->len);
plist_add(&q->list, &hb2->chain);
+   atomic_inc(&hb2->len);
q->lock_ptr = &hb2->lock;
}
get_futex_key_refs(key2);
@@ -1480,6 +1489,8 @@ static inline struct futex_hash_bucket *queue_lock(struct 
futex_q *q)
struct futex_hash_bucket *hb;
 
hb = hash_futex(&q->key);
+   atomic_inc(&hb->len);
+
q->lock_ptr = &hb->lock;
 
spin_lock(&hb->lock);
@@ -1491,6 +1502,7 @@ queue_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
 {
spin_unlock(&hb->lock);
+   atomic_dec(&hb->len);
 }
 
 /**
@@ -,6 +2234,7 @@ int handle_early_requeue_pi_wakeup(struct 
futex_hash_bucket *hb,
 * Unqueue the futex_q and determine which it was.
 */
plist_del(&q->list, &hb->chain);
+   atomic_dec(&hb->len);
 
/* Handle spurious wakeups gracefully */
ret = -EWOULDBLOCK;
@@ -2749,6 +2762,7 @@ static int __init futex_init(void)
for (i = 0; i < futex_hashsize; i++) {
plist_head_init(&futex_queues[i].chain);
spin_lock_init(&futex_queues[i].lock);
+   atomic_set(&futex_queues[i].len, 0);
}
 
return 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/