Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-25 Thread Kent Overstreet
On Tue, Oct 25, 2016 at 09:55:21AM -0700, Eric Wheeler wrote:
> Sure, I'll put it up with my -rc2 pull request to Jens.  
> 
> A couple of sanity checks (for my understanding at least):
> 
> Why does bch_data_insert_start() no longer need to call 
> set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Before, the gc thread wasn't being explicitly signalled that it was time to run
- it was just being woken up, meaning that a spurious wakeup would cause gc to
run. At best it was sketchy and fragile, for multiple reasons - wait_event() is
a much better mechanism.

So with wait_event() gc is explicitly checking if it's time to run - the wakeup
doesn't really make anything happen by itself, it just pokes the gc thread so
it's able to notice that it should run.

When you're signalling a thread this way - we're in effect setting a global
variable that says "gc should run now", then kicking the gc thread so it can
check the "gc should run" variable. But wakeups aren't synchronous - our call to
wake_up() doesn't make the gc thread check that variable before it returns, all
we know when the wake_up() call returns is that the gc thread is going to check
that variable some point in the future. So we can't set the "gc should run"
varible, wake up the gc thread, and then set it back to "gc shouldn't run yet" -
what'll happen most of the time is that the gc thread won't run before we set
the variable back to "gc shouldn't run yet", it'll never see that it was
supposed to run and it'll go back to sleep.

So the way you make this work is you have the gc thread has to set the variable
back to "gc shouldn't run yet" _after_ it's seen it and decided to run.

> Does bch_cache_set_alloc() even need to call set_gc_sectors since 
> bch_gc_thread() does before calling bch_btree_gc?

Yes, because the gc thread only resets the counter when it's decided to run - we
don't want it to run right away at startup.

> Also I'm curious, why change invalidate_needs_gc from a bitfield? 

Bitfields are particularly unsafe for multiple threads to access - the compiler
has to emit instructions to do read/modify/write, which will clobber adjacent
data. A bare int is also not in _general_ safe for multiple threads to access
without a lock, but for what it's doing here it's fine.


Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-25 Thread Kent Overstreet
On Tue, Oct 25, 2016 at 09:55:21AM -0700, Eric Wheeler wrote:
> Sure, I'll put it up with my -rc2 pull request to Jens.  
> 
> A couple of sanity checks (for my understanding at least):
> 
> Why does bch_data_insert_start() no longer need to call 
> set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Before, the gc thread wasn't being explicitly signalled that it was time to run
- it was just being woken up, meaning that a spurious wakeup would cause gc to
run. At best it was sketchy and fragile, for multiple reasons - wait_event() is
a much better mechanism.

So with wait_event() gc is explicitly checking if it's time to run - the wakeup
doesn't really make anything happen by itself, it just pokes the gc thread so
it's able to notice that it should run.

When you're signalling a thread this way - we're in effect setting a global
variable that says "gc should run now", then kicking the gc thread so it can
check the "gc should run" variable. But wakeups aren't synchronous - our call to
wake_up() doesn't make the gc thread check that variable before it returns, all
we know when the wake_up() call returns is that the gc thread is going to check
that variable some point in the future. So we can't set the "gc should run"
varible, wake up the gc thread, and then set it back to "gc shouldn't run yet" -
what'll happen most of the time is that the gc thread won't run before we set
the variable back to "gc shouldn't run yet", it'll never see that it was
supposed to run and it'll go back to sleep.

So the way you make this work is you have the gc thread has to set the variable
back to "gc shouldn't run yet" _after_ it's seen it and decided to run.

> Does bch_cache_set_alloc() even need to call set_gc_sectors since 
> bch_gc_thread() does before calling bch_btree_gc?

Yes, because the gc thread only resets the counter when it's decided to run - we
don't want it to run right away at startup.

> Also I'm curious, why change invalidate_needs_gc from a bitfield? 

Bitfields are particularly unsafe for multiple threads to access - the compiler
has to emit instructions to do read/modify/write, which will clobber adjacent
data. A bare int is also not in _general_ safe for multiple threads to access
without a lock, but for what it's doing here it's fine.


Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-25 Thread Eric Wheeler
On Mon, 24 Oct 2016, Kent Overstreet wrote:
> On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> > On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> > 
> > > Subject: sched: Better explain sleep/wakeup
> > > From: Peter Zijlstra 
> > > Date: Wed Oct 19 15:45:27 CEST 2016
> > > 
> > > There were a few questions wrt how sleep-wakeup works. Try and explain
> > > it more.
> > > 
> > > Requested-by: Will Deacon 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > > include/linux/sched.h |   52 
> > > ++
> > > kernel/sched/core.c   |   15 +++---
> > > 2 files changed, 44 insertions(+), 23 deletions(-)
> > > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_task_state(tsk, state_value)  \
> > >   do {\
> > >   (tsk)->task_state_change = _THIS_IP_;   \
> > > - smp_store_mb((tsk)->state, (state_value));  \
> > > + smp_store_mb((tsk)->state, (state_value));  \
> > >   } while (0)
> > > 
> > > -/*
> > > - * set_current_state() includes a barrier so that the write of 
> > > current->state
> > > - * is correctly serialised wrt the caller's subsequent test of whether to
> > > - * actually sleep:
> > > - *
> > > - *   set_current_state(TASK_UNINTERRUPTIBLE);
> > > - *   if (do_i_need_to_sleep())
> > > - *   schedule();
> > > - *
> > > - * If the caller does not need such serialisation then use 
> > > __set_current_state()
> > > - */
> > > #define __set_current_state(state_value)  \
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_current_state(state_value)\
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > - smp_store_mb(current->state, (state_value));\
> > > + smp_store_mb(current->state, (state_value));\
> > >   } while (0)
> > > 
> > > #else
> > > 
> > > +/*
> > > + * @tsk had better be current, or you get to keep the pieces.
> > 
> > That reminds me we were getting rid of the set_task_state() calls. Bcache 
> > was
> > pending, being only user in the kernel that doesn't actually use current; 
> > but
> > instead breaks newly (yet blocked/uninterruptible) created garbage 
> > collection
> > kthread. I cannot figure out why this is done (ie purposely accounting the
> > load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> > as as by the allocator task, so I don't see why bcache gc should want to be
> > interruptible.
> > 
> > Kent, Jens, can we get rid of this?
> 
> Here's a patch that just fixes the way gc gets woken up. Eric, you want to 
> take
> this?

Sure, I'll put it up with my -rc2 pull request to Jens.  

A couple of sanity checks (for my understanding at least):

Why does bch_data_insert_start() no longer need to call 
set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Does bch_cache_set_alloc() even need to call set_gc_sectors since 
bch_gc_thread() does before calling bch_btree_gc?

Also I'm curious, why change invalidate_needs_gc from a bitfield? 

-Eric

> 
> -- >8 --
> Subject: [PATCH] bcache: Make gc wakeup saner
> 
> This lets us ditch a set_task_state() call.
> 
> Signed-off-by: Kent Overstreet 
> ---
>  drivers/md/bcache/bcache.h  |  4 ++--
>  drivers/md/bcache/btree.c   | 39 ---
>  drivers/md/bcache/btree.h   |  3 +--
>  drivers/md/bcache/request.c |  4 +---
>  drivers/md/bcache/super.c   |  2 ++
>  5 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a55c7..c3ea03c9a1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -425,7 +425,7 @@ struct cache {
>* until a gc finishes - otherwise we could pointlessly burn a ton of
>* cpu
>*/
> - unsignedinvalidate_needs_gc:1;
> + unsignedinvalidate_needs_gc;
>  
>   booldiscard; /* Get rid of? */
>  
> @@ -593,8 +593,8 @@ struct cache_set {
>  
>   /* Counts how many sectors bio_insert has added to the cache */
>   atomic_tsectors_to_gc;
> + wait_queue_head_t   gc_wait;
>  
> - wait_queue_head_t   moving_gc_wait;
>   struct keybuf   moving_gc_keys;
>   /* Number of moving GC bios in flight */
>   struct semaphoremoving_in_flight;
> diff --git a/drivers/md/bcache/btree.c 

Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-25 Thread Eric Wheeler
On Mon, 24 Oct 2016, Kent Overstreet wrote:
> On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> > On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> > 
> > > Subject: sched: Better explain sleep/wakeup
> > > From: Peter Zijlstra 
> > > Date: Wed Oct 19 15:45:27 CEST 2016
> > > 
> > > There were a few questions wrt how sleep-wakeup works. Try and explain
> > > it more.
> > > 
> > > Requested-by: Will Deacon 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > > include/linux/sched.h |   52 
> > > ++
> > > kernel/sched/core.c   |   15 +++---
> > > 2 files changed, 44 insertions(+), 23 deletions(-)
> > > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_task_state(tsk, state_value)  \
> > >   do {\
> > >   (tsk)->task_state_change = _THIS_IP_;   \
> > > - smp_store_mb((tsk)->state, (state_value));  \
> > > + smp_store_mb((tsk)->state, (state_value));  \
> > >   } while (0)
> > > 
> > > -/*
> > > - * set_current_state() includes a barrier so that the write of 
> > > current->state
> > > - * is correctly serialised wrt the caller's subsequent test of whether to
> > > - * actually sleep:
> > > - *
> > > - *   set_current_state(TASK_UNINTERRUPTIBLE);
> > > - *   if (do_i_need_to_sleep())
> > > - *   schedule();
> > > - *
> > > - * If the caller does not need such serialisation then use 
> > > __set_current_state()
> > > - */
> > > #define __set_current_state(state_value)  \
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_current_state(state_value)\
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > - smp_store_mb(current->state, (state_value));\
> > > + smp_store_mb(current->state, (state_value));\
> > >   } while (0)
> > > 
> > > #else
> > > 
> > > +/*
> > > + * @tsk had better be current, or you get to keep the pieces.
> > 
> > That reminds me we were getting rid of the set_task_state() calls. Bcache 
> > was
> > pending, being only user in the kernel that doesn't actually use current; 
> > but
> > instead breaks newly (yet blocked/uninterruptible) created garbage 
> > collection
> > kthread. I cannot figure out why this is done (ie purposely accounting the
> > load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> > as as by the allocator task, so I don't see why bcache gc should want to be
> > interruptible.
> > 
> > Kent, Jens, can we get rid of this?
> 
> Here's a patch that just fixes the way gc gets woken up. Eric, you want to 
> take
> this?

Sure, I'll put it up with my -rc2 pull request to Jens.  

A couple of sanity checks (for my understanding at least):

Why does bch_data_insert_start() no longer need to call 
set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Does bch_cache_set_alloc() even need to call set_gc_sectors since 
bch_gc_thread() does before calling bch_btree_gc?

Also I'm curious, why change invalidate_needs_gc from a bitfield? 

-Eric

> 
> -- >8 --
> Subject: [PATCH] bcache: Make gc wakeup saner
> 
> This lets us ditch a set_task_state() call.
> 
> Signed-off-by: Kent Overstreet 
> ---
>  drivers/md/bcache/bcache.h  |  4 ++--
>  drivers/md/bcache/btree.c   | 39 ---
>  drivers/md/bcache/btree.h   |  3 +--
>  drivers/md/bcache/request.c |  4 +---
>  drivers/md/bcache/super.c   |  2 ++
>  5 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a55c7..c3ea03c9a1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -425,7 +425,7 @@ struct cache {
>* until a gc finishes - otherwise we could pointlessly burn a ton of
>* cpu
>*/
> - unsignedinvalidate_needs_gc:1;
> + unsignedinvalidate_needs_gc;
>  
>   booldiscard; /* Get rid of? */
>  
> @@ -593,8 +593,8 @@ struct cache_set {
>  
>   /* Counts how many sectors bio_insert has added to the cache */
>   atomic_tsectors_to_gc;
> + wait_queue_head_t   gc_wait;
>  
> - wait_queue_head_t   moving_gc_wait;
>   struct keybuf   moving_gc_keys;
>   /* Number of moving GC bios in flight */
>   struct semaphoremoving_in_flight;
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81d3db40cd..2efdce0724 100644
> --- a/drivers/md/bcache/btree.c
> 

Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-24 Thread Kent Overstreet
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> 
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra 
> > Date: Wed Oct 19 15:45:27 CEST 2016
> > 
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> > 
> > Requested-by: Will Deacon 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> > include/linux/sched.h |   52 
> > ++
> > kernel/sched/core.c   |   15 +++---
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value)\
> > do {\
> > (tsk)->task_state_change = _THIS_IP_;   \
> > -   smp_store_mb((tsk)->state, (state_value));  \
> > +   smp_store_mb((tsk)->state, (state_value));  \
> > } while (0)
> > 
> > -/*
> > - * set_current_state() includes a barrier so that the write of 
> > current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > - * if (do_i_need_to_sleep())
> > - * schedule();
> > - *
> > - * If the caller does not need such serialisation then use 
> > __set_current_state()
> > - */
> > #define __set_current_state(state_value)\
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value)  \
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > -   smp_store_mb(current->state, (state_value));\
> > +   smp_store_mb(current->state, (state_value));\
> > } while (0)
> > 
> > #else
> > 
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
> 
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
> 
> Kent, Jens, can we get rid of this?

Here's a patch that just fixes the way gc gets woken up. Eric, you want to take
this?

-- >8 --
Subject: [PATCH] bcache: Make gc wakeup saner

This lets us ditch a set_task_state() call.

Signed-off-by: Kent Overstreet 
---
 drivers/md/bcache/bcache.h  |  4 ++--
 drivers/md/bcache/btree.c   | 39 ---
 drivers/md/bcache/btree.h   |  3 +--
 drivers/md/bcache/request.c |  4 +---
 drivers/md/bcache/super.c   |  2 ++
 5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a55c7..c3ea03c9a1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -425,7 +425,7 @@ struct cache {
 * until a gc finishes - otherwise we could pointlessly burn a ton of
 * cpu
 */
-   unsignedinvalidate_needs_gc:1;
+   unsignedinvalidate_needs_gc;
 
booldiscard; /* Get rid of? */
 
@@ -593,8 +593,8 @@ struct cache_set {
 
/* Counts how many sectors bio_insert has added to the cache */
atomic_tsectors_to_gc;
+   wait_queue_head_t   gc_wait;
 
-   wait_queue_head_t   moving_gc_wait;
struct keybuf   moving_gc_keys;
/* Number of moving GC bios in flight */
struct semaphoremoving_in_flight;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81d3db40cd..2efdce0724 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c)
bch_moving_gc(c);
 }
 
-static int bch_gc_thread(void *arg)
+static bool gc_should_run(struct cache_set *c)
 {
-   struct cache_set *c = arg;
struct cache *ca;
unsigned i;
 
-   while (1) {
-again:
-   bch_btree_gc(c);
+   for_each_cache(ca, c, i)
+   if (ca->invalidate_needs_gc)
+   return true;
 
-   set_current_state(TASK_INTERRUPTIBLE);
-   if (kthread_should_stop())
-  

Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-24 Thread Kent Overstreet
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> 
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra 
> > Date: Wed Oct 19 15:45:27 CEST 2016
> > 
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> > 
> > Requested-by: Will Deacon 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> > include/linux/sched.h |   52 
> > ++
> > kernel/sched/core.c   |   15 +++---
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value)\
> > do {\
> > (tsk)->task_state_change = _THIS_IP_;   \
> > -   smp_store_mb((tsk)->state, (state_value));  \
> > +   smp_store_mb((tsk)->state, (state_value));  \
> > } while (0)
> > 
> > -/*
> > - * set_current_state() includes a barrier so that the write of 
> > current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > - * if (do_i_need_to_sleep())
> > - * schedule();
> > - *
> > - * If the caller does not need such serialisation then use 
> > __set_current_state()
> > - */
> > #define __set_current_state(state_value)\
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value)  \
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > -   smp_store_mb(current->state, (state_value));\
> > +   smp_store_mb(current->state, (state_value));\
> > } while (0)
> > 
> > #else
> > 
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
> 
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
> 
> Kent, Jens, can we get rid of this?

Here's a patch that just fixes the way gc gets woken up. Eric, you want to take
this?

-- >8 --
Subject: [PATCH] bcache: Make gc wakeup saner

This lets us ditch a set_task_state() call.

Signed-off-by: Kent Overstreet 
---
 drivers/md/bcache/bcache.h  |  4 ++--
 drivers/md/bcache/btree.c   | 39 ---
 drivers/md/bcache/btree.h   |  3 +--
 drivers/md/bcache/request.c |  4 +---
 drivers/md/bcache/super.c   |  2 ++
 5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a55c7..c3ea03c9a1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -425,7 +425,7 @@ struct cache {
 * until a gc finishes - otherwise we could pointlessly burn a ton of
 * cpu
 */
-   unsignedinvalidate_needs_gc:1;
+   unsignedinvalidate_needs_gc;
 
booldiscard; /* Get rid of? */
 
@@ -593,8 +593,8 @@ struct cache_set {
 
/* Counts how many sectors bio_insert has added to the cache */
atomic_tsectors_to_gc;
+   wait_queue_head_t   gc_wait;
 
-   wait_queue_head_t   moving_gc_wait;
struct keybuf   moving_gc_keys;
/* Number of moving GC bios in flight */
struct semaphoremoving_in_flight;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81d3db40cd..2efdce0724 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c)
bch_moving_gc(c);
 }
 
-static int bch_gc_thread(void *arg)
+static bool gc_should_run(struct cache_set *c)
 {
-   struct cache_set *c = arg;
struct cache *ca;
unsigned i;
 
-   while (1) {
-again:
-   bch_btree_gc(c);
+   for_each_cache(ca, c, i)
+   if (ca->invalidate_needs_gc)
+   return true;
 
-   set_current_state(TASK_INTERRUPTIBLE);
-   if (kthread_should_stop())
-   break;
+   if (atomic_read(>sectors_to_gc) < 0)
+   return true;
 
-   

Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-24 Thread Kent Overstreet
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> 
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra 
> > Date: Wed Oct 19 15:45:27 CEST 2016
> > 
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> > 
> > Requested-by: Will Deacon 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> > include/linux/sched.h |   52 
> > ++
> > kernel/sched/core.c   |   15 +++---
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value)\
> > do {\
> > (tsk)->task_state_change = _THIS_IP_;   \
> > -   smp_store_mb((tsk)->state, (state_value));  \
> > +   smp_store_mb((tsk)->state, (state_value));  \
> > } while (0)
> > 
> > -/*
> > - * set_current_state() includes a barrier so that the write of 
> > current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > - * if (do_i_need_to_sleep())
> > - * schedule();
> > - *
> > - * If the caller does not need such serialisation then use 
> > __set_current_state()
> > - */
> > #define __set_current_state(state_value)\
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value)  \
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > -   smp_store_mb(current->state, (state_value));\
> > +   smp_store_mb(current->state, (state_value));\
> > } while (0)
> > 
> > #else
> > 
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
> 
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
> 
> Kent, Jens, can we get rid of this?
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534d1dd1..6e3c358b5759 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
>if (IS_ERR(c->gc_thread))
>return PTR_ERR(c->gc_thread);
> 
> -   set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
>return 0;
> }

Actually, that code looks broken, or at least stupid. Let me do a proper fix...


Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-24 Thread Kent Overstreet
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> 
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra 
> > Date: Wed Oct 19 15:45:27 CEST 2016
> > 
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> > 
> > Requested-by: Will Deacon 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> > include/linux/sched.h |   52 
> > ++
> > kernel/sched/core.c   |   15 +++---
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value)\
> > do {\
> > (tsk)->task_state_change = _THIS_IP_;   \
> > -   smp_store_mb((tsk)->state, (state_value));  \
> > +   smp_store_mb((tsk)->state, (state_value));  \
> > } while (0)
> > 
> > -/*
> > - * set_current_state() includes a barrier so that the write of 
> > current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - * set_current_state(TASK_UNINTERRUPTIBLE);
> > - * if (do_i_need_to_sleep())
> > - * schedule();
> > - *
> > - * If the caller does not need such serialisation then use 
> > __set_current_state()
> > - */
> > #define __set_current_state(state_value)\
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value)  \
> > do {\
> > current->task_state_change = _THIS_IP_; \
> > -   smp_store_mb(current->state, (state_value));\
> > +   smp_store_mb(current->state, (state_value));\
> > } while (0)
> > 
> > #else
> > 
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
> 
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
> 
> Kent, Jens, can we get rid of this?
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534d1dd1..6e3c358b5759 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
>if (IS_ERR(c->gc_thread))
>return PTR_ERR(c->gc_thread);
> 
> -   set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
>return 0;
> }

Actually, that code looks broken, or at least stupid. Let me do a proper fix...


ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-23 Thread Davidlohr Bueso

On Wed, 19 Oct 2016, Peter Zijlstra wrote:


Subject: sched: Better explain sleep/wakeup
From: Peter Zijlstra 
Date: Wed Oct 19 15:45:27 CEST 2016

There were a few questions wrt how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
---
include/linux/sched.h |   52 ++
kernel/sched/core.c   |   15 +++---
2 files changed, 44 insertions(+), 23 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
#define set_task_state(tsk, state_value)\
do {\
(tsk)->task_state_change = _THIS_IP_;\
-   smp_store_mb((tsk)->state, (state_value));   \
+   smp_store_mb((tsk)->state, (state_value));   \
} while (0)

-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use 
__set_current_state()
- */
#define __set_current_state(state_value)\
do {\
current->task_state_change = _THIS_IP_;  \
@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
#define set_current_state(state_value)  \
do {\
current->task_state_change = _THIS_IP_;  \
-   smp_store_mb(current->state, (state_value)); \
+   smp_store_mb(current->state, (state_value)); \
} while (0)

#else

+/*
+ * @tsk had better be current, or you get to keep the pieces.


That reminds me we were getting rid of the set_task_state() calls. Bcache was
pending, being only user in the kernel that doesn't actually use current; but
instead breaks newly (yet blocked/uninterruptible) created garbage collection
kthread. I cannot figure out why this is done (ie purposely accounting the
load avg. Furthermore gc kicks in in very specific scenarios obviously, such
as as by the allocator task, so I don't see why bcache gc should want to be
interruptible.

Kent, Jens, can we get rid of this?

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534d1dd1..6e3c358b5759 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
   if (IS_ERR(c->gc_thread))
   return PTR_ERR(c->gc_thread);

-   set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
   return 0;
}

Thanks,
Davidlohr


+ *
+ * The only reason is that computing current can be more expensive than
+ * using a pointer that's already available.
+ *
+ * Therefore, see set_current_state().
+ */
#define __set_task_state(tsk, state_value)  \
do { (tsk)->state = (state_value); } while (0)
#define set_task_state(tsk, state_value)\
@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
 * is correctly serialised wrt the caller's subsequent test of whether to
 * actually sleep:
 *
+ *   for (;;) {
 *  set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
+ * if (!need_sleep)
+ * break;
+ *
+ * schedule();
+ *   }
+ *   __set_current_state(TASK_RUNNING);
+ *
+ * If the caller does not need such serialisation (because, for instance, the
+ * condition test and condition change and wakeup are under the same lock) then
+ * use __set_current_state().
+ *
+ * The above is typically ordered against the wakeup, which does:
+ *
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *
+ * Where wake_up_state() (and all other wakeup primitives) imply enough
+ * barriers to order the store of the variable against wakeup.
+ *
+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
+ *
+ * This is obviously fine, since they both store the exact same value.
 *
- * If the caller does not need such serialisation then use 
__set_current_state()
+ * Also see the comments of try_to_wake_up().
 */
#define __set_current_state(state_value)\
do { current->state = (state_value); } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
 * @state: the mask of task states that 

ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-23 Thread Davidlohr Bueso

On Wed, 19 Oct 2016, Peter Zijlstra wrote:


Subject: sched: Better explain sleep/wakeup
From: Peter Zijlstra 
Date: Wed Oct 19 15:45:27 CEST 2016

There were a few questions wrt how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
---
include/linux/sched.h |   52 ++
kernel/sched/core.c   |   15 +++---
2 files changed, 44 insertions(+), 23 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
#define set_task_state(tsk, state_value)\
do {\
(tsk)->task_state_change = _THIS_IP_;\
-   smp_store_mb((tsk)->state, (state_value));   \
+   smp_store_mb((tsk)->state, (state_value));   \
} while (0)

-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use 
__set_current_state()
- */
#define __set_current_state(state_value)\
do {\
current->task_state_change = _THIS_IP_;  \
@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
#define set_current_state(state_value)  \
do {\
current->task_state_change = _THIS_IP_;  \
-   smp_store_mb(current->state, (state_value)); \
+   smp_store_mb(current->state, (state_value)); \
} while (0)

#else

+/*
+ * @tsk had better be current, or you get to keep the pieces.


That reminds me we were getting rid of the set_task_state() calls. Bcache was
pending, being only user in the kernel that doesn't actually use current; but
instead breaks newly (yet blocked/uninterruptible) created garbage collection
kthread. I cannot figure out why this is done (ie purposely accounting the
load avg. Furthermore gc kicks in in very specific scenarios obviously, such
as as by the allocator task, so I don't see why bcache gc should want to be
interruptible.

Kent, Jens, can we get rid of this?

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534d1dd1..6e3c358b5759 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
   if (IS_ERR(c->gc_thread))
   return PTR_ERR(c->gc_thread);

-   set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
   return 0;
}

Thanks,
Davidlohr


+ *
+ * The only reason is that computing current can be more expensive than
+ * using a pointer that's already available.
+ *
+ * Therefore, see set_current_state().
+ */
#define __set_task_state(tsk, state_value)  \
do { (tsk)->state = (state_value); } while (0)
#define set_task_state(tsk, state_value)\
@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
 * is correctly serialised wrt the caller's subsequent test of whether to
 * actually sleep:
 *
+ *   for (;;) {
 *  set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
+ * if (!need_sleep)
+ * break;
+ *
+ * schedule();
+ *   }
+ *   __set_current_state(TASK_RUNNING);
+ *
+ * If the caller does not need such serialisation (because, for instance, the
+ * condition test and condition change and wakeup are under the same lock) then
+ * use __set_current_state().
+ *
+ * The above is typically ordered against the wakeup, which does:
+ *
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *
+ * Where wake_up_state() (and all other wakeup primitives) imply enough
+ * barriers to order the store of the variable against wakeup.
+ *
+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
+ *
+ * This is obviously fine, since they both store the exact same value.
 *
- * If the caller does not need such serialisation then use 
__set_current_state()
+ * Also see the comments of try_to_wake_up().
 */
#define __set_current_state(state_value)\
do { current->state = (state_value); } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
 * @state: the mask of task states that can be woken
 * @wake_flags: wake modifier flags (WF_*)
 *
- * Put 

Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-19 Thread Peter Zijlstra
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > if (!first && __mutex_waiter_is_first(lock, )) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> > +
> > +   set_task_state(task, state);
> 
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
> 
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

OK, so after a discussion on IRC the problem appears to have been
unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the
odd duck out for being fully serialized by wait_lock.

The below adds a few words on how the 'normal' sleep/wakeup scheme
works.


---
Subject: sched: Better explain sleep/wakeup
From: Peter Zijlstra 
Date: Wed Oct 19 15:45:27 CEST 2016

There were a few questions wrt how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched.h |   52 ++
 kernel/sched/core.c   |   15 +++---
 2 files changed, 44 insertions(+), 23 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
 #define set_task_state(tsk, state_value)   \
do {\
(tsk)->task_state_change = _THIS_IP_;   \
-   smp_store_mb((tsk)->state, (state_value));  \
+   smp_store_mb((tsk)->state, (state_value));  \
} while (0)
 
-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use 
__set_current_state()
- */
 #define __set_current_state(state_value)   \
do {\
current->task_state_change = _THIS_IP_; \
@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
 #define set_current_state(state_value) \
do {\
current->task_state_change = _THIS_IP_; \
-   smp_store_mb(current->state, (state_value));\
+   smp_store_mb(current->state, (state_value));\
} while (0)
 
 #else
 
+/*
+ * @tsk had better be current, or you get to keep the pieces.
+ *
+ * The only reason is that computing current can be more expensive than
+ * using a pointer that's already available.
+ *
+ * Therefore, see set_current_state().
+ */
 #define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)
 #define set_task_state(tsk, state_value)   \
@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
  * is correctly serialised wrt the caller's subsequent test of whether to
  * actually sleep:
  *
+ *   for (;;) {
  * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
+ * if (!need_sleep)
+ * break;
+ *
+ * schedule();
+ *   }
+ *   __set_current_state(TASK_RUNNING);
+ *
+ * If the caller does not need such serialisation (because, for instance, the
+ * condition test and condition change and wakeup are under the same lock) then
+ * use __set_current_state().
+ *
+ * The above is typically ordered against the wakeup, which does:
+ *
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *
+ * Where wake_up_state() (and all other wakeup primitives) imply enough
+ * barriers to order the store of the variable against wakeup.
+ *
+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
+ *
+ * This is obviously fine, since they both store the exact same value.
  *
- * If the caller does not need such serialisation then use 
__set_current_state()
+ * Also see the comments of try_to_wake_up().
  */
 #define __set_current_state(state_value)   \
do { current->state = (state_value); } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
  * @state: the mask of 

Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-19 Thread Peter Zijlstra
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > if (!first && __mutex_waiter_is_first(lock, )) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> > +
> > +   set_task_state(task, state);
> 
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
> 
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

OK, so after a discussion on IRC the problem appears to have been
unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the
odd duck out for being fully serialized by wait_lock.

The below adds a few words on how the 'normal' sleep/wakeup scheme
works.


---
Subject: sched: Better explain sleep/wakeup
From: Peter Zijlstra 
Date: Wed Oct 19 15:45:27 CEST 2016

There were a few questions wrt how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched.h |   52 ++
 kernel/sched/core.c   |   15 +++---
 2 files changed, 44 insertions(+), 23 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
 #define set_task_state(tsk, state_value)   \
do {\
(tsk)->task_state_change = _THIS_IP_;   \
-   smp_store_mb((tsk)->state, (state_value));  \
+   smp_store_mb((tsk)->state, (state_value));  \
} while (0)
 
-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use 
__set_current_state()
- */
 #define __set_current_state(state_value)   \
do {\
current->task_state_change = _THIS_IP_; \
@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
 #define set_current_state(state_value) \
do {\
current->task_state_change = _THIS_IP_; \
-   smp_store_mb(current->state, (state_value));\
+   smp_store_mb(current->state, (state_value));\
} while (0)
 
 #else
 
+/*
+ * @tsk had better be current, or you get to keep the pieces.
+ *
+ * The only reason is that computing current can be more expensive than
+ * using a pointer that's already available.
+ *
+ * Therefore, see set_current_state().
+ */
 #define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)
 #define set_task_state(tsk, state_value)   \
@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
  * is correctly serialised wrt the caller's subsequent test of whether to
  * actually sleep:
  *
+ *   for (;;) {
  * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
+ * if (!need_sleep)
+ * break;
+ *
+ * schedule();
+ *   }
+ *   __set_current_state(TASK_RUNNING);
+ *
+ * If the caller does not need such serialisation (because, for instance, the
+ * condition test and condition change and wakeup are under the same lock) then
+ * use __set_current_state().
+ *
+ * The above is typically ordered against the wakeup, which does:
+ *
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *
+ * Where wake_up_state() (and all other wakeup primitives) imply enough
+ * barriers to order the store of the variable against wakeup.
+ *
+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
+ *
+ * This is obviously fine, since they both store the exact same value.
  *
- * If the caller does not need such serialisation then use 
__set_current_state()
+ * Also see the comments of try_to_wake_up().
  */
 #define __set_current_state(state_value)   \
do { current->state = (state_value); } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
  * @state: the mask of task states that can be woken
  * @wake_flags: wake modifier flags 

Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-18 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 07:16:50PM -0400, Waiman Long wrote:
> >+++ b/kernel/locking/mutex.c
> >@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >
> > lock_contended(>dep_map, ip);
> >
> >+set_task_state(task, state);
> 
> Do we want to set the state here? I am not sure if it is OK to set the task
> state without ever calling schedule().

That's entirely fine, note how we'll set it back to RUNNING at the end.

> > for (;;) {
> >+/*
> >+ * Once we hold wait_lock, we're serialized against
> >+ * mutex_unlock() handing the lock off to us, do a trylock
> >+ * before testing the error conditions to make sure we pick up
> >+ * the handoff.
> >+ */
> > if (__mutex_trylock(lock, first))
> >-break;
> >+goto acquired;
> >
> > /*
> >- * got a signal? (This code gets eliminated in the
> >- * TASK_UNINTERRUPTIBLE case.)
> >+ * Check for signals and wound conditions while holding
> >+ * wait_lock. This ensures the lock cancellation is ordered
> >+ * against mutex_unlock() and wake-ups do not go missing.
> >  */
> > if (unlikely(signal_pending_state(state, task))) {
> > ret = -EINTR;
> >@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > goto err;
> > }
> >
> >-__set_task_state(task, state);
> > spin_unlock_mutex(>wait_lock, flags);
> > schedule_preempt_disabled();
> >-spin_lock_mutex(>wait_lock, flags);
> >
> > if (!first&&  __mutex_waiter_is_first(lock,)) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> >+
> >+set_task_state(task, state);
> 
> I would suggest keep the __set_task_state() above and change
> set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide
> the memory barrier. Then we don't need adding __set_task_state() calls
> below.

set_task_state(RUNNING) doesn't make sense, ever.

See the comment near set_task_state() for the reason it has a barrier.

We need it here because when we do that trylock (or optimistic spin) we
need to have set the state and done a barrier, otherwise we can miss a
wakeup and get stuck.

> >+/*
> >+ * Here we order against unlock; we must either see it change
> >+ * state back to RUNNING and fall through the next schedule(),
> >+ * or we must see its unlock and acquire.
> >+ */
> >+if (__mutex_trylock(lock, first))
> >+break;
> >+
> 
> I don't think we need a trylock here since we are going to do it at the top
> of the loop within wait_lock anyway.

The idea was to avoid the wait-time of that lock acquire, also, this is
a place-holder for the optimistic spin site for the next patch.




Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-18 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 07:16:50PM -0400, Waiman Long wrote:
> >+++ b/kernel/locking/mutex.c
> >@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >
> > lock_contended(>dep_map, ip);
> >
> >+set_task_state(task, state);
> 
> Do we want to set the state here? I am not sure if it is OK to set the task
> state without ever calling schedule().

That's entirely fine, note how we'll set it back to RUNNING at the end.

> > for (;;) {
> >+/*
> >+ * Once we hold wait_lock, we're serialized against
> >+ * mutex_unlock() handing the lock off to us, do a trylock
> >+ * before testing the error conditions to make sure we pick up
> >+ * the handoff.
> >+ */
> > if (__mutex_trylock(lock, first))
> >-break;
> >+goto acquired;
> >
> > /*
> >- * got a signal? (This code gets eliminated in the
> >- * TASK_UNINTERRUPTIBLE case.)
> >+ * Check for signals and wound conditions while holding
> >+ * wait_lock. This ensures the lock cancellation is ordered
> >+ * against mutex_unlock() and wake-ups do not go missing.
> >  */
> > if (unlikely(signal_pending_state(state, task))) {
> > ret = -EINTR;
> >@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > goto err;
> > }
> >
> >-__set_task_state(task, state);
> > spin_unlock_mutex(>wait_lock, flags);
> > schedule_preempt_disabled();
> >-spin_lock_mutex(>wait_lock, flags);
> >
> > if (!first&&  __mutex_waiter_is_first(lock,)) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> >+
> >+set_task_state(task, state);
> 
> I would suggest keep the __set_task_state() above and change
> set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide
> the memory barrier. Then we don't need adding __set_task_state() calls
> below.

set_task_state(RUNNING) doesn't make sense, ever.

See the comment near set_task_state() for the reason it has a barrier.

We need it here because when we do that trylock (or optimistic spin) we
need to have set the state and done a barrier, otherwise we can miss a
wakeup and get stuck.

> >+/*
> >+ * Here we order against unlock; we must either see it change
> >+ * state back to RUNNING and fall through the next schedule(),
> >+ * or we must see its unlock and acquire.
> >+ */
> >+if (__mutex_trylock(lock, first))
> >+break;
> >+
> 
> I don't think we need a trylock here since we are going to do it at the top
> of the loop within wait_lock anyway.

The idea was to avoid the wait-time of that lock acquire, also, this is
a place-holder for the optimistic spin site for the next patch.




Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Waiman Long

On 10/07/2016 10:52 AM, Peter Zijlstra wrote:

Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
---
  kernel/locking/mutex.c |   30 +-
  1 file changed, 25 insertions(+), 5 deletions(-)



--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,

lock_contended(>dep_map, ip);

+   set_task_state(task, state);


Do we want to set the state here? I am not sure if it is OK to set the 
task state without ever calling schedule().



for (;;) {
+   /*
+* Once we hold wait_lock, we're serialized against
+* mutex_unlock() handing the lock off to us, do a trylock
+* before testing the error conditions to make sure we pick up
+* the handoff.
+*/
if (__mutex_trylock(lock, first))
-   break;
+   goto acquired;

/*
-* got a signal? (This code gets eliminated in the
-* TASK_UNINTERRUPTIBLE case.)
+* Check for signals and wound conditions while holding
+* wait_lock. This ensures the lock cancellation is ordered
+* against mutex_unlock() and wake-ups do not go missing.
 */
if (unlikely(signal_pending_state(state, task))) {
ret = -EINTR;
@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
goto err;
}

-   __set_task_state(task, state);
spin_unlock_mutex(>wait_lock, flags);
schedule_preempt_disabled();
-   spin_lock_mutex(>wait_lock, flags);

if (!first&&  __mutex_waiter_is_first(lock,)) {
first = true;
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+
+   set_task_state(task, state);


I would suggest keep the __set_task_state() above and change 
set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to 
provide the memory barrier. Then we don't need adding __set_task_state() 
calls below.



+   /*
+* Here we order against unlock; we must either see it change
+* state back to RUNNING and fall through the next schedule(),
+* or we must see its unlock and acquire.
+*/
+   if (__mutex_trylock(lock, first))
+   break;
+


I don't think we need a trylock here since we are going to do it at the 
top of the loop within wait_lock anyway.



+   spin_lock_mutex(>wait_lock, flags);
}
+   spin_lock_mutex(>wait_lock, flags);
+acquired:
__set_task_state(task, TASK_RUNNING);

mutex_remove_waiter(lock,, task);
@@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
return 0;

  err:
+   __set_task_state(task, TASK_RUNNING);
mutex_remove_waiter(lock,, task);
spin_unlock_mutex(>wait_lock, flags);
debug_mutex_free_waiter();




Cheers,
Longman


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Waiman Long

On 10/07/2016 10:52 AM, Peter Zijlstra wrote:

Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
---
  kernel/locking/mutex.c |   30 +-
  1 file changed, 25 insertions(+), 5 deletions(-)



--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,

lock_contended(>dep_map, ip);

+   set_task_state(task, state);


Do we want to set the state here? I am not sure if it is OK to set the 
task state without ever calling schedule().



for (;;) {
+   /*
+* Once we hold wait_lock, we're serialized against
+* mutex_unlock() handing the lock off to us, do a trylock
+* before testing the error conditions to make sure we pick up
+* the handoff.
+*/
if (__mutex_trylock(lock, first))
-   break;
+   goto acquired;

/*
-* got a signal? (This code gets eliminated in the
-* TASK_UNINTERRUPTIBLE case.)
+* Check for signals and wound conditions while holding
+* wait_lock. This ensures the lock cancellation is ordered
+* against mutex_unlock() and wake-ups do not go missing.
 */
if (unlikely(signal_pending_state(state, task))) {
ret = -EINTR;
@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
goto err;
}

-   __set_task_state(task, state);
spin_unlock_mutex(>wait_lock, flags);
schedule_preempt_disabled();
-   spin_lock_mutex(>wait_lock, flags);

if (!first&&  __mutex_waiter_is_first(lock,)) {
first = true;
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+
+   set_task_state(task, state);


I would suggest keep the __set_task_state() above and change 
set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to 
provide the memory barrier. Then we don't need adding __set_task_state() 
calls below.



+   /*
+* Here we order against unlock; we must either see it change
+* state back to RUNNING and fall through the next schedule(),
+* or we must see its unlock and acquire.
+*/
+   if (__mutex_trylock(lock, first))
+   break;
+


I don't think we need a trylock here since we are going to do it at the 
top of the loop within wait_lock anyway.



+   spin_lock_mutex(>wait_lock, flags);
}
+   spin_lock_mutex(>wait_lock, flags);
+acquired:
__set_task_state(task, TASK_RUNNING);

mutex_remove_waiter(lock,, task);
@@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
return 0;

  err:
+   __set_task_state(task, TASK_RUNNING);
mutex_remove_waiter(lock,, task);
spin_unlock_mutex(>wait_lock, flags);
debug_mutex_free_waiter();




Cheers,
Longman


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 09:45:01PM +0800, Boqun Feng wrote:
> But __mutex_set_flag() and __mutex_trylock() actually touch the same
> atomic word? So we don't need extra things to order them?

Right.. in any case brain is confused. I'll look again at it later.


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 09:45:01PM +0800, Boqun Feng wrote:
> But __mutex_set_flag() and __mutex_trylock() actually touch the same
> atomic word? So we don't need extra things to order them?

Right.. in any case brain is confused. I'll look again at it later.


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Boqun Feng
On Mon, Oct 17, 2016 at 03:24:08PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > > Hi Peter,
> > > 
> > > I'm struggling to get my head around the handoff code after this change...
> > > 
> > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > > --- a/kernel/locking/mutex.c
> > > > +++ b/kernel/locking/mutex.c
> > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > > >  
> > > > lock_contended(>dep_map, ip);
> > > >  
> > > > +   set_task_state(task, state);
> > > > for (;;) {
> > > > +   /*
> > > > +* Once we hold wait_lock, we're serialized against
> > > > +* mutex_unlock() handing the lock off to us, do a 
> > > > trylock
> > > > +* before testing the error conditions to make sure we 
> > > > pick up
> > > > +* the handoff.
> > > > +*/
> > > > if (__mutex_trylock(lock, first))
> > > > -   break;
> > > > +   goto acquired;
> > > >  
> > > > /*
> > > > -* got a signal? (This code gets eliminated in the
> > > > -* TASK_UNINTERRUPTIBLE case.)
> > > > +* Check for signals and wound conditions while holding
> > > > +* wait_lock. This ensures the lock cancellation is 
> > > > ordered
> > > > +* against mutex_unlock() and wake-ups do not go 
> > > > missing.
> > > >  */
> > > > if (unlikely(signal_pending_state(state, task))) {
> > > > ret = -EINTR;
> > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > > > goto err;
> > > > }
> > > >  
> > > > -   __set_task_state(task, state);
> > > > spin_unlock_mutex(>wait_lock, flags);
> > > > schedule_preempt_disabled();
> > > > -   spin_lock_mutex(>wait_lock, flags);
> > > >  
> > > > if (!first && __mutex_waiter_is_first(lock, )) {
> > > > first = true;
> > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > > > }
> > > > +
> > > > +   set_task_state(task, state);
> > > 
> > > With this change, we no longer hold the lock wit_hen we set the task
> > > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > > the wakeup, but then we come in and overwrite the task state?
> > > 
> > > I'm struggling to work out whether that's an issue, but it certainly
> > > feels odd and is a change from the previous behaviour.
> > 
> > Right, so I think the code is fine, since in that case the
> > __mutex_trylock() must see the handoff and we'll break the loop and
> > (re)set the state to RUNNING.
> > 
> > But you're right in that its slightly odd. I'll reorder them and put the
> > set_task_state() above the !first thing.
> 
> 
> Humm,.. we might actually rely on this order, since the MB implied by
> set_task_state() is the only thing that separates the store of
> __mutex_set_flag() from the load of __mutex_trylock(), and those should
> be ordered I think.
> 

But __mutex_set_flag() and __mutex_trylock() actually touch the same
atomic word? So we don't need extra things to order them?

Regards,
Boqun

> Argh, completely messed up my brain. I'll not touch it and think on this
> again tomorrow.


signature.asc
Description: PGP signature


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Boqun Feng
On Mon, Oct 17, 2016 at 03:24:08PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > > Hi Peter,
> > > 
> > > I'm struggling to get my head around the handoff code after this change...
> > > 
> > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > > --- a/kernel/locking/mutex.c
> > > > +++ b/kernel/locking/mutex.c
> > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > > >  
> > > > lock_contended(>dep_map, ip);
> > > >  
> > > > +   set_task_state(task, state);
> > > > for (;;) {
> > > > +   /*
> > > > +* Once we hold wait_lock, we're serialized against
> > > > +* mutex_unlock() handing the lock off to us, do a 
> > > > trylock
> > > > +* before testing the error conditions to make sure we 
> > > > pick up
> > > > +* the handoff.
> > > > +*/
> > > > if (__mutex_trylock(lock, first))
> > > > -   break;
> > > > +   goto acquired;
> > > >  
> > > > /*
> > > > -* got a signal? (This code gets eliminated in the
> > > > -* TASK_UNINTERRUPTIBLE case.)
> > > > +* Check for signals and wound conditions while holding
> > > > +* wait_lock. This ensures the lock cancellation is 
> > > > ordered
> > > > +* against mutex_unlock() and wake-ups do not go 
> > > > missing.
> > > >  */
> > > > if (unlikely(signal_pending_state(state, task))) {
> > > > ret = -EINTR;
> > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > > > goto err;
> > > > }
> > > >  
> > > > -   __set_task_state(task, state);
> > > > spin_unlock_mutex(>wait_lock, flags);
> > > > schedule_preempt_disabled();
> > > > -   spin_lock_mutex(>wait_lock, flags);
> > > >  
> > > > if (!first && __mutex_waiter_is_first(lock, )) {
> > > > first = true;
> > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > > > }
> > > > +
> > > > +   set_task_state(task, state);
> > > 
> > > With this change, we no longer hold the lock wit_hen we set the task
> > > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > > the wakeup, but then we come in and overwrite the task state?
> > > 
> > > I'm struggling to work out whether that's an issue, but it certainly
> > > feels odd and is a change from the previous behaviour.
> > 
> > Right, so I think the code is fine, since in that case the
> > __mutex_trylock() must see the handoff and we'll break the loop and
> > (re)set the state to RUNNING.
> > 
> > But you're right in that its slightly odd. I'll reorder them and put the
> > set_task_state() above the !first thing.
> 
> 
> Humm,.. we might actually rely on this order, since the MB implied by
> set_task_state() is the only thing that separates the store of
> __mutex_set_flag() from the load of __mutex_trylock(), and those should
> be ordered I think.
> 

But __mutex_set_flag() and __mutex_trylock() actually touch the same
atomic word? So we don't need extra things to order them?

Regards,
Boqun

> Argh, completely messed up my brain. I'll not touch it and think on this
> again tomorrow.


signature.asc
Description: PGP signature


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > Hi Peter,
> > 
> > I'm struggling to get my head around the handoff code after this change...
> > 
> > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > --- a/kernel/locking/mutex.c
> > > +++ b/kernel/locking/mutex.c
> > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > >  
> > >   lock_contended(>dep_map, ip);
> > >  
> > > + set_task_state(task, state);
> > >   for (;;) {
> > > + /*
> > > +  * Once we hold wait_lock, we're serialized against
> > > +  * mutex_unlock() handing the lock off to us, do a trylock
> > > +  * before testing the error conditions to make sure we pick up
> > > +  * the handoff.
> > > +  */
> > >   if (__mutex_trylock(lock, first))
> > > - break;
> > > + goto acquired;
> > >  
> > >   /*
> > > -  * got a signal? (This code gets eliminated in the
> > > -  * TASK_UNINTERRUPTIBLE case.)
> > > +  * Check for signals and wound conditions while holding
> > > +  * wait_lock. This ensures the lock cancellation is ordered
> > > +  * against mutex_unlock() and wake-ups do not go missing.
> > >*/
> > >   if (unlikely(signal_pending_state(state, task))) {
> > >   ret = -EINTR;
> > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > >   goto err;
> > >   }
> > >  
> > > - __set_task_state(task, state);
> > >   spin_unlock_mutex(>wait_lock, flags);
> > >   schedule_preempt_disabled();
> > > - spin_lock_mutex(>wait_lock, flags);
> > >  
> > >   if (!first && __mutex_waiter_is_first(lock, )) {
> > >   first = true;
> > >   __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > >   }
> > > +
> > > + set_task_state(task, state);
> > 
> > With this change, we no longer hold the lock wit_hen we set the task
> > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > the wakeup, but then we come in and overwrite the task state?
> > 
> > I'm struggling to work out whether that's an issue, but it certainly
> > feels odd and is a change from the previous behaviour.
> 
> Right, so I think the code is fine, since in that case the
> __mutex_trylock() must see the handoff and we'll break the loop and
> (re)set the state to RUNNING.
> 
> But you're right in that its slightly odd. I'll reorder them and put the
> set_task_state() above the !first thing.


Humm,.. we might actually rely on this order, since the MB implied by
set_task_state() is the only thing that separates the store of
__mutex_set_flag() from the load of __mutex_trylock(), and those should
be ordered I think.

Argh, completely messed up my brain. I'll not touch it and think on this
again tomorrow.


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > Hi Peter,
> > 
> > I'm struggling to get my head around the handoff code after this change...
> > 
> > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > --- a/kernel/locking/mutex.c
> > > +++ b/kernel/locking/mutex.c
> > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > >  
> > >   lock_contended(>dep_map, ip);
> > >  
> > > + set_task_state(task, state);
> > >   for (;;) {
> > > + /*
> > > +  * Once we hold wait_lock, we're serialized against
> > > +  * mutex_unlock() handing the lock off to us, do a trylock
> > > +  * before testing the error conditions to make sure we pick up
> > > +  * the handoff.
> > > +  */
> > >   if (__mutex_trylock(lock, first))
> > > - break;
> > > + goto acquired;
> > >  
> > >   /*
> > > -  * got a signal? (This code gets eliminated in the
> > > -  * TASK_UNINTERRUPTIBLE case.)
> > > +  * Check for signals and wound conditions while holding
> > > +  * wait_lock. This ensures the lock cancellation is ordered
> > > +  * against mutex_unlock() and wake-ups do not go missing.
> > >*/
> > >   if (unlikely(signal_pending_state(state, task))) {
> > >   ret = -EINTR;
> > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > >   goto err;
> > >   }
> > >  
> > > - __set_task_state(task, state);
> > >   spin_unlock_mutex(>wait_lock, flags);
> > >   schedule_preempt_disabled();
> > > - spin_lock_mutex(>wait_lock, flags);
> > >  
> > >   if (!first && __mutex_waiter_is_first(lock, )) {
> > >   first = true;
> > >   __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > >   }
> > > +
> > > + set_task_state(task, state);
> > 
> > With this change, we no longer hold the lock wit_hen we set the task
> > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > the wakeup, but then we come in and overwrite the task state?
> > 
> > I'm struggling to work out whether that's an issue, but it certainly
> > feels odd and is a change from the previous behaviour.
> 
> Right, so I think the code is fine, since in that case the
> __mutex_trylock() must see the handoff and we'll break the loop and
> (re)set the state to RUNNING.
> 
> But you're right in that its slightly odd. I'll reorder them and put the
> set_task_state() above the !first thing.


Humm,.. we might actually rely on this order, since the MB implied by
set_task_state() is the only thing that separates the store of
__mutex_set_flag() from the load of __mutex_trylock(), and those should
be ordered I think.

Argh, completely messed up my brain. I'll not touch it and think on this
again tomorrow.


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Peter Zijlstra
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> I'm struggling to get my head around the handoff code after this change...
> 
> On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >  
> > lock_contended(>dep_map, ip);
> >  
> > +   set_task_state(task, state);
> > for (;;) {
> > +   /*
> > +* Once we hold wait_lock, we're serialized against
> > +* mutex_unlock() handing the lock off to us, do a trylock
> > +* before testing the error conditions to make sure we pick up
> > +* the handoff.
> > +*/
> > if (__mutex_trylock(lock, first))
> > -   break;
> > +   goto acquired;
> >  
> > /*
> > -* got a signal? (This code gets eliminated in the
> > -* TASK_UNINTERRUPTIBLE case.)
> > +* Check for signals and wound conditions while holding
> > +* wait_lock. This ensures the lock cancellation is ordered
> > +* against mutex_unlock() and wake-ups do not go missing.
> >  */
> > if (unlikely(signal_pending_state(state, task))) {
> > ret = -EINTR;
> > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > goto err;
> > }
> >  
> > -   __set_task_state(task, state);
> > spin_unlock_mutex(>wait_lock, flags);
> > schedule_preempt_disabled();
> > -   spin_lock_mutex(>wait_lock, flags);
> >  
> > if (!first && __mutex_waiter_is_first(lock, )) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> > +
> > +   set_task_state(task, state);
> 
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
> 
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

Right, so I think the code is fine, since in that case the
__mutex_trylock() must see the handoff and we'll break the loop and
(re)set the state to RUNNING.

But you're right in that its slightly odd. I'll reorder them and put the
set_task_state() above the !first thing.




Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-17 Thread Peter Zijlstra
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> I'm struggling to get my head around the handoff code after this change...
> 
> On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >  
> > lock_contended(>dep_map, ip);
> >  
> > +   set_task_state(task, state);
> > for (;;) {
> > +   /*
> > +* Once we hold wait_lock, we're serialized against
> > +* mutex_unlock() handing the lock off to us, do a trylock
> > +* before testing the error conditions to make sure we pick up
> > +* the handoff.
> > +*/
> > if (__mutex_trylock(lock, first))
> > -   break;
> > +   goto acquired;
> >  
> > /*
> > -* got a signal? (This code gets eliminated in the
> > -* TASK_UNINTERRUPTIBLE case.)
> > +* Check for signals and wound conditions while holding
> > +* wait_lock. This ensures the lock cancellation is ordered
> > +* against mutex_unlock() and wake-ups do not go missing.
> >  */
> > if (unlikely(signal_pending_state(state, task))) {
> > ret = -EINTR;
> > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > goto err;
> > }
> >  
> > -   __set_task_state(task, state);
> > spin_unlock_mutex(>wait_lock, flags);
> > schedule_preempt_disabled();
> > -   spin_lock_mutex(>wait_lock, flags);
> >  
> > if (!first && __mutex_waiter_is_first(lock, )) {
> > first = true;
> > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > }
> > +
> > +   set_task_state(task, state);
> 
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
> 
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

Right, so I think the code is fine, since in that case the
__mutex_trylock() must see the handoff and we'll break the loop and
(re)set the state to RUNNING.

But you're right in that its slightly odd. I'll reorder them and put the
set_task_state() above the !first thing.




Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-13 Thread Will Deacon
Hi Peter,

I'm struggling to get my head around the handoff code after this change...

On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
>  
>   lock_contended(>dep_map, ip);
>  
> + set_task_state(task, state);
>   for (;;) {
> + /*
> +  * Once we hold wait_lock, we're serialized against
> +  * mutex_unlock() handing the lock off to us, do a trylock
> +  * before testing the error conditions to make sure we pick up
> +  * the handoff.
> +  */
>   if (__mutex_trylock(lock, first))
> - break;
> + goto acquired;
>  
>   /*
> -  * got a signal? (This code gets eliminated in the
> -  * TASK_UNINTERRUPTIBLE case.)
> +  * Check for signals and wound conditions while holding
> +  * wait_lock. This ensures the lock cancellation is ordered
> +  * against mutex_unlock() and wake-ups do not go missing.
>*/
>   if (unlikely(signal_pending_state(state, task))) {
>   ret = -EINTR;
> @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
>   goto err;
>   }
>  
> - __set_task_state(task, state);
>   spin_unlock_mutex(>wait_lock, flags);
>   schedule_preempt_disabled();
> - spin_lock_mutex(>wait_lock, flags);
>  
>   if (!first && __mutex_waiter_is_first(lock, )) {
>   first = true;
>   __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
>   }
> +
> + set_task_state(task, state);

With this change, we no longer hold the lock wit_hen we set the task
state, and it's ordered strictly *after* setting the HANDOFF flag.
Doesn't that mean that the unlock code can see the HANDOFF flag, issue
the wakeup, but then we come in and overwrite the task state?

I'm struggling to work out whether that's an issue, but it certainly
feels odd and is a change from the previous behaviour.

Will


Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-13 Thread Will Deacon
Hi Peter,

I'm struggling to get my head around the handoff code after this change...

On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
>  
>   lock_contended(>dep_map, ip);
>  
> + set_task_state(task, state);
>   for (;;) {
> + /*
> +  * Once we hold wait_lock, we're serialized against
> +  * mutex_unlock() handing the lock off to us, do a trylock
> +  * before testing the error conditions to make sure we pick up
> +  * the handoff.
> +  */
>   if (__mutex_trylock(lock, first))
> - break;
> + goto acquired;
>  
>   /*
> -  * got a signal? (This code gets eliminated in the
> -  * TASK_UNINTERRUPTIBLE case.)
> +  * Check for signals and wound conditions while holding
> +  * wait_lock. This ensures the lock cancellation is ordered
> +  * against mutex_unlock() and wake-ups do not go missing.
>*/
>   if (unlikely(signal_pending_state(state, task))) {
>   ret = -EINTR;
> @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
>   goto err;
>   }
>  
> - __set_task_state(task, state);
>   spin_unlock_mutex(>wait_lock, flags);
>   schedule_preempt_disabled();
> - spin_lock_mutex(>wait_lock, flags);
>  
>   if (!first && __mutex_waiter_is_first(lock, )) {
>   first = true;
>   __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
>   }
> +
> + set_task_state(task, state);

With this change, we no longer hold the lock wit_hen we set the task
state, and it's ordered strictly *after* setting the HANDOFF flag.
Doesn't that mean that the unlock code can see the HANDOFF flag, issue
the wakeup, but then we come in and overwrite the task state?

I'm struggling to work out whether that's an issue, but it certainly
feels odd and is a change from the previous behaviour.

Will


[PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-07 Thread Peter Zijlstra
Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/locking/mutex.c |   30 +-
 1 file changed, 25 insertions(+), 5 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
 
lock_contended(>dep_map, ip);
 
+   set_task_state(task, state);
for (;;) {
+   /*
+* Once we hold wait_lock, we're serialized against
+* mutex_unlock() handing the lock off to us, do a trylock
+* before testing the error conditions to make sure we pick up
+* the handoff.
+*/
if (__mutex_trylock(lock, first))
-   break;
+   goto acquired;
 
/*
-* got a signal? (This code gets eliminated in the
-* TASK_UNINTERRUPTIBLE case.)
+* Check for signals and wound conditions while holding
+* wait_lock. This ensures the lock cancellation is ordered
+* against mutex_unlock() and wake-ups do not go missing.
 */
if (unlikely(signal_pending_state(state, task))) {
ret = -EINTR;
@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
goto err;
}
 
-   __set_task_state(task, state);
spin_unlock_mutex(>wait_lock, flags);
schedule_preempt_disabled();
-   spin_lock_mutex(>wait_lock, flags);
 
if (!first && __mutex_waiter_is_first(lock, )) {
first = true;
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+
+   set_task_state(task, state);
+   /*
+* Here we order against unlock; we must either see it change
+* state back to RUNNING and fall through the next schedule(),
+* or we must see its unlock and acquire.
+*/
+   if (__mutex_trylock(lock, first))
+   break;
+
+   spin_lock_mutex(>wait_lock, flags);
}
+   spin_lock_mutex(>wait_lock, flags);
+acquired:
__set_task_state(task, TASK_RUNNING);
 
mutex_remove_waiter(lock, , task);
@@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
return 0;
 
 err:
+   __set_task_state(task, TASK_RUNNING);
mutex_remove_waiter(lock, , task);
spin_unlock_mutex(>wait_lock, flags);
debug_mutex_free_waiter();




[PATCH -v4 6/8] locking/mutex: Restructure wait loop

2016-10-07 Thread Peter Zijlstra
Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/locking/mutex.c |   30 +-
 1 file changed, 25 insertions(+), 5 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
 
lock_contended(>dep_map, ip);
 
+   set_task_state(task, state);
for (;;) {
+   /*
+* Once we hold wait_lock, we're serialized against
+* mutex_unlock() handing the lock off to us, do a trylock
+* before testing the error conditions to make sure we pick up
+* the handoff.
+*/
if (__mutex_trylock(lock, first))
-   break;
+   goto acquired;
 
/*
-* got a signal? (This code gets eliminated in the
-* TASK_UNINTERRUPTIBLE case.)
+* Check for signals and wound conditions while holding
+* wait_lock. This ensures the lock cancellation is ordered
+* against mutex_unlock() and wake-ups do not go missing.
 */
if (unlikely(signal_pending_state(state, task))) {
ret = -EINTR;
@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
goto err;
}
 
-   __set_task_state(task, state);
spin_unlock_mutex(>wait_lock, flags);
schedule_preempt_disabled();
-   spin_lock_mutex(>wait_lock, flags);
 
if (!first && __mutex_waiter_is_first(lock, )) {
first = true;
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+
+   set_task_state(task, state);
+   /*
+* Here we order against unlock; we must either see it change
+* state back to RUNNING and fall through the next schedule(),
+* or we must see its unlock and acquire.
+*/
+   if (__mutex_trylock(lock, first))
+   break;
+
+   spin_lock_mutex(>wait_lock, flags);
}
+   spin_lock_mutex(>wait_lock, flags);
+acquired:
__set_task_state(task, TASK_RUNNING);
 
mutex_remove_waiter(lock, , task);
@@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
return 0;
 
 err:
+   __set_task_state(task, TASK_RUNNING);
mutex_remove_waiter(lock, , task);
spin_unlock_mutex(>wait_lock, flags);
debug_mutex_free_waiter();