Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Tue, Oct 25, 2016 at 09:55:21AM -0700, Eric Wheeler wrote: > Sure, I'll put it up with my -rc2 pull request to Jens. > > A couple of sanity checks (for my understanding at least): > > Why does bch_data_insert_start() no longer need to call > set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do? Before, the gc thread wasn't being explicitly signalled that it was time to run - it was just being woken up, meaning that a spurious wakeup would cause gc to run. At best it was sketchy and fragile, for multiple reasons - wait_event() is a much better mechanism. So with wait_event() gc is explicitly checking if it's time to run - the wakeup doesn't really make anything happen by itself, it just pokes the gc thread so it's able to notice that it should run. When you're signalling a thread this way - we're in effect setting a global variable that says "gc should run now", then kicking the gc thread so it can check the "gc should run" variable. But wakeups aren't synchronous - our call to wake_up() doesn't make the gc thread check that variable before it returns, all we know when the wake_up() call returns is that the gc thread is going to check that variable some point in the future. So we can't set the "gc should run" varible, wake up the gc thread, and then set it back to "gc shouldn't run yet" - what'll happen most of the time is that the gc thread won't run before we set the variable back to "gc shouldn't run yet", it'll never see that it was supposed to run and it'll go back to sleep. So the way you make this work is you have the gc thread has to set the variable back to "gc shouldn't run yet" _after_ it's seen it and decided to run. > Does bch_cache_set_alloc() even need to call set_gc_sectors since > bch_gc_thread() does before calling bch_btree_gc? Yes, because the gc thread only resets the counter when it's decided to run - we don't want it to run right away at startup. > Also I'm curious, why change invalidate_needs_gc from a bitfield? Bitfields are particularly unsafe for multiple threads to access - the compiler has to emit instructions to do read/modify/write, which will clobber adjacent data. A bare int is also not in _general_ safe for multiple threads to access without a lock, but for what it's doing here it's fine.
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Tue, Oct 25, 2016 at 09:55:21AM -0700, Eric Wheeler wrote: > Sure, I'll put it up with my -rc2 pull request to Jens. > > A couple of sanity checks (for my understanding at least): > > Why does bch_data_insert_start() no longer need to call > set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do? Before, the gc thread wasn't being explicitly signalled that it was time to run - it was just being woken up, meaning that a spurious wakeup would cause gc to run. At best it was sketchy and fragile, for multiple reasons - wait_event() is a much better mechanism. So with wait_event() gc is explicitly checking if it's time to run - the wakeup doesn't really make anything happen by itself, it just pokes the gc thread so it's able to notice that it should run. When you're signalling a thread this way - we're in effect setting a global variable that says "gc should run now", then kicking the gc thread so it can check the "gc should run" variable. But wakeups aren't synchronous - our call to wake_up() doesn't make the gc thread check that variable before it returns, all we know when the wake_up() call returns is that the gc thread is going to check that variable some point in the future. So we can't set the "gc should run" varible, wake up the gc thread, and then set it back to "gc shouldn't run yet" - what'll happen most of the time is that the gc thread won't run before we set the variable back to "gc shouldn't run yet", it'll never see that it was supposed to run and it'll go back to sleep. So the way you make this work is you have the gc thread has to set the variable back to "gc shouldn't run yet" _after_ it's seen it and decided to run. > Does bch_cache_set_alloc() even need to call set_gc_sectors since > bch_gc_thread() does before calling bch_btree_gc? Yes, because the gc thread only resets the counter when it's decided to run - we don't want it to run right away at startup. > Also I'm curious, why change invalidate_needs_gc from a bitfield? Bitfields are particularly unsafe for multiple threads to access - the compiler has to emit instructions to do read/modify/write, which will clobber adjacent data. A bare int is also not in _general_ safe for multiple threads to access without a lock, but for what it's doing here it's fine.
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Mon, 24 Oct 2016, Kent Overstreet wrote: > On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote: > > On Wed, 19 Oct 2016, Peter Zijlstra wrote: > > > > > Subject: sched: Better explain sleep/wakeup > > > From: Peter Zijlstra> > > Date: Wed Oct 19 15:45:27 CEST 2016 > > > > > > There were a few questions wrt how sleep-wakeup works. Try and explain > > > it more. > > > > > > Requested-by: Will Deacon > > > Signed-off-by: Peter Zijlstra (Intel) > > > --- > > > include/linux/sched.h | 52 > > > ++ > > > kernel/sched/core.c | 15 +++--- > > > 2 files changed, 44 insertions(+), 23 deletions(-) > > > > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! > > > #define set_task_state(tsk, state_value) \ > > > do {\ > > > (tsk)->task_state_change = _THIS_IP_; \ > > > - smp_store_mb((tsk)->state, (state_value)); \ > > > + smp_store_mb((tsk)->state, (state_value)); \ > > > } while (0) > > > > > > -/* > > > - * set_current_state() includes a barrier so that the write of > > > current->state > > > - * is correctly serialised wrt the caller's subsequent test of whether to > > > - * actually sleep: > > > - * > > > - * set_current_state(TASK_UNINTERRUPTIBLE); > > > - * if (do_i_need_to_sleep()) > > > - * schedule(); > > > - * > > > - * If the caller does not need such serialisation then use > > > __set_current_state() > > > - */ > > > #define __set_current_state(state_value) \ > > > do {\ > > > current->task_state_change = _THIS_IP_; \ > > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! > > > #define set_current_state(state_value)\ > > > do {\ > > > current->task_state_change = _THIS_IP_; \ > > > - smp_store_mb(current->state, (state_value));\ > > > + smp_store_mb(current->state, (state_value));\ > > > } while (0) > > > > > > #else > > > > > > +/* > > > + * @tsk had better be current, or you get to keep the pieces. > > > > That reminds me we were getting rid of the set_task_state() calls. Bcache > > was > > pending, being only user in the kernel that doesn't actually use current; > > but > > instead breaks newly (yet blocked/uninterruptible) created garbage > > collection > > kthread. I cannot figure out why this is done (ie purposely accounting the > > load avg. Furthermore gc kicks in in very specific scenarios obviously, such > > as as by the allocator task, so I don't see why bcache gc should want to be > > interruptible. > > > > Kent, Jens, can we get rid of this? > > Here's a patch that just fixes the way gc gets woken up. Eric, you want to > take > this? Sure, I'll put it up with my -rc2 pull request to Jens. A couple of sanity checks (for my understanding at least): Why does bch_data_insert_start() no longer need to call set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do? Does bch_cache_set_alloc() even need to call set_gc_sectors since bch_gc_thread() does before calling bch_btree_gc? Also I'm curious, why change invalidate_needs_gc from a bitfield? -Eric > > -- >8 -- > Subject: [PATCH] bcache: Make gc wakeup saner > > This lets us ditch a set_task_state() call. > > Signed-off-by: Kent Overstreet > --- > drivers/md/bcache/bcache.h | 4 ++-- > drivers/md/bcache/btree.c | 39 --- > drivers/md/bcache/btree.h | 3 +-- > drivers/md/bcache/request.c | 4 +--- > drivers/md/bcache/super.c | 2 ++ > 5 files changed, 26 insertions(+), 26 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 6b420a55c7..c3ea03c9a1 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -425,7 +425,7 @@ struct cache { >* until a gc finishes - otherwise we could pointlessly burn a ton of >* cpu >*/ > - unsignedinvalidate_needs_gc:1; > + unsignedinvalidate_needs_gc; > > booldiscard; /* Get rid of? */ > > @@ -593,8 +593,8 @@ struct cache_set { > > /* Counts how many sectors bio_insert has added to the cache */ > atomic_tsectors_to_gc; > + wait_queue_head_t gc_wait; > > - wait_queue_head_t moving_gc_wait; > struct keybuf moving_gc_keys; > /* Number of moving GC bios in flight */ > struct semaphoremoving_in_flight; > diff --git a/drivers/md/bcache/btree.c
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Mon, 24 Oct 2016, Kent Overstreet wrote: > On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote: > > On Wed, 19 Oct 2016, Peter Zijlstra wrote: > > > > > Subject: sched: Better explain sleep/wakeup > > > From: Peter Zijlstra > > > Date: Wed Oct 19 15:45:27 CEST 2016 > > > > > > There were a few questions wrt how sleep-wakeup works. Try and explain > > > it more. > > > > > > Requested-by: Will Deacon > > > Signed-off-by: Peter Zijlstra (Intel) > > > --- > > > include/linux/sched.h | 52 > > > ++ > > > kernel/sched/core.c | 15 +++--- > > > 2 files changed, 44 insertions(+), 23 deletions(-) > > > > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! > > > #define set_task_state(tsk, state_value) \ > > > do {\ > > > (tsk)->task_state_change = _THIS_IP_; \ > > > - smp_store_mb((tsk)->state, (state_value)); \ > > > + smp_store_mb((tsk)->state, (state_value)); \ > > > } while (0) > > > > > > -/* > > > - * set_current_state() includes a barrier so that the write of > > > current->state > > > - * is correctly serialised wrt the caller's subsequent test of whether to > > > - * actually sleep: > > > - * > > > - * set_current_state(TASK_UNINTERRUPTIBLE); > > > - * if (do_i_need_to_sleep()) > > > - * schedule(); > > > - * > > > - * If the caller does not need such serialisation then use > > > __set_current_state() > > > - */ > > > #define __set_current_state(state_value) \ > > > do {\ > > > current->task_state_change = _THIS_IP_; \ > > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! > > > #define set_current_state(state_value)\ > > > do {\ > > > current->task_state_change = _THIS_IP_; \ > > > - smp_store_mb(current->state, (state_value));\ > > > + smp_store_mb(current->state, (state_value));\ > > > } while (0) > > > > > > #else > > > > > > +/* > > > + * @tsk had better be current, or you get to keep the pieces. > > > > That reminds me we were getting rid of the set_task_state() calls. Bcache > > was > > pending, being only user in the kernel that doesn't actually use current; > > but > > instead breaks newly (yet blocked/uninterruptible) created garbage > > collection > > kthread. I cannot figure out why this is done (ie purposely accounting the > > load avg. Furthermore gc kicks in in very specific scenarios obviously, such > > as as by the allocator task, so I don't see why bcache gc should want to be > > interruptible. > > > > Kent, Jens, can we get rid of this? > > Here's a patch that just fixes the way gc gets woken up. Eric, you want to > take > this? Sure, I'll put it up with my -rc2 pull request to Jens. A couple of sanity checks (for my understanding at least): Why does bch_data_insert_start() no longer need to call set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do? Does bch_cache_set_alloc() even need to call set_gc_sectors since bch_gc_thread() does before calling bch_btree_gc? Also I'm curious, why change invalidate_needs_gc from a bitfield? -Eric > > -- >8 -- > Subject: [PATCH] bcache: Make gc wakeup saner > > This lets us ditch a set_task_state() call. > > Signed-off-by: Kent Overstreet > --- > drivers/md/bcache/bcache.h | 4 ++-- > drivers/md/bcache/btree.c | 39 --- > drivers/md/bcache/btree.h | 3 +-- > drivers/md/bcache/request.c | 4 +--- > drivers/md/bcache/super.c | 2 ++ > 5 files changed, 26 insertions(+), 26 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 6b420a55c7..c3ea03c9a1 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -425,7 +425,7 @@ struct cache { >* until a gc finishes - otherwise we could pointlessly burn a ton of >* cpu >*/ > - unsignedinvalidate_needs_gc:1; > + unsignedinvalidate_needs_gc; > > booldiscard; /* Get rid of? */ > > @@ -593,8 +593,8 @@ struct cache_set { > > /* Counts how many sectors bio_insert has added to the cache */ > atomic_tsectors_to_gc; > + wait_queue_head_t gc_wait; > > - wait_queue_head_t moving_gc_wait; > struct keybuf moving_gc_keys; > /* Number of moving GC bios in flight */ > struct semaphoremoving_in_flight; > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 81d3db40cd..2efdce0724 100644 > --- a/drivers/md/bcache/btree.c >
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote: > On Wed, 19 Oct 2016, Peter Zijlstra wrote: > > > Subject: sched: Better explain sleep/wakeup > > From: Peter Zijlstra> > Date: Wed Oct 19 15:45:27 CEST 2016 > > > > There were a few questions wrt how sleep-wakeup works. Try and explain > > it more. > > > > Requested-by: Will Deacon > > Signed-off-by: Peter Zijlstra (Intel) > > --- > > include/linux/sched.h | 52 > > ++ > > kernel/sched/core.c | 15 +++--- > > 2 files changed, 44 insertions(+), 23 deletions(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! > > #define set_task_state(tsk, state_value)\ > > do {\ > > (tsk)->task_state_change = _THIS_IP_; \ > > - smp_store_mb((tsk)->state, (state_value)); \ > > + smp_store_mb((tsk)->state, (state_value)); \ > > } while (0) > > > > -/* > > - * set_current_state() includes a barrier so that the write of > > current->state > > - * is correctly serialised wrt the caller's subsequent test of whether to > > - * actually sleep: > > - * > > - * set_current_state(TASK_UNINTERRUPTIBLE); > > - * if (do_i_need_to_sleep()) > > - * schedule(); > > - * > > - * If the caller does not need such serialisation then use > > __set_current_state() > > - */ > > #define __set_current_state(state_value)\ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! > > #define set_current_state(state_value) \ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > - smp_store_mb(current->state, (state_value));\ > > + smp_store_mb(current->state, (state_value));\ > > } while (0) > > > > #else > > > > +/* > > + * @tsk had better be current, or you get to keep the pieces. > > That reminds me we were getting rid of the set_task_state() calls. Bcache was > pending, being only user in the kernel that doesn't actually use current; but > instead breaks newly (yet blocked/uninterruptible) created garbage collection > kthread. I cannot figure out why this is done (ie purposely accounting the > load avg. Furthermore gc kicks in in very specific scenarios obviously, such > as as by the allocator task, so I don't see why bcache gc should want to be > interruptible. > > Kent, Jens, can we get rid of this? Here's a patch that just fixes the way gc gets woken up. Eric, you want to take this? -- >8 -- Subject: [PATCH] bcache: Make gc wakeup saner This lets us ditch a set_task_state() call. Signed-off-by: Kent Overstreet --- drivers/md/bcache/bcache.h | 4 ++-- drivers/md/bcache/btree.c | 39 --- drivers/md/bcache/btree.h | 3 +-- drivers/md/bcache/request.c | 4 +--- drivers/md/bcache/super.c | 2 ++ 5 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 6b420a55c7..c3ea03c9a1 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -425,7 +425,7 @@ struct cache { * until a gc finishes - otherwise we could pointlessly burn a ton of * cpu */ - unsignedinvalidate_needs_gc:1; + unsignedinvalidate_needs_gc; booldiscard; /* Get rid of? */ @@ -593,8 +593,8 @@ struct cache_set { /* Counts how many sectors bio_insert has added to the cache */ atomic_tsectors_to_gc; + wait_queue_head_t gc_wait; - wait_queue_head_t moving_gc_wait; struct keybuf moving_gc_keys; /* Number of moving GC bios in flight */ struct semaphoremoving_in_flight; diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 81d3db40cd..2efdce0724 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c) bch_moving_gc(c); } -static int bch_gc_thread(void *arg) +static bool gc_should_run(struct cache_set *c) { - struct cache_set *c = arg; struct cache *ca; unsigned i; - while (1) { -again: - bch_btree_gc(c); + for_each_cache(ca, c, i) + if (ca->invalidate_needs_gc) + return true; - set_current_state(TASK_INTERRUPTIBLE); - if (kthread_should_stop()) -
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote: > On Wed, 19 Oct 2016, Peter Zijlstra wrote: > > > Subject: sched: Better explain sleep/wakeup > > From: Peter Zijlstra > > Date: Wed Oct 19 15:45:27 CEST 2016 > > > > There were a few questions wrt how sleep-wakeup works. Try and explain > > it more. > > > > Requested-by: Will Deacon > > Signed-off-by: Peter Zijlstra (Intel) > > --- > > include/linux/sched.h | 52 > > ++ > > kernel/sched/core.c | 15 +++--- > > 2 files changed, 44 insertions(+), 23 deletions(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! > > #define set_task_state(tsk, state_value)\ > > do {\ > > (tsk)->task_state_change = _THIS_IP_; \ > > - smp_store_mb((tsk)->state, (state_value)); \ > > + smp_store_mb((tsk)->state, (state_value)); \ > > } while (0) > > > > -/* > > - * set_current_state() includes a barrier so that the write of > > current->state > > - * is correctly serialised wrt the caller's subsequent test of whether to > > - * actually sleep: > > - * > > - * set_current_state(TASK_UNINTERRUPTIBLE); > > - * if (do_i_need_to_sleep()) > > - * schedule(); > > - * > > - * If the caller does not need such serialisation then use > > __set_current_state() > > - */ > > #define __set_current_state(state_value)\ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! > > #define set_current_state(state_value) \ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > - smp_store_mb(current->state, (state_value));\ > > + smp_store_mb(current->state, (state_value));\ > > } while (0) > > > > #else > > > > +/* > > + * @tsk had better be current, or you get to keep the pieces. > > That reminds me we were getting rid of the set_task_state() calls. Bcache was > pending, being only user in the kernel that doesn't actually use current; but > instead breaks newly (yet blocked/uninterruptible) created garbage collection > kthread. I cannot figure out why this is done (ie purposely accounting the > load avg. Furthermore gc kicks in in very specific scenarios obviously, such > as as by the allocator task, so I don't see why bcache gc should want to be > interruptible. > > Kent, Jens, can we get rid of this? Here's a patch that just fixes the way gc gets woken up. Eric, you want to take this? -- >8 -- Subject: [PATCH] bcache: Make gc wakeup saner This lets us ditch a set_task_state() call. Signed-off-by: Kent Overstreet --- drivers/md/bcache/bcache.h | 4 ++-- drivers/md/bcache/btree.c | 39 --- drivers/md/bcache/btree.h | 3 +-- drivers/md/bcache/request.c | 4 +--- drivers/md/bcache/super.c | 2 ++ 5 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 6b420a55c7..c3ea03c9a1 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -425,7 +425,7 @@ struct cache { * until a gc finishes - otherwise we could pointlessly burn a ton of * cpu */ - unsignedinvalidate_needs_gc:1; + unsignedinvalidate_needs_gc; booldiscard; /* Get rid of? */ @@ -593,8 +593,8 @@ struct cache_set { /* Counts how many sectors bio_insert has added to the cache */ atomic_tsectors_to_gc; + wait_queue_head_t gc_wait; - wait_queue_head_t moving_gc_wait; struct keybuf moving_gc_keys; /* Number of moving GC bios in flight */ struct semaphoremoving_in_flight; diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 81d3db40cd..2efdce0724 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c) bch_moving_gc(c); } -static int bch_gc_thread(void *arg) +static bool gc_should_run(struct cache_set *c) { - struct cache_set *c = arg; struct cache *ca; unsigned i; - while (1) { -again: - bch_btree_gc(c); + for_each_cache(ca, c, i) + if (ca->invalidate_needs_gc) + return true; - set_current_state(TASK_INTERRUPTIBLE); - if (kthread_should_stop()) - break; + if (atomic_read(>sectors_to_gc) < 0) + return true; -
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote: > On Wed, 19 Oct 2016, Peter Zijlstra wrote: > > > Subject: sched: Better explain sleep/wakeup > > From: Peter Zijlstra> > Date: Wed Oct 19 15:45:27 CEST 2016 > > > > There were a few questions wrt how sleep-wakeup works. Try and explain > > it more. > > > > Requested-by: Will Deacon > > Signed-off-by: Peter Zijlstra (Intel) > > --- > > include/linux/sched.h | 52 > > ++ > > kernel/sched/core.c | 15 +++--- > > 2 files changed, 44 insertions(+), 23 deletions(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! > > #define set_task_state(tsk, state_value)\ > > do {\ > > (tsk)->task_state_change = _THIS_IP_; \ > > - smp_store_mb((tsk)->state, (state_value)); \ > > + smp_store_mb((tsk)->state, (state_value)); \ > > } while (0) > > > > -/* > > - * set_current_state() includes a barrier so that the write of > > current->state > > - * is correctly serialised wrt the caller's subsequent test of whether to > > - * actually sleep: > > - * > > - * set_current_state(TASK_UNINTERRUPTIBLE); > > - * if (do_i_need_to_sleep()) > > - * schedule(); > > - * > > - * If the caller does not need such serialisation then use > > __set_current_state() > > - */ > > #define __set_current_state(state_value)\ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! > > #define set_current_state(state_value) \ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > - smp_store_mb(current->state, (state_value));\ > > + smp_store_mb(current->state, (state_value));\ > > } while (0) > > > > #else > > > > +/* > > + * @tsk had better be current, or you get to keep the pieces. > > That reminds me we were getting rid of the set_task_state() calls. Bcache was > pending, being only user in the kernel that doesn't actually use current; but > instead breaks newly (yet blocked/uninterruptible) created garbage collection > kthread. I cannot figure out why this is done (ie purposely accounting the > load avg. Furthermore gc kicks in in very specific scenarios obviously, such > as as by the allocator task, so I don't see why bcache gc should want to be > interruptible. > > Kent, Jens, can we get rid of this? > > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 76f7534d1dd1..6e3c358b5759 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c) >if (IS_ERR(c->gc_thread)) >return PTR_ERR(c->gc_thread); > > - set_task_state(c->gc_thread, TASK_INTERRUPTIBLE); >return 0; > } Actually, that code looks broken, or at least stupid. Let me do a proper fix...
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote: > On Wed, 19 Oct 2016, Peter Zijlstra wrote: > > > Subject: sched: Better explain sleep/wakeup > > From: Peter Zijlstra > > Date: Wed Oct 19 15:45:27 CEST 2016 > > > > There were a few questions wrt how sleep-wakeup works. Try and explain > > it more. > > > > Requested-by: Will Deacon > > Signed-off-by: Peter Zijlstra (Intel) > > --- > > include/linux/sched.h | 52 > > ++ > > kernel/sched/core.c | 15 +++--- > > 2 files changed, 44 insertions(+), 23 deletions(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! > > #define set_task_state(tsk, state_value)\ > > do {\ > > (tsk)->task_state_change = _THIS_IP_; \ > > - smp_store_mb((tsk)->state, (state_value)); \ > > + smp_store_mb((tsk)->state, (state_value)); \ > > } while (0) > > > > -/* > > - * set_current_state() includes a barrier so that the write of > > current->state > > - * is correctly serialised wrt the caller's subsequent test of whether to > > - * actually sleep: > > - * > > - * set_current_state(TASK_UNINTERRUPTIBLE); > > - * if (do_i_need_to_sleep()) > > - * schedule(); > > - * > > - * If the caller does not need such serialisation then use > > __set_current_state() > > - */ > > #define __set_current_state(state_value)\ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! > > #define set_current_state(state_value) \ > > do {\ > > current->task_state_change = _THIS_IP_; \ > > - smp_store_mb(current->state, (state_value));\ > > + smp_store_mb(current->state, (state_value));\ > > } while (0) > > > > #else > > > > +/* > > + * @tsk had better be current, or you get to keep the pieces. > > That reminds me we were getting rid of the set_task_state() calls. Bcache was > pending, being only user in the kernel that doesn't actually use current; but > instead breaks newly (yet blocked/uninterruptible) created garbage collection > kthread. I cannot figure out why this is done (ie purposely accounting the > load avg. Furthermore gc kicks in in very specific scenarios obviously, such > as as by the allocator task, so I don't see why bcache gc should want to be > interruptible. > > Kent, Jens, can we get rid of this? > > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 76f7534d1dd1..6e3c358b5759 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c) >if (IS_ERR(c->gc_thread)) >return PTR_ERR(c->gc_thread); > > - set_task_state(c->gc_thread, TASK_INTERRUPTIBLE); >return 0; > } Actually, that code looks broken, or at least stupid. Let me do a proper fix...
ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Wed, 19 Oct 2016, Peter Zijlstra wrote: Subject: sched: Better explain sleep/wakeup From: Peter ZijlstraDate: Wed Oct 19 15:45:27 CEST 2016 There were a few questions wrt how sleep-wakeup works. Try and explain it more. Requested-by: Will Deacon Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 52 ++ kernel/sched/core.c | 15 +++--- 2 files changed, 44 insertions(+), 23 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! #define set_task_state(tsk, state_value)\ do {\ (tsk)->task_state_change = _THIS_IP_;\ - smp_store_mb((tsk)->state, (state_value)); \ + smp_store_mb((tsk)->state, (state_value)); \ } while (0) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value)\ do {\ current->task_state_change = _THIS_IP_; \ @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! #define set_current_state(state_value) \ do {\ current->task_state_change = _THIS_IP_; \ - smp_store_mb(current->state, (state_value)); \ + smp_store_mb(current->state, (state_value)); \ } while (0) #else +/* + * @tsk had better be current, or you get to keep the pieces. That reminds me we were getting rid of the set_task_state() calls. Bcache was pending, being only user in the kernel that doesn't actually use current; but instead breaks newly (yet blocked/uninterruptible) created garbage collection kthread. I cannot figure out why this is done (ie purposely accounting the load avg. Furthermore gc kicks in in very specific scenarios obviously, such as as by the allocator task, so I don't see why bcache gc should want to be interruptible. Kent, Jens, can we get rid of this? diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 76f7534d1dd1..6e3c358b5759 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c) if (IS_ERR(c->gc_thread)) return PTR_ERR(c->gc_thread); - set_task_state(c->gc_thread, TASK_INTERRUPTIBLE); return 0; } Thanks, Davidlohr + * + * The only reason is that computing current can be more expensive than + * using a pointer that's already available. + * + * Therefore, see set_current_state(). + */ #define __set_task_state(tsk, state_value) \ do { (tsk)->state = (state_value); } while (0) #define set_task_state(tsk, state_value)\ @@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*! * is correctly serialised wrt the caller's subsequent test of whether to * actually sleep: * + * for (;;) { * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); + * if (!need_sleep) + * break; + * + * schedule(); + * } + * __set_current_state(TASK_RUNNING); + * + * If the caller does not need such serialisation (because, for instance, the + * condition test and condition change and wakeup are under the same lock) then + * use __set_current_state(). + * + * The above is typically ordered against the wakeup, which does: + * + * need_sleep = false; + * wake_up_state(p, TASK_UNINTERRUPTIBLE); + * + * Where wake_up_state() (and all other wakeup primitives) imply enough + * barriers to order the store of the variable against wakeup. + * + * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, + * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a + * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). + * + * This is obviously fine, since they both store the exact same value. * - * If the caller does not need such serialisation then use __set_current_state() + * Also see the comments of try_to_wake_up(). */ #define __set_current_state(state_value)\ do { current->state = (state_value); } while (0) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc * @state: the mask of task states that
ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
On Wed, 19 Oct 2016, Peter Zijlstra wrote: Subject: sched: Better explain sleep/wakeup From: Peter Zijlstra Date: Wed Oct 19 15:45:27 CEST 2016 There were a few questions wrt how sleep-wakeup works. Try and explain it more. Requested-by: Will Deacon Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 52 ++ kernel/sched/core.c | 15 +++--- 2 files changed, 44 insertions(+), 23 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! #define set_task_state(tsk, state_value)\ do {\ (tsk)->task_state_change = _THIS_IP_;\ - smp_store_mb((tsk)->state, (state_value)); \ + smp_store_mb((tsk)->state, (state_value)); \ } while (0) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value)\ do {\ current->task_state_change = _THIS_IP_; \ @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! #define set_current_state(state_value) \ do {\ current->task_state_change = _THIS_IP_; \ - smp_store_mb(current->state, (state_value)); \ + smp_store_mb(current->state, (state_value)); \ } while (0) #else +/* + * @tsk had better be current, or you get to keep the pieces. That reminds me we were getting rid of the set_task_state() calls. Bcache was pending, being only user in the kernel that doesn't actually use current; but instead breaks newly (yet blocked/uninterruptible) created garbage collection kthread. I cannot figure out why this is done (ie purposely accounting the load avg. Furthermore gc kicks in in very specific scenarios obviously, such as as by the allocator task, so I don't see why bcache gc should want to be interruptible. Kent, Jens, can we get rid of this? diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 76f7534d1dd1..6e3c358b5759 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c) if (IS_ERR(c->gc_thread)) return PTR_ERR(c->gc_thread); - set_task_state(c->gc_thread, TASK_INTERRUPTIBLE); return 0; } Thanks, Davidlohr + * + * The only reason is that computing current can be more expensive than + * using a pointer that's already available. + * + * Therefore, see set_current_state(). + */ #define __set_task_state(tsk, state_value) \ do { (tsk)->state = (state_value); } while (0) #define set_task_state(tsk, state_value)\ @@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*! * is correctly serialised wrt the caller's subsequent test of whether to * actually sleep: * + * for (;;) { * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); + * if (!need_sleep) + * break; + * + * schedule(); + * } + * __set_current_state(TASK_RUNNING); + * + * If the caller does not need such serialisation (because, for instance, the + * condition test and condition change and wakeup are under the same lock) then + * use __set_current_state(). + * + * The above is typically ordered against the wakeup, which does: + * + * need_sleep = false; + * wake_up_state(p, TASK_UNINTERRUPTIBLE); + * + * Where wake_up_state() (and all other wakeup primitives) imply enough + * barriers to order the store of the variable against wakeup. + * + * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, + * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a + * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). + * + * This is obviously fine, since they both store the exact same value. * - * If the caller does not need such serialisation then use __set_current_state() + * Also see the comments of try_to_wake_up(). */ #define __set_current_state(state_value)\ do { current->state = (state_value); } while (0) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc * @state: the mask of task states that can be woken * @wake_flags: wake modifier flags (WF_*) * - * Put
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > if (!first && __mutex_waiter_is_first(lock, )) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > > + > > + set_task_state(task, state); > > With this change, we no longer hold the lock wit_hen we set the task > state, and it's ordered strictly *after* setting the HANDOFF flag. > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > the wakeup, but then we come in and overwrite the task state? > > I'm struggling to work out whether that's an issue, but it certainly > feels odd and is a change from the previous behaviour. OK, so after a discussion on IRC the problem appears to have been unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the odd duck out for being fully serialized by wait_lock. The below adds a few words on how the 'normal' sleep/wakeup scheme works. --- Subject: sched: Better explain sleep/wakeup From: Peter ZijlstraDate: Wed Oct 19 15:45:27 CEST 2016 There were a few questions wrt how sleep-wakeup works. Try and explain it more. Requested-by: Will Deacon Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 52 ++ kernel/sched/core.c | 15 +++--- 2 files changed, 44 insertions(+), 23 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! #define set_task_state(tsk, state_value) \ do {\ (tsk)->task_state_change = _THIS_IP_; \ - smp_store_mb((tsk)->state, (state_value)); \ + smp_store_mb((tsk)->state, (state_value)); \ } while (0) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value) \ do {\ current->task_state_change = _THIS_IP_; \ @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! #define set_current_state(state_value) \ do {\ current->task_state_change = _THIS_IP_; \ - smp_store_mb(current->state, (state_value));\ + smp_store_mb(current->state, (state_value));\ } while (0) #else +/* + * @tsk had better be current, or you get to keep the pieces. + * + * The only reason is that computing current can be more expensive than + * using a pointer that's already available. + * + * Therefore, see set_current_state(). + */ #define __set_task_state(tsk, state_value) \ do { (tsk)->state = (state_value); } while (0) #define set_task_state(tsk, state_value) \ @@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*! * is correctly serialised wrt the caller's subsequent test of whether to * actually sleep: * + * for (;;) { * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); + * if (!need_sleep) + * break; + * + * schedule(); + * } + * __set_current_state(TASK_RUNNING); + * + * If the caller does not need such serialisation (because, for instance, the + * condition test and condition change and wakeup are under the same lock) then + * use __set_current_state(). + * + * The above is typically ordered against the wakeup, which does: + * + * need_sleep = false; + * wake_up_state(p, TASK_UNINTERRUPTIBLE); + * + * Where wake_up_state() (and all other wakeup primitives) imply enough + * barriers to order the store of the variable against wakeup. + * + * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, + * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a + * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). + * + * This is obviously fine, since they both store the exact same value. * - * If the caller does not need such serialisation then use __set_current_state() + * Also see the comments of try_to_wake_up(). */ #define __set_current_state(state_value) \ do { current->state = (state_value); } while (0) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc * @state: the mask of
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > if (!first && __mutex_waiter_is_first(lock, )) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > > + > > + set_task_state(task, state); > > With this change, we no longer hold the lock wit_hen we set the task > state, and it's ordered strictly *after* setting the HANDOFF flag. > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > the wakeup, but then we come in and overwrite the task state? > > I'm struggling to work out whether that's an issue, but it certainly > feels odd and is a change from the previous behaviour. OK, so after a discussion on IRC the problem appears to have been unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the odd duck out for being fully serialized by wait_lock. The below adds a few words on how the 'normal' sleep/wakeup scheme works. --- Subject: sched: Better explain sleep/wakeup From: Peter Zijlstra Date: Wed Oct 19 15:45:27 CEST 2016 There were a few questions wrt how sleep-wakeup works. Try and explain it more. Requested-by: Will Deacon Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 52 ++ kernel/sched/core.c | 15 +++--- 2 files changed, 44 insertions(+), 23 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! #define set_task_state(tsk, state_value) \ do {\ (tsk)->task_state_change = _THIS_IP_; \ - smp_store_mb((tsk)->state, (state_value)); \ + smp_store_mb((tsk)->state, (state_value)); \ } while (0) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value) \ do {\ current->task_state_change = _THIS_IP_; \ @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! #define set_current_state(state_value) \ do {\ current->task_state_change = _THIS_IP_; \ - smp_store_mb(current->state, (state_value));\ + smp_store_mb(current->state, (state_value));\ } while (0) #else +/* + * @tsk had better be current, or you get to keep the pieces. + * + * The only reason is that computing current can be more expensive than + * using a pointer that's already available. + * + * Therefore, see set_current_state(). + */ #define __set_task_state(tsk, state_value) \ do { (tsk)->state = (state_value); } while (0) #define set_task_state(tsk, state_value) \ @@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*! * is correctly serialised wrt the caller's subsequent test of whether to * actually sleep: * + * for (;;) { * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); + * if (!need_sleep) + * break; + * + * schedule(); + * } + * __set_current_state(TASK_RUNNING); + * + * If the caller does not need such serialisation (because, for instance, the + * condition test and condition change and wakeup are under the same lock) then + * use __set_current_state(). + * + * The above is typically ordered against the wakeup, which does: + * + * need_sleep = false; + * wake_up_state(p, TASK_UNINTERRUPTIBLE); + * + * Where wake_up_state() (and all other wakeup primitives) imply enough + * barriers to order the store of the variable against wakeup. + * + * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, + * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a + * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). + * + * This is obviously fine, since they both store the exact same value. * - * If the caller does not need such serialisation then use __set_current_state() + * Also see the comments of try_to_wake_up(). */ #define __set_current_state(state_value) \ do { current->state = (state_value); } while (0) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc * @state: the mask of task states that can be woken * @wake_flags: wake modifier flags
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 07:16:50PM -0400, Waiman Long wrote: > >+++ b/kernel/locking/mutex.c > >@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > lock_contended(>dep_map, ip); > > > >+set_task_state(task, state); > > Do we want to set the state here? I am not sure if it is OK to set the task > state without ever calling schedule(). That's entirely fine, note how we'll set it back to RUNNING at the end. > > for (;;) { > >+/* > >+ * Once we hold wait_lock, we're serialized against > >+ * mutex_unlock() handing the lock off to us, do a trylock > >+ * before testing the error conditions to make sure we pick up > >+ * the handoff. > >+ */ > > if (__mutex_trylock(lock, first)) > >-break; > >+goto acquired; > > > > /* > >- * got a signal? (This code gets eliminated in the > >- * TASK_UNINTERRUPTIBLE case.) > >+ * Check for signals and wound conditions while holding > >+ * wait_lock. This ensures the lock cancellation is ordered > >+ * against mutex_unlock() and wake-ups do not go missing. > > */ > > if (unlikely(signal_pending_state(state, task))) { > > ret = -EINTR; > >@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > goto err; > > } > > > >-__set_task_state(task, state); > > spin_unlock_mutex(>wait_lock, flags); > > schedule_preempt_disabled(); > >-spin_lock_mutex(>wait_lock, flags); > > > > if (!first&& __mutex_waiter_is_first(lock,)) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > >+ > >+set_task_state(task, state); > > I would suggest keep the __set_task_state() above and change > set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide > the memory barrier. Then we don't need adding __set_task_state() calls > below. set_task_state(RUNNING) doesn't make sense, ever. See the comment near set_task_state() for the reason it has a barrier. We need it here because when we do that trylock (or optimistic spin) we need to have set the state and done a barrier, otherwise we can miss a wakeup and get stuck. > >+/* > >+ * Here we order against unlock; we must either see it change > >+ * state back to RUNNING and fall through the next schedule(), > >+ * or we must see its unlock and acquire. > >+ */ > >+if (__mutex_trylock(lock, first)) > >+break; > >+ > > I don't think we need a trylock here since we are going to do it at the top > of the loop within wait_lock anyway. The idea was to avoid the wait-time of that lock acquire, also, this is a place-holder for the optimistic spin site for the next patch.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 07:16:50PM -0400, Waiman Long wrote: > >+++ b/kernel/locking/mutex.c > >@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > lock_contended(>dep_map, ip); > > > >+set_task_state(task, state); > > Do we want to set the state here? I am not sure if it is OK to set the task > state without ever calling schedule(). That's entirely fine, note how we'll set it back to RUNNING at the end. > > for (;;) { > >+/* > >+ * Once we hold wait_lock, we're serialized against > >+ * mutex_unlock() handing the lock off to us, do a trylock > >+ * before testing the error conditions to make sure we pick up > >+ * the handoff. > >+ */ > > if (__mutex_trylock(lock, first)) > >-break; > >+goto acquired; > > > > /* > >- * got a signal? (This code gets eliminated in the > >- * TASK_UNINTERRUPTIBLE case.) > >+ * Check for signals and wound conditions while holding > >+ * wait_lock. This ensures the lock cancellation is ordered > >+ * against mutex_unlock() and wake-ups do not go missing. > > */ > > if (unlikely(signal_pending_state(state, task))) { > > ret = -EINTR; > >@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > goto err; > > } > > > >-__set_task_state(task, state); > > spin_unlock_mutex(>wait_lock, flags); > > schedule_preempt_disabled(); > >-spin_lock_mutex(>wait_lock, flags); > > > > if (!first&& __mutex_waiter_is_first(lock,)) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > >+ > >+set_task_state(task, state); > > I would suggest keep the __set_task_state() above and change > set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide > the memory barrier. Then we don't need adding __set_task_state() calls > below. set_task_state(RUNNING) doesn't make sense, ever. See the comment near set_task_state() for the reason it has a barrier. We need it here because when we do that trylock (or optimistic spin) we need to have set the state and done a barrier, otherwise we can miss a wakeup and get stuck. > >+/* > >+ * Here we order against unlock; we must either see it change > >+ * state back to RUNNING and fall through the next schedule(), > >+ * or we must see its unlock and acquire. > >+ */ > >+if (__mutex_trylock(lock, first)) > >+break; > >+ > > I don't think we need a trylock here since we are going to do it at the top > of the loop within wait_lock anyway. The idea was to avoid the wait-time of that lock acquire, also, this is a place-holder for the optimistic spin site for the next patch.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On 10/07/2016 10:52 AM, Peter Zijlstra wrote: Doesn't really matter yet, but pull the HANDOFF and trylock out from under the wait_lock. The intention is to add an optimistic spin loop here, which requires we do not hold the wait_lock, so shuffle code around in preparation. Also clarify the purpose of taking the wait_lock in the wait loop, its tempting to want to avoid it altogether, but the cancellation cases need to to avoid losing wakeups. Suggested-by: Waiman LongSigned-off-by: Peter Zijlstra (Intel) --- kernel/locking/mutex.c | 30 +- 1 file changed, 25 insertions(+), 5 deletions(-) --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, lock_contended(>dep_map, ip); + set_task_state(task, state); Do we want to set the state here? I am not sure if it is OK to set the task state without ever calling schedule(). for (;;) { + /* +* Once we hold wait_lock, we're serialized against +* mutex_unlock() handing the lock off to us, do a trylock +* before testing the error conditions to make sure we pick up +* the handoff. +*/ if (__mutex_trylock(lock, first)) - break; + goto acquired; /* -* got a signal? (This code gets eliminated in the -* TASK_UNINTERRUPTIBLE case.) +* Check for signals and wound conditions while holding +* wait_lock. This ensures the lock cancellation is ordered +* against mutex_unlock() and wake-ups do not go missing. */ if (unlikely(signal_pending_state(state, task))) { ret = -EINTR; @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, goto err; } - __set_task_state(task, state); spin_unlock_mutex(>wait_lock, flags); schedule_preempt_disabled(); - spin_lock_mutex(>wait_lock, flags); if (!first&& __mutex_waiter_is_first(lock,)) { first = true; __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); } + + set_task_state(task, state); I would suggest keep the __set_task_state() above and change set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide the memory barrier. Then we don't need adding __set_task_state() calls below. + /* +* Here we order against unlock; we must either see it change +* state back to RUNNING and fall through the next schedule(), +* or we must see its unlock and acquire. +*/ + if (__mutex_trylock(lock, first)) + break; + I don't think we need a trylock here since we are going to do it at the top of the loop within wait_lock anyway. + spin_lock_mutex(>wait_lock, flags); } + spin_lock_mutex(>wait_lock, flags); +acquired: __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock,, task); @@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock, return 0; err: + __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock,, task); spin_unlock_mutex(>wait_lock, flags); debug_mutex_free_waiter(); Cheers, Longman
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On 10/07/2016 10:52 AM, Peter Zijlstra wrote: Doesn't really matter yet, but pull the HANDOFF and trylock out from under the wait_lock. The intention is to add an optimistic spin loop here, which requires we do not hold the wait_lock, so shuffle code around in preparation. Also clarify the purpose of taking the wait_lock in the wait loop, its tempting to want to avoid it altogether, but the cancellation cases need to to avoid losing wakeups. Suggested-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) --- kernel/locking/mutex.c | 30 +- 1 file changed, 25 insertions(+), 5 deletions(-) --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, lock_contended(>dep_map, ip); + set_task_state(task, state); Do we want to set the state here? I am not sure if it is OK to set the task state without ever calling schedule(). for (;;) { + /* +* Once we hold wait_lock, we're serialized against +* mutex_unlock() handing the lock off to us, do a trylock +* before testing the error conditions to make sure we pick up +* the handoff. +*/ if (__mutex_trylock(lock, first)) - break; + goto acquired; /* -* got a signal? (This code gets eliminated in the -* TASK_UNINTERRUPTIBLE case.) +* Check for signals and wound conditions while holding +* wait_lock. This ensures the lock cancellation is ordered +* against mutex_unlock() and wake-ups do not go missing. */ if (unlikely(signal_pending_state(state, task))) { ret = -EINTR; @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, goto err; } - __set_task_state(task, state); spin_unlock_mutex(>wait_lock, flags); schedule_preempt_disabled(); - spin_lock_mutex(>wait_lock, flags); if (!first&& __mutex_waiter_is_first(lock,)) { first = true; __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); } + + set_task_state(task, state); I would suggest keep the __set_task_state() above and change set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide the memory barrier. Then we don't need adding __set_task_state() calls below. + /* +* Here we order against unlock; we must either see it change +* state back to RUNNING and fall through the next schedule(), +* or we must see its unlock and acquire. +*/ + if (__mutex_trylock(lock, first)) + break; + I don't think we need a trylock here since we are going to do it at the top of the loop within wait_lock anyway. + spin_lock_mutex(>wait_lock, flags); } + spin_lock_mutex(>wait_lock, flags); +acquired: __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock,, task); @@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock, return 0; err: + __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock,, task); spin_unlock_mutex(>wait_lock, flags); debug_mutex_free_waiter(); Cheers, Longman
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 09:45:01PM +0800, Boqun Feng wrote: > But __mutex_set_flag() and __mutex_trylock() actually touch the same > atomic word? So we don't need extra things to order them? Right.. in any case brain is confused. I'll look again at it later.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 09:45:01PM +0800, Boqun Feng wrote: > But __mutex_set_flag() and __mutex_trylock() actually touch the same > atomic word? So we don't need extra things to order them? Right.. in any case brain is confused. I'll look again at it later.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 03:24:08PM +0200, Peter Zijlstra wrote: > On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote: > > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > > Hi Peter, > > > > > > I'm struggling to get my head around the handoff code after this change... > > > > > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > > > > --- a/kernel/locking/mutex.c > > > > +++ b/kernel/locking/mutex.c > > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > > > > > lock_contended(>dep_map, ip); > > > > > > > > + set_task_state(task, state); > > > > for (;;) { > > > > + /* > > > > +* Once we hold wait_lock, we're serialized against > > > > +* mutex_unlock() handing the lock off to us, do a > > > > trylock > > > > +* before testing the error conditions to make sure we > > > > pick up > > > > +* the handoff. > > > > +*/ > > > > if (__mutex_trylock(lock, first)) > > > > - break; > > > > + goto acquired; > > > > > > > > /* > > > > -* got a signal? (This code gets eliminated in the > > > > -* TASK_UNINTERRUPTIBLE case.) > > > > +* Check for signals and wound conditions while holding > > > > +* wait_lock. This ensures the lock cancellation is > > > > ordered > > > > +* against mutex_unlock() and wake-ups do not go > > > > missing. > > > > */ > > > > if (unlikely(signal_pending_state(state, task))) { > > > > ret = -EINTR; > > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > > > goto err; > > > > } > > > > > > > > - __set_task_state(task, state); > > > > spin_unlock_mutex(>wait_lock, flags); > > > > schedule_preempt_disabled(); > > > > - spin_lock_mutex(>wait_lock, flags); > > > > > > > > if (!first && __mutex_waiter_is_first(lock, )) { > > > > first = true; > > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > > > } > > > > + > > > > + set_task_state(task, state); > > > > > > With this change, we no longer hold the lock wit_hen we set the task > > > state, and it's ordered strictly *after* setting the HANDOFF flag. > > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > > > the wakeup, but then we come in and overwrite the task state? > > > > > > I'm struggling to work out whether that's an issue, but it certainly > > > feels odd and is a change from the previous behaviour. > > > > Right, so I think the code is fine, since in that case the > > __mutex_trylock() must see the handoff and we'll break the loop and > > (re)set the state to RUNNING. > > > > But you're right in that its slightly odd. I'll reorder them and put the > > set_task_state() above the !first thing. > > > Humm,.. we might actually rely on this order, since the MB implied by > set_task_state() is the only thing that separates the store of > __mutex_set_flag() from the load of __mutex_trylock(), and those should > be ordered I think. > But __mutex_set_flag() and __mutex_trylock() actually touch the same atomic word? So we don't need extra things to order them? Regards, Boqun > Argh, completely messed up my brain. I'll not touch it and think on this > again tomorrow. signature.asc Description: PGP signature
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 03:24:08PM +0200, Peter Zijlstra wrote: > On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote: > > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > > Hi Peter, > > > > > > I'm struggling to get my head around the handoff code after this change... > > > > > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > > > > --- a/kernel/locking/mutex.c > > > > +++ b/kernel/locking/mutex.c > > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > > > > > lock_contended(>dep_map, ip); > > > > > > > > + set_task_state(task, state); > > > > for (;;) { > > > > + /* > > > > +* Once we hold wait_lock, we're serialized against > > > > +* mutex_unlock() handing the lock off to us, do a > > > > trylock > > > > +* before testing the error conditions to make sure we > > > > pick up > > > > +* the handoff. > > > > +*/ > > > > if (__mutex_trylock(lock, first)) > > > > - break; > > > > + goto acquired; > > > > > > > > /* > > > > -* got a signal? (This code gets eliminated in the > > > > -* TASK_UNINTERRUPTIBLE case.) > > > > +* Check for signals and wound conditions while holding > > > > +* wait_lock. This ensures the lock cancellation is > > > > ordered > > > > +* against mutex_unlock() and wake-ups do not go > > > > missing. > > > > */ > > > > if (unlikely(signal_pending_state(state, task))) { > > > > ret = -EINTR; > > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > > > goto err; > > > > } > > > > > > > > - __set_task_state(task, state); > > > > spin_unlock_mutex(>wait_lock, flags); > > > > schedule_preempt_disabled(); > > > > - spin_lock_mutex(>wait_lock, flags); > > > > > > > > if (!first && __mutex_waiter_is_first(lock, )) { > > > > first = true; > > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > > > } > > > > + > > > > + set_task_state(task, state); > > > > > > With this change, we no longer hold the lock wit_hen we set the task > > > state, and it's ordered strictly *after* setting the HANDOFF flag. > > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > > > the wakeup, but then we come in and overwrite the task state? > > > > > > I'm struggling to work out whether that's an issue, but it certainly > > > feels odd and is a change from the previous behaviour. > > > > Right, so I think the code is fine, since in that case the > > __mutex_trylock() must see the handoff and we'll break the loop and > > (re)set the state to RUNNING. > > > > But you're right in that its slightly odd. I'll reorder them and put the > > set_task_state() above the !first thing. > > > Humm,.. we might actually rely on this order, since the MB implied by > set_task_state() is the only thing that separates the store of > __mutex_set_flag() from the load of __mutex_trylock(), and those should > be ordered I think. > But __mutex_set_flag() and __mutex_trylock() actually touch the same atomic word? So we don't need extra things to order them? Regards, Boqun > Argh, completely messed up my brain. I'll not touch it and think on this > again tomorrow. signature.asc Description: PGP signature
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote: > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > Hi Peter, > > > > I'm struggling to get my head around the handoff code after this change... > > > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > > > --- a/kernel/locking/mutex.c > > > +++ b/kernel/locking/mutex.c > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > > > lock_contended(>dep_map, ip); > > > > > > + set_task_state(task, state); > > > for (;;) { > > > + /* > > > + * Once we hold wait_lock, we're serialized against > > > + * mutex_unlock() handing the lock off to us, do a trylock > > > + * before testing the error conditions to make sure we pick up > > > + * the handoff. > > > + */ > > > if (__mutex_trylock(lock, first)) > > > - break; > > > + goto acquired; > > > > > > /* > > > - * got a signal? (This code gets eliminated in the > > > - * TASK_UNINTERRUPTIBLE case.) > > > + * Check for signals and wound conditions while holding > > > + * wait_lock. This ensures the lock cancellation is ordered > > > + * against mutex_unlock() and wake-ups do not go missing. > > >*/ > > > if (unlikely(signal_pending_state(state, task))) { > > > ret = -EINTR; > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > > goto err; > > > } > > > > > > - __set_task_state(task, state); > > > spin_unlock_mutex(>wait_lock, flags); > > > schedule_preempt_disabled(); > > > - spin_lock_mutex(>wait_lock, flags); > > > > > > if (!first && __mutex_waiter_is_first(lock, )) { > > > first = true; > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > > } > > > + > > > + set_task_state(task, state); > > > > With this change, we no longer hold the lock wit_hen we set the task > > state, and it's ordered strictly *after* setting the HANDOFF flag. > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > > the wakeup, but then we come in and overwrite the task state? > > > > I'm struggling to work out whether that's an issue, but it certainly > > feels odd and is a change from the previous behaviour. > > Right, so I think the code is fine, since in that case the > __mutex_trylock() must see the handoff and we'll break the loop and > (re)set the state to RUNNING. > > But you're right in that its slightly odd. I'll reorder them and put the > set_task_state() above the !first thing. Humm,.. we might actually rely on this order, since the MB implied by set_task_state() is the only thing that separates the store of __mutex_set_flag() from the load of __mutex_trylock(), and those should be ordered I think. Argh, completely messed up my brain. I'll not touch it and think on this again tomorrow.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote: > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > Hi Peter, > > > > I'm struggling to get my head around the handoff code after this change... > > > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > > > --- a/kernel/locking/mutex.c > > > +++ b/kernel/locking/mutex.c > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > > > lock_contended(>dep_map, ip); > > > > > > + set_task_state(task, state); > > > for (;;) { > > > + /* > > > + * Once we hold wait_lock, we're serialized against > > > + * mutex_unlock() handing the lock off to us, do a trylock > > > + * before testing the error conditions to make sure we pick up > > > + * the handoff. > > > + */ > > > if (__mutex_trylock(lock, first)) > > > - break; > > > + goto acquired; > > > > > > /* > > > - * got a signal? (This code gets eliminated in the > > > - * TASK_UNINTERRUPTIBLE case.) > > > + * Check for signals and wound conditions while holding > > > + * wait_lock. This ensures the lock cancellation is ordered > > > + * against mutex_unlock() and wake-ups do not go missing. > > >*/ > > > if (unlikely(signal_pending_state(state, task))) { > > > ret = -EINTR; > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > > goto err; > > > } > > > > > > - __set_task_state(task, state); > > > spin_unlock_mutex(>wait_lock, flags); > > > schedule_preempt_disabled(); > > > - spin_lock_mutex(>wait_lock, flags); > > > > > > if (!first && __mutex_waiter_is_first(lock, )) { > > > first = true; > > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > > } > > > + > > > + set_task_state(task, state); > > > > With this change, we no longer hold the lock wit_hen we set the task > > state, and it's ordered strictly *after* setting the HANDOFF flag. > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > > the wakeup, but then we come in and overwrite the task state? > > > > I'm struggling to work out whether that's an issue, but it certainly > > feels odd and is a change from the previous behaviour. > > Right, so I think the code is fine, since in that case the > __mutex_trylock() must see the handoff and we'll break the loop and > (re)set the state to RUNNING. > > But you're right in that its slightly odd. I'll reorder them and put the > set_task_state() above the !first thing. Humm,.. we might actually rely on this order, since the MB implied by set_task_state() is the only thing that separates the store of __mutex_set_flag() from the load of __mutex_trylock(), and those should be ordered I think. Argh, completely messed up my brain. I'll not touch it and think on this again tomorrow.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > Hi Peter, > > I'm struggling to get my head around the handoff code after this change... > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > > --- a/kernel/locking/mutex.c > > +++ b/kernel/locking/mutex.c > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > lock_contended(>dep_map, ip); > > > > + set_task_state(task, state); > > for (;;) { > > + /* > > +* Once we hold wait_lock, we're serialized against > > +* mutex_unlock() handing the lock off to us, do a trylock > > +* before testing the error conditions to make sure we pick up > > +* the handoff. > > +*/ > > if (__mutex_trylock(lock, first)) > > - break; > > + goto acquired; > > > > /* > > -* got a signal? (This code gets eliminated in the > > -* TASK_UNINTERRUPTIBLE case.) > > +* Check for signals and wound conditions while holding > > +* wait_lock. This ensures the lock cancellation is ordered > > +* against mutex_unlock() and wake-ups do not go missing. > > */ > > if (unlikely(signal_pending_state(state, task))) { > > ret = -EINTR; > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > goto err; > > } > > > > - __set_task_state(task, state); > > spin_unlock_mutex(>wait_lock, flags); > > schedule_preempt_disabled(); > > - spin_lock_mutex(>wait_lock, flags); > > > > if (!first && __mutex_waiter_is_first(lock, )) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > > + > > + set_task_state(task, state); > > With this change, we no longer hold the lock wit_hen we set the task > state, and it's ordered strictly *after* setting the HANDOFF flag. > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > the wakeup, but then we come in and overwrite the task state? > > I'm struggling to work out whether that's an issue, but it certainly > feels odd and is a change from the previous behaviour. Right, so I think the code is fine, since in that case the __mutex_trylock() must see the handoff and we'll break the loop and (re)set the state to RUNNING. But you're right in that its slightly odd. I'll reorder them and put the set_task_state() above the !first thing.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > Hi Peter, > > I'm struggling to get my head around the handoff code after this change... > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > > --- a/kernel/locking/mutex.c > > +++ b/kernel/locking/mutex.c > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > > > lock_contended(>dep_map, ip); > > > > + set_task_state(task, state); > > for (;;) { > > + /* > > +* Once we hold wait_lock, we're serialized against > > +* mutex_unlock() handing the lock off to us, do a trylock > > +* before testing the error conditions to make sure we pick up > > +* the handoff. > > +*/ > > if (__mutex_trylock(lock, first)) > > - break; > > + goto acquired; > > > > /* > > -* got a signal? (This code gets eliminated in the > > -* TASK_UNINTERRUPTIBLE case.) > > +* Check for signals and wound conditions while holding > > +* wait_lock. This ensures the lock cancellation is ordered > > +* against mutex_unlock() and wake-ups do not go missing. > > */ > > if (unlikely(signal_pending_state(state, task))) { > > ret = -EINTR; > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > > goto err; > > } > > > > - __set_task_state(task, state); > > spin_unlock_mutex(>wait_lock, flags); > > schedule_preempt_disabled(); > > - spin_lock_mutex(>wait_lock, flags); > > > > if (!first && __mutex_waiter_is_first(lock, )) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > > + > > + set_task_state(task, state); > > With this change, we no longer hold the lock wit_hen we set the task > state, and it's ordered strictly *after* setting the HANDOFF flag. > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > the wakeup, but then we come in and overwrite the task state? > > I'm struggling to work out whether that's an issue, but it certainly > feels odd and is a change from the previous behaviour. Right, so I think the code is fine, since in that case the __mutex_trylock() must see the handoff and we'll break the loop and (re)set the state to RUNNING. But you're right in that its slightly odd. I'll reorder them and put the set_task_state() above the !first thing.
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
Hi Peter, I'm struggling to get my head around the handoff code after this change... On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > --- a/kernel/locking/mutex.c > +++ b/kernel/locking/mutex.c > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > lock_contended(>dep_map, ip); > > + set_task_state(task, state); > for (;;) { > + /* > + * Once we hold wait_lock, we're serialized against > + * mutex_unlock() handing the lock off to us, do a trylock > + * before testing the error conditions to make sure we pick up > + * the handoff. > + */ > if (__mutex_trylock(lock, first)) > - break; > + goto acquired; > > /* > - * got a signal? (This code gets eliminated in the > - * TASK_UNINTERRUPTIBLE case.) > + * Check for signals and wound conditions while holding > + * wait_lock. This ensures the lock cancellation is ordered > + * against mutex_unlock() and wake-ups do not go missing. >*/ > if (unlikely(signal_pending_state(state, task))) { > ret = -EINTR; > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > goto err; > } > > - __set_task_state(task, state); > spin_unlock_mutex(>wait_lock, flags); > schedule_preempt_disabled(); > - spin_lock_mutex(>wait_lock, flags); > > if (!first && __mutex_waiter_is_first(lock, )) { > first = true; > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > } > + > + set_task_state(task, state); With this change, we no longer hold the lock wit_hen we set the task state, and it's ordered strictly *after* setting the HANDOFF flag. Doesn't that mean that the unlock code can see the HANDOFF flag, issue the wakeup, but then we come in and overwrite the task state? I'm struggling to work out whether that's an issue, but it certainly feels odd and is a change from the previous behaviour. Will
Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
Hi Peter, I'm struggling to get my head around the handoff code after this change... On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote: > --- a/kernel/locking/mutex.c > +++ b/kernel/locking/mutex.c > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, > > lock_contended(>dep_map, ip); > > + set_task_state(task, state); > for (;;) { > + /* > + * Once we hold wait_lock, we're serialized against > + * mutex_unlock() handing the lock off to us, do a trylock > + * before testing the error conditions to make sure we pick up > + * the handoff. > + */ > if (__mutex_trylock(lock, first)) > - break; > + goto acquired; > > /* > - * got a signal? (This code gets eliminated in the > - * TASK_UNINTERRUPTIBLE case.) > + * Check for signals and wound conditions while holding > + * wait_lock. This ensures the lock cancellation is ordered > + * against mutex_unlock() and wake-ups do not go missing. >*/ > if (unlikely(signal_pending_state(state, task))) { > ret = -EINTR; > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, > goto err; > } > > - __set_task_state(task, state); > spin_unlock_mutex(>wait_lock, flags); > schedule_preempt_disabled(); > - spin_lock_mutex(>wait_lock, flags); > > if (!first && __mutex_waiter_is_first(lock, )) { > first = true; > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > } > + > + set_task_state(task, state); With this change, we no longer hold the lock wit_hen we set the task state, and it's ordered strictly *after* setting the HANDOFF flag. Doesn't that mean that the unlock code can see the HANDOFF flag, issue the wakeup, but then we come in and overwrite the task state? I'm struggling to work out whether that's an issue, but it certainly feels odd and is a change from the previous behaviour. Will
[PATCH -v4 6/8] locking/mutex: Restructure wait loop
Doesn't really matter yet, but pull the HANDOFF and trylock out from under the wait_lock. The intention is to add an optimistic spin loop here, which requires we do not hold the wait_lock, so shuffle code around in preparation. Also clarify the purpose of taking the wait_lock in the wait loop, its tempting to want to avoid it altogether, but the cancellation cases need to to avoid losing wakeups. Suggested-by: Waiman LongSigned-off-by: Peter Zijlstra (Intel) --- kernel/locking/mutex.c | 30 +- 1 file changed, 25 insertions(+), 5 deletions(-) --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, lock_contended(>dep_map, ip); + set_task_state(task, state); for (;;) { + /* +* Once we hold wait_lock, we're serialized against +* mutex_unlock() handing the lock off to us, do a trylock +* before testing the error conditions to make sure we pick up +* the handoff. +*/ if (__mutex_trylock(lock, first)) - break; + goto acquired; /* -* got a signal? (This code gets eliminated in the -* TASK_UNINTERRUPTIBLE case.) +* Check for signals and wound conditions while holding +* wait_lock. This ensures the lock cancellation is ordered +* against mutex_unlock() and wake-ups do not go missing. */ if (unlikely(signal_pending_state(state, task))) { ret = -EINTR; @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, goto err; } - __set_task_state(task, state); spin_unlock_mutex(>wait_lock, flags); schedule_preempt_disabled(); - spin_lock_mutex(>wait_lock, flags); if (!first && __mutex_waiter_is_first(lock, )) { first = true; __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); } + + set_task_state(task, state); + /* +* Here we order against unlock; we must either see it change +* state back to RUNNING and fall through the next schedule(), +* or we must see its unlock and acquire. +*/ + if (__mutex_trylock(lock, first)) + break; + + spin_lock_mutex(>wait_lock, flags); } + spin_lock_mutex(>wait_lock, flags); +acquired: __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock, , task); @@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock, return 0; err: + __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock, , task); spin_unlock_mutex(>wait_lock, flags); debug_mutex_free_waiter();
[PATCH -v4 6/8] locking/mutex: Restructure wait loop
Doesn't really matter yet, but pull the HANDOFF and trylock out from under the wait_lock. The intention is to add an optimistic spin loop here, which requires we do not hold the wait_lock, so shuffle code around in preparation. Also clarify the purpose of taking the wait_lock in the wait loop, its tempting to want to avoid it altogether, but the cancellation cases need to to avoid losing wakeups. Suggested-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) --- kernel/locking/mutex.c | 30 +- 1 file changed, 25 insertions(+), 5 deletions(-) --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock, lock_contended(>dep_map, ip); + set_task_state(task, state); for (;;) { + /* +* Once we hold wait_lock, we're serialized against +* mutex_unlock() handing the lock off to us, do a trylock +* before testing the error conditions to make sure we pick up +* the handoff. +*/ if (__mutex_trylock(lock, first)) - break; + goto acquired; /* -* got a signal? (This code gets eliminated in the -* TASK_UNINTERRUPTIBLE case.) +* Check for signals and wound conditions while holding +* wait_lock. This ensures the lock cancellation is ordered +* against mutex_unlock() and wake-ups do not go missing. */ if (unlikely(signal_pending_state(state, task))) { ret = -EINTR; @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock, goto err; } - __set_task_state(task, state); spin_unlock_mutex(>wait_lock, flags); schedule_preempt_disabled(); - spin_lock_mutex(>wait_lock, flags); if (!first && __mutex_waiter_is_first(lock, )) { first = true; __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); } + + set_task_state(task, state); + /* +* Here we order against unlock; we must either see it change +* state back to RUNNING and fall through the next schedule(), +* or we must see its unlock and acquire. +*/ + if (__mutex_trylock(lock, first)) + break; + + spin_lock_mutex(>wait_lock, flags); } + spin_lock_mutex(>wait_lock, flags); +acquired: __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock, , task); @@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock, return 0; err: + __set_task_state(task, TASK_RUNNING); mutex_remove_waiter(lock, , task); spin_unlock_mutex(>wait_lock, flags); debug_mutex_free_waiter();