Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Mon, Oct 12, 2015 at 09:09:24PM +0800, Boqun Feng wrote: > On Mon, Oct 12, 2015 at 01:54:38PM +0200, Peter Zijlstra wrote: > > On Mon, Oct 12, 2015 at 05:06:36PM +0800, Boqun Feng wrote: > > > Understood. > > > > > > But, IMO, the position of this section is already misleading: > > > > > > (*) Implicit kernel memory barriers. > > > - Locking functions. > > > - Interrupt disabling functions. > > >->- Sleep and wake-up functions.<- > > > - Miscellaneous functions. > > > > > > I read it as that sleep and wake-up functions provide some kernel memory > > > barriers which we can use *externally*(outside sleep/wakeup themselves). > > > > I think it is useful to state that the primitives handle the ordering > > between the waker and wakee wrt the 'blocking' state. > > > > I agree that's useful, however, the 'blocking' state is something > internal for sleep and wakeup, right? Not entirely; its also the @cond thing in wait queues. IE: for (;;) set_current_state(TASK_INTERRUPTIBLE); if (@cond) break; schedule(); } __set_current_state(TASK_RUNNING); vs. @cond = true; wake_up_process(p); So we guarantee that 'p' will see the @cond stores IF it does the wakeup. (If it does not, ie. 'p' wasn't sleeping, any guarantee is out the window). > Not sure whether the users of > wake_up() and wait_event() will care much about this or they need to > understand that detailedly to use wake_up() and wait_event() correctly. I think its mostly natural; but it explains why you don't have to do: wait_event(wq, @cond); vs. @cond = true; smp_wmb(); wake_up(wq); (or worse...) > > But I've not put much thought into wording. I wanted to finish process > > order 'comment' patch first. > > Of course. Actually your 'comment' patch is the reason why I think this > section may be removed. Yes, that is another option, referring to the comment, once that's sorted. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Mon, Oct 12, 2015 at 01:54:38PM +0200, Peter Zijlstra wrote: > On Mon, Oct 12, 2015 at 05:06:36PM +0800, Boqun Feng wrote: > > Understood. > > > > But, IMO, the position of this section is already misleading: > > > > (*) Implicit kernel memory barriers. > > - Locking functions. > > - Interrupt disabling functions. > >->- Sleep and wake-up functions.<- > > - Miscellaneous functions. > > > > I read it as that sleep and wake-up functions provide some kernel memory > > barriers which we can use *externally*(outside sleep/wakeup themselves). > > I think it is useful to state that the primitives handle the ordering > between the waker and wakee wrt the 'blocking' state. > I agree that's useful, however, the 'blocking' state is something internal for sleep and wakeup, right? Not sure whether the users of wake_up() and wait_event() will care much about this or they need to understand that detailedly to use wake_up() and wait_event() correctly. I treat this part of memory-barriers.txt as an API document to describe the implicit barriers in some primitives, which can be used *externally* by someone, but anyway, that's just my own opinion ;-) > But I've not put much thought into wording. I wanted to finish process > order 'comment' patch first. Of course. Actually your 'comment' patch is the reason why I think this section may be removed. Regards, Boqun signature.asc Description: PGP signature
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Mon, Oct 12, 2015 at 05:06:36PM +0800, Boqun Feng wrote: > Understood. > > But, IMO, the position of this section is already misleading: > > (*) Implicit kernel memory barriers. > - Locking functions. > - Interrupt disabling functions. >->- Sleep and wake-up functions.<- > - Miscellaneous functions. > > I read it as that sleep and wake-up functions provide some kernel memory > barriers which we can use *externally*(outside sleep/wakeup themselves). I think it is useful to state that the primitives handle the ordering between the waker and wakee wrt the 'blocking' state. But I've not put much thought into wording. I wanted to finish process order 'comment' patch first. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Sun, Oct 11, 2015 at 05:40:44PM -0700, Paul E. McKenney wrote: > On Sun, Oct 11, 2015 at 11:26:40PM +0800, Boqun Feng wrote: > > On Tue, Oct 06, 2015 at 06:06:50PM +0200, Peter Zijlstra wrote: > > > On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote: > > > > > Included in it are some of the details on this subject, because a > > > > > wakeup > > > > > has two prior states that are of importance, the tasks own prior state > > > > > and the wakeup state, both should be considered in the 'program order' > > > > > flow. > > > > > > > > > > > > > Great and very helpful ;-) > > > > > > > > > So maybe we can reduce the description in memory-barriers to this > > > > > 'split' program order guarantee, where a woken task must observe both > > > > > its own prior state and its wakee state. > > > > ^ > > > > I think you mean "waker" here, right? > > > > > > Yes. > > > > > > > And the waker is not necessarily the same task who set the @cond to > > > > true, right? > > > > > > It should be. > > > > > > > If so, I feel like it's really hard to *use* this 'split' > > > > program order guarantee in other places than sleep/wakeup itself. Could > > > > you give an example? Thank you. > > > > > > It was not meant to be used in any other scenario; the 'split' PO really > > > is part of the whole sleep/wakeup. It does not apply to anything else. > > > > Got it. So at this point, I think it's better to remove the entire > > "Sleep and wake-up functions" section in memory-barriers.txt. Because > > this order guarantee is not for other users except sleep/wakeup. Any > > concern, Paul? > > The concern I have with just removing it is that it is all too easy for > people to assume that they provide ordering. So we should at least have Understood. But, IMO, the position of this section is already misleading: (*) Implicit kernel memory barriers. - Locking functions. - Interrupt disabling functions. ->- Sleep and wake-up functions.<- - Miscellaneous functions. I read it as that sleep and wake-up functions provide some kernel memory barriers which we can use *externally*(outside sleep/wakeup themselves). So how about something like: (*) Barriers only for internal use - Sleep and wake-up functions. Regards, Boqun > a section stating clearly that ordering is not guaranteed without help > from locks, release-acquire, explicit memory barriers, etc. > > Thanx, Paul > signature.asc Description: PGP signature
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Sun, Oct 11, 2015 at 11:26:40PM +0800, Boqun Feng wrote: > On Tue, Oct 06, 2015 at 06:06:50PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote: > > > > Included in it are some of the details on this subject, because a wakeup > > > > has two prior states that are of importance, the tasks own prior state > > > > and the wakeup state, both should be considered in the 'program order' > > > > flow. > > > > > > > > > > Great and very helpful ;-) > > > > > > > So maybe we can reduce the description in memory-barriers to this > > > > 'split' program order guarantee, where a woken task must observe both > > > > its own prior state and its wakee state. > > > ^ > > > I think you mean "waker" here, right? > > > > Yes. > > > > > And the waker is not necessarily the same task who set the @cond to > > > true, right? > > > > It should be. > > > > > If so, I feel like it's really hard to *use* this 'split' > > > program order guarantee in other places than sleep/wakeup itself. Could > > > you give an example? Thank you. > > > > It was not meant to be used in any other scenario; the 'split' PO really > > is part of the whole sleep/wakeup. It does not apply to anything else. > > Got it. So at this point, I think it's better to remove the entire > "Sleep and wake-up functions" section in memory-barriers.txt. Because > this order guarantee is not for other users except sleep/wakeup. Any > concern, Paul? The concern I have with just removing it is that it is all too easy for people to assume that they provide ordering. So we should at least have a section stating clearly that ordering is not guaranteed without help from locks, release-acquire, explicit memory barriers, etc. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Tue, Oct 06, 2015 at 06:06:50PM +0200, Peter Zijlstra wrote: > On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote: > > > Included in it are some of the details on this subject, because a wakeup > > > has two prior states that are of importance, the tasks own prior state > > > and the wakeup state, both should be considered in the 'program order' > > > flow. > > > > > > > Great and very helpful ;-) > > > > > So maybe we can reduce the description in memory-barriers to this > > > 'split' program order guarantee, where a woken task must observe both > > > its own prior state and its wakee state. > > ^ > > I think you mean "waker" here, right? > > Yes. > > > And the waker is not necessarily the same task who set the @cond to > > true, right? > > It should be. > > > If so, I feel like it's really hard to *use* this 'split' > > program order guarantee in other places than sleep/wakeup itself. Could > > you give an example? Thank you. > > It was not meant to be used in any other scenario; the 'split' PO really > is part of the whole sleep/wakeup. It does not apply to anything else. Got it. So at this point, I think it's better to remove the entire "Sleep and wake-up functions" section in memory-barriers.txt. Because this order guarantee is not for other users except sleep/wakeup. Any concern, Paul? Regards, Boqun signature.asc Description: PGP signature
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Wed, Oct 07, 2015 at 01:10:24PM +0200, Peter Zijlstra wrote: > On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote: > > That said, it would be good if Paul (or anyone really) can explain to me > > the reason for: 5af4692a75da ("smp: Make control dependencies work on > > Alpha, improve documentation"). The Changelog simply states that Alpha > > needs the mb, but not how/why etc. "Alpha AXP Architecture Reference Manual, Second Edition", Sites & Witek, page 5-20, Section 5.6.3, "Implied Barriers": In Alpha AXP, there are no implied barriers. If an implied barrier is needed for functionally correct access to shared data, it must be written as an explicit instruction. (Software must explicitly include any needed MB, WMB, or CALL_PAL IMB instructions.) On exactly how the hardware might choose not to respect control dependencies, I must defer to someone who knows about the Alpha hardware. On my last opportunity to discuss this with Alpha architects, I was concerned only about data dependencies, and didn't think to ask about control dependencies. > Also, I suppose Documentation/circular-buffer.txt needs help. Who wrote that??? ;-) How about the following? Thanx, Paul diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt index 88951b179262..c71c0cab7bbb 100644 --- a/Documentation/circular-buffers.txt +++ b/Documentation/circular-buffers.txt @@ -161,7 +161,7 @@ The producer will look something like this: unsigned long head = buffer->head; /* The spin_unlock() and next spin_lock() provide needed ordering. */ - unsigned long tail = ACCESS_ONCE(buffer->tail); + unsigned long tail = READ_ONCE_CTRL(buffer->tail); if (CIRC_SPACE(head, tail, buffer->size) >= 1) { /* insert one item into the buffer */ @@ -222,7 +222,7 @@ This will instruct the CPU to make sure the index is up to date before reading the new item, and then it shall make sure the CPU has finished reading the item before it writes the new tail pointer, which will erase the item. -Note the use of ACCESS_ONCE() and smp_load_acquire() to read the +Note the use of READ_ONCE_CTRL() and smp_load_acquire() to read the opposition index. This prevents the compiler from discarding and reloading its cached value - which some compilers will do across smp_read_barrier_depends(). This isn't strictly needed if you can -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote: > That said, it would be good if Paul (or anyone really) can explain to me > the reason for: 5af4692a75da ("smp: Make control dependencies work on > Alpha, improve documentation"). The Changelog simply states that Alpha > needs the mb, but not how/why etc. Also, I suppose Documentation/circular-buffer.txt needs help. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
> On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote: > > In short, I got lost ;) Now I don't even understand why we do not need > > another rmb() between p->on_rq and p->on_cpu. Suppose a thread T does > > > > set_current_state(...); > > schedule(); > > > > it can be preempted in between, after that we have "on_rq && !on_cpu". > > Then it gets CPU again and calls schedule() which clears on_rq. > > > > What guarantees that if ttwu() sees on_rq == 0 cleared by schedule() > > then it can _not_ still see the old value of on_cpu == 0? I think you're right. Does the below adequately explain things? I'll have another look tomorrow to see if I still agree with myself, but for now I think I've convinced myself you're right. --- Subject: sched: Fix race in try_to_wake_up() vs schedule() Oleg noticed that its possible to falsely observe p->on_cpu == 0 such that we'll prematurely continue with the wakeup and effectively run p on two CPUs at the same time. Even though the overlap is very limited; the task is in the middle of being scheduled out; it could still result in corruption of the scheduler data structures. CPU0CPU1 set_current_state(...) context_switch(X, Y) prepare_lock_switch(Y) Y->on_cpu = 1; finish_lock_switch(X) store_release(X->on_cpu, 0); try_to_wake_up(X) LOCK(p->pi_lock); t = X->on_cpu; // 0 context_switch(Y, X) prepare_lock_switch(X) X->on_cpu = 1; finish_lock_switch(Y) store_release(Y->on_cpu, 0); schedule(); deactivate_task(X); X->on_rq = 0; if (X->on_rq) // false if (t) while (X->on_cpu) cpu_relax(); context_switch(X, ..) finish_lock_switch(X) store_release(X->on_cpu, 0); Avoid the load of X->on_cpu being hoisted over the X->on_rq load. Reported-by: Oleg Nesterov Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 19 +++ 1 file changed, 19 insertions(+) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2084,6 +2084,25 @@ try_to_wake_up(struct task_struct *p, un #ifdef CONFIG_SMP /* +* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be +* possible to, falsely, observe p->on_cpu == 0. +* +* One must be running (->on_cpu == 1) in order to remove oneself +* from the runqueue. +* +* [S] ->on_cpu = 1; [L] ->on_rq +* UNLOCK rq->lock +* RMB +* LOCK rq->lock +* [S] ->on_rq = 0;[L] ->on_cpu +* +* Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock +* from the consecutive calls to schedule(); the first switching to our +* task, the second putting it to sleep. +*/ + smp_rmb(); + + /* * If the owning (remote) cpu is still in the middle of schedule() with * this task as prev, wait until its done referencing the task. */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Tue, Oct 06, 2015 at 06:24:23PM +0200, Peter Zijlstra wrote: > On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote: > > On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote: > > > On 09/18, Peter Zijlstra wrote: > > > > > > > > the text is correct, right? > > > > > > Yes, it looks good to me and helpful. > > > > > > But damn. I forgot why exactly try_to_wake_up() needs rmb() after > > > ->on_cpu check... It looks reasonable in any case, but I do not > > > see any strong reason immediately. > > > > I read it like the smp_rmb() we have for > > acquire__after_spin_is_unlocked. Except, as you note below, we need to > > need an smp_read_barrier_depends for control barriers as well > > > Yes, but I'm not sure we should go write: > > > > while (READ_ONCE_CTRL(p->on_cpu)) > > cpu_relax(); > > > > Or: > > > > while (p->on_cpu) > > cpu_relax(); > > > > smp_read_barrier_depends(); > > > > It seems to me that doing the smp_mb() (for Alpha) inside the loop might > > be sub-optimal. > > And also referring to: > > lkml.kernel.org/r/20150812133109.ga8...@redhat.com > > Do we want something like this? > > #define smp_spin_acquire(cond) do { \ > while (cond)\ > cpu_relax();\ > smp_read_barrier_depends(); /* ctrl */ \ > smp_rmb(); /* ctrl + rmb := acquire */ \ > } while (0) > > And use it like: > > smp_spin_acquire(raw_spin_is_locked(&task->pi_lock)); > > That might work for your task_work_run() and the scheduler case, > although it might be somewhat awkward for sem_wait_array(). I could *really* use something like this for implementing power-saving busy loops for arch/arm64 (i.e. in the qrwlock code). We have a WFE instruction (wait for event) that can stop the processor clock and resume it when the exclusive monitor is cleared (i.e. a cacheline migrates to another CPU). That means we can implement a targetted wake-up when an unlocker writes to a node in a queued lock, which isn't something expressible with cpu_relax alone. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote: > On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote: > > On 09/18, Peter Zijlstra wrote: > > > > > > the text is correct, right? > > > > Yes, it looks good to me and helpful. > > > > But damn. I forgot why exactly try_to_wake_up() needs rmb() after > > ->on_cpu check... It looks reasonable in any case, but I do not > > see any strong reason immediately. > > I read it like the smp_rmb() we have for > acquire__after_spin_is_unlocked. Except, as you note below, we need to > need an smp_read_barrier_depends for control barriers as well > Yes, but I'm not sure we should go write: > > while (READ_ONCE_CTRL(p->on_cpu)) > cpu_relax(); > > Or: > > while (p->on_cpu) > cpu_relax(); > > smp_read_barrier_depends(); > > It seems to me that doing the smp_mb() (for Alpha) inside the loop might > be sub-optimal. And also referring to: lkml.kernel.org/r/20150812133109.ga8...@redhat.com Do we want something like this? #define smp_spin_acquire(cond) do { \ while (cond)\ cpu_relax();\ smp_read_barrier_depends(); /* ctrl */ \ smp_rmb(); /* ctrl + rmb := acquire */ \ } while (0) And use it like: smp_spin_acquire(raw_spin_is_locked(&task->pi_lock)); That might work for your task_work_run() and the scheduler case, although it might be somewhat awkward for sem_wait_array(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote: > > Included in it are some of the details on this subject, because a wakeup > > has two prior states that are of importance, the tasks own prior state > > and the wakeup state, both should be considered in the 'program order' > > flow. > > > > Great and very helpful ;-) > > > So maybe we can reduce the description in memory-barriers to this > > 'split' program order guarantee, where a woken task must observe both > > its own prior state and its wakee state. > ^ > I think you mean "waker" here, right? Yes. > And the waker is not necessarily the same task who set the @cond to > true, right? It should be. > If so, I feel like it's really hard to *use* this 'split' > program order guarantee in other places than sleep/wakeup itself. Could > you give an example? Thank you. It was not meant to be used in any other scenario; the 'split' PO really is part of the whole sleep/wakeup. It does not apply to anything else. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote: > On 09/18, Peter Zijlstra wrote: > > > > the text is correct, right? > > Yes, it looks good to me and helpful. > > But damn. I forgot why exactly try_to_wake_up() needs rmb() after > ->on_cpu check... It looks reasonable in any case, but I do not > see any strong reason immediately. I read it like the smp_rmb() we have for acquire__after_spin_is_unlocked. Except, as you note below, we need to need an smp_read_barrier_depends for control barriers as well (I'm starting to think we're having more control deps what we were thinking...) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un while (p->on_cpu) cpu_relax(); /* -* Pairs with the smp_wmb() in finish_lock_switch(). +* Combined with the control dependency above, we have an effective +* smp_load_acquire() without the need for full barriers. +* +* Pairs with the smp_store_release() in finish_lock_switch(). +* +* This ensures that tasks getting woken will be fully ordered against +* their previous state and preserve Program Order. */ smp_rmb(); --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st * We must ensure this doesn't happen until the switch is completely * finished. * +* In particular, the load of prev->state in finish_task_switch() must +* happen before this. +* * Pairs with the control dependency and rmb in try_to_wake_up(). */ smp_store_release(&prev->on_cpu, 0); Updates the comments to clarify the release/acquire pair on p->on_cpu. > Say, > > p->sched_contributes_to_load = !!task_contributes_to_load(p); > p->state = TASK_WAKING; > > we can actually do this before "while (p->on_cpu)", afaics. However > we must not do this before the previous p->on_rq check. No, we must not touch the task before p->on_cpu is cleared, up until that point the task is owned by the 'previous' CPU. > So perhaps this rmb() helps to ensure task_contributes_to_load() can't > happen before p->on_rq check... > > As for "p->state = TASK_WAKING" we have the control dependency in both > cases. But the modern fashion suggests to use _CTRL(). Yes, but I'm not sure we should go write: while (READ_ONCE_CTRL(p->on_cpu)) cpu_relax(); Or: while (p->on_cpu) cpu_relax(); smp_read_barrier_depends(); It seems to me that doing the smp_mb() (for Alpha) inside the loop might be sub-optimal. That said, it would be good if Paul (or anyone really) can explain to me the reason for: 5af4692a75da ("smp: Make control dependencies work on Alpha, improve documentation"). The Changelog simply states that Alpha needs the mb, but not how/why etc. > Although cpu_relax() > should imply barrier(), but afaik this is not documented. I think we're relying on that in many places.. > In short, I got lost ;) Now I don't even understand why we do not need > another rmb() between p->on_rq and p->on_cpu. Suppose a thread T does > > set_current_state(...); > schedule(); > > it can be preempted in between, after that we have "on_rq && !on_cpu". > Then it gets CPU again and calls schedule() which clears on_rq. > > What guarantees that if ttwu() sees on_rq == 0 cleared by schedule() > then it can _not_ still see the old value of on_cpu == 0? Right, let me go have a think about that ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
Hi Peter, On Thu, Sep 17, 2015 at 03:01:26PM +0200, Peter Zijlstra wrote: > On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote: > > On 09/10, Boqun Feng wrote: > > > > > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > > > > My feeling is > > > > that we should avoid saying too much about the internals of wait_event() > > > > and wake_up(). > > > > I feel the same. I simply can't understand what we are trying to > > document ;) > > So I've been sitting on this for a while and figured I'd finish it now. > > It are some notes on the scheduler locking and how it provides program > order guarantees on SMP systems. > > Included in it are some of the details on this subject, because a wakeup > has two prior states that are of importance, the tasks own prior state > and the wakeup state, both should be considered in the 'program order' > flow. > Great and very helpful ;-) > So maybe we can reduce the description in memory-barriers to this > 'split' program order guarantee, where a woken task must observe both > its own prior state and its wakee state. ^ I think you mean "waker" here, right? And the waker is not necessarily the same task who set the @cond to true, right? If so, I feel like it's really hard to *use* this 'split' program order guarantee in other places than sleep/wakeup itself. Could you give an example? Thank you. Regards, Boqun signature.asc Description: PGP signature
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On 09/18, Peter Zijlstra wrote: > > the text is correct, right? Yes, it looks good to me and helpful. But damn. I forgot why exactly try_to_wake_up() needs rmb() after ->on_cpu check... It looks reasonable in any case, but I do not see any strong reason immediately. Say, p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; we can actually do this before "while (p->on_cpu)", afaics. However we must not do this before the previous p->on_rq check. So perhaps this rmb() helps to ensure task_contributes_to_load() can't happen before p->on_rq check... As for "p->state = TASK_WAKING" we have the control dependency in both cases. But the modern fashion suggests to use _CTRL(). Although cpu_relax() should imply barrier(), but afaik this is not documented. In short, I got lost ;) Now I don't even understand why we do not need another rmb() between p->on_rq and p->on_cpu. Suppose a thread T does set_current_state(...); schedule(); it can be preempted in between, after that we have "on_rq && !on_cpu". Then it gets CPU again and calls schedule() which clears on_rq. What guarantees that if ttwu() sees on_rq == 0 cleared by schedule() then it can _not_ still see the old value of on_cpu == 0? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Thu, Sep 17, 2015 at 07:01:11PM +0200, Oleg Nesterov wrote: > On 09/17, Peter Zijlstra wrote: > > > > Included in it are some of the details on this subject, because a wakeup > > has two prior states that are of importance, the tasks own prior state > > and the wakeup state, both should be considered in the 'program order' > > flow. > > Great. Just one question, > > > + * BLOCKING -- aka. SLEEP + WAKEUP > > + * > > + * For blocking things are a little more interesting, because when we > > dequeue > > + * the task, we don't need to acquire the old rq lock in order to migrate > > it. > > + * > > + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the > > task > > + * to CPU2 (the most complex example): > > + * > > + * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending) > > + * > > + * X->state = UNINTERRUPTIBLE > > + * MB > > + * if (cond) > > + * break > > + *cond = true > > + * > > + * WMB WMB (aka smp_mb__before_spinlock) > > Yes, both CPU's do WMB-aka-smp_mb__before_spinlock... > > But afaics in this particular case we do not really need them? > So perhaps we should not even mention them? > > Because (if I am right) this can confuse the reader who will try > to understand how/where do we rely on these barriers. Good point. Initially I put all barriers in, but now that we've figured out which are important (the text is correct, right? please double check) we can remove the rest. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On 09/17, Peter Zijlstra wrote: > > Included in it are some of the details on this subject, because a wakeup > has two prior states that are of importance, the tasks own prior state > and the wakeup state, both should be considered in the 'program order' > flow. Great. Just one question, > + * BLOCKING -- aka. SLEEP + WAKEUP > + * > + * For blocking things are a little more interesting, because when we dequeue > + * the task, we don't need to acquire the old rq lock in order to migrate it. > + * > + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the > task > + * to CPU2 (the most complex example): > + * > + * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending) > + * > + * X->state = UNINTERRUPTIBLE > + * MB > + * if (cond) > + * break > + *cond = true > + * > + * WMB WMB (aka smp_mb__before_spinlock) Yes, both CPU's do WMB-aka-smp_mb__before_spinlock... But afaics in this particular case we do not really need them? So perhaps we should not even mention them? Because (if I am right) this can confuse the reader who will try to understand how/where do we rely on these barriers. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote: > On 09/10, Boqun Feng wrote: > > > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > > > My feeling is > > > that we should avoid saying too much about the internals of wait_event() > > > and wake_up(). > > I feel the same. I simply can't understand what we are trying to > document ;) So I've been sitting on this for a while and figured I'd finish it now. It are some notes on the scheduler locking and how it provides program order guarantees on SMP systems. Included in it are some of the details on this subject, because a wakeup has two prior states that are of importance, the tasks own prior state and the wakeup state, both should be considered in the 'program order' flow. So maybe we can reduce the description in memory-barriers to this 'split' program order guarantee, where a woken task must observe both its own prior state and its wakee state. --- kernel/sched/core.c | 137 1 file changed, 137 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6ab415aa15c4..4fffde6f7c08 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1897,6 +1897,143 @@ static void ttwu_queue(struct task_struct *p, int cpu) raw_spin_unlock(&rq->lock); } +/* + * Notes on process order guarantees on SMP systems. + * + * + * PREEMPTION/MIGRATION + * + * Regular preemption/migration is safe because as long as the task is runnable + * migrations involve both rq locks, albeit not (necessarily) at the same time. + * + * So we get (we allow 3 CPU migrations): + * + * CPU0CPU1CPU2 + * + * LOCK rq(0)->lock + * sched-out X + * sched-in Y + * UNLOCK rq(0)->lock + * + * LOCK rq(0)->lock // MB against CPU0 + * dequeue X + * UNLOCK rq(0)->lock + * + * LOCK rq(1)->lock + * enqueue X + * UNLOCK rq(1)->lock + * + * LOCK rq(1)->lock // MB against CPU2 + * sched-out Z + * sched-in X + * UNLOCK rq(1)->lock + * + * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0) + * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the + * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must + * observe the state it left behind on CPU0. + * + * + * BLOCKING -- aka. SLEEP + WAKEUP + * + * For blocking things are a little more interesting, because when we dequeue + * the task, we don't need to acquire the old rq lock in order to migrate it. + * + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task + * to CPU2 (the most complex example): + * + * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending) + * + * X->state = UNINTERRUPTIBLE + * MB + * if (cond) + * break + *cond = true + * + * WMB WMB (aka smp_mb__before_spinlock) + * LOCK rq(0)->lock LOCK X->pi_lock + * + * dequeue X + *while (X->on_cpu) + * cpu_relax() + * sched-out X + * WMB + * X->on_cpu = 0 + *RMB + *X->state = WAKING + *set_task_cpu(X,1) + * WMB + * ti(X)->cpu = 2 + * + *llist_add(X, rq(2)) // MB + * llist_del_all() // MB + * + * LOCK rq(2)->lock + * enqueue X + * X->state = RUNNING + * UNLOCK rq(2)->lock + * + * LOCK rq(2)->lock + * sched-out Z + * sched-in X + * UNLOCK rq(1)->lock + * + * if (cond) // _TRUE_ + *UNLOCK X->pi_lock + * UNLOCK rq(0)->lock + * + * So in this case the scheduler does not provide an obvious full barrier; but + * stores on CPU0 cannot get delayed past the MB, the llist primitives order + * between CPU1 and CPU2 and X's loads on CPU2 cannot get before the last LOCK. + * + * Which again leads to the guarantee that by the time X gets to run on CPU2 + * it must observe the state it left behind on CPU0. + * + * However; for blocking there is a second guarantee we must provide, namely we + * must observe the state that lead to our wakeup. That is, not only must X + * observe its own prior state, it must also observe the @cond store. + * + * This too is achieved in the above, purely by the llist primitives ordering + * CPU1 to CPU2. + * + * There is
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
Hi Oleg, On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote: > On 09/10, Boqun Feng wrote: > > > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > > > My feeling is > > > that we should avoid saying too much about the internals of wait_event() > > > and wake_up(). > > I feel the same. I simply can't understand what we are trying to > document ;) > What I think we should document here is what memory ordering guarantee we can rely on with these sleep/wakeup primitives, or what kind of barriers these primitives imply. Because the structure of the memory-barriers.txt here is: (*) Implicit kernel memory barriers. - Locking functions. - Interrupt disabling functions. ->- Sleep and wake-up functions.<- - Miscellaneous functions. > For example, > > > A STORE-LOAD barrier is implied after setting task state by wait-related > > functions: > > > > prepare_to_wait(); > > prepare_to_wait_exclusive(); > > prepare_to_wait_event(); > > I won't argue, but to me this looks misleading too. > > Yes, prepare_to_wait()->set_current_state() implies mb() and thus > a STORE-LOAD barrier. > > But this has nothing to do with the explanation above. We do not > need this barrier to avoid the race with wake_up(). Again, again, > we can safely rely on wq->lock and acquire/release semantics. > Yes, you are right. prepare_to_wait*() should be put here. What should be put here is set_current_state(), whose STORE-LOAD barrier pairs with the STORE-LOAD barrier of wake_up_process(). > This barrier is only needed if you do, say, > > CONDITION = 1; > > if (waitqueue_active(wq)) > wake_up(wq); > > And note that the code above is wrong without another mb() after > CONDITION = 1. > Understood, I admit I didn't realize this before. To summarize, we have three kinds of data related to sleep/wakeup: * CONDITIONs: global data used to indicate events * task states * wait queues(may not be used, if users use set_current_state() + schedule() to sleep and wake_up_process() to wake up) IIUC, the race on wait queues are almost avoided because of wq->locks, and if a wait queue is used, race on task states are avoided because states are readed and written with a wq->lock held in sleep/wakeup functions. So only in two cases we need STORE-LOAD barriers to avoid the race: 1. no wait queue used(e.g. rcu_boost_kthread), we need STORE-LOAD to order accesses to task states and CONDITIONs, so we have barriers in wake_up_process() and set_current_state(). 2. wait queue accessed without a wq->lock held(e.g. your example), we need STORE-LOAD to order accesses to wait queues and CONDITIONs Since case #1 still exists in kernel, we'd better keep this section in memory-barriers.txt, however, I'm not sure whether we should mention case #2 in this section. Here is a modified version, without mentioning case #2: SLEEP AND WAKE-UP FUNCTIONS --- Sleeping and waking on an event flagged in global data can be viewed as an interaction between two pieces of data: the task state of the task waiting for the event and the global data used to indicate the event. To make sure that these appear to happen in the right order, the primitives to begin the process of going to sleep, and the primitives to initiate a wake up imply certain barriers. If a wait queue is used, all accesses to task states are protected by the lock of the wait queue, so the race on task states are avoided. However, if no wait queue used, we need some memory ordering guantanee to avoid the race between sleepers/wakees and wakers. The memory ordering requirement here can be expressed by two STORE-LOAD barriers(barriers which can guarantee a STORE perceding it can never be reordered after a LOAD following it). One STORE-LOAD barrier is needed on the sleeper/wakee side, before reading a variable used to indicate the event and after setting the state of the current task. Another STORE-LOAD barrier is needed on the waker side, before reading the state of the wakee task and after setting a variable used to indicate the event. These two barriers can pair with each other to avoid race conditions between sleepers/wakees and wakers: sleepr/wakee on CPU 1 waker on CPU 2 { wakee->state = TASK_RUNNING, event_indicated = 0 } STORE current->state=TASK_INTERRUPTIBLE c = LOAD event_indicated STORE event_indicated=1 s = LOAD wakee->state assert(!(c==0 && s == TASK_RUNNING)); A STORE-LOAD barrier is implied after setting task state in set_current_state() and before reading task state in wake_up_process() Make sure call set_current_state() before read the global data used to indica
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On 09/10, Boqun Feng wrote: > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > > My feeling is > > that we should avoid saying too much about the internals of wait_event() > > and wake_up(). I feel the same. I simply can't understand what we are trying to document ;) For example, > A STORE-LOAD barrier is implied after setting task state by wait-related > functions: > > prepare_to_wait(); > prepare_to_wait_exclusive(); > prepare_to_wait_event(); I won't argue, but to me this looks misleading too. Yes, prepare_to_wait()->set_current_state() implies mb() and thus a STORE-LOAD barrier. But this has nothing to do with the explanation above. We do not need this barrier to avoid the race with wake_up(). Again, again, we can safely rely on wq->lock and acquire/release semantics. This barrier is only needed if you do, say, CONDITION = 1; if (waitqueue_active(wq)) wake_up(wq); And note that the code above is wrong without another mb() after CONDITION = 1. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > On Tue, Sep 08, 2015 at 09:14:01AM +0800, Boqun Feng wrote: > > Two examples for barriers in wake_up() and co. in memory-barriers.txt > > are misleading, along with their explanations: > > > > 1. The example which wanted to explain the write barrier in > > wake_up() and co. [spotted by Oleg Nesterov ] > > > > 2. The example which wanted to explain that the write barriers in > > wake_up() and co. only exist iff a wakeup actually occurs. > > > > For example #1, according to Oleg Nesterov: > > > > > > > > The barrier occurs before the task state is cleared > > > > > > is not actually right. This is misleading. What is really important is > > > that > > > we have a barrier before we _read_ the task state. And again, again, the > > > fact that we actually have the write barrier is just the implementation > > > detail. > > > > > > > And the example #2 is actually an example which could explain that the > > barriers in wait_event() and co. only exist iff a sleep actually occurs. > > > > Further more, these barriers are only used for the correctness of > > sleeping and waking up, i.e. they exist only to guarantee the ordering > > of memory accesses to the task states and the global variables > > indicating an event. Users can't rely on them for other things, so > > memory-barriers.txt had better to call this out and remove the > > misleading examples. > > > > This patch removes the misleading examples along with their > > explanations, calls it out that those implied barriers are only for > > sleep and wakeup related variables and adds a new example to explain the > > implied barrier in wake_up() and co. > > > > Signed-off-by: Boqun Feng > > At this point, I would favor replacing that entire section with a short > paragraph describing what guarantees are provided, perhaps with an example > showing what added barriers/locks/whatever are required. My feeling is > that we should avoid saying too much about the internals of wait_event() > and wake_up(). Good idea! However, I think a little more words help understand. So I keep the original first paragraph and also add a paragraph for NOTE which, I think, may be a little redundant although ;-) How about the following new whole section? SLEEP AND WAKE-UP FUNCTIONS --- Sleeping and waking on an event flagged in global data can be viewed as an interaction between two pieces of data: the task state of the task waiting for the event and the global data used to indicate the event. To make sure that these appear to happen in the right order, the primitives to begin the process of going to sleep, and the primitives to initiate a wake up imply certain barriers. The memory ordering requirement here can be expressed by two STORE-LOAD barriers(barriers which can guarantee a STORE perceding it can never be reordered after a LOAD following it). One STORE-LOAD barrier is needed on the sleeper/wakee side, before reading a variable used to indicate the event and after setting the state of the current task. Another STORE-LOAD barrier is needed on the waker side, before reading the state of the wakee task and after setting a variable used to indicate the event. These two barriers can pair with each other to avoid race conditions between sleepers/wakees and wakers: sleepr/wakee on CPU 1 waker on CPU 2 { wakee->state = TASK_RUNNING, event_indicated = 0 } STORE current->state=TASK_INTERRUPTIBLE c = LOAD event_indicated STORE event_indicated=1 s = LOAD wakee->state assert(!(c==0 && s == TASK_RUNNING)); A STORE-LOAD barrier is implied after setting task state by wait-related functions: prepare_to_wait(); prepare_to_wait_exclusive(); prepare_to_wait_event(); A STORE-LOAD barrier is implied before reading task state by wake-related functions: complete(); wake_up(); wake_up_all(); wake_up_bit(); wake_up_interruptible(); wake_up_interruptible_all(); wake_up_interruptible_nr(); wake_up_interruptible_poll(); wake_up_interruptible_sync(); wake_up_interruptible_sync_poll(); wake_up_locked(); wake_up_locked_poll(); wake_up_nr(); wake_up_poll(); wake_up_process(); Make sure an appropriate wake-related function is called after setting a global data used to indicate a event. [!] Note that these implied barriers are only for the correctness of sleep and wake-up. So don't rely on these barriers for things that are neither the task states nor the global variables indicating the events. git log --stat for this is: 1 file changed, 29 insertions(+), 108 deletions(-) , which I t
Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
On Tue, Sep 08, 2015 at 09:14:01AM +0800, Boqun Feng wrote: > Two examples for barriers in wake_up() and co. in memory-barriers.txt > are misleading, along with their explanations: > > 1.The example which wanted to explain the write barrier in > wake_up() and co. [spotted by Oleg Nesterov ] > > 2.The example which wanted to explain that the write barriers in > wake_up() and co. only exist iff a wakeup actually occurs. > > For example #1, according to Oleg Nesterov: > > > > > The barrier occurs before the task state is cleared > > > > is not actually right. This is misleading. What is really important is that > > we have a barrier before we _read_ the task state. And again, again, the > > fact that we actually have the write barrier is just the implementation > > detail. > > > > And the example #2 is actually an example which could explain that the > barriers in wait_event() and co. only exist iff a sleep actually occurs. > > Further more, these barriers are only used for the correctness of > sleeping and waking up, i.e. they exist only to guarantee the ordering > of memory accesses to the task states and the global variables > indicating an event. Users can't rely on them for other things, so > memory-barriers.txt had better to call this out and remove the > misleading examples. > > This patch removes the misleading examples along with their > explanations, calls it out that those implied barriers are only for > sleep and wakeup related variables and adds a new example to explain the > implied barrier in wake_up() and co. > > Signed-off-by: Boqun Feng At this point, I would favor replacing that entire section with a short paragraph describing what guarantees are provided, perhaps with an example showing what added barriers/locks/whatever are required. My feeling is that we should avoid saying too much about the internals of wait_event() and wake_up(). Or am I missing something? Thanx, Paul > --- > Documentation/memory-barriers.txt | 42 > +-- > 1 file changed, 18 insertions(+), 24 deletions(-) > > diff --git a/Documentation/memory-barriers.txt > b/Documentation/memory-barriers.txt > index eafa6a5..07de72f 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1948,6 +1948,10 @@ these appear to happen in the right order, the > primitives to begin the process > of going to sleep, and the primitives to initiate a wake up imply certain > barriers. > > +[!] Note that these implied barriers are only for the correctness of sleep > and > +wake-up. So don't rely on these barriers for things that are neither the task > +states nor the global variables indicating the events. > + > Firstly, the sleeper normally follows something like this sequence of events: > > for (;;) { > @@ -1997,32 +2001,22 @@ or: > event_indicated = 1; > wake_up_process(event_daemon); > > -A write memory barrier is implied by wake_up() and co. if and only if they > wake > -something up. The barrier occurs before the task state is cleared, and so > sits > -between the STORE to indicate the event and the STORE to set TASK_RUNNING: > - > - CPU 1 CPU 2 > - === === > - set_current_state();STORE event_indicated > - smp_store_mb(); wake_up(); > - STORE current->state > - STORE current->state > - LOAD event_indicated > +A memory barrier is implied by wake_up() and co. if and only if they wake > +something up. The memory barrier here is not necessary to be a general > barrier, > +it only needs to guarantee a STORE preceding this barrier can never be > +reordered after a LOAD following this barrier(i.e. a STORE-LOAD barrier). > This > +barrier guarantees that the event has been indicated before the waker read > the > +wakee's task state: > > -To repeat, this write memory barrier is present if and only if something > -is actually awakened. To see this, consider the following sequence of > -events, where X and Y are both initially zero: > + CPU 1 > + === > + STORE event_indicated; > + wake_up_process(wakee); > + > + LOAD wakee->state; > > - CPU 1 CPU 2 > - === === > - X = 1; STORE event_indicated > - smp_mb(); wake_up(); > - Y = 1; wait_event(wq, Y == 1); > - wake_up();load from Y sees 1, no memory barrier > - load from X might see 0 > - > -In contrast, if a wakeup does occur, CPU 2's load from X would be guaranteed > -to see 1. > +This barrier pairs with the general barrier implied by set_cur