Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Thu, Nov 10, 2016 at 01:08:56AM +0100, Rafael J. Wysocki wrote: > On Wed, Nov 2, 2016 at 6:07 AM, Brian Norriswrote: > > I can test this and send it in proper form if that looks preferable. > > It does to me as per the discussion at the LPC. > > Are you still going to submit it? Yes, it just moved down my mental list of things. Will do today.
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Thu, Nov 10, 2016 at 01:08:56AM +0100, Rafael J. Wysocki wrote: > On Wed, Nov 2, 2016 at 6:07 AM, Brian Norris wrote: > > I can test this and send it in proper form if that looks preferable. > > It does to me as per the discussion at the LPC. > > Are you still going to submit it? Yes, it just moved down my mental list of things. Will do today.
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Wed, Nov 2, 2016 at 6:07 AM, Brian Norriswrote: > + more genpd folks > > On Wed, Nov 02, 2016 at 04:51:08AM +0100, Rafael J. Wysocki wrote: >> On Tuesday, November 01, 2016 12:04:28 AM Dmitry Torokhov wrote: >> > On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki >> > wrote: >> > > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: >> > >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >> > >> index c58563581345..eaf6b53463a5 100644 >> > >> --- a/drivers/base/power/main.c >> > >> +++ b/drivers/base/power/main.c >> > >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device >> > >> *dev, pm_message_t state, bool a >> > >> >> > >> dpm_wait_for_children(dev, async); >> > >> >> > >> + if (async_error) >> > >> + goto Complete; >> > >> + >> > > >> > > This is a second chech for async_error in this routine and is the first >> > > one >> > > really needed after adding this? >> > >> > There is really no point in waiting for children to be suspended if >> > error has already been signalled; that's what first check achieves. >> > The 2nd check ensures that we abort suspend if any of the children >> > failed to suspend. >> > >> > I'd say both checks are needed (well, 1st is helpful, 2nd is essential). >> >> OK, fair enough. > > Sort of agreed, although I'm still not sure how helpful the 1st one is; > kinda serves to complicate things, for little real benefit IMO (you > don't save much time by "not waiting" -- either the child quickly > notices the same error and complete()'s quickly, or else you're going to > wait for that child in the end anyway). > > I think it's also important to ask why we do this optimization in the > {late,noirq} cases, but we don't do this in __device_suspend(). As > demonstrated by the $subject bug, I think we would yield fewer bugs by > sharing code structure (if not the code itself) among the similar > phases. > > I'm happy for you to take my current patch, of course, but I think some > further effort on making this consistent might be warranted. Either put > all of these short-circuit checks after the wait_for_children(), or else > add the same short-circuit for the missing case (__device_suspend()). > i.e., this (untested) patch: > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index e44944f4be77..2932a5bd892f 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -1027,6 +1027,8 @@ static int __device_suspend_noirq(struct device *dev, > pm_message_t state, bool a > TRACE_DEVICE(dev); > TRACE_SUSPEND(0); > > + dpm_wait_for_children(dev, async); > + > if (async_error) > goto Complete; > > @@ -1038,8 +1040,6 @@ static int __device_suspend_noirq(struct device *dev, > pm_message_t state, bool a > if (dev->power.syscore || dev->power.direct_complete) > goto Complete; > > - dpm_wait_for_children(dev, async); > - > if (dev->pm_domain) { > info = "noirq power domain "; > callback = pm_noirq_op(>pm_domain->ops, state); > @@ -1174,6 +1174,8 @@ static int __device_suspend_late(struct device *dev, > pm_message_t state, bool as > > __pm_runtime_disable(dev, false); > > + dpm_wait_for_children(dev, async); > + > if (async_error) > goto Complete; > > @@ -1185,8 +1187,6 @@ static int __device_suspend_late(struct device *dev, > pm_message_t state, bool as > if (dev->power.syscore || dev->power.direct_complete) > goto Complete; > > - dpm_wait_for_children(dev, async); > - > if (dev->pm_domain) { > info = "late power domain "; > callback = pm_late_early_op(>pm_domain->ops, state); > > --- > > I can test this and send it in proper form if that looks preferable. It does to me as per the discussion at the LPC. Are you still going to submit it? Thanks, Rafael
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Wed, Nov 2, 2016 at 6:07 AM, Brian Norris wrote: > + more genpd folks > > On Wed, Nov 02, 2016 at 04:51:08AM +0100, Rafael J. Wysocki wrote: >> On Tuesday, November 01, 2016 12:04:28 AM Dmitry Torokhov wrote: >> > On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki >> > wrote: >> > > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: >> > >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >> > >> index c58563581345..eaf6b53463a5 100644 >> > >> --- a/drivers/base/power/main.c >> > >> +++ b/drivers/base/power/main.c >> > >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device >> > >> *dev, pm_message_t state, bool a >> > >> >> > >> dpm_wait_for_children(dev, async); >> > >> >> > >> + if (async_error) >> > >> + goto Complete; >> > >> + >> > > >> > > This is a second chech for async_error in this routine and is the first >> > > one >> > > really needed after adding this? >> > >> > There is really no point in waiting for children to be suspended if >> > error has already been signalled; that's what first check achieves. >> > The 2nd check ensures that we abort suspend if any of the children >> > failed to suspend. >> > >> > I'd say both checks are needed (well, 1st is helpful, 2nd is essential). >> >> OK, fair enough. > > Sort of agreed, although I'm still not sure how helpful the 1st one is; > kinda serves to complicate things, for little real benefit IMO (you > don't save much time by "not waiting" -- either the child quickly > notices the same error and complete()'s quickly, or else you're going to > wait for that child in the end anyway). > > I think it's also important to ask why we do this optimization in the > {late,noirq} cases, but we don't do this in __device_suspend(). As > demonstrated by the $subject bug, I think we would yield fewer bugs by > sharing code structure (if not the code itself) among the similar > phases. > > I'm happy for you to take my current patch, of course, but I think some > further effort on making this consistent might be warranted. Either put > all of these short-circuit checks after the wait_for_children(), or else > add the same short-circuit for the missing case (__device_suspend()). > i.e., this (untested) patch: > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index e44944f4be77..2932a5bd892f 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -1027,6 +1027,8 @@ static int __device_suspend_noirq(struct device *dev, > pm_message_t state, bool a > TRACE_DEVICE(dev); > TRACE_SUSPEND(0); > > + dpm_wait_for_children(dev, async); > + > if (async_error) > goto Complete; > > @@ -1038,8 +1040,6 @@ static int __device_suspend_noirq(struct device *dev, > pm_message_t state, bool a > if (dev->power.syscore || dev->power.direct_complete) > goto Complete; > > - dpm_wait_for_children(dev, async); > - > if (dev->pm_domain) { > info = "noirq power domain "; > callback = pm_noirq_op(>pm_domain->ops, state); > @@ -1174,6 +1174,8 @@ static int __device_suspend_late(struct device *dev, > pm_message_t state, bool as > > __pm_runtime_disable(dev, false); > > + dpm_wait_for_children(dev, async); > + > if (async_error) > goto Complete; > > @@ -1185,8 +1187,6 @@ static int __device_suspend_late(struct device *dev, > pm_message_t state, bool as > if (dev->power.syscore || dev->power.direct_complete) > goto Complete; > > - dpm_wait_for_children(dev, async); > - > if (dev->pm_domain) { > info = "late power domain "; > callback = pm_late_early_op(>pm_domain->ops, state); > > --- > > I can test this and send it in proper form if that looks preferable. It does to me as per the discussion at the LPC. Are you still going to submit it? Thanks, Rafael
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
+ more genpd folks On Wed, Nov 02, 2016 at 04:51:08AM +0100, Rafael J. Wysocki wrote: > On Tuesday, November 01, 2016 12:04:28 AM Dmitry Torokhov wrote: > > On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki> > wrote: > > > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > > >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > >> index c58563581345..eaf6b53463a5 100644 > > >> --- a/drivers/base/power/main.c > > >> +++ b/drivers/base/power/main.c > > >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device > > >> *dev, pm_message_t state, bool a > > >> > > >> dpm_wait_for_children(dev, async); > > >> > > >> + if (async_error) > > >> + goto Complete; > > >> + > > > > > > This is a second chech for async_error in this routine and is the first > > > one > > > really needed after adding this? > > > > There is really no point in waiting for children to be suspended if > > error has already been signalled; that's what first check achieves. > > The 2nd check ensures that we abort suspend if any of the children > > failed to suspend. > > > > I'd say both checks are needed (well, 1st is helpful, 2nd is essential). > > OK, fair enough. Sort of agreed, although I'm still not sure how helpful the 1st one is; kinda serves to complicate things, for little real benefit IMO (you don't save much time by "not waiting" -- either the child quickly notices the same error and complete()'s quickly, or else you're going to wait for that child in the end anyway). I think it's also important to ask why we do this optimization in the {late,noirq} cases, but we don't do this in __device_suspend(). As demonstrated by the $subject bug, I think we would yield fewer bugs by sharing code structure (if not the code itself) among the similar phases. I'm happy for you to take my current patch, of course, but I think some further effort on making this consistent might be warranted. Either put all of these short-circuit checks after the wait_for_children(), or else add the same short-circuit for the missing case (__device_suspend()). i.e., this (untested) patch: diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index e44944f4be77..2932a5bd892f 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1027,6 +1027,8 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a TRACE_DEVICE(dev); TRACE_SUSPEND(0); + dpm_wait_for_children(dev, async); + if (async_error) goto Complete; @@ -1038,8 +1040,6 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a if (dev->power.syscore || dev->power.direct_complete) goto Complete; - dpm_wait_for_children(dev, async); - if (dev->pm_domain) { info = "noirq power domain "; callback = pm_noirq_op(>pm_domain->ops, state); @@ -1174,6 +1174,8 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as __pm_runtime_disable(dev, false); + dpm_wait_for_children(dev, async); + if (async_error) goto Complete; @@ -1185,8 +1187,6 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as if (dev->power.syscore || dev->power.direct_complete) goto Complete; - dpm_wait_for_children(dev, async); - if (dev->pm_domain) { info = "late power domain "; callback = pm_late_early_op(>pm_domain->ops, state); --- I can test this and send it in proper form if that looks preferable. P.S. To get slightly off-topic here (but speaking of noirq bugs): I noticed the genpd code has comments like this scattered all over: * This function is only called in "noirq" and "syscore" stages of system power * transitions, so it need not acquire locks (all of the "noirq" callbacks are * executed sequentially, so it is guaranteed that it will never run twice in * parallel). Isn't that no longer true, now that noirq suspend can be asynchronous? Maybe we should grep for the phrase "need not acquire locks" throughout the kernel, in order to find low-hanging fruit for race conditions :)
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
+ more genpd folks On Wed, Nov 02, 2016 at 04:51:08AM +0100, Rafael J. Wysocki wrote: > On Tuesday, November 01, 2016 12:04:28 AM Dmitry Torokhov wrote: > > On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki > > wrote: > > > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > > >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > >> index c58563581345..eaf6b53463a5 100644 > > >> --- a/drivers/base/power/main.c > > >> +++ b/drivers/base/power/main.c > > >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device > > >> *dev, pm_message_t state, bool a > > >> > > >> dpm_wait_for_children(dev, async); > > >> > > >> + if (async_error) > > >> + goto Complete; > > >> + > > > > > > This is a second chech for async_error in this routine and is the first > > > one > > > really needed after adding this? > > > > There is really no point in waiting for children to be suspended if > > error has already been signalled; that's what first check achieves. > > The 2nd check ensures that we abort suspend if any of the children > > failed to suspend. > > > > I'd say both checks are needed (well, 1st is helpful, 2nd is essential). > > OK, fair enough. Sort of agreed, although I'm still not sure how helpful the 1st one is; kinda serves to complicate things, for little real benefit IMO (you don't save much time by "not waiting" -- either the child quickly notices the same error and complete()'s quickly, or else you're going to wait for that child in the end anyway). I think it's also important to ask why we do this optimization in the {late,noirq} cases, but we don't do this in __device_suspend(). As demonstrated by the $subject bug, I think we would yield fewer bugs by sharing code structure (if not the code itself) among the similar phases. I'm happy for you to take my current patch, of course, but I think some further effort on making this consistent might be warranted. Either put all of these short-circuit checks after the wait_for_children(), or else add the same short-circuit for the missing case (__device_suspend()). i.e., this (untested) patch: diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index e44944f4be77..2932a5bd892f 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1027,6 +1027,8 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a TRACE_DEVICE(dev); TRACE_SUSPEND(0); + dpm_wait_for_children(dev, async); + if (async_error) goto Complete; @@ -1038,8 +1040,6 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a if (dev->power.syscore || dev->power.direct_complete) goto Complete; - dpm_wait_for_children(dev, async); - if (dev->pm_domain) { info = "noirq power domain "; callback = pm_noirq_op(>pm_domain->ops, state); @@ -1174,6 +1174,8 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as __pm_runtime_disable(dev, false); + dpm_wait_for_children(dev, async); + if (async_error) goto Complete; @@ -1185,8 +1187,6 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as if (dev->power.syscore || dev->power.direct_complete) goto Complete; - dpm_wait_for_children(dev, async); - if (dev->pm_domain) { info = "late power domain "; callback = pm_late_early_op(>pm_domain->ops, state); --- I can test this and send it in proper form if that looks preferable. P.S. To get slightly off-topic here (but speaking of noirq bugs): I noticed the genpd code has comments like this scattered all over: * This function is only called in "noirq" and "syscore" stages of system power * transitions, so it need not acquire locks (all of the "noirq" callbacks are * executed sequentially, so it is guaranteed that it will never run twice in * parallel). Isn't that no longer true, now that noirq suspend can be asynchronous? Maybe we should grep for the phrase "need not acquire locks" throughout the kernel, in order to find low-hanging fruit for race conditions :)
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Tuesday, November 01, 2016 12:04:28 AM Dmitry Torokhov wrote: > On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki> wrote: > > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > >> Consider two devices, A and B, where B is a child of A, and B utilizes > >> asynchronous suspend (it does not matter whether A is sync or async). If > >> B fails to suspend_noirq() or suspend_late(), or is interrupted by a > >> wakeup (pm_wakeup_pending()), then it aborts and sets the async_error > >> variable. However, device A does not (immediately) check the async_error > >> variable; it may continue to run its own suspend_noirq()/suspend_late() > >> callback. This is bad. > >> > >> We can resolve this problem by checking the async_error flag after > >> waiting for children to suspend, using the same logic for the noirq and > >> late suspend cases as we already do for __device_suspend(). > >> > >> It's easy to observe this erroneous behavior by, for example, forcing a > >> device to sleep a bit in its suspend_noirq() (to ensure the parent is > >> waiting for the child to complete), then return an error, and watch the > >> parent suspend_noirq() still get called. (Or similarly, fake a wakeup > >> event at the right (or is it wrong?) time.) > >> > >> Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") > >> Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") > >> Reported-by: Jeffy Chen > >> Signed-off-by: Brian Norris > >> Reviewed-by: Dmitry Torokhov > >> --- > >> v2: s/early/late/ in commit message > >> > >> drivers/base/power/main.c | 6 ++ > >> 1 file changed, 6 insertions(+) > >> > >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > >> index c58563581345..eaf6b53463a5 100644 > >> --- a/drivers/base/power/main.c > >> +++ b/drivers/base/power/main.c > >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device > >> *dev, pm_message_t state, bool a > >> > >> dpm_wait_for_children(dev, async); > >> > >> + if (async_error) > >> + goto Complete; > >> + > > > > This is a second chech for async_error in this routine and is the first one > > really needed after adding this? > > There is really no point in waiting for children to be suspended if > error has already been signalled; that's what first check achieves. > The 2nd check ensures that we abort suspend if any of the children > failed to suspend. > > I'd say both checks are needed (well, 1st is helpful, 2nd is essential). OK, fair enough. Thanks, Rafael
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Tuesday, November 01, 2016 12:04:28 AM Dmitry Torokhov wrote: > On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki > wrote: > > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > >> Consider two devices, A and B, where B is a child of A, and B utilizes > >> asynchronous suspend (it does not matter whether A is sync or async). If > >> B fails to suspend_noirq() or suspend_late(), or is interrupted by a > >> wakeup (pm_wakeup_pending()), then it aborts and sets the async_error > >> variable. However, device A does not (immediately) check the async_error > >> variable; it may continue to run its own suspend_noirq()/suspend_late() > >> callback. This is bad. > >> > >> We can resolve this problem by checking the async_error flag after > >> waiting for children to suspend, using the same logic for the noirq and > >> late suspend cases as we already do for __device_suspend(). > >> > >> It's easy to observe this erroneous behavior by, for example, forcing a > >> device to sleep a bit in its suspend_noirq() (to ensure the parent is > >> waiting for the child to complete), then return an error, and watch the > >> parent suspend_noirq() still get called. (Or similarly, fake a wakeup > >> event at the right (or is it wrong?) time.) > >> > >> Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") > >> Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") > >> Reported-by: Jeffy Chen > >> Signed-off-by: Brian Norris > >> Reviewed-by: Dmitry Torokhov > >> --- > >> v2: s/early/late/ in commit message > >> > >> drivers/base/power/main.c | 6 ++ > >> 1 file changed, 6 insertions(+) > >> > >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > >> index c58563581345..eaf6b53463a5 100644 > >> --- a/drivers/base/power/main.c > >> +++ b/drivers/base/power/main.c > >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device > >> *dev, pm_message_t state, bool a > >> > >> dpm_wait_for_children(dev, async); > >> > >> + if (async_error) > >> + goto Complete; > >> + > > > > This is a second chech for async_error in this routine and is the first one > > really needed after adding this? > > There is really no point in waiting for children to be suspended if > error has already been signalled; that's what first check achieves. > The 2nd check ensures that we abort suspend if any of the children > failed to suspend. > > I'd say both checks are needed (well, 1st is helpful, 2nd is essential). OK, fair enough. Thanks, Rafael
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysockiwrote: > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: >> Consider two devices, A and B, where B is a child of A, and B utilizes >> asynchronous suspend (it does not matter whether A is sync or async). If >> B fails to suspend_noirq() or suspend_late(), or is interrupted by a >> wakeup (pm_wakeup_pending()), then it aborts and sets the async_error >> variable. However, device A does not (immediately) check the async_error >> variable; it may continue to run its own suspend_noirq()/suspend_late() >> callback. This is bad. >> >> We can resolve this problem by checking the async_error flag after >> waiting for children to suspend, using the same logic for the noirq and >> late suspend cases as we already do for __device_suspend(). >> >> It's easy to observe this erroneous behavior by, for example, forcing a >> device to sleep a bit in its suspend_noirq() (to ensure the parent is >> waiting for the child to complete), then return an error, and watch the >> parent suspend_noirq() still get called. (Or similarly, fake a wakeup >> event at the right (or is it wrong?) time.) >> >> Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") >> Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") >> Reported-by: Jeffy Chen >> Signed-off-by: Brian Norris >> Reviewed-by: Dmitry Torokhov >> --- >> v2: s/early/late/ in commit message >> >> drivers/base/power/main.c | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >> index c58563581345..eaf6b53463a5 100644 >> --- a/drivers/base/power/main.c >> +++ b/drivers/base/power/main.c >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, >> pm_message_t state, bool a >> >> dpm_wait_for_children(dev, async); >> >> + if (async_error) >> + goto Complete; >> + > > This is a second chech for async_error in this routine and is the first one > really needed after adding this? There is really no point in waiting for children to be suspended if error has already been signalled; that's what first check achieves. The 2nd check ensures that we abort suspend if any of the children failed to suspend. I'd say both checks are needed (well, 1st is helpful, 2nd is essential). > >> if (dev->pm_domain) { >> info = "noirq power domain "; >> callback = pm_noirq_op(>pm_domain->ops, state); >> @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, >> pm_message_t state, bool as >> >> dpm_wait_for_children(dev, async); >> >> + if (async_error) >> + goto Complete; >> + > > Same question. > >> if (dev->pm_domain) { >> info = "late power domain "; >> callback = pm_late_early_op(>pm_domain->ops, state); >> > > Thanks, > Rafael > Thanks. -- Dmitry
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Mon, Oct 31, 2016 at 10:25 PM, Rafael J. Wysocki wrote: > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: >> Consider two devices, A and B, where B is a child of A, and B utilizes >> asynchronous suspend (it does not matter whether A is sync or async). If >> B fails to suspend_noirq() or suspend_late(), or is interrupted by a >> wakeup (pm_wakeup_pending()), then it aborts and sets the async_error >> variable. However, device A does not (immediately) check the async_error >> variable; it may continue to run its own suspend_noirq()/suspend_late() >> callback. This is bad. >> >> We can resolve this problem by checking the async_error flag after >> waiting for children to suspend, using the same logic for the noirq and >> late suspend cases as we already do for __device_suspend(). >> >> It's easy to observe this erroneous behavior by, for example, forcing a >> device to sleep a bit in its suspend_noirq() (to ensure the parent is >> waiting for the child to complete), then return an error, and watch the >> parent suspend_noirq() still get called. (Or similarly, fake a wakeup >> event at the right (or is it wrong?) time.) >> >> Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") >> Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") >> Reported-by: Jeffy Chen >> Signed-off-by: Brian Norris >> Reviewed-by: Dmitry Torokhov >> --- >> v2: s/early/late/ in commit message >> >> drivers/base/power/main.c | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >> index c58563581345..eaf6b53463a5 100644 >> --- a/drivers/base/power/main.c >> +++ b/drivers/base/power/main.c >> @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, >> pm_message_t state, bool a >> >> dpm_wait_for_children(dev, async); >> >> + if (async_error) >> + goto Complete; >> + > > This is a second chech for async_error in this routine and is the first one > really needed after adding this? There is really no point in waiting for children to be suspended if error has already been signalled; that's what first check achieves. The 2nd check ensures that we abort suspend if any of the children failed to suspend. I'd say both checks are needed (well, 1st is helpful, 2nd is essential). > >> if (dev->pm_domain) { >> info = "noirq power domain "; >> callback = pm_noirq_op(>pm_domain->ops, state); >> @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, >> pm_message_t state, bool as >> >> dpm_wait_for_children(dev, async); >> >> + if (async_error) >> + goto Complete; >> + > > Same question. > >> if (dev->pm_domain) { >> info = "late power domain "; >> callback = pm_late_early_op(>pm_domain->ops, state); >> > > Thanks, > Rafael > Thanks. -- Dmitry
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
Hi Rafael, On Tue, Nov 01, 2016 at 05:25:39AM +0100, Rafael J. Wysocki wrote: > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > index c58563581345..eaf6b53463a5 100644 > > --- a/drivers/base/power/main.c > > +++ b/drivers/base/power/main.c > > @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, > > pm_message_t state, bool a > > > > dpm_wait_for_children(dev, async); > > > > + if (async_error) > > + goto Complete; > > + > > This is a second chech for async_error in this routine and is the first one > really needed after adding this? Maybe not? I confess I'm not 100% sure on all the reasons for the code structure as-is, but it looks like we're trying to catch pending wakeups early, and because that procedure utilizes 'async_error' to stash the -EBUSY, it seemingly makes sense to check if it's non-zero before overwriting it. But then, that's all kind of racy, since there can be multiple writers to that variable, no? So it can't matter *that* much if we clobber the error, as long as we abort somewhere. Anyway, maybe it's best if dpm_wait_for_children() just moves to be first thing in this function (after the tracepoints). That seems just as correct to me, and shouldn't waste any additional time suspending devices for a failed system suspend attempt -- as long as we're still catching wakeups before we suspend the current device. (That also incidentally matches the structure of __device_suspend() more closely. Why did this all get out of sync (pun unintended) when copied from the suspend() to the suspend_{late,noirq}() phase?) All in all, the short response is that I wrote the smallest patch that fixes the bug, AFAICT. But actually I think the above would be both shorter and better. I'll give that a go. > > if (dev->pm_domain) { > > info = "noirq power domain "; > > callback = pm_noirq_op(>pm_domain->ops, state); > > @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, > > pm_message_t state, bool as > > > > dpm_wait_for_children(dev, async); > > > > + if (async_error) > > + goto Complete; > > + > > Same question. Same answer :) Brian > > if (dev->pm_domain) { > > info = "late power domain "; > > callback = pm_late_early_op(>pm_domain->ops, state); > > > > Thanks, > Rafael >
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
Hi Rafael, On Tue, Nov 01, 2016 at 05:25:39AM +0100, Rafael J. Wysocki wrote: > On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > index c58563581345..eaf6b53463a5 100644 > > --- a/drivers/base/power/main.c > > +++ b/drivers/base/power/main.c > > @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, > > pm_message_t state, bool a > > > > dpm_wait_for_children(dev, async); > > > > + if (async_error) > > + goto Complete; > > + > > This is a second chech for async_error in this routine and is the first one > really needed after adding this? Maybe not? I confess I'm not 100% sure on all the reasons for the code structure as-is, but it looks like we're trying to catch pending wakeups early, and because that procedure utilizes 'async_error' to stash the -EBUSY, it seemingly makes sense to check if it's non-zero before overwriting it. But then, that's all kind of racy, since there can be multiple writers to that variable, no? So it can't matter *that* much if we clobber the error, as long as we abort somewhere. Anyway, maybe it's best if dpm_wait_for_children() just moves to be first thing in this function (after the tracepoints). That seems just as correct to me, and shouldn't waste any additional time suspending devices for a failed system suspend attempt -- as long as we're still catching wakeups before we suspend the current device. (That also incidentally matches the structure of __device_suspend() more closely. Why did this all get out of sync (pun unintended) when copied from the suspend() to the suspend_{late,noirq}() phase?) All in all, the short response is that I wrote the smallest patch that fixes the bug, AFAICT. But actually I think the above would be both shorter and better. I'll give that a go. > > if (dev->pm_domain) { > > info = "noirq power domain "; > > callback = pm_noirq_op(>pm_domain->ops, state); > > @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, > > pm_message_t state, bool as > > > > dpm_wait_for_children(dev, async); > > > > + if (async_error) > > + goto Complete; > > + > > Same question. Same answer :) Brian > > if (dev->pm_domain) { > > info = "late power domain "; > > callback = pm_late_early_op(>pm_domain->ops, state); > > > > Thanks, > Rafael >
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > Consider two devices, A and B, where B is a child of A, and B utilizes > asynchronous suspend (it does not matter whether A is sync or async). If > B fails to suspend_noirq() or suspend_late(), or is interrupted by a > wakeup (pm_wakeup_pending()), then it aborts and sets the async_error > variable. However, device A does not (immediately) check the async_error > variable; it may continue to run its own suspend_noirq()/suspend_late() > callback. This is bad. > > We can resolve this problem by checking the async_error flag after > waiting for children to suspend, using the same logic for the noirq and > late suspend cases as we already do for __device_suspend(). > > It's easy to observe this erroneous behavior by, for example, forcing a > device to sleep a bit in its suspend_noirq() (to ensure the parent is > waiting for the child to complete), then return an error, and watch the > parent suspend_noirq() still get called. (Or similarly, fake a wakeup > event at the right (or is it wrong?) time.) > > Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") > Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") > Reported-by: Jeffy Chen> Signed-off-by: Brian Norris > Reviewed-by: Dmitry Torokhov > --- > v2: s/early/late/ in commit message > > drivers/base/power/main.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index c58563581345..eaf6b53463a5 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, > pm_message_t state, bool a > > dpm_wait_for_children(dev, async); > > + if (async_error) > + goto Complete; > + This is a second chech for async_error in this routine and is the first one really needed after adding this? > if (dev->pm_domain) { > info = "noirq power domain "; > callback = pm_noirq_op(>pm_domain->ops, state); > @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, > pm_message_t state, bool as > > dpm_wait_for_children(dev, async); > > + if (async_error) > + goto Complete; > + Same question. > if (dev->pm_domain) { > info = "late power domain "; > callback = pm_late_early_op(>pm_domain->ops, state); > Thanks, Rafael
Re: [PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
On Thursday, October 27, 2016 09:05:34 AM Brian Norris wrote: > Consider two devices, A and B, where B is a child of A, and B utilizes > asynchronous suspend (it does not matter whether A is sync or async). If > B fails to suspend_noirq() or suspend_late(), or is interrupted by a > wakeup (pm_wakeup_pending()), then it aborts and sets the async_error > variable. However, device A does not (immediately) check the async_error > variable; it may continue to run its own suspend_noirq()/suspend_late() > callback. This is bad. > > We can resolve this problem by checking the async_error flag after > waiting for children to suspend, using the same logic for the noirq and > late suspend cases as we already do for __device_suspend(). > > It's easy to observe this erroneous behavior by, for example, forcing a > device to sleep a bit in its suspend_noirq() (to ensure the parent is > waiting for the child to complete), then return an error, and watch the > parent suspend_noirq() still get called. (Or similarly, fake a wakeup > event at the right (or is it wrong?) time.) > > Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") > Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") > Reported-by: Jeffy Chen > Signed-off-by: Brian Norris > Reviewed-by: Dmitry Torokhov > --- > v2: s/early/late/ in commit message > > drivers/base/power/main.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index c58563581345..eaf6b53463a5 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, > pm_message_t state, bool a > > dpm_wait_for_children(dev, async); > > + if (async_error) > + goto Complete; > + This is a second chech for async_error in this routine and is the first one really needed after adding this? > if (dev->pm_domain) { > info = "noirq power domain "; > callback = pm_noirq_op(>pm_domain->ops, state); > @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, > pm_message_t state, bool as > > dpm_wait_for_children(dev, async); > > + if (async_error) > + goto Complete; > + Same question. > if (dev->pm_domain) { > info = "late power domain "; > callback = pm_late_early_op(>pm_domain->ops, state); > Thanks, Rafael
[PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
Consider two devices, A and B, where B is a child of A, and B utilizes asynchronous suspend (it does not matter whether A is sync or async). If B fails to suspend_noirq() or suspend_late(), or is interrupted by a wakeup (pm_wakeup_pending()), then it aborts and sets the async_error variable. However, device A does not (immediately) check the async_error variable; it may continue to run its own suspend_noirq()/suspend_late() callback. This is bad. We can resolve this problem by checking the async_error flag after waiting for children to suspend, using the same logic for the noirq and late suspend cases as we already do for __device_suspend(). It's easy to observe this erroneous behavior by, for example, forcing a device to sleep a bit in its suspend_noirq() (to ensure the parent is waiting for the child to complete), then return an error, and watch the parent suspend_noirq() still get called. (Or similarly, fake a wakeup event at the right (or is it wrong?) time.) Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") Reported-by: Jeffy ChenSigned-off-by: Brian Norris Reviewed-by: Dmitry Torokhov --- v2: s/early/late/ in commit message drivers/base/power/main.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index c58563581345..eaf6b53463a5 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a dpm_wait_for_children(dev, async); + if (async_error) + goto Complete; + if (dev->pm_domain) { info = "noirq power domain "; callback = pm_noirq_op(>pm_domain->ops, state); @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as dpm_wait_for_children(dev, async); + if (async_error) + goto Complete; + if (dev->pm_domain) { info = "late power domain "; callback = pm_late_early_op(>pm_domain->ops, state); -- 2.8.0.rc3.226.g39d4020
[PATCH v2 2/2] PM / sleep: don't suspend parent when async child suspend_{noirq,late} fails
Consider two devices, A and B, where B is a child of A, and B utilizes asynchronous suspend (it does not matter whether A is sync or async). If B fails to suspend_noirq() or suspend_late(), or is interrupted by a wakeup (pm_wakeup_pending()), then it aborts and sets the async_error variable. However, device A does not (immediately) check the async_error variable; it may continue to run its own suspend_noirq()/suspend_late() callback. This is bad. We can resolve this problem by checking the async_error flag after waiting for children to suspend, using the same logic for the noirq and late suspend cases as we already do for __device_suspend(). It's easy to observe this erroneous behavior by, for example, forcing a device to sleep a bit in its suspend_noirq() (to ensure the parent is waiting for the child to complete), then return an error, and watch the parent suspend_noirq() still get called. (Or similarly, fake a wakeup event at the right (or is it wrong?) time.) Fixes: de377b397272 ("PM / sleep: Asynchronous threads for suspend_late") Fixes: 28b6fd6e3779 ("PM / sleep: Asynchronous threads for suspend_noirq") Reported-by: Jeffy Chen Signed-off-by: Brian Norris Reviewed-by: Dmitry Torokhov --- v2: s/early/late/ in commit message drivers/base/power/main.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index c58563581345..eaf6b53463a5 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1040,6 +1040,9 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a dpm_wait_for_children(dev, async); + if (async_error) + goto Complete; + if (dev->pm_domain) { info = "noirq power domain "; callback = pm_noirq_op(>pm_domain->ops, state); @@ -1187,6 +1190,9 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as dpm_wait_for_children(dev, async); + if (async_error) + goto Complete; + if (dev->pm_domain) { info = "late power domain "; callback = pm_late_early_op(>pm_domain->ops, state); -- 2.8.0.rc3.226.g39d4020