Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-06 Thread Neeraj Upadhyay




On 09/06/2018 01:48 PM, Thomas Gleixner wrote:

On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:

On 09/05/2018 06:47 PM, Thomas Gleixner wrote:

On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:

On 09/05/2018 05:53 PM, Thomas Gleixner wrote:

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first
teardown
callback fails because, st->state is only decremented _AFTER_ the
callback
returns success, but undo_cpu_down() increments unconditionally.


As per my understanding, there are 2 problems here; one is fixed with your
patch, and other is cpuhp_reset_state() is used during rollback from
non-AP to
AP state, which seem to result in 2 increments of st->state (one increment
done by cpuhp_reset_state() and another by cpu_thread_fun()) .

And how did your hack fix that up magically? I'll have a look later today.

Thanks,

tglx


The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
reset inline in  _cpu_down().


And how is that any different from the proper patch I sent? It ends up in
the same state. So I have a hard time to understand your blurb above where
you claim that my patch just solves one of two problems?

Thanks,

tglx



Yes, your patch solves the problem related to smpboot unparking being 
skipped during rollback and with the hack we end up in same state. The 
second thing, which I am referring to, is that there is one additional 
state increment. I missed the part that, it could be required, so that 
we reach CPUHP_AP_ONLINE_IDLE before calling __cpuhp_kick_ap(). So, it's 
not a problem.


* 	cpuhp_reset_state() does one state increment and we reach 
CPUHP_AP_ONLINE_IDLE.


if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}

CPUHP_TEARDOWN_CPU,
CPUHP_AP_ONLINE_IDLE,
CPUHP_AP_SMPBOOT_THREADS,


* cpuhp_thread_fun() does one more increment before invoking state 
callback (so, we skip CPUHP_AP_ONLINE_IDLE) and we reach 
CPUHP_AP_SMPBOOT_THREADS:


static void cpuhp_thread_fun(unsigned int cpu)

if (bringup) {
st->state++;
state = st->state;



--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-06 Thread Thomas Gleixner
On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> > > On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> > > > And looking closer this is a general issue. Just that the TEARDOWN state
> > > > makes it simple to observe. It's universaly broken, when the first
> > > > teardown
> > > > callback fails because, st->state is only decremented _AFTER_ the
> > > > callback
> > > > returns success, but undo_cpu_down() increments unconditionally.
> > > > 
> > > As per my understanding, there are 2 problems here; one is fixed with your
> > > patch, and other is cpuhp_reset_state() is used during rollback from
> > > non-AP to
> > > AP state, which seem to result in 2 increments of st->state (one increment
> > > done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> > And how did your hack fix that up magically? I'll have a look later today.
> > 
> > Thanks,
> > 
> > tglx
> 
> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
> reset inline in  _cpu_down().

And how is that any different from the proper patch I sent? It ends up in
the same state. So I have a hard time to understand your blurb above where
you claim that my patch just solves one of two problems?

Thanks,

tglx



Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Neeraj Upadhyay




On 09/05/2018 06:47 PM, Thomas Gleixner wrote:

On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:

On 09/05/2018 05:53 PM, Thomas Gleixner wrote:

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first teardown
callback fails because, st->state is only decremented _AFTER_ the callback
returns success, but undo_cpu_down() increments unconditionally.


As per my understanding, there are 2 problems here; one is fixed with your
patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
AP state, which seem to result in 2 increments of st->state (one increment
done by cpuhp_reset_state() and another by cpu_thread_fun()) .

And how did your hack fix that up magically? I'll have a look later today.

Thanks,

tglx


The hack fixes it by not calling cpuhp_reset_state() and doing rollback 
state reset inline in  _cpu_down().





--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Sudeep Holla
On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > >   ret = cpuhp_down_callbacks(cpu, st, target);
> > >   if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > - cpuhp_reset_state(st, prev_state);
> > > + /*
> > > +  * As st->last is not set, cpuhp_reset_state() increments
> > > +  * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > +  * skipped during rollback. So, don't use it here.
> > > +  */
> > > + st->rollback = true;
> > > + st->target = prev_state;
> > > + st->bringup = !st->bringup;
> > 
> > No, this is just papering over the actual problem.
> > 
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> > 
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
> 
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
> 
> Patch below.

This patch fixes the issue reported @[1]. Lorenzo did some debugging and
I wanted to have a look at it at some point but this discussion drew my
attention and sounded very similar[2]. So I did a quick test with this
patch and it fixes the issue.

--
Regards,
Sudeep

[1] 
https://lore.kernel.org/lkml/camuhmdvg868lgl5xtg5dp5rrekxoo+8fry+etjimxgwzcp+...@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/


Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Thomas Gleixner
On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
>
> As per my understanding, there are 2 problems here; one is fixed with your
> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
> AP state, which seem to result in 2 increments of st->state (one increment
> done by cpuhp_reset_state() and another by cpu_thread_fun()) .

And how did your hack fix that up magically? I'll have a look later today.

Thanks,

tglx


Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Thomas Gleixner
On Wed, 5 Sep 2018, Mukesh Ojha wrote:
> On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> > > + st->rollback = true;
> > > + st->target = prev_state;
> > > + st->bringup = !st->bringup;
> > No, this is just papering over the actual problem.
> > 
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> > 
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
> 
> Could this be done like below ?

IOW, more papering over the real root cause.

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..47bce90 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
>     int err, cpu = smp_processor_id();
>     int ret;

  ^^^

Please fix your mailer or your editor. That patch is whitespace damaged.

Thanks,

tglx

Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Neeraj Upadhyay




On 09/05/2018 05:53 PM, Thomas Gleixner wrote:

On Wed, 5 Sep 2018, Thomas Gleixner wrote:

On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:

ret = cpuhp_down_callbacks(cpu, st, target);
if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
-   cpuhp_reset_state(st, prev_state);
+   /*
+* As st->last is not set, cpuhp_reset_state() increments
+* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
+* skipped during rollback. So, don't use it here.
+*/
+   st->rollback = true;
+   st->target = prev_state;
+   st->bringup = !st->bringup;

No, this is just papering over the actual problem.

The state inconsistency happens in take_cpu_down() when it returns with a
failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
and st->state is then incremented in undo_cpu_down().

That's the real issue and we need to analyze the whole cpu_down rollback
logic first.

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first teardown
callback fails because, st->state is only decremented _AFTER_ the callback
returns success, but undo_cpu_down() increments unconditionally.

Patch below.

Thanks,

tglx
As per my understanding, there are 2 problems here; one is fixed with 
your patch, and other is cpuhp_reset_state() is used during rollback 
from non-AP to AP state, which seem to result in 2 increments of 
st->state (one increment done by cpuhp_reset_state() and another by 
cpu_thread_fun()) .


--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
if (ret) {
st->target = prev_state;
-   undo_cpu_down(cpu, st);
+   if (st->state < prev_state)
+   undo_cpu_down(cpu, st);
break;
}
}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
 * to do the further cleanups.
 */
ret = cpuhp_down_callbacks(cpu, st, target);
-   if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+   if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Thomas Gleixner
On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > ret = cpuhp_down_callbacks(cpu, st, target);
> > if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > -   cpuhp_reset_state(st, prev_state);
> > +   /*
> > +* As st->last is not set, cpuhp_reset_state() increments
> > +* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > +* skipped during rollback. So, don't use it here.
> > +*/
> > +   st->rollback = true;
> > +   st->target = prev_state;
> > +   st->bringup = !st->bringup;
> 
> No, this is just papering over the actual problem.
> 
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
> 
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first teardown
callback fails because, st->state is only decremented _AFTER_ the callback
returns success, but undo_cpu_down() increments unconditionally.

Patch below.

Thanks,

tglx

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
if (ret) {
st->target = prev_state;
-   undo_cpu_down(cpu, st);
+   if (st->state < prev_state)
+   undo_cpu_down(cpu, st);
break;
}
}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
 * to do the further cleanups.
 */
ret = cpuhp_down_callbacks(cpu, st, target);
-   if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+   if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}


Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Mukesh Ojha




On 9/5/2018 5:03 PM, Thomas Gleixner wrote:

On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:

If takedown_cpu() fails during _cpu_down(), st->state is reset,
by calling cpuhp_reset_state(). This results in an additional
increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
state being skipped during rollback. Fix this by not calling
cpuhp_reset_state() and doing the state reset directly in
_cpu_down().

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Signed-off-by: Neeraj Upadhyay 
---
  kernel/cpu.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..9f49edb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int 
tasks_frozen,
 */
ret = cpuhp_down_callbacks(cpu, st, target);
if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
-   cpuhp_reset_state(st, prev_state);
+   /*
+* As st->last is not set, cpuhp_reset_state() increments
+* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
+* skipped during rollback. So, don't use it here.
+*/
+   st->rollback = true;
+   st->target = prev_state;
+   st->bringup = !st->bringup;

No, this is just papering over the actual problem.

The state inconsistency happens in take_cpu_down() when it returns with a
failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
and st->state is then incremented in undo_cpu_down().

That's the real issue and we need to analyze the whole cpu_down rollback
logic first.


Could this be done like below ?

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..47bce90 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
    int err, cpu = smp_processor_id();
    int ret;

-   /* Ensure this CPU doesn't handle any more interrupts. */
-   err = __cpu_disable();
-   if (err < 0)
-   return err;
-
    /*
 * We get here while we are in CPUHP_TEARDOWN_CPU state and we 
must not

 * do this step again.
 */
    WARN_ON(st->state != CPUHP_TEARDOWN_CPU);
    st->state--;
+
+   /* Ensure this CPU doesn't handle any more interrupts. */
+   err = __cpu_disable();
+   if (err < 0)
+   return err;
+
    /* Invoke the former CPU_DYING callbacks */

Thanks,
Mukesh











Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-05 Thread Thomas Gleixner
On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
> 
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay 
> ---
>  kernel/cpu.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int 
> tasks_frozen,
>*/
>   ret = cpuhp_down_callbacks(cpu, st, target);
>   if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> - cpuhp_reset_state(st, prev_state);
> + /*
> +  * As st->last is not set, cpuhp_reset_state() increments
> +  * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> +  * skipped during rollback. So, don't use it here.
> +  */
> + st->rollback = true;
> + st->target = prev_state;
> + st->bringup = !st->bringup;

No, this is just papering over the actual problem.

The state inconsistency happens in take_cpu_down() when it returns with a
failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
and st->state is then incremented in undo_cpu_down().

That's the real issue and we need to analyze the whole cpu_down rollback
logic first.

Thanks,

tglx

























Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

2018-09-04 Thread Mukesh Ojha




On 9/4/2018 12:03 PM, Neeraj Upadhyay wrote:

If takedown_cpu() fails during _cpu_down(), st->state is reset,
by calling cpuhp_reset_state(). This results in an additional
increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
state being skipped during rollback. Fix this by not calling
cpuhp_reset_state() and doing the state reset directly in
_cpu_down().

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Signed-off-by: Neeraj Upadhyay 
---
  kernel/cpu.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..9f49edb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int 
tasks_frozen,
 */
ret = cpuhp_down_callbacks(cpu, st, target);
if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
-   cpuhp_reset_state(st, prev_state);
+   /*
+* As st->last is not set, cpuhp_reset_state() increments
+* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
+* skipped during rollback. So, don't use it here.
+*/
+   st->rollback = true;
+   st->target = prev_state;
+   st->bringup = !st->bringup;


Acked-by: Mukesh Ojha 

looks valid for this case.


__cpuhp_kick_ap(st);
}