Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-16 Thread Ingo Molnar

* Robin Holt  wrote:

> I have the patches sort-of finished.  The patch set starts by
> moving the halt/shutdown/reboot functions over to a new
> kernel/reboot.c, next patch gets a checkpatch.pl cleanup to
> work, third patch is essentially the below patch against the
> new file, and the fourth patch introduces a kernel boot parameter.
> 
> That said, I don't like them because of the 'stable' marking for
> these patches.  I think I am going submit them with the
> existing patch first in the series, then introduce the kernel parameter,
> then move them to kernel/reboot.c, and finally pass checkpatch.pl.
> 
> Does that sound alright?

Yeah, that ordering sounds right.

If there are no objections from others I'll first apply the first patch (with a 
-stable tag), test it for a day, then apply the rest.

Even patch #1 probably won't make it for v3.9-final [there's too many potential 
downsides IMHO], but this could be one of the cases where marking a patch for 
-stable and merging it in the merge window is legit.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-16 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 I have the patches sort-of finished.  The patch set starts by
 moving the halt/shutdown/reboot functions over to a new
 kernel/reboot.c, next patch gets a checkpatch.pl cleanup to
 work, third patch is essentially the below patch against the
 new file, and the fourth patch introduces a kernel boot parameter.
 
 That said, I don't like them because of the 'stable' marking for
 these patches.  I think I am going submit them with the
 existing patch first in the series, then introduce the kernel parameter,
 then move them to kernel/reboot.c, and finally pass checkpatch.pl.
 
 Does that sound alright?

Yeah, that ordering sounds right.

If there are no objections from others I'll first apply the first patch (with a 
-stable tag), test it for a day, then apply the rest.

Even patch #1 probably won't make it for v3.9-final [there's too many potential 
downsides IMHO], but this could be one of the cases where marking a patch for 
-stable and merging it in the merge window is legit.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Oleg Nesterov
On 04/15, Robin Holt wrote:
>
> On Sat, Apr 13, 2013 at 06:30:22PM +0200, Oleg Nesterov wrote:
> > On 04/12, Robin Holt wrote:
> > >
> > > +void migrate_to_boot_cpu(void)
> > > +{
> > > + /* The boot cpu is always logical cpu 0 */
> > > + int reboot_cpu_id = 0;
> > > +
> > > + /* Make certain the cpu I'm about to reboot on is online */
> > > + if (!cpu_online(reboot_cpu_id))
> > > + reboot_cpu_id = smp_processor_id();
> > > +
> > > + /* Make certain I only run on the appropriate processor */
> > > + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> >
> > This is only theoretical, but perhaps it makes sense to set
> > PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
> > race with another thread doing sched_setaffinity().
>
> I don't quite understand this comment.  We are migrating our own thread.
> How does setting PF_THREAD_BOUND have any effect?

Suppose that another thread does, say, sys_sched_setaffinity(our_pid)
right after set_cpus_allowed_ptr(). If we set PF_THREAD_BOUND the
next set_cpus_allowed() will fail.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
On Mon, Apr 15, 2013 at 11:04:08AM -0500, Robin Holt wrote:
> On Sat, Apr 13, 2013 at 06:30:22PM +0200, Oleg Nesterov wrote:
> > On 04/12, Robin Holt wrote:
> > >
> > > +void migrate_to_boot_cpu(void)
> > > +{
> > > + /* The boot cpu is always logical cpu 0 */
> > > + int reboot_cpu_id = 0;
> > > +
> > > + /* Make certain the cpu I'm about to reboot on is online */
> > > + if (!cpu_online(reboot_cpu_id))
> > > + reboot_cpu_id = smp_processor_id();
> > > +
> > > + /* Make certain I only run on the appropriate processor */
> > > + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> > 
> > This is only theoretical, but perhaps it makes sense to set
> > PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
> > race with another thread doing sched_setaffinity().
> 
> I don't quite understand this comment.  We are migrating our own thread.
> How does setting PF_THREAD_BOUND have any effect?

I should have taken a few more minutes.  I understand now.  Will
do.

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
On Sat, Apr 13, 2013 at 06:30:22PM +0200, Oleg Nesterov wrote:
> On 04/12, Robin Holt wrote:
> >
> > +void migrate_to_boot_cpu(void)
> > +{
> > +   /* The boot cpu is always logical cpu 0 */
> > +   int reboot_cpu_id = 0;
> > +
> > +   /* Make certain the cpu I'm about to reboot on is online */
> > +   if (!cpu_online(reboot_cpu_id))
> > +   reboot_cpu_id = smp_processor_id();
> > +
> > +   /* Make certain I only run on the appropriate processor */
> > +   set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> 
> This is only theoretical, but perhaps it makes sense to set
> PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
> race with another thread doing sched_setaffinity().

I don't quite understand this comment.  We are migrating our own thread.
How does setting PF_THREAD_BOUND have any effect?

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
I have the patches sort-of finished.  The patch set starts by
moving the halt/shutdown/reboot functions over to a new
kernel/reboot.c, next patch gets a checkpatch.pl cleanup to
work, third patch is essentially the below patch against the
new file, and the fourth patch introduces a kernel boot parameter.

That said, I don't like them because of the 'stable' marking for
these patches.  I think I am going submit them with the
existing patch first in the series, then introduce the kernel parameter,
then move them to kernel/reboot.c, and finally pass checkpatch.pl.

Does that sound alright?

Robin

On Mon, Apr 15, 2013 at 07:02:16AM -0500, Robin Holt wrote:
> On Mon, Apr 15, 2013 at 12:16:44PM +0200, Ingo Molnar wrote:
> > 
> > * Robin Holt  wrote:
> > 
> > > From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
> > > From: Robin Holt 
> > > Date: Wed, 3 Apr 2013 13:52:00 -0500
> > > Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
> > > 
> > > We recently noticed that reboot of a 1024 cpu machine takes approx 16
> > > minutes of just stopping the cpus.  The slowdown was tracked to commit
> > > f96972f.
> > > 
> > > The current implementation does all the work of hot removing the cpus
> > > before halting the system.  We are switching to just migrating to the
> > > boot cpu and then calling continuing with shutdown/reboot.
> > > 
> > > This also has the effect of not breaking x86's command line parameter for
> > > specifying the reboot cpu.  Note, this code was shamelessly copied from
> > > arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
> > > command line parameter.
> > > 
> > > Signed-off-by: Robin Holt 
> > > To: Shawn Guo 
> > > To: Ingo Molnar 
> > > To: Russ Anderson 
> > > Cc: Andrew Morton 
> > > Cc: "H. Peter Anvin" 
> > > Cc: Lai Jiangshan 
> > > Cc: Linus Torvalds 
> > > Cc: Linux Kernel Mailing List 
> > > Cc: Michel Lespinasse 
> > > Cc: Oleg Nesterov 
> > > Cc: "Paul E. McKenney" 
> > > Cc: Paul Mackerras 
> > > Cc: Peter Zijlstra 
> > > Cc: Robin Holt 
> > > Cc: "ru...@rustcorp.com.au" 
> > > Cc: Tejun Heo 
> > > Cc: the arch/x86 maintainers 
> > > Cc: Thomas Gleixner 
> > > Cc: 
> > > ---
> > >  kernel/sys.c | 17 +++--
> > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > index 0da73cf..4d1047d 100644
> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> > > @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block 
> > > *nb)
> > >  }
> > >  EXPORT_SYMBOL(unregister_reboot_notifier);
> > >  
> > > +void migrate_to_boot_cpu(void)
> > > +{
> > > + /* The boot cpu is always logical cpu 0 */
> > > + int reboot_cpu_id = 0;
> > > +
> > > + /* Make certain the cpu I'm about to reboot on is online */
> > > + if (!cpu_online(reboot_cpu_id))
> > > + reboot_cpu_id = smp_processor_id();
> > > +
> > > + /* Make certain I only run on the appropriate processor */
> > > + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> > > +}
> > 
> > I guess you suspect what I'm going to suggest next? :-)
> > 
> > While I think something like this commit would be acceptable as a minimal 
> > regression fix, it would be really lovely to add a second patch as well, 
> > which 
> > removes the same code from 
> > arch/x86/kernel/reboot.c:native_machine_shutdown() and 
> > merged it with the kernel/sys.c version? That way all platforms gained a 
> > reboot_cpu command line, and we'd have less code duplication as well. 
> > Win-win.
> > 
> > ( While at it, it might make sense to move the reboot/shutdown related bits 
> > from 
> >   kernel/sys.c to a new kernel/shutdown.c or kernel/reboot.c file or so. )
> 
> That all sounds good.  I will work on that this morning.
> 
> Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
On Mon, Apr 15, 2013 at 12:16:44PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
> > From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
> > From: Robin Holt 
> > Date: Wed, 3 Apr 2013 13:52:00 -0500
> > Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
> > 
> > We recently noticed that reboot of a 1024 cpu machine takes approx 16
> > minutes of just stopping the cpus.  The slowdown was tracked to commit
> > f96972f.
> > 
> > The current implementation does all the work of hot removing the cpus
> > before halting the system.  We are switching to just migrating to the
> > boot cpu and then calling continuing with shutdown/reboot.
> > 
> > This also has the effect of not breaking x86's command line parameter for
> > specifying the reboot cpu.  Note, this code was shamelessly copied from
> > arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
> > command line parameter.
> > 
> > Signed-off-by: Robin Holt 
> > To: Shawn Guo 
> > To: Ingo Molnar 
> > To: Russ Anderson 
> > Cc: Andrew Morton 
> > Cc: "H. Peter Anvin" 
> > Cc: Lai Jiangshan 
> > Cc: Linus Torvalds 
> > Cc: Linux Kernel Mailing List 
> > Cc: Michel Lespinasse 
> > Cc: Oleg Nesterov 
> > Cc: "Paul E. McKenney" 
> > Cc: Paul Mackerras 
> > Cc: Peter Zijlstra 
> > Cc: Robin Holt 
> > Cc: "ru...@rustcorp.com.au" 
> > Cc: Tejun Heo 
> > Cc: the arch/x86 maintainers 
> > Cc: Thomas Gleixner 
> > Cc: 
> > ---
> >  kernel/sys.c | 17 +++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 0da73cf..4d1047d 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block 
> > *nb)
> >  }
> >  EXPORT_SYMBOL(unregister_reboot_notifier);
> >  
> > +void migrate_to_boot_cpu(void)
> > +{
> > +   /* The boot cpu is always logical cpu 0 */
> > +   int reboot_cpu_id = 0;
> > +
> > +   /* Make certain the cpu I'm about to reboot on is online */
> > +   if (!cpu_online(reboot_cpu_id))
> > +   reboot_cpu_id = smp_processor_id();
> > +
> > +   /* Make certain I only run on the appropriate processor */
> > +   set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> > +}
> 
> I guess you suspect what I'm going to suggest next? :-)
> 
> While I think something like this commit would be acceptable as a minimal 
> regression fix, it would be really lovely to add a second patch as well, 
> which 
> removes the same code from arch/x86/kernel/reboot.c:native_machine_shutdown() 
> and 
> merged it with the kernel/sys.c version? That way all platforms gained a 
> reboot_cpu command line, and we'd have less code duplication as well. Win-win.
> 
> ( While at it, it might make sense to move the reboot/shutdown related bits 
> from 
>   kernel/sys.c to a new kernel/shutdown.c or kernel/reboot.c file or so. )

That all sounds good.  I will work on that this morning.

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Ingo Molnar

* Robin Holt  wrote:

> From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
> From: Robin Holt 
> Date: Wed, 3 Apr 2013 13:52:00 -0500
> Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
> 
> We recently noticed that reboot of a 1024 cpu machine takes approx 16
> minutes of just stopping the cpus.  The slowdown was tracked to commit
> f96972f.
> 
> The current implementation does all the work of hot removing the cpus
> before halting the system.  We are switching to just migrating to the
> boot cpu and then calling continuing with shutdown/reboot.
> 
> This also has the effect of not breaking x86's command line parameter for
> specifying the reboot cpu.  Note, this code was shamelessly copied from
> arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
> command line parameter.
> 
> Signed-off-by: Robin Holt 
> To: Shawn Guo 
> To: Ingo Molnar 
> To: Russ Anderson 
> Cc: Andrew Morton 
> Cc: "H. Peter Anvin" 
> Cc: Lai Jiangshan 
> Cc: Linus Torvalds 
> Cc: Linux Kernel Mailing List 
> Cc: Michel Lespinasse 
> Cc: Oleg Nesterov 
> Cc: "Paul E. McKenney" 
> Cc: Paul Mackerras 
> Cc: Peter Zijlstra 
> Cc: Robin Holt 
> Cc: "ru...@rustcorp.com.au" 
> Cc: Tejun Heo 
> Cc: the arch/x86 maintainers 
> Cc: Thomas Gleixner 
> Cc: 
> ---
>  kernel/sys.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 0da73cf..4d1047d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL(unregister_reboot_notifier);
>  
> +void migrate_to_boot_cpu(void)
> +{
> + /* The boot cpu is always logical cpu 0 */
> + int reboot_cpu_id = 0;
> +
> + /* Make certain the cpu I'm about to reboot on is online */
> + if (!cpu_online(reboot_cpu_id))
> + reboot_cpu_id = smp_processor_id();
> +
> + /* Make certain I only run on the appropriate processor */
> + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> +}

I guess you suspect what I'm going to suggest next? :-)

While I think something like this commit would be acceptable as a minimal 
regression fix, it would be really lovely to add a second patch as well, which 
removes the same code from arch/x86/kernel/reboot.c:native_machine_shutdown() 
and 
merged it with the kernel/sys.c version? That way all platforms gained a 
reboot_cpu command line, and we'd have less code duplication as well. Win-win.

( While at it, it might make sense to move the reboot/shutdown related bits 
from 
  kernel/sys.c to a new kernel/shutdown.c or kernel/reboot.c file or so. )

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
 From: Robin Holt h...@sgi.com
 Date: Wed, 3 Apr 2013 13:52:00 -0500
 Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
 
 We recently noticed that reboot of a 1024 cpu machine takes approx 16
 minutes of just stopping the cpus.  The slowdown was tracked to commit
 f96972f.
 
 The current implementation does all the work of hot removing the cpus
 before halting the system.  We are switching to just migrating to the
 boot cpu and then calling continuing with shutdown/reboot.
 
 This also has the effect of not breaking x86's command line parameter for
 specifying the reboot cpu.  Note, this code was shamelessly copied from
 arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
 command line parameter.
 
 Signed-off-by: Robin Holt h...@sgi.com
 To: Shawn Guo shawn@linaro.org
 To: Ingo Molnar mi...@redhat.com
 To: Russ Anderson r...@sgi.com
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Lai Jiangshan la...@cn.fujitsu.com
 Cc: Linus Torvalds torva...@linux-foundation.org
 Cc: Linux Kernel Mailing List linux-kernel@vger.kernel.org
 Cc: Michel Lespinasse wal...@google.com
 Cc: Oleg Nesterov o...@redhat.com
 Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
 Cc: Paul Mackerras pau...@samba.org
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Robin Holt h...@sgi.com
 Cc: ru...@rustcorp.com.au ru...@rustcorp.com.au
 Cc: Tejun Heo t...@kernel.org
 Cc: the arch/x86 maintainers x...@kernel.org
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: sta...@vger.kernel.org
 ---
  kernel/sys.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/sys.c b/kernel/sys.c
 index 0da73cf..4d1047d 100644
 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
  }
  EXPORT_SYMBOL(unregister_reboot_notifier);
  
 +void migrate_to_boot_cpu(void)
 +{
 + /* The boot cpu is always logical cpu 0 */
 + int reboot_cpu_id = 0;
 +
 + /* Make certain the cpu I'm about to reboot on is online */
 + if (!cpu_online(reboot_cpu_id))
 + reboot_cpu_id = smp_processor_id();
 +
 + /* Make certain I only run on the appropriate processor */
 + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
 +}

I guess you suspect what I'm going to suggest next? :-)

While I think something like this commit would be acceptable as a minimal 
regression fix, it would be really lovely to add a second patch as well, which 
removes the same code from arch/x86/kernel/reboot.c:native_machine_shutdown() 
and 
merged it with the kernel/sys.c version? That way all platforms gained a 
reboot_cpu command line, and we'd have less code duplication as well. Win-win.

( While at it, it might make sense to move the reboot/shutdown related bits 
from 
  kernel/sys.c to a new kernel/shutdown.c or kernel/reboot.c file or so. )

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
On Mon, Apr 15, 2013 at 12:16:44PM +0200, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
  From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
  From: Robin Holt h...@sgi.com
  Date: Wed, 3 Apr 2013 13:52:00 -0500
  Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
  
  We recently noticed that reboot of a 1024 cpu machine takes approx 16
  minutes of just stopping the cpus.  The slowdown was tracked to commit
  f96972f.
  
  The current implementation does all the work of hot removing the cpus
  before halting the system.  We are switching to just migrating to the
  boot cpu and then calling continuing with shutdown/reboot.
  
  This also has the effect of not breaking x86's command line parameter for
  specifying the reboot cpu.  Note, this code was shamelessly copied from
  arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
  command line parameter.
  
  Signed-off-by: Robin Holt h...@sgi.com
  To: Shawn Guo shawn@linaro.org
  To: Ingo Molnar mi...@redhat.com
  To: Russ Anderson r...@sgi.com
  Cc: Andrew Morton a...@linux-foundation.org
  Cc: H. Peter Anvin h...@zytor.com
  Cc: Lai Jiangshan la...@cn.fujitsu.com
  Cc: Linus Torvalds torva...@linux-foundation.org
  Cc: Linux Kernel Mailing List linux-kernel@vger.kernel.org
  Cc: Michel Lespinasse wal...@google.com
  Cc: Oleg Nesterov o...@redhat.com
  Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
  Cc: Paul Mackerras pau...@samba.org
  Cc: Peter Zijlstra pet...@infradead.org
  Cc: Robin Holt h...@sgi.com
  Cc: ru...@rustcorp.com.au ru...@rustcorp.com.au
  Cc: Tejun Heo t...@kernel.org
  Cc: the arch/x86 maintainers x...@kernel.org
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: sta...@vger.kernel.org
  ---
   kernel/sys.c | 17 +++--
   1 file changed, 15 insertions(+), 2 deletions(-)
  
  diff --git a/kernel/sys.c b/kernel/sys.c
  index 0da73cf..4d1047d 100644
  --- a/kernel/sys.c
  +++ b/kernel/sys.c
  @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block 
  *nb)
   }
   EXPORT_SYMBOL(unregister_reboot_notifier);
   
  +void migrate_to_boot_cpu(void)
  +{
  +   /* The boot cpu is always logical cpu 0 */
  +   int reboot_cpu_id = 0;
  +
  +   /* Make certain the cpu I'm about to reboot on is online */
  +   if (!cpu_online(reboot_cpu_id))
  +   reboot_cpu_id = smp_processor_id();
  +
  +   /* Make certain I only run on the appropriate processor */
  +   set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
  +}
 
 I guess you suspect what I'm going to suggest next? :-)
 
 While I think something like this commit would be acceptable as a minimal 
 regression fix, it would be really lovely to add a second patch as well, 
 which 
 removes the same code from arch/x86/kernel/reboot.c:native_machine_shutdown() 
 and 
 merged it with the kernel/sys.c version? That way all platforms gained a 
 reboot_cpu command line, and we'd have less code duplication as well. Win-win.
 
 ( While at it, it might make sense to move the reboot/shutdown related bits 
 from 
   kernel/sys.c to a new kernel/shutdown.c or kernel/reboot.c file or so. )

That all sounds good.  I will work on that this morning.

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
I have the patches sort-of finished.  The patch set starts by
moving the halt/shutdown/reboot functions over to a new
kernel/reboot.c, next patch gets a checkpatch.pl cleanup to
work, third patch is essentially the below patch against the
new file, and the fourth patch introduces a kernel boot parameter.

That said, I don't like them because of the 'stable' marking for
these patches.  I think I am going submit them with the
existing patch first in the series, then introduce the kernel parameter,
then move them to kernel/reboot.c, and finally pass checkpatch.pl.

Does that sound alright?

Robin

On Mon, Apr 15, 2013 at 07:02:16AM -0500, Robin Holt wrote:
 On Mon, Apr 15, 2013 at 12:16:44PM +0200, Ingo Molnar wrote:
  
  * Robin Holt h...@sgi.com wrote:
  
   From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
   From: Robin Holt h...@sgi.com
   Date: Wed, 3 Apr 2013 13:52:00 -0500
   Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
   
   We recently noticed that reboot of a 1024 cpu machine takes approx 16
   minutes of just stopping the cpus.  The slowdown was tracked to commit
   f96972f.
   
   The current implementation does all the work of hot removing the cpus
   before halting the system.  We are switching to just migrating to the
   boot cpu and then calling continuing with shutdown/reboot.
   
   This also has the effect of not breaking x86's command line parameter for
   specifying the reboot cpu.  Note, this code was shamelessly copied from
   arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
   command line parameter.
   
   Signed-off-by: Robin Holt h...@sgi.com
   To: Shawn Guo shawn@linaro.org
   To: Ingo Molnar mi...@redhat.com
   To: Russ Anderson r...@sgi.com
   Cc: Andrew Morton a...@linux-foundation.org
   Cc: H. Peter Anvin h...@zytor.com
   Cc: Lai Jiangshan la...@cn.fujitsu.com
   Cc: Linus Torvalds torva...@linux-foundation.org
   Cc: Linux Kernel Mailing List linux-kernel@vger.kernel.org
   Cc: Michel Lespinasse wal...@google.com
   Cc: Oleg Nesterov o...@redhat.com
   Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
   Cc: Paul Mackerras pau...@samba.org
   Cc: Peter Zijlstra pet...@infradead.org
   Cc: Robin Holt h...@sgi.com
   Cc: ru...@rustcorp.com.au ru...@rustcorp.com.au
   Cc: Tejun Heo t...@kernel.org
   Cc: the arch/x86 maintainers x...@kernel.org
   Cc: Thomas Gleixner t...@linutronix.de
   Cc: sta...@vger.kernel.org
   ---
kernel/sys.c | 17 +++--
1 file changed, 15 insertions(+), 2 deletions(-)
   
   diff --git a/kernel/sys.c b/kernel/sys.c
   index 0da73cf..4d1047d 100644
   --- a/kernel/sys.c
   +++ b/kernel/sys.c
   @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block 
   *nb)
}
EXPORT_SYMBOL(unregister_reboot_notifier);

   +void migrate_to_boot_cpu(void)
   +{
   + /* The boot cpu is always logical cpu 0 */
   + int reboot_cpu_id = 0;
   +
   + /* Make certain the cpu I'm about to reboot on is online */
   + if (!cpu_online(reboot_cpu_id))
   + reboot_cpu_id = smp_processor_id();
   +
   + /* Make certain I only run on the appropriate processor */
   + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
   +}
  
  I guess you suspect what I'm going to suggest next? :-)
  
  While I think something like this commit would be acceptable as a minimal 
  regression fix, it would be really lovely to add a second patch as well, 
  which 
  removes the same code from 
  arch/x86/kernel/reboot.c:native_machine_shutdown() and 
  merged it with the kernel/sys.c version? That way all platforms gained a 
  reboot_cpu command line, and we'd have less code duplication as well. 
  Win-win.
  
  ( While at it, it might make sense to move the reboot/shutdown related bits 
  from 
kernel/sys.c to a new kernel/shutdown.c or kernel/reboot.c file or so. )
 
 That all sounds good.  I will work on that this morning.
 
 Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
On Sat, Apr 13, 2013 at 06:30:22PM +0200, Oleg Nesterov wrote:
 On 04/12, Robin Holt wrote:
 
  +void migrate_to_boot_cpu(void)
  +{
  +   /* The boot cpu is always logical cpu 0 */
  +   int reboot_cpu_id = 0;
  +
  +   /* Make certain the cpu I'm about to reboot on is online */
  +   if (!cpu_online(reboot_cpu_id))
  +   reboot_cpu_id = smp_processor_id();
  +
  +   /* Make certain I only run on the appropriate processor */
  +   set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
 
 This is only theoretical, but perhaps it makes sense to set
 PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
 race with another thread doing sched_setaffinity().

I don't quite understand this comment.  We are migrating our own thread.
How does setting PF_THREAD_BOUND have any effect?

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Robin Holt
On Mon, Apr 15, 2013 at 11:04:08AM -0500, Robin Holt wrote:
 On Sat, Apr 13, 2013 at 06:30:22PM +0200, Oleg Nesterov wrote:
  On 04/12, Robin Holt wrote:
  
   +void migrate_to_boot_cpu(void)
   +{
   + /* The boot cpu is always logical cpu 0 */
   + int reboot_cpu_id = 0;
   +
   + /* Make certain the cpu I'm about to reboot on is online */
   + if (!cpu_online(reboot_cpu_id))
   + reboot_cpu_id = smp_processor_id();
   +
   + /* Make certain I only run on the appropriate processor */
   + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
  
  This is only theoretical, but perhaps it makes sense to set
  PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
  race with another thread doing sched_setaffinity().
 
 I don't quite understand this comment.  We are migrating our own thread.
 How does setting PF_THREAD_BOUND have any effect?

I should have taken a few more minutes.  I understand now.  Will
do.

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-15 Thread Oleg Nesterov
On 04/15, Robin Holt wrote:

 On Sat, Apr 13, 2013 at 06:30:22PM +0200, Oleg Nesterov wrote:
  On 04/12, Robin Holt wrote:
  
   +void migrate_to_boot_cpu(void)
   +{
   + /* The boot cpu is always logical cpu 0 */
   + int reboot_cpu_id = 0;
   +
   + /* Make certain the cpu I'm about to reboot on is online */
   + if (!cpu_online(reboot_cpu_id))
   + reboot_cpu_id = smp_processor_id();
   +
   + /* Make certain I only run on the appropriate processor */
   + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
 
  This is only theoretical, but perhaps it makes sense to set
  PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
  race with another thread doing sched_setaffinity().

 I don't quite understand this comment.  We are migrating our own thread.
 How does setting PF_THREAD_BOUND have any effect?

Suppose that another thread does, say, sys_sched_setaffinity(our_pid)
right after set_cpus_allowed_ptr(). If we set PF_THREAD_BOUND the
next set_cpus_allowed() will fail.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-13 Thread Srivatsa S. Bhat
On 04/12/2013 03:01 PM, Robin Holt wrote:
>  kernel/sys.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 0da73cf..4d1047d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL(unregister_reboot_notifier);
> 
> +void migrate_to_boot_cpu(void)
> +{
> + /* The boot cpu is always logical cpu 0 */
> + int reboot_cpu_id = 0;
> +
> + /* Make certain the cpu I'm about to reboot on is online */
> + if (!cpu_online(reboot_cpu_id))
> + reboot_cpu_id = smp_processor_id();
> +

If CPU 0 is offline, there is no point in binding, right?

[Fenghua (in CC) added the support to offline CPU0 on x86 Intel platforms.
So its possible that CPU0 is offline when you try a reboot.]

> + /* Make certain I only run on the appropriate processor */
> + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> +}
> +
>  /**
>   *   kernel_restart - reboot the system
>   *   @cmd: pointer to buffer containing command to execute for restart
> @@ -368,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
>  void kernel_restart(char *cmd)
>  {
>   kernel_restart_prepare(cmd);
> - disable_nonboot_cpus();
> + migrate_to_boot_cpu();
>   syscore_shutdown();
>   if (!cmd)
>   printk(KERN_EMERG "Restarting system.\n");
> @@ -414,7 +427,7 @@ void kernel_power_off(void)
>   kernel_shutdown_prepare(SYSTEM_POWER_OFF);
>   if (pm_power_off_prepare)
>   pm_power_off_prepare();
> - disable_nonboot_cpus();
> + migrate_to_boot_cpu();

Okay, so you are touching poweroff also. Restart was only recently altered
by Shawn, so we can assume that his fix was necessary only to his platform.
However, for poweroff, I see the commit below in the git log, which added
the disable_nonboot_cpus() call.

commit 4047727e5ae33f9b8d2b7766d1994ea6e5ec2991
Author: Mark Lord 
Date:   Mon Oct 1 01:20:10 2007 -0700

Fix SMP poweroff hangs

Its an old commit, so perhaps the issue no longer holds good, but I thought
I should bring this to notice, just in case.

>   syscore_shutdown();
>   printk(KERN_EMERG "Power down.\n");
>   kmsg_dump(KMSG_DUMP_POWEROFF);
> 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-13 Thread Oleg Nesterov
On 04/12, Robin Holt wrote:
>
> +void migrate_to_boot_cpu(void)
> +{
> + /* The boot cpu is always logical cpu 0 */
> + int reboot_cpu_id = 0;
> +
> + /* Make certain the cpu I'm about to reboot on is online */
> + if (!cpu_online(reboot_cpu_id))
> + reboot_cpu_id = smp_processor_id();
> +
> + /* Make certain I only run on the appropriate processor */
> + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));

This is only theoretical, but perhaps it makes sense to set
PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
race with another thread doing sched_setaffinity().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-13 Thread Oleg Nesterov
On 04/12, Robin Holt wrote:

 +void migrate_to_boot_cpu(void)
 +{
 + /* The boot cpu is always logical cpu 0 */
 + int reboot_cpu_id = 0;
 +
 + /* Make certain the cpu I'm about to reboot on is online */
 + if (!cpu_online(reboot_cpu_id))
 + reboot_cpu_id = smp_processor_id();
 +
 + /* Make certain I only run on the appropriate processor */
 + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));

This is only theoretical, but perhaps it makes sense to set
PF_THREAD_BOUND before set_cpus_allowed_ptr() ? To prevent the
race with another thread doing sched_setaffinity().

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-13 Thread Srivatsa S. Bhat
On 04/12/2013 03:01 PM, Robin Holt wrote:
  kernel/sys.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/sys.c b/kernel/sys.c
 index 0da73cf..4d1047d 100644
 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
  }
  EXPORT_SYMBOL(unregister_reboot_notifier);
 
 +void migrate_to_boot_cpu(void)
 +{
 + /* The boot cpu is always logical cpu 0 */
 + int reboot_cpu_id = 0;
 +
 + /* Make certain the cpu I'm about to reboot on is online */
 + if (!cpu_online(reboot_cpu_id))
 + reboot_cpu_id = smp_processor_id();
 +

If CPU 0 is offline, there is no point in binding, right?

[Fenghua (in CC) added the support to offline CPU0 on x86 Intel platforms.
So its possible that CPU0 is offline when you try a reboot.]

 + /* Make certain I only run on the appropriate processor */
 + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
 +}
 +
  /**
   *   kernel_restart - reboot the system
   *   @cmd: pointer to buffer containing command to execute for restart
 @@ -368,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
  void kernel_restart(char *cmd)
  {
   kernel_restart_prepare(cmd);
 - disable_nonboot_cpus();
 + migrate_to_boot_cpu();
   syscore_shutdown();
   if (!cmd)
   printk(KERN_EMERG Restarting system.\n);
 @@ -414,7 +427,7 @@ void kernel_power_off(void)
   kernel_shutdown_prepare(SYSTEM_POWER_OFF);
   if (pm_power_off_prepare)
   pm_power_off_prepare();
 - disable_nonboot_cpus();
 + migrate_to_boot_cpu();

Okay, so you are touching poweroff also. Restart was only recently altered
by Shawn, so we can assume that his fix was necessary only to his platform.
However, for poweroff, I see the commit below in the git log, which added
the disable_nonboot_cpus() call.

commit 4047727e5ae33f9b8d2b7766d1994ea6e5ec2991
Author: Mark Lord l...@rtr.ca
Date:   Mon Oct 1 01:20:10 2007 -0700

Fix SMP poweroff hangs

Its an old commit, so perhaps the issue no longer holds good, but I thought
I should bring this to notice, just in case.

   syscore_shutdown();
   printk(KERN_EMERG Power down.\n);
   kmsg_dump(KMSG_DUMP_POWEROFF);
 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-12 Thread Robin Holt
Meant to send this to Shawn.  Too early in the morning.

Robin

On Fri, Apr 12, 2013 at 04:31:49AM -0500, Robin Holt wrote:
> On Fri, Apr 12, 2013 at 11:39:51AM +0530, Srivatsa S. Bhat wrote:
> > On 04/12/2013 11:07 AM, Ingo Molnar wrote:
> > > 
> > > * Robin Holt  wrote:
> > > 
> > >> For the v3.9 release, can we consider my awful patch?
> > > 
> > > How about trying what I suggested, to make reboot affine to the boot CPU 
> > > explicitly, not by shutting down all the other CPUs, but by 
> > > set_cpus_allowed() or 
> > > so?
> > > 
> > 
> > I agree, that sounds like the right thing to do for 3.9. Of course, it 
> > would be
> > nice if Shawn could verify that doing that doesn't break his platform due to
> > some unknown corner case.
> > 
> > > That should solve the regression, without the ugly special-casing - while 
> > > giving 
> > > time to address the hot-unplug performance bottleneck.
> > > 
> > > Once that is done disable_nonboot_cpus() can be used again for reboot.
> > > 
> > > (But no strong feelings either way - both solutions are a workaround in a 
> > > sense.)
> 
> 
> >From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
> From: Robin Holt 
> Date: Wed, 3 Apr 2013 13:52:00 -0500
> Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
> 
> We recently noticed that reboot of a 1024 cpu machine takes approx 16
> minutes of just stopping the cpus.  The slowdown was tracked to commit
> f96972f.
> 
> The current implementation does all the work of hot removing the cpus
> before halting the system.  We are switching to just migrating to the
> boot cpu and then calling continuing with shutdown/reboot.
> 
> This also has the effect of not breaking x86's command line parameter for
> specifying the reboot cpu.  Note, this code was shamelessly copied from
> arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
> command line parameter.
> 
> Signed-off-by: Robin Holt 
> To: Shawn Guo 
> To: Ingo Molnar 
> To: Russ Anderson 
> Cc: Andrew Morton 
> Cc: "H. Peter Anvin" 
> Cc: Lai Jiangshan 
> Cc: Linus Torvalds 
> Cc: Linux Kernel Mailing List 
> Cc: Michel Lespinasse 
> Cc: Oleg Nesterov 
> Cc: "Paul E. McKenney" 
> Cc: Paul Mackerras 
> Cc: Peter Zijlstra 
> Cc: Robin Holt 
> Cc: "ru...@rustcorp.com.au" 
> Cc: Tejun Heo 
> Cc: the arch/x86 maintainers 
> Cc: Thomas Gleixner 
> Cc: 
> ---
>  kernel/sys.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 0da73cf..4d1047d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL(unregister_reboot_notifier);
>  
> +void migrate_to_boot_cpu(void)
> +{
> + /* The boot cpu is always logical cpu 0 */
> + int reboot_cpu_id = 0;
> +
> + /* Make certain the cpu I'm about to reboot on is online */
> + if (!cpu_online(reboot_cpu_id))
> + reboot_cpu_id = smp_processor_id();
> +
> + /* Make certain I only run on the appropriate processor */
> + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
> +}
> +
>  /**
>   *   kernel_restart - reboot the system
>   *   @cmd: pointer to buffer containing command to execute for restart
> @@ -368,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
>  void kernel_restart(char *cmd)
>  {
>   kernel_restart_prepare(cmd);
> - disable_nonboot_cpus();
> + migrate_to_boot_cpu();
>   syscore_shutdown();
>   if (!cmd)
>   printk(KERN_EMERG "Restarting system.\n");
> @@ -414,7 +427,7 @@ void kernel_power_off(void)
>   kernel_shutdown_prepare(SYSTEM_POWER_OFF);
>   if (pm_power_off_prepare)
>   pm_power_off_prepare();
> - disable_nonboot_cpus();
> + migrate_to_boot_cpu();
>   syscore_shutdown();
>   printk(KERN_EMERG "Power down.\n");
>   kmsg_dump(KMSG_DUMP_POWEROFF);
> -- 
> 1.8.1.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-12 Thread Robin Holt
On Fri, Apr 12, 2013 at 11:39:51AM +0530, Srivatsa S. Bhat wrote:
> On 04/12/2013 11:07 AM, Ingo Molnar wrote:
> > 
> > * Robin Holt  wrote:
> > 
> >> For the v3.9 release, can we consider my awful patch?
> > 
> > How about trying what I suggested, to make reboot affine to the boot CPU 
> > explicitly, not by shutting down all the other CPUs, but by 
> > set_cpus_allowed() or 
> > so?
> > 
> 
> I agree, that sounds like the right thing to do for 3.9. Of course, it would 
> be
> nice if Shawn could verify that doing that doesn't break his platform due to
> some unknown corner case.
> 
> > That should solve the regression, without the ugly special-casing - while 
> > giving 
> > time to address the hot-unplug performance bottleneck.
> > 
> > Once that is done disable_nonboot_cpus() can be used again for reboot.
> > 
> > (But no strong feelings either way - both solutions are a workaround in a 
> > sense.)


>From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
From: Robin Holt 
Date: Wed, 3 Apr 2013 13:52:00 -0500
Subject: [PATCH] Migrate shutdown/reboot to boot cpu.

We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus.  The slowdown was tracked to commit
f96972f.

The current implementation does all the work of hot removing the cpus
before halting the system.  We are switching to just migrating to the
boot cpu and then calling continuing with shutdown/reboot.

This also has the effect of not breaking x86's command line parameter for
specifying the reboot cpu.  Note, this code was shamelessly copied from
arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
command line parameter.

Signed-off-by: Robin Holt 
To: Shawn Guo 
To: Ingo Molnar 
To: Russ Anderson 
Cc: Andrew Morton 
Cc: "H. Peter Anvin" 
Cc: Lai Jiangshan 
Cc: Linus Torvalds 
Cc: Linux Kernel Mailing List 
Cc: Michel Lespinasse 
Cc: Oleg Nesterov 
Cc: "Paul E. McKenney" 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Robin Holt 
Cc: "ru...@rustcorp.com.au" 
Cc: Tejun Heo 
Cc: the arch/x86 maintainers 
Cc: Thomas Gleixner 
Cc: 
---
 kernel/sys.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 0da73cf..4d1047d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL(unregister_reboot_notifier);
 
+void migrate_to_boot_cpu(void)
+{
+   /* The boot cpu is always logical cpu 0 */
+   int reboot_cpu_id = 0;
+
+   /* Make certain the cpu I'm about to reboot on is online */
+   if (!cpu_online(reboot_cpu_id))
+   reboot_cpu_id = smp_processor_id();
+
+   /* Make certain I only run on the appropriate processor */
+   set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
+}
+
 /**
  * kernel_restart - reboot the system
  * @cmd: pointer to buffer containing command to execute for restart
@@ -368,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
 void kernel_restart(char *cmd)
 {
kernel_restart_prepare(cmd);
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
syscore_shutdown();
if (!cmd)
printk(KERN_EMERG "Restarting system.\n");
@@ -414,7 +427,7 @@ void kernel_power_off(void)
kernel_shutdown_prepare(SYSTEM_POWER_OFF);
if (pm_power_off_prepare)
pm_power_off_prepare();
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
syscore_shutdown();
printk(KERN_EMERG "Power down.\n");
kmsg_dump(KMSG_DUMP_POWEROFF);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-12 Thread Srivatsa S. Bhat
On 04/12/2013 11:07 AM, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
>> For the v3.9 release, can we consider my awful patch?
> 
> How about trying what I suggested, to make reboot affine to the boot CPU 
> explicitly, not by shutting down all the other CPUs, but by 
> set_cpus_allowed() or 
> so?
> 

I agree, that sounds like the right thing to do for 3.9. Of course, it would be
nice if Shawn could verify that doing that doesn't break his platform due to
some unknown corner case.

> That should solve the regression, without the ugly special-casing - while 
> giving 
> time to address the hot-unplug performance bottleneck.
> 
> Once that is done disable_nonboot_cpus() can be used again for reboot.
> 
> (But no strong feelings either way - both solutions are a workaround in a 
> sense.)
> 
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-12 Thread Srivatsa S. Bhat
On 04/12/2013 11:07 AM, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
 For the v3.9 release, can we consider my awful patch?
 
 How about trying what I suggested, to make reboot affine to the boot CPU 
 explicitly, not by shutting down all the other CPUs, but by 
 set_cpus_allowed() or 
 so?
 

I agree, that sounds like the right thing to do for 3.9. Of course, it would be
nice if Shawn could verify that doing that doesn't break his platform due to
some unknown corner case.

 That should solve the regression, without the ugly special-casing - while 
 giving 
 time to address the hot-unplug performance bottleneck.
 
 Once that is done disable_nonboot_cpus() can be used again for reboot.
 
 (But no strong feelings either way - both solutions are a workaround in a 
 sense.)
 
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-12 Thread Robin Holt
On Fri, Apr 12, 2013 at 11:39:51AM +0530, Srivatsa S. Bhat wrote:
 On 04/12/2013 11:07 AM, Ingo Molnar wrote:
  
  * Robin Holt h...@sgi.com wrote:
  
  For the v3.9 release, can we consider my awful patch?
  
  How about trying what I suggested, to make reboot affine to the boot CPU 
  explicitly, not by shutting down all the other CPUs, but by 
  set_cpus_allowed() or 
  so?
  
 
 I agree, that sounds like the right thing to do for 3.9. Of course, it would 
 be
 nice if Shawn could verify that doing that doesn't break his platform due to
 some unknown corner case.
 
  That should solve the regression, without the ugly special-casing - while 
  giving 
  time to address the hot-unplug performance bottleneck.
  
  Once that is done disable_nonboot_cpus() can be used again for reboot.
  
  (But no strong feelings either way - both solutions are a workaround in a 
  sense.)


From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
From: Robin Holt h...@sgi.com
Date: Wed, 3 Apr 2013 13:52:00 -0500
Subject: [PATCH] Migrate shutdown/reboot to boot cpu.

We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus.  The slowdown was tracked to commit
f96972f.

The current implementation does all the work of hot removing the cpus
before halting the system.  We are switching to just migrating to the
boot cpu and then calling continuing with shutdown/reboot.

This also has the effect of not breaking x86's command line parameter for
specifying the reboot cpu.  Note, this code was shamelessly copied from
arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
command line parameter.

Signed-off-by: Robin Holt h...@sgi.com
To: Shawn Guo shawn@linaro.org
To: Ingo Molnar mi...@redhat.com
To: Russ Anderson r...@sgi.com
Cc: Andrew Morton a...@linux-foundation.org
Cc: H. Peter Anvin h...@zytor.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Linux Kernel Mailing List linux-kernel@vger.kernel.org
Cc: Michel Lespinasse wal...@google.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Mackerras pau...@samba.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Robin Holt h...@sgi.com
Cc: ru...@rustcorp.com.au ru...@rustcorp.com.au
Cc: Tejun Heo t...@kernel.org
Cc: the arch/x86 maintainers x...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: sta...@vger.kernel.org
---
 kernel/sys.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 0da73cf..4d1047d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL(unregister_reboot_notifier);
 
+void migrate_to_boot_cpu(void)
+{
+   /* The boot cpu is always logical cpu 0 */
+   int reboot_cpu_id = 0;
+
+   /* Make certain the cpu I'm about to reboot on is online */
+   if (!cpu_online(reboot_cpu_id))
+   reboot_cpu_id = smp_processor_id();
+
+   /* Make certain I only run on the appropriate processor */
+   set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
+}
+
 /**
  * kernel_restart - reboot the system
  * @cmd: pointer to buffer containing command to execute for restart
@@ -368,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
 void kernel_restart(char *cmd)
 {
kernel_restart_prepare(cmd);
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
syscore_shutdown();
if (!cmd)
printk(KERN_EMERG Restarting system.\n);
@@ -414,7 +427,7 @@ void kernel_power_off(void)
kernel_shutdown_prepare(SYSTEM_POWER_OFF);
if (pm_power_off_prepare)
pm_power_off_prepare();
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
syscore_shutdown();
printk(KERN_EMERG Power down.\n);
kmsg_dump(KMSG_DUMP_POWEROFF);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-12 Thread Robin Holt
Meant to send this to Shawn.  Too early in the morning.

Robin

On Fri, Apr 12, 2013 at 04:31:49AM -0500, Robin Holt wrote:
 On Fri, Apr 12, 2013 at 11:39:51AM +0530, Srivatsa S. Bhat wrote:
  On 04/12/2013 11:07 AM, Ingo Molnar wrote:
   
   * Robin Holt h...@sgi.com wrote:
   
   For the v3.9 release, can we consider my awful patch?
   
   How about trying what I suggested, to make reboot affine to the boot CPU 
   explicitly, not by shutting down all the other CPUs, but by 
   set_cpus_allowed() or 
   so?
   
  
  I agree, that sounds like the right thing to do for 3.9. Of course, it 
  would be
  nice if Shawn could verify that doing that doesn't break his platform due to
  some unknown corner case.
  
   That should solve the regression, without the ugly special-casing - while 
   giving 
   time to address the hot-unplug performance bottleneck.
   
   Once that is done disable_nonboot_cpus() can be used again for reboot.
   
   (But no strong feelings either way - both solutions are a workaround in a 
   sense.)
 
 
 From 1767003c943325e52ac78cac6fdbaf2ab63d Mon Sep 17 00:00:00 2001
 From: Robin Holt h...@sgi.com
 Date: Wed, 3 Apr 2013 13:52:00 -0500
 Subject: [PATCH] Migrate shutdown/reboot to boot cpu.
 
 We recently noticed that reboot of a 1024 cpu machine takes approx 16
 minutes of just stopping the cpus.  The slowdown was tracked to commit
 f96972f.
 
 The current implementation does all the work of hot removing the cpus
 before halting the system.  We are switching to just migrating to the
 boot cpu and then calling continuing with shutdown/reboot.
 
 This also has the effect of not breaking x86's command line parameter for
 specifying the reboot cpu.  Note, this code was shamelessly copied from
 arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu
 command line parameter.
 
 Signed-off-by: Robin Holt h...@sgi.com
 To: Shawn Guo shawn@linaro.org
 To: Ingo Molnar mi...@redhat.com
 To: Russ Anderson r...@sgi.com
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Lai Jiangshan la...@cn.fujitsu.com
 Cc: Linus Torvalds torva...@linux-foundation.org
 Cc: Linux Kernel Mailing List linux-kernel@vger.kernel.org
 Cc: Michel Lespinasse wal...@google.com
 Cc: Oleg Nesterov o...@redhat.com
 Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
 Cc: Paul Mackerras pau...@samba.org
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Robin Holt h...@sgi.com
 Cc: ru...@rustcorp.com.au ru...@rustcorp.com.au
 Cc: Tejun Heo t...@kernel.org
 Cc: the arch/x86 maintainers x...@kernel.org
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: sta...@vger.kernel.org
 ---
  kernel/sys.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/sys.c b/kernel/sys.c
 index 0da73cf..4d1047d 100644
 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -357,6 +357,19 @@ int unregister_reboot_notifier(struct notifier_block *nb)
  }
  EXPORT_SYMBOL(unregister_reboot_notifier);
  
 +void migrate_to_boot_cpu(void)
 +{
 + /* The boot cpu is always logical cpu 0 */
 + int reboot_cpu_id = 0;
 +
 + /* Make certain the cpu I'm about to reboot on is online */
 + if (!cpu_online(reboot_cpu_id))
 + reboot_cpu_id = smp_processor_id();
 +
 + /* Make certain I only run on the appropriate processor */
 + set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));
 +}
 +
  /**
   *   kernel_restart - reboot the system
   *   @cmd: pointer to buffer containing command to execute for restart
 @@ -368,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
  void kernel_restart(char *cmd)
  {
   kernel_restart_prepare(cmd);
 - disable_nonboot_cpus();
 + migrate_to_boot_cpu();
   syscore_shutdown();
   if (!cmd)
   printk(KERN_EMERG Restarting system.\n);
 @@ -414,7 +427,7 @@ void kernel_power_off(void)
   kernel_shutdown_prepare(SYSTEM_POWER_OFF);
   if (pm_power_off_prepare)
   pm_power_off_prepare();
 - disable_nonboot_cpus();
 + migrate_to_boot_cpu();
   syscore_shutdown();
   printk(KERN_EMERG Power down.\n);
   kmsg_dump(KMSG_DUMP_POWEROFF);
 -- 
 1.8.1.2
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Ingo Molnar

* Robin Holt  wrote:

> For the v3.9 release, can we consider my awful patch?

How about trying what I suggested, to make reboot affine to the boot CPU 
explicitly, not by shutting down all the other CPUs, but by set_cpus_allowed() 
or 
so?

That should solve the regression, without the ugly special-casing - while 
giving 
time to address the hot-unplug performance bottleneck.

Once that is done disable_nonboot_cpus() can be used again for reboot.

(But no strong feelings either way - both solutions are a workaround in a 
sense.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Robin Holt
On Thu, Apr 11, 2013 at 03:08:20PM -0500, Russ Anderson wrote:
> On Thu, Apr 11, 2013 at 08:15:27PM +0530, Srivatsa S. Bhat wrote:
> > On 04/11/2013 07:53 PM, Russ Anderson wrote:
> > > On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
> > >>
> > >> One more thing we have to note is that, there are 4 notifiers for taking 
> > >> a
> > >> CPU offline:
> > >>
> > >> CPU_DOWN_PREPARE
> > >> CPU_DYING
> > >> CPU_DEAD
> > >> CPU_POST_DEAD
> > >>
> > >> The first can be run in parallel as mentioned above. The second is run in
> > >> parallel in the stop_machine() phase as shown in Russ' patch. But the 
> > >> third
> > >> and fourth set of notifications all end up running only on CPU0, which 
> > >> will
> > >> again slow down things.
> > > 
> > > In my testing the third and fourth set were a small part of the overall
> > > time.  Less than 10%, with cpu notifiers 90+% of the time.
> > 
> > *All* of them are cpu notifiers! All of them invoke __cpu_notify() 
> > internally.
> > So how did you differentiate between them and find out that the third and
> > fourth sets take less time?
> 
> I reran a test on a 1024 cpu system, using my test patch to only call
> __stop_machine() once.  Added printks to show the kernel timestamp
> at various points.
> 
> When calling disable_nonboot_cpus() and enable_nonboot_cpus() just after
> booting the system:
>  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 376.6 seconds.
>  The loop calling cpu_notify_nofail(CPU_DEAD) took 8.1 seconds.
> 
> My guess is that notifiers do more work in the CPU_DOWN_PREPARE case.
> 
> I also added a loop calling a new notifier (CPU_TEST) which none of
> notifiers would recognize, to measure the time it took to spin through
> the call chain without the notifiers doing any work.  It took
> 0.0067 seconds.
> 
> On the actual reboot, as the system was shutting down:
>  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 333.8 seconds.
>  The loop calling cpu_notify_nofail(CPU_DEAD) took 2.7 seconds.

How about if you take the notifier_call_chain function copy it
to kernel/sys.c, and time each notifier_call() callout individually.

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Srivatsa S. Bhat
On 04/12/2013 01:38 AM, Russ Anderson wrote:
> On Thu, Apr 11, 2013 at 08:15:27PM +0530, Srivatsa S. Bhat wrote:
>> On 04/11/2013 07:53 PM, Russ Anderson wrote:
>>> On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:

 One more thing we have to note is that, there are 4 notifiers for taking a
 CPU offline:

 CPU_DOWN_PREPARE
 CPU_DYING
 CPU_DEAD
 CPU_POST_DEAD

 The first can be run in parallel as mentioned above. The second is run in
 parallel in the stop_machine() phase as shown in Russ' patch. But the third
 and fourth set of notifications all end up running only on CPU0, which will
 again slow down things.
>>>
>>> In my testing the third and fourth set were a small part of the overall
>>> time.  Less than 10%, with cpu notifiers 90+% of the time.
>>
>> *All* of them are cpu notifiers! All of them invoke __cpu_notify() 
>> internally.
>> So how did you differentiate between them and find out that the third and
>> fourth sets take less time?
> 
> I reran a test on a 1024 cpu system, using my test patch to only call
> __stop_machine() once.  Added printks to show the kernel timestamp
> at various points.
> 
> When calling disable_nonboot_cpus() and enable_nonboot_cpus() just after
> booting the system:
>  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 376.6 seconds.
>  The loop calling cpu_notify_nofail(CPU_DEAD) took 8.1 seconds.
> 
> My guess is that notifiers do more work in the CPU_DOWN_PREPARE case.
> 
> I also added a loop calling a new notifier (CPU_TEST) which none of
> notifiers would recognize, to measure the time it took to spin through
> the call chain without the notifiers doing any work.  It took
> 0.0067 seconds.
> 
> On the actual reboot, as the system was shutting down:
>  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 333.8 seconds.
>  The loop calling cpu_notify_nofail(CPU_DEAD) took 2.7 seconds.
> 
> I don't know how many notifiers are on the chain, or if there is
> one heavy hitter accounting for much of the time in the
> CPU_DOWN_PREPARE case.
> 
> 
> FWIW, the overall cpu stop times are somewhat longer than what I
> measured before.  Not sure if the difference is due to changes in
> my test patch, other kernel changes pulled in, or some difference
> on the test system.
> 
> 

Thanks a lot for reporting the time taken at each stage. Its extremely
useful. So, we can drop the idea of taking CPUs down in multiple rounds
like 512, 256 etc. And, like you mentioned earlier, just running the
CPU_DOWN_PREPARE notifiers in parallel (like we discussed earlier) should
give us all the performance improvement. Or perhaps, we can instrument
the code in kernel/notifier.c (notifier_call_chain) to find out if there
is a rogue notifier which contributes most to the ~300 seconds.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Russ Anderson
On Thu, Apr 11, 2013 at 08:15:27PM +0530, Srivatsa S. Bhat wrote:
> On 04/11/2013 07:53 PM, Russ Anderson wrote:
> > On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
> >>
> >> One more thing we have to note is that, there are 4 notifiers for taking a
> >> CPU offline:
> >>
> >> CPU_DOWN_PREPARE
> >> CPU_DYING
> >> CPU_DEAD
> >> CPU_POST_DEAD
> >>
> >> The first can be run in parallel as mentioned above. The second is run in
> >> parallel in the stop_machine() phase as shown in Russ' patch. But the third
> >> and fourth set of notifications all end up running only on CPU0, which will
> >> again slow down things.
> > 
> > In my testing the third and fourth set were a small part of the overall
> > time.  Less than 10%, with cpu notifiers 90+% of the time.
> 
> *All* of them are cpu notifiers! All of them invoke __cpu_notify() internally.
> So how did you differentiate between them and find out that the third and
> fourth sets take less time?

I reran a test on a 1024 cpu system, using my test patch to only call
__stop_machine() once.  Added printks to show the kernel timestamp
at various points.

When calling disable_nonboot_cpus() and enable_nonboot_cpus() just after
booting the system:
 The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 376.6 seconds.
 The loop calling cpu_notify_nofail(CPU_DEAD) took 8.1 seconds.

My guess is that notifiers do more work in the CPU_DOWN_PREPARE case.

I also added a loop calling a new notifier (CPU_TEST) which none of
notifiers would recognize, to measure the time it took to spin through
the call chain without the notifiers doing any work.  It took
0.0067 seconds.

On the actual reboot, as the system was shutting down:
 The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 333.8 seconds.
 The loop calling cpu_notify_nofail(CPU_DEAD) took 2.7 seconds.

I don't know how many notifiers are on the chain, or if there is
one heavy hitter accounting for much of the time in the
CPU_DOWN_PREPARE case.


FWIW, the overall cpu stop times are somewhat longer than what I
measured before.  Not sure if the difference is due to changes in
my test patch, other kernel changes pulled in, or some difference
on the test system.


> >  So you may
> > not need the added complexity, or at least fix the cpu notifier part
> > first.
> > 
> 
> To make the 3rd and 4th run fast, the only thing we need to do is take CPUs
> offline in smaller steps, like 512, 256 etc.. It doesn't add any extra
> complexity over and above what is necessary to make the cpu notifiers run
> in parallel in the first place.
> 
> Regards,
> Srivatsa S. Bhat

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Srivatsa S. Bhat
On 04/11/2013 07:53 PM, Russ Anderson wrote:
> On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
>> On 04/11/2013 11:01 AM, Paul Mackerras wrote:
>>> On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
 The optimal solution would be to just speed up the
 disable_nonboot_cpus() code so much that it isn't an issue. That would
 be good for suspending too, although I guess suspend isn't a big issue
 if you have a thousand CPU's.

 Has anybody checked whether we could do the cpu_down() on non-boot
 CPU's in parallel? Right now we serialize the thing completely, with
>>>
>>> I thought Srivatsa S. Bhat had a patchset that did exactly that.
>>> Srivatsa?
>>>
>>
>> Thanks for the CC, Paul! Adding some more people to CC.
>>
>> Actually, my patchset was about removing stop_machine() from the CPU
>> offline path.
>> http://lwn.net/Articles/538819/
> 
> I certainly agree with the intent.
> 

Thank you!

>> And here is the performance improvement I had measured in the version
>> prior to that:
>> http://article.gmane.org/gmane.linux.kernel/1435249
>>
>> I'm planning to revive this patchset after the 3.10 merge window closes,
>> because it depends on doing a tree-wide sweep, and I think its a little
>> late to do it in time for the upcoming 3.10 merge window itself.
>>
>> Anyway, that's about removing stop_machine from CPU hotplug.
>>
>> Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
>> But I believe we can do more than that.
>>
>> As Russ pointed out, the notifiers are not thread-safe, so calling them
>> in parallel with different CPUs as arguments isn't going to work.
>>
>> So, first, we can convert all the CPU hotplug notifiers to take a cpumask
>> instead of a single CPU. So assuming that there are 'n' notifiers in total,
>> the number of function calls would become n, instead of n*1024.
>> But that itself most likely won't give us much benefit over the for-loop
>> that Russ has done in his patch, because it'll simply do longer processing
>> in each of those 'n' notifiers, by iterating over the cpumask inside each
>> notifier.
> 
> As an alternative, how about each cpu have their own notifier list?
> Then one task per cpu can spin through that cpu's notifier list, 
> allowing them to run in parallel.
> 
> I don't know if that would be a faster solution than adding cpumask
> to notifiers, but it my guess is it may.
> 

That might not work out well because those notifiers will have to lock
against each other. That is, notifier callback A cannot run as A(cpuX)
and A(cpuY) in parallel. They will have to serialize themselves, which
will make the whole effort useless. But, as I mentioned earlier, A(cpuX)
and B(cpuX) can run in parallel without additional serialization, if A
and B are completely different callbacks (ie., belonging to different
subsystems).

>> Now comes the interesting thing:
>>
>> Consider a notifier chain that looks like this:
>> Priority 0: A->B->C->D
>>
>> We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
>> different hotcpus as argument. *However*, since A, B, C, D all (more or less)
>> belong to different subsystems, we can call A, B, C and D in parallel on
>> different CPUs. They won't even serialize amongst themselves because they
>> take locks (if any) of different subsystems. And since they are of same
>> priority, the ordering (A after B or B after A) doesn't matter as well.
>>
>> So with this, if we combine the idea I wrote above about giving a cpumask
>> to each of these notifiers to work with, we end up in this:
>>
>> CPU 0  CPU 1  CPU2   
>> A(cpumask) B(cpumask) C(cpumask) 
>>
>> So, for example, the CPU_DOWN_PREPARE notification can be processed in 
>> parallel
>> on multiple CPUs at a time, for a given cpumask! That should definitely
>> give us a good speed-up.
>>
>> One more thing we have to note is that, there are 4 notifiers for taking a
>> CPU offline:
>>
>> CPU_DOWN_PREPARE
>> CPU_DYING
>> CPU_DEAD
>> CPU_POST_DEAD
>>
>> The first can be run in parallel as mentioned above. The second is run in
>> parallel in the stop_machine() phase as shown in Russ' patch. But the third
>> and fourth set of notifications all end up running only on CPU0, which will
>> again slow down things.
> 
> In my testing the third and fourth set were a small part of the overall
> time.  Less than 10%, with cpu notifiers 90+% of the time.

*All* of them are cpu notifiers! All of them invoke __cpu_notify() internally.
So how did you differentiate between them and find out that the third and
fourth sets take less time?

>  So you may
> not need the added complexity, or at least fix the cpu notifier part
> first.
> 

To make the 3rd and 4th run fast, the only thing we need to do is take CPUs
offline in smaller steps, like 512, 256 etc.. It doesn't add any extra
complexity over and above what is necessary to make the cpu notifiers run
in parallel in the first place.


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Russ Anderson
On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
> On 04/11/2013 11:01 AM, Paul Mackerras wrote:
> > On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
> >> The optimal solution would be to just speed up the
> >> disable_nonboot_cpus() code so much that it isn't an issue. That would
> >> be good for suspending too, although I guess suspend isn't a big issue
> >> if you have a thousand CPU's.
> >>
> >> Has anybody checked whether we could do the cpu_down() on non-boot
> >> CPU's in parallel? Right now we serialize the thing completely, with
> > 
> > I thought Srivatsa S. Bhat had a patchset that did exactly that.
> > Srivatsa?
> > 
> 
> Thanks for the CC, Paul! Adding some more people to CC.
> 
> Actually, my patchset was about removing stop_machine() from the CPU
> offline path.
> http://lwn.net/Articles/538819/

I certainly agree with the intent.

> And here is the performance improvement I had measured in the version
> prior to that:
> http://article.gmane.org/gmane.linux.kernel/1435249
> 
> I'm planning to revive this patchset after the 3.10 merge window closes,
> because it depends on doing a tree-wide sweep, and I think its a little
> late to do it in time for the upcoming 3.10 merge window itself.
> 
> Anyway, that's about removing stop_machine from CPU hotplug.
> 
> Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
> But I believe we can do more than that.
> 
> As Russ pointed out, the notifiers are not thread-safe, so calling them
> in parallel with different CPUs as arguments isn't going to work.
> 
> So, first, we can convert all the CPU hotplug notifiers to take a cpumask
> instead of a single CPU. So assuming that there are 'n' notifiers in total,
> the number of function calls would become n, instead of n*1024.
> But that itself most likely won't give us much benefit over the for-loop
> that Russ has done in his patch, because it'll simply do longer processing
> in each of those 'n' notifiers, by iterating over the cpumask inside each
> notifier.

As an alternative, how about each cpu have their own notifier list?
Then one task per cpu can spin through that cpu's notifier list, 
allowing them to run in parallel.

I don't know if that would be a faster solution than adding cpumask
to notifiers, but it my guess is it may.

> Now comes the interesting thing:
> 
> Consider a notifier chain that looks like this:
> Priority 0: A->B->C->D
> 
> We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
> different hotcpus as argument. *However*, since A, B, C, D all (more or less)
> belong to different subsystems, we can call A, B, C and D in parallel on
> different CPUs. They won't even serialize amongst themselves because they
> take locks (if any) of different subsystems. And since they are of same
> priority, the ordering (A after B or B after A) doesn't matter as well.
> 
> So with this, if we combine the idea I wrote above about giving a cpumask
> to each of these notifiers to work with, we end up in this:
> 
> CPU 0  CPU 1  CPU2   
> A(cpumask) B(cpumask) C(cpumask) 
> 
> So, for example, the CPU_DOWN_PREPARE notification can be processed in 
> parallel
> on multiple CPUs at a time, for a given cpumask! That should definitely
> give us a good speed-up.
> 
> One more thing we have to note is that, there are 4 notifiers for taking a
> CPU offline:
> 
> CPU_DOWN_PREPARE
> CPU_DYING
> CPU_DEAD
> CPU_POST_DEAD
> 
> The first can be run in parallel as mentioned above. The second is run in
> parallel in the stop_machine() phase as shown in Russ' patch. But the third
> and fourth set of notifications all end up running only on CPU0, which will
> again slow down things.

In my testing the third and fourth set were a small part of the overall
time.  Less than 10%, with cpu notifiers 90+% of the time.  So you may
not need the added complexity, or at least fix the cpu notifier part
first.

> So I suggest taking down the 1024 CPUs in multiple phases, like a binary 
> search.
> First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk
> CPU hotplug, we have enough online CPUs to handle the notifier load, and that
> helps speed things up. Moreover, a handful of calls to stop_machine() is OK
> because, stop_machine() takes progressively lesser and lesser time because
> lesser CPUs are online on each iteration (and hence it reduces the
> synchronization overhead of the stop-machine phase).
> 
> The only downside to this whole idea of running the notifiers of a given
> priority in parallel, is error handling - if a notifier fails, it would be
> troublesome to rollback I guess. But if we forget that for a moment, we can
> give this idea a try!

Yes.

> Regards,
> Srivatsa S. Bhat

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message 

Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Robin Holt
For the v3.9 release, can we consider my awful patch?

Thanks,
Robin

On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
> On 04/11/2013 11:01 AM, Paul Mackerras wrote:
> > On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
> >> The optimal solution would be to just speed up the
> >> disable_nonboot_cpus() code so much that it isn't an issue. That would
> >> be good for suspending too, although I guess suspend isn't a big issue
> >> if you have a thousand CPU's.
> >>
> >> Has anybody checked whether we could do the cpu_down() on non-boot
> >> CPU's in parallel? Right now we serialize the thing completely, with
> > 
> > I thought Srivatsa S. Bhat had a patchset that did exactly that.
> > Srivatsa?
> > 
> 
> Thanks for the CC, Paul! Adding some more people to CC.
> 
> Actually, my patchset was about removing stop_machine() from the CPU
> offline path.
> http://lwn.net/Articles/538819/
> 
> And here is the performance improvement I had measured in the version
> prior to that:
> http://article.gmane.org/gmane.linux.kernel/1435249
> 
> I'm planning to revive this patchset after the 3.10 merge window closes,
> because it depends on doing a tree-wide sweep, and I think its a little
> late to do it in time for the upcoming 3.10 merge window itself.
> 
> Anyway, that's about removing stop_machine from CPU hotplug.
> 
> Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
> But I believe we can do more than that.
> 
> As Russ pointed out, the notifiers are not thread-safe, so calling them
> in parallel with different CPUs as arguments isn't going to work.
> 
> So, first, we can convert all the CPU hotplug notifiers to take a cpumask
> instead of a single CPU. So assuming that there are 'n' notifiers in total,
> the number of function calls would become n, instead of n*1024.
> But that itself most likely won't give us much benefit over the for-loop
> that Russ has done in his patch, because it'll simply do longer processing
> in each of those 'n' notifiers, by iterating over the cpumask inside each
> notifier.
> 
> Now comes the interesting thing:
> 
> Consider a notifier chain that looks like this:
> Priority 0: A->B->C->D
> 
> We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
> different hotcpus as argument. *However*, since A, B, C, D all (more or less)
> belong to different subsystems, we can call A, B, C and D in parallel on
> different CPUs. They won't even serialize amongst themselves because they
> take locks (if any) of different subsystems. And since they are of same
> priority, the ordering (A after B or B after A) doesn't matter as well.
> 
> So with this, if we combine the idea I wrote above about giving a cpumask
> to each of these notifiers to work with, we end up in this:
> 
> CPU 0  CPU 1  CPU2   
> A(cpumask) B(cpumask) C(cpumask) 
> 
> So, for example, the CPU_DOWN_PREPARE notification can be processed in 
> parallel
> on multiple CPUs at a time, for a given cpumask! That should definitely
> give us a good speed-up.
> 
> One more thing we have to note is that, there are 4 notifiers for taking a
> CPU offline:
> 
> CPU_DOWN_PREPARE
> CPU_DYING
> CPU_DEAD
> CPU_POST_DEAD
> 
> The first can be run in parallel as mentioned above. The second is run in
> parallel in the stop_machine() phase as shown in Russ' patch. But the third
> and fourth set of notifications all end up running only on CPU0, which will
> again slow down things.
> 
> So I suggest taking down the 1024 CPUs in multiple phases, like a binary 
> search.
> First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk
> CPU hotplug, we have enough online CPUs to handle the notifier load, and that
> helps speed things up. Moreover, a handful of calls to stop_machine() is OK
> because, stop_machine() takes progressively lesser and lesser time because
> lesser CPUs are online on each iteration (and hence it reduces the
> synchronization overhead of the stop-machine phase).
> 
> The only downside to this whole idea of running the notifiers of a given
> priority in parallel, is error handling - if a notifier fails, it would be
> troublesome to rollback I guess. But if we forget that for a moment, we can
> give this idea a try!
> 
> Regards,
> Srivatsa S. Bhat
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Srivatsa S. Bhat
On 04/11/2013 11:01 AM, Paul Mackerras wrote:
> On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
>> The optimal solution would be to just speed up the
>> disable_nonboot_cpus() code so much that it isn't an issue. That would
>> be good for suspending too, although I guess suspend isn't a big issue
>> if you have a thousand CPU's.
>>
>> Has anybody checked whether we could do the cpu_down() on non-boot
>> CPU's in parallel? Right now we serialize the thing completely, with
> 
> I thought Srivatsa S. Bhat had a patchset that did exactly that.
> Srivatsa?
> 

Thanks for the CC, Paul! Adding some more people to CC.

Actually, my patchset was about removing stop_machine() from the CPU
offline path.
http://lwn.net/Articles/538819/

And here is the performance improvement I had measured in the version
prior to that:
http://article.gmane.org/gmane.linux.kernel/1435249

I'm planning to revive this patchset after the 3.10 merge window closes,
because it depends on doing a tree-wide sweep, and I think its a little
late to do it in time for the upcoming 3.10 merge window itself.

Anyway, that's about removing stop_machine from CPU hotplug.

Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
But I believe we can do more than that.

As Russ pointed out, the notifiers are not thread-safe, so calling them
in parallel with different CPUs as arguments isn't going to work.

So, first, we can convert all the CPU hotplug notifiers to take a cpumask
instead of a single CPU. So assuming that there are 'n' notifiers in total,
the number of function calls would become n, instead of n*1024.
But that itself most likely won't give us much benefit over the for-loop
that Russ has done in his patch, because it'll simply do longer processing
in each of those 'n' notifiers, by iterating over the cpumask inside each
notifier.

Now comes the interesting thing:

Consider a notifier chain that looks like this:
Priority 0: A->B->C->D

We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
different hotcpus as argument. *However*, since A, B, C, D all (more or less)
belong to different subsystems, we can call A, B, C and D in parallel on
different CPUs. They won't even serialize amongst themselves because they
take locks (if any) of different subsystems. And since they are of same
priority, the ordering (A after B or B after A) doesn't matter as well.

So with this, if we combine the idea I wrote above about giving a cpumask
to each of these notifiers to work with, we end up in this:

CPU 0  CPU 1  CPU2   
A(cpumask) B(cpumask) C(cpumask) 

So, for example, the CPU_DOWN_PREPARE notification can be processed in parallel
on multiple CPUs at a time, for a given cpumask! That should definitely
give us a good speed-up.

One more thing we have to note is that, there are 4 notifiers for taking a
CPU offline:

CPU_DOWN_PREPARE
CPU_DYING
CPU_DEAD
CPU_POST_DEAD

The first can be run in parallel as mentioned above. The second is run in
parallel in the stop_machine() phase as shown in Russ' patch. But the third
and fourth set of notifications all end up running only on CPU0, which will
again slow down things.

So I suggest taking down the 1024 CPUs in multiple phases, like a binary search.
First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk
CPU hotplug, we have enough online CPUs to handle the notifier load, and that
helps speed things up. Moreover, a handful of calls to stop_machine() is OK
because, stop_machine() takes progressively lesser and lesser time because
lesser CPUs are online on each iteration (and hence it reduces the
synchronization overhead of the stop-machine phase).

The only downside to this whole idea of running the notifiers of a given
priority in parallel, is error handling - if a notifier fails, it would be
troublesome to rollback I guess. But if we forget that for a moment, we can
give this idea a try!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Ingo Molnar

* Robin Holt  wrote:

> On Thu, Apr 11, 2013 at 07:03:58AM -0500, Robin Holt wrote:
> > On Thu, Apr 11, 2013 at 02:00:27PM +0200, Ingo Molnar wrote:
> > > 
> > > * Robin Holt  wrote:
> > > 
> > > > > Ok, so it looks profilable.
> > > > > 
> > > > > The result above is not surprising: most CPUs sit in idle and don't 
> > > > > do anything, 
> > > > > while the loop goes on, right?
> > > > > 
> > > > > The interesting thing to profile would be the parallel bring-down, 
> > > > > with the 
> > > > > simplest global lock solution you mentioned. In that case most CPUs 
> > > > > should be 
> > > > > doing 'something' all the time - maybe spinning on the lock, maybe 
> > > > > something else, 
> > > > > right?
> > > > 
> > > > Again, mostly looks idle.
> > > 
> > > Forgot to suggest:
> > > 
> > >   perf record -a /sbin/reboot
> > 
> > I used perf record -a /sbin/reboot -f -d -n
> 
> OK.  Looking at Russ' patch, I understand now why it is looking idle.
> We are still serially doing the DOWN_PREPARE, etc.  All those other cpus
> are still sitting idle.
> 
> Can we call the __cpu_down functions from an smp_call_function()?

I think the kthread_park() will generally schedule.

But ... whether it's an IPI or a wakeup should matter little: wakeups are IPI 
based (sometimes faster, mwait based).

So the main overhead is the serial loop - if that's done in parallel, and then 
all 
CPUs are waited for in a second loop, then much of the work can go on in 
parallel.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Robin Holt
On Thu, Apr 11, 2013 at 07:03:58AM -0500, Robin Holt wrote:
> On Thu, Apr 11, 2013 at 02:00:27PM +0200, Ingo Molnar wrote:
> > 
> > * Robin Holt  wrote:
> > 
> > > > Ok, so it looks profilable.
> > > > 
> > > > The result above is not surprising: most CPUs sit in idle and don't do 
> > > > anything, 
> > > > while the loop goes on, right?
> > > > 
> > > > The interesting thing to profile would be the parallel bring-down, with 
> > > > the 
> > > > simplest global lock solution you mentioned. In that case most CPUs 
> > > > should be 
> > > > doing 'something' all the time - maybe spinning on the lock, maybe 
> > > > something else, 
> > > > right?
> > > 
> > > Again, mostly looks idle.
> > 
> > Forgot to suggest:
> > 
> >   perf record -a /sbin/reboot
> 
> I used perf record -a /sbin/reboot -f -d -n

OK.  Looking at Russ' patch, I understand now why it is looking idle.
We are still serially doing the DOWN_PREPARE, etc.  All those other cpus
are still sitting idle.

Can we call the __cpu_down functions from an smp_call_function()?

Robin
> 
> Robin
> > 
> > ... to capture remote CPU activity too.
> > 
> > > Events: 5M cycles
> > > 31.69%  swapper  [kernel.kallsyms][k] 
> > > update_cfs_rq_blocked_load
> > > 14.22%  swapper  [kernel.kallsyms][k] load_balance
> > > 12.95%  swapper  [kernel.kallsyms][k] ktime_get
> > >  4.64%  swapper  [kernel.kallsyms][k] idle_cpu
> > >  3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
> > >  2.26%  swapper  [kernel.kallsyms][k] 
> > > ktime_get_update_offsets
> > >  2.25%  swapper  [kernel.kallsyms][k] 
> > > rcu_check_callbacks
> > >  1.72%  swapper  [kernel.kallsyms][k] 
> > > _raw_spin_lock_irqsave
> > >  1.57%  swapper  [kernel.kallsyms][k] 
> > > native_write_msr_safe
> > >  1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
> > >  1.52%  swapper  [kernel.kallsyms][k] 
> > > apic_timer_interrupt
> > >  1.52%  swapper  [kernel.kallsyms][k] 
> > > update_blocked_averages
> > >  1.51%  swapper  [kernel.kallsyms][k] 
> > > __lock_text_start
> > >  1.48%  swapper  [kernel.kallsyms][k] 
> > > rcu_process_gp_end
> > >  1.40%  swapper  [kernel.kallsyms][k] 
> > > rcu_process_callbacks
> > >  1.19%   reboot  [kernel.kallsyms][k] 
> > > kmem_cache_alloc_node
> > >  0.63%  swapper  [kernel.kallsyms][k] 
> > > check_for_new_grace_period
> > >  0.58%  swapper  [kernel.kallsyms][k] 
> > > rebalance_domains
> > >  0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
> > >  0.54%  swapper  [kernel.kallsyms][k] 
> > > __tick_nohz_idle_enter
> > >  0.53%  swapper  [kernel.kallsyms][k] 
> > > perf_adjust_freq_unthr_context
> > >  0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock
> > 
> > If even perf record -a shows a mostly idle system, then the overhead must 
> > be in 
> > sleep/wakeup latencies - for that the next step would be to figure out 
> > where all 
> > the waiting happens, for example via call-graph context-switch profiling:
> > 
> >perf stat --null perf record -a -g -e sched:sched_switch /sbin/reboot
> > 
> > (the perf stat --null will tell us the runtime of the whole operation.)
> > 
> > Thanks,
> > 
> > Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Robin Holt
On Thu, Apr 11, 2013 at 02:00:27PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
> > > Ok, so it looks profilable.
> > > 
> > > The result above is not surprising: most CPUs sit in idle and don't do 
> > > anything, 
> > > while the loop goes on, right?
> > > 
> > > The interesting thing to profile would be the parallel bring-down, with 
> > > the 
> > > simplest global lock solution you mentioned. In that case most CPUs 
> > > should be 
> > > doing 'something' all the time - maybe spinning on the lock, maybe 
> > > something else, 
> > > right?
> > 
> > Again, mostly looks idle.
> 
> Forgot to suggest:
> 
>   perf record -a /sbin/reboot

I used perf record -a /sbin/reboot -f -d -n

Robin
> 
> ... to capture remote CPU activity too.
> 
> > Events: 5M cycles
> > 31.69%  swapper  [kernel.kallsyms][k] 
> > update_cfs_rq_blocked_load
> > 14.22%  swapper  [kernel.kallsyms][k] load_balance
> > 12.95%  swapper  [kernel.kallsyms][k] ktime_get
> >  4.64%  swapper  [kernel.kallsyms][k] idle_cpu
> >  3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
> >  2.26%  swapper  [kernel.kallsyms][k] 
> > ktime_get_update_offsets
> >  2.25%  swapper  [kernel.kallsyms][k] 
> > rcu_check_callbacks
> >  1.72%  swapper  [kernel.kallsyms][k] 
> > _raw_spin_lock_irqsave
> >  1.57%  swapper  [kernel.kallsyms][k] 
> > native_write_msr_safe
> >  1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
> >  1.52%  swapper  [kernel.kallsyms][k] 
> > apic_timer_interrupt
> >  1.52%  swapper  [kernel.kallsyms][k] 
> > update_blocked_averages
> >  1.51%  swapper  [kernel.kallsyms][k] __lock_text_start
> >  1.48%  swapper  [kernel.kallsyms][k] rcu_process_gp_end
> >  1.40%  swapper  [kernel.kallsyms][k] 
> > rcu_process_callbacks
> >  1.19%   reboot  [kernel.kallsyms][k] 
> > kmem_cache_alloc_node
> >  0.63%  swapper  [kernel.kallsyms][k] 
> > check_for_new_grace_period
> >  0.58%  swapper  [kernel.kallsyms][k] rebalance_domains
> >  0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
> >  0.54%  swapper  [kernel.kallsyms][k] 
> > __tick_nohz_idle_enter
> >  0.53%  swapper  [kernel.kallsyms][k] 
> > perf_adjust_freq_unthr_context
> >  0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock
> 
> If even perf record -a shows a mostly idle system, then the overhead must be 
> in 
> sleep/wakeup latencies - for that the next step would be to figure out where 
> all 
> the waiting happens, for example via call-graph context-switch profiling:
> 
>perf stat --null perf record -a -g -e sched:sched_switch /sbin/reboot
> 
> (the perf stat --null will tell us the runtime of the whole operation.)
> 
> Thanks,
> 
>   Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Ingo Molnar

* Robin Holt  wrote:

> > Ok, so it looks profilable.
> > 
> > The result above is not surprising: most CPUs sit in idle and don't do 
> > anything, 
> > while the loop goes on, right?
> > 
> > The interesting thing to profile would be the parallel bring-down, with the 
> > simplest global lock solution you mentioned. In that case most CPUs should 
> > be 
> > doing 'something' all the time - maybe spinning on the lock, maybe 
> > something else, 
> > right?
> 
> Again, mostly looks idle.

Forgot to suggest:

  perf record -a /sbin/reboot

... to capture remote CPU activity too.

> Events: 5M cycles
> 31.69%  swapper  [kernel.kallsyms][k] 
> update_cfs_rq_blocked_load
> 14.22%  swapper  [kernel.kallsyms][k] load_balance
> 12.95%  swapper  [kernel.kallsyms][k] ktime_get
>  4.64%  swapper  [kernel.kallsyms][k] idle_cpu
>  3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
>  2.26%  swapper  [kernel.kallsyms][k] 
> ktime_get_update_offsets
>  2.25%  swapper  [kernel.kallsyms][k] rcu_check_callbacks
>  1.72%  swapper  [kernel.kallsyms][k] 
> _raw_spin_lock_irqsave
>  1.57%  swapper  [kernel.kallsyms][k] 
> native_write_msr_safe
>  1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
>  1.52%  swapper  [kernel.kallsyms][k] apic_timer_interrupt
>  1.52%  swapper  [kernel.kallsyms][k] 
> update_blocked_averages
>  1.51%  swapper  [kernel.kallsyms][k] __lock_text_start
>  1.48%  swapper  [kernel.kallsyms][k] rcu_process_gp_end
>  1.40%  swapper  [kernel.kallsyms][k] 
> rcu_process_callbacks
>  1.19%   reboot  [kernel.kallsyms][k] 
> kmem_cache_alloc_node
>  0.63%  swapper  [kernel.kallsyms][k] 
> check_for_new_grace_period
>  0.58%  swapper  [kernel.kallsyms][k] rebalance_domains
>  0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
>  0.54%  swapper  [kernel.kallsyms][k] 
> __tick_nohz_idle_enter
>  0.53%  swapper  [kernel.kallsyms][k] 
> perf_adjust_freq_unthr_context
>  0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock

If even perf record -a shows a mostly idle system, then the overhead must be in 
sleep/wakeup latencies - for that the next step would be to figure out where 
all 
the waiting happens, for example via call-graph context-switch profiling:

   perf stat --null perf record -a -g -e sched:sched_switch /sbin/reboot

(the perf stat --null will tell us the runtime of the whole operation.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Robin Holt
> Ok, so it looks profilable.
> 
> The result above is not surprising: most CPUs sit in idle and don't do 
> anything, 
> while the loop goes on, right?
> 
> The interesting thing to profile would be the parallel bring-down, with the 
> simplest global lock solution you mentioned. In that case most CPUs should be 
> doing 'something' all the time - maybe spinning on the lock, maybe something 
> else, 
> right?

Again, mostly looks idle.

Events: 5M cycles
31.69%  swapper  [kernel.kallsyms][k] 
update_cfs_rq_blocked_load
14.22%  swapper  [kernel.kallsyms][k] load_balance
12.95%  swapper  [kernel.kallsyms][k] ktime_get
 4.64%  swapper  [kernel.kallsyms][k] idle_cpu
 3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
 2.26%  swapper  [kernel.kallsyms][k] 
ktime_get_update_offsets
 2.25%  swapper  [kernel.kallsyms][k] rcu_check_callbacks
 1.72%  swapper  [kernel.kallsyms][k] _raw_spin_lock_irqsave
 1.57%  swapper  [kernel.kallsyms][k] native_write_msr_safe
 1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
 1.52%  swapper  [kernel.kallsyms][k] apic_timer_interrupt
 1.52%  swapper  [kernel.kallsyms][k] 
update_blocked_averages
 1.51%  swapper  [kernel.kallsyms][k] __lock_text_start
 1.48%  swapper  [kernel.kallsyms][k] rcu_process_gp_end
 1.40%  swapper  [kernel.kallsyms][k] rcu_process_callbacks
 1.19%   reboot  [kernel.kallsyms][k] kmem_cache_alloc_node
 0.63%  swapper  [kernel.kallsyms][k] 
check_for_new_grace_period
 0.58%  swapper  [kernel.kallsyms][k] rebalance_domains
 0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
 0.54%  swapper  [kernel.kallsyms][k] __tick_nohz_idle_enter
 0.53%  swapper  [kernel.kallsyms][k] 
perf_adjust_freq_unthr_context
 0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock

Robin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Ingo Molnar

* Robin Holt  wrote:

> I had the machine booted as 512 cpus.
> I tweaked the kernel like this:
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 39c9c4a..b42bd4f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -368,8 +368,10 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
>   */
>  void kernel_restart(char *cmd)
>  {
> -   kernel_restart_prepare(cmd);
> +   // kernel_restart_prepare(cmd);
> disable_nonboot_cpus();
> +   enable_nonboot_cpus();
> +   return;
> if (!cmd)
> printk(KERN_EMERG "Restarting system.\n");
> else
> 
> perf record -a /sbin/reboot -d -f -n
> 
> The top of 'perf report' has:
> Events: 14M cycles
> 22.58%  swapper  [kernel.kallsyms]   [k] 
> update_cfs_rq_blocked_load
> 10.52%  swapper  [kernel.kallsyms]   [k] load_balance
>  4.96%  swapper  [kernel.kallsyms]   [k] ktime_get
>  4.12%  swapper  [kernel.kallsyms]   [k] 
> update_blocked_averages
>  3.55%  swapper  [kernel.kallsyms]   [k] idle_cpu
>  1.97%  swapper  [kernel.kallsyms]   [k] uv_read_rtc
>  0.98%  swapper  [kernel.kallsyms]   [k] 
> rcu_process_gp_end
>  0.84%  swapper  [kernel.kallsyms]   [k] 
> apic_timer_interrupt
>  0.84%  swapper  [kernel.kallsyms]   [k] __lock_text_start
>  0.84%  swapper  [kernel.kallsyms]   [k] 
> _raw_spin_lock_irqsave
>  0.73%  swapper  [kernel.kallsyms]   [k] native_safe_halt
>  0.56%  swapper  [kernel.kallsyms]   [k] 
> rcu_check_callbacks
>  0.56%  swapper  [kernel.kallsyms]   [k] 
> native_write_msr_safe
>  0.44%  swapper  [kernel.kallsyms]   [k] cpumask_next_and
>  0.42%   reboot  [kernel.kallsyms]   [k] 
> kmem_cache_alloc_node

Ok, so it looks profilable.

The result above is not surprising: most CPUs sit in idle and don't do 
anything, 
while the loop goes on, right?

The interesting thing to profile would be the parallel bring-down, with the 
simplest global lock solution you mentioned. In that case most CPUs should be 
doing 'something' all the time - maybe spinning on the lock, maybe something 
else, 
right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 I had the machine booted as 512 cpus.
 I tweaked the kernel like this:
 
 diff --git a/kernel/sys.c b/kernel/sys.c
 index 39c9c4a..b42bd4f 100644
 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -368,8 +368,10 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
   */
  void kernel_restart(char *cmd)
  {
 -   kernel_restart_prepare(cmd);
 +   // kernel_restart_prepare(cmd);
 disable_nonboot_cpus();
 +   enable_nonboot_cpus();
 +   return;
 if (!cmd)
 printk(KERN_EMERG Restarting system.\n);
 else
 
 perf record -a /sbin/reboot -d -f -n
 
 The top of 'perf report' has:
 Events: 14M cycles
 22.58%  swapper  [kernel.kallsyms]   [k] 
 update_cfs_rq_blocked_load
 10.52%  swapper  [kernel.kallsyms]   [k] load_balance
  4.96%  swapper  [kernel.kallsyms]   [k] ktime_get
  4.12%  swapper  [kernel.kallsyms]   [k] 
 update_blocked_averages
  3.55%  swapper  [kernel.kallsyms]   [k] idle_cpu
  1.97%  swapper  [kernel.kallsyms]   [k] uv_read_rtc
  0.98%  swapper  [kernel.kallsyms]   [k] 
 rcu_process_gp_end
  0.84%  swapper  [kernel.kallsyms]   [k] 
 apic_timer_interrupt
  0.84%  swapper  [kernel.kallsyms]   [k] __lock_text_start
  0.84%  swapper  [kernel.kallsyms]   [k] 
 _raw_spin_lock_irqsave
  0.73%  swapper  [kernel.kallsyms]   [k] native_safe_halt
  0.56%  swapper  [kernel.kallsyms]   [k] 
 rcu_check_callbacks
  0.56%  swapper  [kernel.kallsyms]   [k] 
 native_write_msr_safe
  0.44%  swapper  [kernel.kallsyms]   [k] cpumask_next_and
  0.42%   reboot  [kernel.kallsyms]   [k] 
 kmem_cache_alloc_node

Ok, so it looks profilable.

The result above is not surprising: most CPUs sit in idle and don't do 
anything, 
while the loop goes on, right?

The interesting thing to profile would be the parallel bring-down, with the 
simplest global lock solution you mentioned. In that case most CPUs should be 
doing 'something' all the time - maybe spinning on the lock, maybe something 
else, 
right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Robin Holt
 Ok, so it looks profilable.
 
 The result above is not surprising: most CPUs sit in idle and don't do 
 anything, 
 while the loop goes on, right?
 
 The interesting thing to profile would be the parallel bring-down, with the 
 simplest global lock solution you mentioned. In that case most CPUs should be 
 doing 'something' all the time - maybe spinning on the lock, maybe something 
 else, 
 right?

Again, mostly looks idle.

Events: 5M cycles
31.69%  swapper  [kernel.kallsyms][k] 
update_cfs_rq_blocked_load
14.22%  swapper  [kernel.kallsyms][k] load_balance
12.95%  swapper  [kernel.kallsyms][k] ktime_get
 4.64%  swapper  [kernel.kallsyms][k] idle_cpu
 3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
 2.26%  swapper  [kernel.kallsyms][k] 
ktime_get_update_offsets
 2.25%  swapper  [kernel.kallsyms][k] rcu_check_callbacks
 1.72%  swapper  [kernel.kallsyms][k] _raw_spin_lock_irqsave
 1.57%  swapper  [kernel.kallsyms][k] native_write_msr_safe
 1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
 1.52%  swapper  [kernel.kallsyms][k] apic_timer_interrupt
 1.52%  swapper  [kernel.kallsyms][k] 
update_blocked_averages
 1.51%  swapper  [kernel.kallsyms][k] __lock_text_start
 1.48%  swapper  [kernel.kallsyms][k] rcu_process_gp_end
 1.40%  swapper  [kernel.kallsyms][k] rcu_process_callbacks
 1.19%   reboot  [kernel.kallsyms][k] kmem_cache_alloc_node
 0.63%  swapper  [kernel.kallsyms][k] 
check_for_new_grace_period
 0.58%  swapper  [kernel.kallsyms][k] rebalance_domains
 0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
 0.54%  swapper  [kernel.kallsyms][k] __tick_nohz_idle_enter
 0.53%  swapper  [kernel.kallsyms][k] 
perf_adjust_freq_unthr_context
 0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock

Robin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

  Ok, so it looks profilable.
  
  The result above is not surprising: most CPUs sit in idle and don't do 
  anything, 
  while the loop goes on, right?
  
  The interesting thing to profile would be the parallel bring-down, with the 
  simplest global lock solution you mentioned. In that case most CPUs should 
  be 
  doing 'something' all the time - maybe spinning on the lock, maybe 
  something else, 
  right?
 
 Again, mostly looks idle.

Forgot to suggest:

  perf record -a /sbin/reboot

... to capture remote CPU activity too.

 Events: 5M cycles
 31.69%  swapper  [kernel.kallsyms][k] 
 update_cfs_rq_blocked_load
 14.22%  swapper  [kernel.kallsyms][k] load_balance
 12.95%  swapper  [kernel.kallsyms][k] ktime_get
  4.64%  swapper  [kernel.kallsyms][k] idle_cpu
  3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
  2.26%  swapper  [kernel.kallsyms][k] 
 ktime_get_update_offsets
  2.25%  swapper  [kernel.kallsyms][k] rcu_check_callbacks
  1.72%  swapper  [kernel.kallsyms][k] 
 _raw_spin_lock_irqsave
  1.57%  swapper  [kernel.kallsyms][k] 
 native_write_msr_safe
  1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
  1.52%  swapper  [kernel.kallsyms][k] apic_timer_interrupt
  1.52%  swapper  [kernel.kallsyms][k] 
 update_blocked_averages
  1.51%  swapper  [kernel.kallsyms][k] __lock_text_start
  1.48%  swapper  [kernel.kallsyms][k] rcu_process_gp_end
  1.40%  swapper  [kernel.kallsyms][k] 
 rcu_process_callbacks
  1.19%   reboot  [kernel.kallsyms][k] 
 kmem_cache_alloc_node
  0.63%  swapper  [kernel.kallsyms][k] 
 check_for_new_grace_period
  0.58%  swapper  [kernel.kallsyms][k] rebalance_domains
  0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
  0.54%  swapper  [kernel.kallsyms][k] 
 __tick_nohz_idle_enter
  0.53%  swapper  [kernel.kallsyms][k] 
 perf_adjust_freq_unthr_context
  0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock

If even perf record -a shows a mostly idle system, then the overhead must be in 
sleep/wakeup latencies - for that the next step would be to figure out where 
all 
the waiting happens, for example via call-graph context-switch profiling:

   perf stat --null perf record -a -g -e sched:sched_switch /sbin/reboot

(the perf stat --null will tell us the runtime of the whole operation.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Robin Holt
On Thu, Apr 11, 2013 at 02:00:27PM +0200, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
   Ok, so it looks profilable.
   
   The result above is not surprising: most CPUs sit in idle and don't do 
   anything, 
   while the loop goes on, right?
   
   The interesting thing to profile would be the parallel bring-down, with 
   the 
   simplest global lock solution you mentioned. In that case most CPUs 
   should be 
   doing 'something' all the time - maybe spinning on the lock, maybe 
   something else, 
   right?
  
  Again, mostly looks idle.
 
 Forgot to suggest:
 
   perf record -a /sbin/reboot

I used perf record -a /sbin/reboot -f -d -n

Robin
 
 ... to capture remote CPU activity too.
 
  Events: 5M cycles
  31.69%  swapper  [kernel.kallsyms][k] 
  update_cfs_rq_blocked_load
  14.22%  swapper  [kernel.kallsyms][k] load_balance
  12.95%  swapper  [kernel.kallsyms][k] ktime_get
   4.64%  swapper  [kernel.kallsyms][k] idle_cpu
   3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
   2.26%  swapper  [kernel.kallsyms][k] 
  ktime_get_update_offsets
   2.25%  swapper  [kernel.kallsyms][k] 
  rcu_check_callbacks
   1.72%  swapper  [kernel.kallsyms][k] 
  _raw_spin_lock_irqsave
   1.57%  swapper  [kernel.kallsyms][k] 
  native_write_msr_safe
   1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
   1.52%  swapper  [kernel.kallsyms][k] 
  apic_timer_interrupt
   1.52%  swapper  [kernel.kallsyms][k] 
  update_blocked_averages
   1.51%  swapper  [kernel.kallsyms][k] __lock_text_start
   1.48%  swapper  [kernel.kallsyms][k] rcu_process_gp_end
   1.40%  swapper  [kernel.kallsyms][k] 
  rcu_process_callbacks
   1.19%   reboot  [kernel.kallsyms][k] 
  kmem_cache_alloc_node
   0.63%  swapper  [kernel.kallsyms][k] 
  check_for_new_grace_period
   0.58%  swapper  [kernel.kallsyms][k] rebalance_domains
   0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
   0.54%  swapper  [kernel.kallsyms][k] 
  __tick_nohz_idle_enter
   0.53%  swapper  [kernel.kallsyms][k] 
  perf_adjust_freq_unthr_context
   0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock
 
 If even perf record -a shows a mostly idle system, then the overhead must be 
 in 
 sleep/wakeup latencies - for that the next step would be to figure out where 
 all 
 the waiting happens, for example via call-graph context-switch profiling:
 
perf stat --null perf record -a -g -e sched:sched_switch /sbin/reboot
 
 (the perf stat --null will tell us the runtime of the whole operation.)
 
 Thanks,
 
   Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Robin Holt
On Thu, Apr 11, 2013 at 07:03:58AM -0500, Robin Holt wrote:
 On Thu, Apr 11, 2013 at 02:00:27PM +0200, Ingo Molnar wrote:
  
  * Robin Holt h...@sgi.com wrote:
  
Ok, so it looks profilable.

The result above is not surprising: most CPUs sit in idle and don't do 
anything, 
while the loop goes on, right?

The interesting thing to profile would be the parallel bring-down, with 
the 
simplest global lock solution you mentioned. In that case most CPUs 
should be 
doing 'something' all the time - maybe spinning on the lock, maybe 
something else, 
right?
   
   Again, mostly looks idle.
  
  Forgot to suggest:
  
perf record -a /sbin/reboot
 
 I used perf record -a /sbin/reboot -f -d -n

OK.  Looking at Russ' patch, I understand now why it is looking idle.
We are still serially doing the DOWN_PREPARE, etc.  All those other cpus
are still sitting idle.

Can we call the __cpu_down functions from an smp_call_function()?

Robin
 
 Robin
  
  ... to capture remote CPU activity too.
  
   Events: 5M cycles
   31.69%  swapper  [kernel.kallsyms][k] 
   update_cfs_rq_blocked_load
   14.22%  swapper  [kernel.kallsyms][k] load_balance
   12.95%  swapper  [kernel.kallsyms][k] ktime_get
4.64%  swapper  [kernel.kallsyms][k] idle_cpu
3.46%  swapper  [kernel.kallsyms][k] uv_read_rtc
2.26%  swapper  [kernel.kallsyms][k] 
   ktime_get_update_offsets
2.25%  swapper  [kernel.kallsyms][k] 
   rcu_check_callbacks
1.72%  swapper  [kernel.kallsyms][k] 
   _raw_spin_lock_irqsave
1.57%  swapper  [kernel.kallsyms][k] 
   native_write_msr_safe
1.53%  swapper  [kernel.kallsyms][k] native_safe_halt
1.52%  swapper  [kernel.kallsyms][k] 
   apic_timer_interrupt
1.52%  swapper  [kernel.kallsyms][k] 
   update_blocked_averages
1.51%  swapper  [kernel.kallsyms][k] 
   __lock_text_start
1.48%  swapper  [kernel.kallsyms][k] 
   rcu_process_gp_end
1.40%  swapper  [kernel.kallsyms][k] 
   rcu_process_callbacks
1.19%   reboot  [kernel.kallsyms][k] 
   kmem_cache_alloc_node
0.63%  swapper  [kernel.kallsyms][k] 
   check_for_new_grace_period
0.58%  swapper  [kernel.kallsyms][k] 
   rebalance_domains
0.55%  swapper  [kernel.kallsyms][k] cpumask_next_and
0.54%  swapper  [kernel.kallsyms][k] 
   __tick_nohz_idle_enter
0.53%  swapper  [kernel.kallsyms][k] 
   perf_adjust_freq_unthr_context
0.49%  swapper  [kernel.kallsyms][k] _raw_spin_lock
  
  If even perf record -a shows a mostly idle system, then the overhead must 
  be in 
  sleep/wakeup latencies - for that the next step would be to figure out 
  where all 
  the waiting happens, for example via call-graph context-switch profiling:
  
 perf stat --null perf record -a -g -e sched:sched_switch /sbin/reboot
  
  (the perf stat --null will tell us the runtime of the whole operation.)
  
  Thanks,
  
  Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-11 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 On Thu, Apr 11, 2013 at 07:03:58AM -0500, Robin Holt wrote:
  On Thu, Apr 11, 2013 at 02:00:27PM +0200, Ingo Molnar wrote:
   
   * Robin Holt h...@sgi.com wrote:
   
 Ok, so it looks profilable.
 
 The result above is not surprising: most CPUs sit in idle and don't 
 do anything, 
 while the loop goes on, right?
 
 The interesting thing to profile would be the parallel bring-down, 
 with the 
 simplest global lock solution you mentioned. In that case most CPUs 
 should be 
 doing 'something' all the time - maybe spinning on the lock, maybe 
 something else, 
 right?

Again, mostly looks idle.
   
   Forgot to suggest:
   
 perf record -a /sbin/reboot
  
  I used perf record -a /sbin/reboot -f -d -n
 
 OK.  Looking at Russ' patch, I understand now why it is looking idle.
 We are still serially doing the DOWN_PREPARE, etc.  All those other cpus
 are still sitting idle.
 
 Can we call the __cpu_down functions from an smp_call_function()?

I think the kthread_park() will generally schedule.

But ... whether it's an IPI or a wakeup should matter little: wakeups are IPI 
based (sometimes faster, mwait based).

So the main overhead is the serial loop - if that's done in parallel, and then 
all 
CPUs are waited for in a second loop, then much of the work can go on in 
parallel.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Srivatsa S. Bhat
On 04/11/2013 11:01 AM, Paul Mackerras wrote:
 On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
 The optimal solution would be to just speed up the
 disable_nonboot_cpus() code so much that it isn't an issue. That would
 be good for suspending too, although I guess suspend isn't a big issue
 if you have a thousand CPU's.

 Has anybody checked whether we could do the cpu_down() on non-boot
 CPU's in parallel? Right now we serialize the thing completely, with
 
 I thought Srivatsa S. Bhat had a patchset that did exactly that.
 Srivatsa?
 

Thanks for the CC, Paul! Adding some more people to CC.

Actually, my patchset was about removing stop_machine() from the CPU
offline path.
http://lwn.net/Articles/538819/

And here is the performance improvement I had measured in the version
prior to that:
http://article.gmane.org/gmane.linux.kernel/1435249

I'm planning to revive this patchset after the 3.10 merge window closes,
because it depends on doing a tree-wide sweep, and I think its a little
late to do it in time for the upcoming 3.10 merge window itself.

Anyway, that's about removing stop_machine from CPU hotplug.

Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
But I believe we can do more than that.

As Russ pointed out, the notifiers are not thread-safe, so calling them
in parallel with different CPUs as arguments isn't going to work.

So, first, we can convert all the CPU hotplug notifiers to take a cpumask
instead of a single CPU. So assuming that there are 'n' notifiers in total,
the number of function calls would become n, instead of n*1024.
But that itself most likely won't give us much benefit over the for-loop
that Russ has done in his patch, because it'll simply do longer processing
in each of those 'n' notifiers, by iterating over the cpumask inside each
notifier.

Now comes the interesting thing:

Consider a notifier chain that looks like this:
Priority 0: A-B-C-D

We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
different hotcpus as argument. *However*, since A, B, C, D all (more or less)
belong to different subsystems, we can call A, B, C and D in parallel on
different CPUs. They won't even serialize amongst themselves because they
take locks (if any) of different subsystems. And since they are of same
priority, the ordering (A after B or B after A) doesn't matter as well.

So with this, if we combine the idea I wrote above about giving a cpumask
to each of these notifiers to work with, we end up in this:

CPU 0  CPU 1  CPU2   
A(cpumask) B(cpumask) C(cpumask) 

So, for example, the CPU_DOWN_PREPARE notification can be processed in parallel
on multiple CPUs at a time, for a given cpumask! That should definitely
give us a good speed-up.

One more thing we have to note is that, there are 4 notifiers for taking a
CPU offline:

CPU_DOWN_PREPARE
CPU_DYING
CPU_DEAD
CPU_POST_DEAD

The first can be run in parallel as mentioned above. The second is run in
parallel in the stop_machine() phase as shown in Russ' patch. But the third
and fourth set of notifications all end up running only on CPU0, which will
again slow down things.

So I suggest taking down the 1024 CPUs in multiple phases, like a binary search.
First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk
CPU hotplug, we have enough online CPUs to handle the notifier load, and that
helps speed things up. Moreover, a handful of calls to stop_machine() is OK
because, stop_machine() takes progressively lesser and lesser time because
lesser CPUs are online on each iteration (and hence it reduces the
synchronization overhead of the stop-machine phase).

The only downside to this whole idea of running the notifiers of a given
priority in parallel, is error handling - if a notifier fails, it would be
troublesome to rollback I guess. But if we forget that for a moment, we can
give this idea a try!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Robin Holt
For the v3.9 release, can we consider my awful patch?

Thanks,
Robin

On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
 On 04/11/2013 11:01 AM, Paul Mackerras wrote:
  On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
  The optimal solution would be to just speed up the
  disable_nonboot_cpus() code so much that it isn't an issue. That would
  be good for suspending too, although I guess suspend isn't a big issue
  if you have a thousand CPU's.
 
  Has anybody checked whether we could do the cpu_down() on non-boot
  CPU's in parallel? Right now we serialize the thing completely, with
  
  I thought Srivatsa S. Bhat had a patchset that did exactly that.
  Srivatsa?
  
 
 Thanks for the CC, Paul! Adding some more people to CC.
 
 Actually, my patchset was about removing stop_machine() from the CPU
 offline path.
 http://lwn.net/Articles/538819/
 
 And here is the performance improvement I had measured in the version
 prior to that:
 http://article.gmane.org/gmane.linux.kernel/1435249
 
 I'm planning to revive this patchset after the 3.10 merge window closes,
 because it depends on doing a tree-wide sweep, and I think its a little
 late to do it in time for the upcoming 3.10 merge window itself.
 
 Anyway, that's about removing stop_machine from CPU hotplug.
 
 Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
 But I believe we can do more than that.
 
 As Russ pointed out, the notifiers are not thread-safe, so calling them
 in parallel with different CPUs as arguments isn't going to work.
 
 So, first, we can convert all the CPU hotplug notifiers to take a cpumask
 instead of a single CPU. So assuming that there are 'n' notifiers in total,
 the number of function calls would become n, instead of n*1024.
 But that itself most likely won't give us much benefit over the for-loop
 that Russ has done in his patch, because it'll simply do longer processing
 in each of those 'n' notifiers, by iterating over the cpumask inside each
 notifier.
 
 Now comes the interesting thing:
 
 Consider a notifier chain that looks like this:
 Priority 0: A-B-C-D
 
 We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
 different hotcpus as argument. *However*, since A, B, C, D all (more or less)
 belong to different subsystems, we can call A, B, C and D in parallel on
 different CPUs. They won't even serialize amongst themselves because they
 take locks (if any) of different subsystems. And since they are of same
 priority, the ordering (A after B or B after A) doesn't matter as well.
 
 So with this, if we combine the idea I wrote above about giving a cpumask
 to each of these notifiers to work with, we end up in this:
 
 CPU 0  CPU 1  CPU2   
 A(cpumask) B(cpumask) C(cpumask) 
 
 So, for example, the CPU_DOWN_PREPARE notification can be processed in 
 parallel
 on multiple CPUs at a time, for a given cpumask! That should definitely
 give us a good speed-up.
 
 One more thing we have to note is that, there are 4 notifiers for taking a
 CPU offline:
 
 CPU_DOWN_PREPARE
 CPU_DYING
 CPU_DEAD
 CPU_POST_DEAD
 
 The first can be run in parallel as mentioned above. The second is run in
 parallel in the stop_machine() phase as shown in Russ' patch. But the third
 and fourth set of notifications all end up running only on CPU0, which will
 again slow down things.
 
 So I suggest taking down the 1024 CPUs in multiple phases, like a binary 
 search.
 First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk
 CPU hotplug, we have enough online CPUs to handle the notifier load, and that
 helps speed things up. Moreover, a handful of calls to stop_machine() is OK
 because, stop_machine() takes progressively lesser and lesser time because
 lesser CPUs are online on each iteration (and hence it reduces the
 synchronization overhead of the stop-machine phase).
 
 The only downside to this whole idea of running the notifiers of a given
 priority in parallel, is error handling - if a notifier fails, it would be
 troublesome to rollback I guess. But if we forget that for a moment, we can
 give this idea a try!
 
 Regards,
 Srivatsa S. Bhat
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Russ Anderson
On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
 On 04/11/2013 11:01 AM, Paul Mackerras wrote:
  On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
  The optimal solution would be to just speed up the
  disable_nonboot_cpus() code so much that it isn't an issue. That would
  be good for suspending too, although I guess suspend isn't a big issue
  if you have a thousand CPU's.
 
  Has anybody checked whether we could do the cpu_down() on non-boot
  CPU's in parallel? Right now we serialize the thing completely, with
  
  I thought Srivatsa S. Bhat had a patchset that did exactly that.
  Srivatsa?
  
 
 Thanks for the CC, Paul! Adding some more people to CC.
 
 Actually, my patchset was about removing stop_machine() from the CPU
 offline path.
 http://lwn.net/Articles/538819/

I certainly agree with the intent.

 And here is the performance improvement I had measured in the version
 prior to that:
 http://article.gmane.org/gmane.linux.kernel/1435249
 
 I'm planning to revive this patchset after the 3.10 merge window closes,
 because it depends on doing a tree-wide sweep, and I think its a little
 late to do it in time for the upcoming 3.10 merge window itself.
 
 Anyway, that's about removing stop_machine from CPU hotplug.
 
 Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
 But I believe we can do more than that.
 
 As Russ pointed out, the notifiers are not thread-safe, so calling them
 in parallel with different CPUs as arguments isn't going to work.
 
 So, first, we can convert all the CPU hotplug notifiers to take a cpumask
 instead of a single CPU. So assuming that there are 'n' notifiers in total,
 the number of function calls would become n, instead of n*1024.
 But that itself most likely won't give us much benefit over the for-loop
 that Russ has done in his patch, because it'll simply do longer processing
 in each of those 'n' notifiers, by iterating over the cpumask inside each
 notifier.

As an alternative, how about each cpu have their own notifier list?
Then one task per cpu can spin through that cpu's notifier list, 
allowing them to run in parallel.

I don't know if that would be a faster solution than adding cpumask
to notifiers, but it my guess is it may.

 Now comes the interesting thing:
 
 Consider a notifier chain that looks like this:
 Priority 0: A-B-C-D
 
 We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
 different hotcpus as argument. *However*, since A, B, C, D all (more or less)
 belong to different subsystems, we can call A, B, C and D in parallel on
 different CPUs. They won't even serialize amongst themselves because they
 take locks (if any) of different subsystems. And since they are of same
 priority, the ordering (A after B or B after A) doesn't matter as well.
 
 So with this, if we combine the idea I wrote above about giving a cpumask
 to each of these notifiers to work with, we end up in this:
 
 CPU 0  CPU 1  CPU2   
 A(cpumask) B(cpumask) C(cpumask) 
 
 So, for example, the CPU_DOWN_PREPARE notification can be processed in 
 parallel
 on multiple CPUs at a time, for a given cpumask! That should definitely
 give us a good speed-up.
 
 One more thing we have to note is that, there are 4 notifiers for taking a
 CPU offline:
 
 CPU_DOWN_PREPARE
 CPU_DYING
 CPU_DEAD
 CPU_POST_DEAD
 
 The first can be run in parallel as mentioned above. The second is run in
 parallel in the stop_machine() phase as shown in Russ' patch. But the third
 and fourth set of notifications all end up running only on CPU0, which will
 again slow down things.

In my testing the third and fourth set were a small part of the overall
time.  Less than 10%, with cpu notifiers 90+% of the time.  So you may
not need the added complexity, or at least fix the cpu notifier part
first.

 So I suggest taking down the 1024 CPUs in multiple phases, like a binary 
 search.
 First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk
 CPU hotplug, we have enough online CPUs to handle the notifier load, and that
 helps speed things up. Moreover, a handful of calls to stop_machine() is OK
 because, stop_machine() takes progressively lesser and lesser time because
 lesser CPUs are online on each iteration (and hence it reduces the
 synchronization overhead of the stop-machine phase).
 
 The only downside to this whole idea of running the notifiers of a given
 priority in parallel, is error handling - if a notifier fails, it would be
 troublesome to rollback I guess. But if we forget that for a moment, we can
 give this idea a try!

Yes.

 Regards,
 Srivatsa S. Bhat

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ 

Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Srivatsa S. Bhat
On 04/11/2013 07:53 PM, Russ Anderson wrote:
 On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
 On 04/11/2013 11:01 AM, Paul Mackerras wrote:
 On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
 The optimal solution would be to just speed up the
 disable_nonboot_cpus() code so much that it isn't an issue. That would
 be good for suspending too, although I guess suspend isn't a big issue
 if you have a thousand CPU's.

 Has anybody checked whether we could do the cpu_down() on non-boot
 CPU's in parallel? Right now we serialize the thing completely, with

 I thought Srivatsa S. Bhat had a patchset that did exactly that.
 Srivatsa?


 Thanks for the CC, Paul! Adding some more people to CC.

 Actually, my patchset was about removing stop_machine() from the CPU
 offline path.
 http://lwn.net/Articles/538819/
 
 I certainly agree with the intent.
 

Thank you!

 And here is the performance improvement I had measured in the version
 prior to that:
 http://article.gmane.org/gmane.linux.kernel/1435249

 I'm planning to revive this patchset after the 3.10 merge window closes,
 because it depends on doing a tree-wide sweep, and I think its a little
 late to do it in time for the upcoming 3.10 merge window itself.

 Anyway, that's about removing stop_machine from CPU hotplug.

 Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested.
 But I believe we can do more than that.

 As Russ pointed out, the notifiers are not thread-safe, so calling them
 in parallel with different CPUs as arguments isn't going to work.

 So, first, we can convert all the CPU hotplug notifiers to take a cpumask
 instead of a single CPU. So assuming that there are 'n' notifiers in total,
 the number of function calls would become n, instead of n*1024.
 But that itself most likely won't give us much benefit over the for-loop
 that Russ has done in his patch, because it'll simply do longer processing
 in each of those 'n' notifiers, by iterating over the cpumask inside each
 notifier.
 
 As an alternative, how about each cpu have their own notifier list?
 Then one task per cpu can spin through that cpu's notifier list, 
 allowing them to run in parallel.
 
 I don't know if that would be a faster solution than adding cpumask
 to notifiers, but it my guess is it may.
 

That might not work out well because those notifiers will have to lock
against each other. That is, notifier callback A cannot run as A(cpuX)
and A(cpuY) in parallel. They will have to serialize themselves, which
will make the whole effort useless. But, as I mentioned earlier, A(cpuX)
and B(cpuX) can run in parallel without additional serialization, if A
and B are completely different callbacks (ie., belonging to different
subsystems).

 Now comes the interesting thing:

 Consider a notifier chain that looks like this:
 Priority 0: A-B-C-D

 We can't invoke say notifier callback A simultaneously on 2 CPUs with 2
 different hotcpus as argument. *However*, since A, B, C, D all (more or less)
 belong to different subsystems, we can call A, B, C and D in parallel on
 different CPUs. They won't even serialize amongst themselves because they
 take locks (if any) of different subsystems. And since they are of same
 priority, the ordering (A after B or B after A) doesn't matter as well.

 So with this, if we combine the idea I wrote above about giving a cpumask
 to each of these notifiers to work with, we end up in this:

 CPU 0  CPU 1  CPU2   
 A(cpumask) B(cpumask) C(cpumask) 

 So, for example, the CPU_DOWN_PREPARE notification can be processed in 
 parallel
 on multiple CPUs at a time, for a given cpumask! That should definitely
 give us a good speed-up.

 One more thing we have to note is that, there are 4 notifiers for taking a
 CPU offline:

 CPU_DOWN_PREPARE
 CPU_DYING
 CPU_DEAD
 CPU_POST_DEAD

 The first can be run in parallel as mentioned above. The second is run in
 parallel in the stop_machine() phase as shown in Russ' patch. But the third
 and fourth set of notifications all end up running only on CPU0, which will
 again slow down things.
 
 In my testing the third and fourth set were a small part of the overall
 time.  Less than 10%, with cpu notifiers 90+% of the time.

*All* of them are cpu notifiers! All of them invoke __cpu_notify() internally.
So how did you differentiate between them and find out that the third and
fourth sets take less time?

  So you may
 not need the added complexity, or at least fix the cpu notifier part
 first.
 

To make the 3rd and 4th run fast, the only thing we need to do is take CPUs
offline in smaller steps, like 512, 256 etc.. It doesn't add any extra
complexity over and above what is necessary to make the cpu notifiers run
in parallel in the first place.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Russ Anderson
On Thu, Apr 11, 2013 at 08:15:27PM +0530, Srivatsa S. Bhat wrote:
 On 04/11/2013 07:53 PM, Russ Anderson wrote:
  On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
 
  One more thing we have to note is that, there are 4 notifiers for taking a
  CPU offline:
 
  CPU_DOWN_PREPARE
  CPU_DYING
  CPU_DEAD
  CPU_POST_DEAD
 
  The first can be run in parallel as mentioned above. The second is run in
  parallel in the stop_machine() phase as shown in Russ' patch. But the third
  and fourth set of notifications all end up running only on CPU0, which will
  again slow down things.
  
  In my testing the third and fourth set were a small part of the overall
  time.  Less than 10%, with cpu notifiers 90+% of the time.
 
 *All* of them are cpu notifiers! All of them invoke __cpu_notify() internally.
 So how did you differentiate between them and find out that the third and
 fourth sets take less time?

I reran a test on a 1024 cpu system, using my test patch to only call
__stop_machine() once.  Added printks to show the kernel timestamp
at various points.

When calling disable_nonboot_cpus() and enable_nonboot_cpus() just after
booting the system:
 The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 376.6 seconds.
 The loop calling cpu_notify_nofail(CPU_DEAD) took 8.1 seconds.

My guess is that notifiers do more work in the CPU_DOWN_PREPARE case.

I also added a loop calling a new notifier (CPU_TEST) which none of
notifiers would recognize, to measure the time it took to spin through
the call chain without the notifiers doing any work.  It took
0.0067 seconds.

On the actual reboot, as the system was shutting down:
 The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 333.8 seconds.
 The loop calling cpu_notify_nofail(CPU_DEAD) took 2.7 seconds.

I don't know how many notifiers are on the chain, or if there is
one heavy hitter accounting for much of the time in the
CPU_DOWN_PREPARE case.


FWIW, the overall cpu stop times are somewhat longer than what I
measured before.  Not sure if the difference is due to changes in
my test patch, other kernel changes pulled in, or some difference
on the test system.


   So you may
  not need the added complexity, or at least fix the cpu notifier part
  first.
  
 
 To make the 3rd and 4th run fast, the only thing we need to do is take CPUs
 offline in smaller steps, like 512, 256 etc.. It doesn't add any extra
 complexity over and above what is necessary to make the cpu notifiers run
 in parallel in the first place.
 
 Regards,
 Srivatsa S. Bhat

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Srivatsa S. Bhat
On 04/12/2013 01:38 AM, Russ Anderson wrote:
 On Thu, Apr 11, 2013 at 08:15:27PM +0530, Srivatsa S. Bhat wrote:
 On 04/11/2013 07:53 PM, Russ Anderson wrote:
 On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:

 One more thing we have to note is that, there are 4 notifiers for taking a
 CPU offline:

 CPU_DOWN_PREPARE
 CPU_DYING
 CPU_DEAD
 CPU_POST_DEAD

 The first can be run in parallel as mentioned above. The second is run in
 parallel in the stop_machine() phase as shown in Russ' patch. But the third
 and fourth set of notifications all end up running only on CPU0, which will
 again slow down things.

 In my testing the third and fourth set were a small part of the overall
 time.  Less than 10%, with cpu notifiers 90+% of the time.

 *All* of them are cpu notifiers! All of them invoke __cpu_notify() 
 internally.
 So how did you differentiate between them and find out that the third and
 fourth sets take less time?
 
 I reran a test on a 1024 cpu system, using my test patch to only call
 __stop_machine() once.  Added printks to show the kernel timestamp
 at various points.
 
 When calling disable_nonboot_cpus() and enable_nonboot_cpus() just after
 booting the system:
  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 376.6 seconds.
  The loop calling cpu_notify_nofail(CPU_DEAD) took 8.1 seconds.
 
 My guess is that notifiers do more work in the CPU_DOWN_PREPARE case.
 
 I also added a loop calling a new notifier (CPU_TEST) which none of
 notifiers would recognize, to measure the time it took to spin through
 the call chain without the notifiers doing any work.  It took
 0.0067 seconds.
 
 On the actual reboot, as the system was shutting down:
  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 333.8 seconds.
  The loop calling cpu_notify_nofail(CPU_DEAD) took 2.7 seconds.
 
 I don't know how many notifiers are on the chain, or if there is
 one heavy hitter accounting for much of the time in the
 CPU_DOWN_PREPARE case.
 
 
 FWIW, the overall cpu stop times are somewhat longer than what I
 measured before.  Not sure if the difference is due to changes in
 my test patch, other kernel changes pulled in, or some difference
 on the test system.
 
 

Thanks a lot for reporting the time taken at each stage. Its extremely
useful. So, we can drop the idea of taking CPUs down in multiple rounds
like 512, 256 etc. And, like you mentioned earlier, just running the
CPU_DOWN_PREPARE notifiers in parallel (like we discussed earlier) should
give us all the performance improvement. Or perhaps, we can instrument
the code in kernel/notifier.c (notifier_call_chain) to find out if there
is a rogue notifier which contributes most to the ~300 seconds.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Robin Holt
On Thu, Apr 11, 2013 at 03:08:20PM -0500, Russ Anderson wrote:
 On Thu, Apr 11, 2013 at 08:15:27PM +0530, Srivatsa S. Bhat wrote:
  On 04/11/2013 07:53 PM, Russ Anderson wrote:
   On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote:
  
   One more thing we have to note is that, there are 4 notifiers for taking 
   a
   CPU offline:
  
   CPU_DOWN_PREPARE
   CPU_DYING
   CPU_DEAD
   CPU_POST_DEAD
  
   The first can be run in parallel as mentioned above. The second is run in
   parallel in the stop_machine() phase as shown in Russ' patch. But the 
   third
   and fourth set of notifications all end up running only on CPU0, which 
   will
   again slow down things.
   
   In my testing the third and fourth set were a small part of the overall
   time.  Less than 10%, with cpu notifiers 90+% of the time.
  
  *All* of them are cpu notifiers! All of them invoke __cpu_notify() 
  internally.
  So how did you differentiate between them and find out that the third and
  fourth sets take less time?
 
 I reran a test on a 1024 cpu system, using my test patch to only call
 __stop_machine() once.  Added printks to show the kernel timestamp
 at various points.
 
 When calling disable_nonboot_cpus() and enable_nonboot_cpus() just after
 booting the system:
  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 376.6 seconds.
  The loop calling cpu_notify_nofail(CPU_DEAD) took 8.1 seconds.
 
 My guess is that notifiers do more work in the CPU_DOWN_PREPARE case.
 
 I also added a loop calling a new notifier (CPU_TEST) which none of
 notifiers would recognize, to measure the time it took to spin through
 the call chain without the notifiers doing any work.  It took
 0.0067 seconds.
 
 On the actual reboot, as the system was shutting down:
  The loop calling __cpu_notify(CPU_DOWN_PREPARE) took 333.8 seconds.
  The loop calling cpu_notify_nofail(CPU_DEAD) took 2.7 seconds.

How about if you take the notifier_call_chain function copy it
to kernel/sys.c, and time each notifier_call() callout individually.

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.)

2013-04-11 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 For the v3.9 release, can we consider my awful patch?

How about trying what I suggested, to make reboot affine to the boot CPU 
explicitly, not by shutting down all the other CPUs, but by set_cpus_allowed() 
or 
so?

That should solve the regression, without the ugly special-casing - while 
giving 
time to address the hot-unplug performance bottleneck.

Once that is done disable_nonboot_cpus() can be used again for reboot.

(But no strong feelings either way - both solutions are a workaround in a 
sense.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Paul Mackerras
On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
> The optimal solution would be to just speed up the
> disable_nonboot_cpus() code so much that it isn't an issue. That would
> be good for suspending too, although I guess suspend isn't a big issue
> if you have a thousand CPU's.
> 
> Has anybody checked whether we could do the cpu_down() on non-boot
> CPU's in parallel? Right now we serialize the thing completely, with

I thought Srivatsa S. Bhat had a patchset that did exactly that.
Srivatsa?

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Russ Anderson
On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
> * Russ Anderson  wrote:
> 
> > Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a cpu 
> > bitmask in disable_nonboot_cpus().  The lower level routines already take a 
> > bitmask.  It allows __stop_machine() to be called just once.  That change 
> > reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
> > Significant improvement, but not good enough.
> > 
> > The next significant bottleneck is __cpu_notify().  Tried creating worker 
> > threads to parallelize the shutdown, but the problem is __cpu_notify() is 
> > not 
> > thread safe.  Putting a lock around it caused all the worker threads to 
> > fight 
> > over the lock.
> 
> 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
> CPU.
> 
> That sounds a lot, given that unlike bootup there's not much real work to be 
> done 
> during shutdown - we don't initialize anything, etc.
> 
> Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
> which 
> could be made parallel?

I was hoping for a stupid udelay when I first started looking
at this code, but found nothing obvious.

The bulk of the time (after making the cpu bitmask change) is
spent in __cpu_notify(), as explained above.


> Would it be possible to create a 'reboot but stop at the end and reactivate 
> all 
> CPUs again' reboot flag, so that it can all be NMI-profiled, to see where the 
> true 
> bottleneck is? A naked disable_nonboot_cpus() call in essence.

My testing was similar.  I hacked a kernel module to call
disable_nonboot_cpus() and enable_nonboot_cpus() and used
printks to narrow down the slow functions.  That points
at the cpu notifier call chain.  It's not clear if any
of the functions on the call chain take a long time, or
just going sequentially through the list for all cpus just
takes a long time.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Russ Anderson
On Wed, Apr 10, 2013 at 10:29:12AM -0500, Russ Anderson wrote:
> On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
> > 
> > Yeah, we've had issues with ACPI in the past, so I do think we should
> > always reboot using the BP. Even if it almost certainly works on 99+%
> > of all machines on any random CPU.
> > 
> > The optimal solution would be to just speed up the
> > disable_nonboot_cpus() code so much that it isn't an issue. That would
> > be good for suspending too, although I guess suspend isn't a big issue
> > if you have a thousand CPU's.
> > 
> > Has anybody checked whether we could do the cpu_down() on non-boot
> > CPU's in parallel? Right now we serialize the thing completely, with
> > one single
> > 
> > for_each_online_cpu(cpu) {
> > ...
> > 
> > loop that does a synchrinous _cpu_down() for each CPU. No wonder it
> > takes forever. We do __stop_machine() over and over and over again:
> > the whole thing is basically O(n**2) in CPU's.
> 
> Yes, I have a test patch that replaces for_each_online_cpu(cpu)
> with a cpu bitmask in disable_nonboot_cpus().  The lower level
> routines already take a bitmask.  It allows __stop_machine() to
> be called just once.  That change reduces shutdown time on a
> 1024 cpu machine from 16 minutes 4 minutes.  Significant improvement,
> but not good enough.
> 
> The next significant bottleneck is __cpu_notify().  Tried creating
> worker threads to parallelize the shutdown, but the problem is
> __cpu_notify() is not thread safe.  Putting a lock around it
> caused all the worker threads to fight over the lock.
> 
> Note that __cpu_notify() has to be called for all cpus being
> shut down because the cpu_chain notifier call chain has cpu as a
> parameter.  The delema is that cpu_chain notifiers need to be called on
> all cpus, but cannot be done in parallel due to __cpu_notify() not being
> thread safe.  Spinning through the notifier chain sequentially for all
> cpus just takes a long time.
> 
> The real fix would be to make the _chain notifier per cpu, or at
> least thread safe, so that all the cpus being shut down could do so
> in parallel.  That is a significant change with ramifications on
> other code.
> 
> I will post a patch shortly with the cpu bitmask change.  Changing
> __cpu_notify() will take more discussion.

Here is the test patch with the cpu bitmask change.  It results
in calling __stop_machine() just once.

After making feedback changes I'll formally submit the patch.

---
 kernel/cpu.c |   94 +++
 1 file changed, 56 insertions(+), 38 deletions(-)

Index: linux/kernel/cpu.c
===
--- linux.orig/kernel/cpu.c 2013-04-10 17:16:40.084960495 -0500
+++ linux/kernel/cpu.c  2013-04-10 17:17:27.988245160 -0500
@@ -262,10 +262,11 @@ static int __ref take_cpu_down(void *_pa
 }
 
 /* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+static int __ref _cpu_down(const cpumask_t *cpus_to_offline, int tasks_frozen)
 {
-   int err, nr_calls = 0;
+   int cpu = 0, err = 0, nr_calls = 0;
void *hcpu = (void *)(long)cpu;
+   cpumask_var_t cpus_offlined;
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct take_cpu_down_param tcd_param = {
.mod = mod,
@@ -278,46 +279,65 @@ static int __ref _cpu_down(unsigned int
if (!cpu_online(cpu))
return -EINVAL;
 
+   if (!alloc_cpumask_var(_offlined, GFP_KERNEL))
+   return -ENOMEM;
+
cpu_hotplug_begin();
+   cpumask_clear(cpus_offlined);
+   cpumask_copy(cpus_offlined, cpus_to_offline);
 
-   err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, _calls);
-   if (err) {
-   nr_calls--;
-   __cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
-   printk("%s: attempt to take down CPU %u failed\n",
+   for_each_cpu_mask(cpu, *cpus_to_offline) {
+   hcpu = (void *)(long)cpu;
+   if (!cpu_online(cpu))
+   continue;
+   tcd_param.hcpu = hcpu;
+   err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, _calls);
+   if (err) {
+   nr_calls--;
+   __cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, 
NULL);
+   pr_err("%s: attempt to take down CPU %u failed\n",
__func__, cpu);
-   goto out_release;
+   goto out_release;
+   }
+   smpboot_park_threads(cpu);
}
-   smpboot_park_threads(cpu);
 
-   err = __stop_machine(take_cpu_down, _param, cpumask_of(cpu));
+   err = __stop_machine(take_cpu_down, _param, cpus_to_offline);
if (err) {
/* CPU didn't die: tell everyone.  Can't complain. */
-   

Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
> > I'm proposing to make 'reboot' overhead profilable, via a debug hack:
> > 
> > echo 1 > /proc/sys/kernel/magic_dont_fully_reboot_flag
> > 
> > perf record reboot
> > 
> > perf is using NMIs to profile - and since much of cpu_down() is with irqs 
> > disabled, NMI profiling would be needed to see inside the overhead.
> > 
> > (Assuming the 240 msecs is CPU overhead, not waiting for some IRQ/IPI 
> > event.)


I had the machine booted as 512 cpus.
I tweaked the kernel like this:

diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..b42bd4f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -368,8 +368,10 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
  */
 void kernel_restart(char *cmd)
 {
-   kernel_restart_prepare(cmd);
+   // kernel_restart_prepare(cmd);
disable_nonboot_cpus();
+   enable_nonboot_cpus();
+   return;
if (!cmd)
printk(KERN_EMERG "Restarting system.\n");
else

perf record -a /sbin/reboot -d -f -n

The top of 'perf report' has:
Events: 14M cycles
22.58%  swapper  [kernel.kallsyms]   [k] 
update_cfs_rq_blocked_load
10.52%  swapper  [kernel.kallsyms]   [k] load_balance
 4.96%  swapper  [kernel.kallsyms]   [k] ktime_get
 4.12%  swapper  [kernel.kallsyms]   [k] 
update_blocked_averages
 3.55%  swapper  [kernel.kallsyms]   [k] idle_cpu
 1.97%  swapper  [kernel.kallsyms]   [k] uv_read_rtc
 0.98%  swapper  [kernel.kallsyms]   [k] rcu_process_gp_end
 0.84%  swapper  [kernel.kallsyms]   [k] 
apic_timer_interrupt
 0.84%  swapper  [kernel.kallsyms]   [k] __lock_text_start
 0.84%  swapper  [kernel.kallsyms]   [k] 
_raw_spin_lock_irqsave
 0.73%  swapper  [kernel.kallsyms]   [k] native_safe_halt
 0.56%  swapper  [kernel.kallsyms]   [k] rcu_check_callbacks
 0.56%  swapper  [kernel.kallsyms]   [k] 
native_write_msr_safe
 0.44%  swapper  [kernel.kallsyms]   [k] cpumask_next_and
 0.42%   reboot  [kernel.kallsyms]   [k] 
kmem_cache_alloc_node

The perf data is 676 MB.  I don't know how well it compresses, but the
lzma task has been running for a while.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread H. Peter Anvin
On 04/10/2013 09:59 AM, Ingo Molnar wrote:
> 
> 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
> CPU.
> 
> That sounds a lot, given that unlike bootup there's not much real work to be 
> done 
> during shutdown - we don't initialize anything, etc.
> 
> Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
> which 
> could be made parallel?
> 

I wonder if it isn't calling stop_machine() 1024 times, each time
rendezvousing all the still-active CPUs.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
On Wed, Apr 10, 2013 at 07:22:36PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
> > On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
> > > 
> > > * Russ Anderson  wrote:
> > > 
> > > > Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a 
> > > > cpu 
> > > > bitmask in disable_nonboot_cpus().  The lower level routines already 
> > > > take a 
> > > > bitmask.  It allows __stop_machine() to be called just once.  That 
> > > > change 
> > > > reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
> > > > Significant improvement, but not good enough.
> > > > 
> > > > The next significant bottleneck is __cpu_notify().  Tried creating 
> > > > worker 
> > > > threads to parallelize the shutdown, but the problem is __cpu_notify() 
> > > > is not 
> > > > thread safe.  Putting a lock around it caused all the worker threads to 
> > > > fight 
> > > > over the lock.
> > > 
> > > 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs 
> > > per CPU.
> > > 
> > > That sounds a lot, given that unlike bootup there's not much real work to 
> > > be done 
> > > during shutdown - we don't initialize anything, etc.
> > > 
> > > Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
> > > which 
> > > could be made parallel?
> > > 
> > > Would it be possible to create a 'reboot but stop at the end and 
> > > reactivate all 
> > > CPUs again' reboot flag, so that it can all be NMI-profiled, to see where 
> > > the true 
> > > bottleneck is? A naked disable_nonboot_cpus() call in essence.
> > 
> > What, exactly, are you proposing with the NMI profiling? [...]
> 
> I'm proposing to make 'reboot' overhead profilable, via a debug hack:
> 
> echo 1 > /proc/sys/kernel/magic_dont_fully_reboot_flag
> 
>   perf record reboot
> 
> perf is using NMIs to profile - and since much of cpu_down() is with irqs 
> disabled, NMI profiling would be needed to see inside the overhead.
> 
> (Assuming the 240 msecs is CPU overhead, not waiting for some IRQ/IPI event.)

Let me give it a try.

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Ingo Molnar

* Robin Holt  wrote:

> On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
> > 
> > * Russ Anderson  wrote:
> > 
> > > Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a 
> > > cpu 
> > > bitmask in disable_nonboot_cpus().  The lower level routines already take 
> > > a 
> > > bitmask.  It allows __stop_machine() to be called just once.  That change 
> > > reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
> > > Significant improvement, but not good enough.
> > > 
> > > The next significant bottleneck is __cpu_notify().  Tried creating worker 
> > > threads to parallelize the shutdown, but the problem is __cpu_notify() is 
> > > not 
> > > thread safe.  Putting a lock around it caused all the worker threads to 
> > > fight 
> > > over the lock.
> > 
> > 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
> > CPU.
> > 
> > That sounds a lot, given that unlike bootup there's not much real work to 
> > be done 
> > during shutdown - we don't initialize anything, etc.
> > 
> > Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
> > which 
> > could be made parallel?
> > 
> > Would it be possible to create a 'reboot but stop at the end and reactivate 
> > all 
> > CPUs again' reboot flag, so that it can all be NMI-profiled, to see where 
> > the true 
> > bottleneck is? A naked disable_nonboot_cpus() call in essence.
> 
> What, exactly, are you proposing with the NMI profiling? [...]

I'm proposing to make 'reboot' overhead profilable, via a debug hack:

echo 1 > /proc/sys/kernel/magic_dont_fully_reboot_flag

perf record reboot

perf is using NMIs to profile - and since much of cpu_down() is with irqs 
disabled, NMI profiling would be needed to see inside the overhead.

(Assuming the 240 msecs is CPU overhead, not waiting for some IRQ/IPI event.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
> 
> * Russ Anderson  wrote:
> 
> > Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a cpu 
> > bitmask in disable_nonboot_cpus().  The lower level routines already take a 
> > bitmask.  It allows __stop_machine() to be called just once.  That change 
> > reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
> > Significant improvement, but not good enough.
> > 
> > The next significant bottleneck is __cpu_notify().  Tried creating worker 
> > threads to parallelize the shutdown, but the problem is __cpu_notify() is 
> > not 
> > thread safe.  Putting a lock around it caused all the worker threads to 
> > fight 
> > over the lock.
> 
> 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
> CPU.
> 
> That sounds a lot, given that unlike bootup there's not much real work to be 
> done 
> during shutdown - we don't initialize anything, etc.
> 
> Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
> which 
> could be made parallel?
> 
> Would it be possible to create a 'reboot but stop at the end and reactivate 
> all 
> CPUs again' reboot flag, so that it can all be NMI-profiled, to see where the 
> true 
> bottleneck is? A naked disable_nonboot_cpus() call in essence.

What, exactly, are you proposing with the NMI profiling?  Currently,
if I NMI the system, I get dump_stack() output for all cpus.  Without
introducing a lock to serialize those, they stacks are really just a
jumbled mess.  With a lock, things are fairly slow.

Are you proposing something other than looking at stack dumps?

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Ingo Molnar

* Russ Anderson  wrote:

> Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a cpu 
> bitmask in disable_nonboot_cpus().  The lower level routines already take a 
> bitmask.  It allows __stop_machine() to be called just once.  That change 
> reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
> Significant improvement, but not good enough.
> 
> The next significant bottleneck is __cpu_notify().  Tried creating worker 
> threads to parallelize the shutdown, but the problem is __cpu_notify() is not 
> thread safe.  Putting a lock around it caused all the worker threads to fight 
> over the lock.

4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per CPU.

That sounds a lot, given that unlike bootup there's not much real work to be 
done 
during shutdown - we don't initialize anything, etc.

Maybe much of those 240 msecs are spent in some stupid udelay loop or so, which 
could be made parallel?

Would it be possible to create a 'reboot but stop at the end and reactivate all 
CPUs again' reboot flag, so that it can all be NMI-profiled, to see where the 
true 
bottleneck is? A naked disable_nonboot_cpus() call in essence.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Russ Anderson
On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
> On Wed, Apr 10, 2013 at 4:16 AM, Ingo Molnar  wrote:
> >
> > I think rebooting on the same CPU where we booted up is something worth 
> > having in
> > general, as a firmware robustness feature. (assuming the CPU in question is 
> > still
> > online)
> 
> Yeah, we've had issues with ACPI in the past, so I do think we should
> always reboot using the BP. Even if it almost certainly works on 99+%
> of all machines on any random CPU.
> 
> The optimal solution would be to just speed up the
> disable_nonboot_cpus() code so much that it isn't an issue. That would
> be good for suspending too, although I guess suspend isn't a big issue
> if you have a thousand CPU's.
> 
> Has anybody checked whether we could do the cpu_down() on non-boot
> CPU's in parallel? Right now we serialize the thing completely, with
> one single
> 
> for_each_online_cpu(cpu) {
> ...
> 
> loop that does a synchrinous _cpu_down() for each CPU. No wonder it
> takes forever. We do __stop_machine() over and over and over again:
> the whole thing is basically O(n**2) in CPU's.

Yes, I have a test patch that replaces for_each_online_cpu(cpu)
with a cpu bitmask in disable_nonboot_cpus().  The lower level
routines already take a bitmask.  It allows __stop_machine() to
be called just once.  That change reduces shutdown time on a
1024 cpu machine from 16 minutes 4 minutes.  Significant improvement,
but not good enough.

The next significant bottleneck is __cpu_notify().  Tried creating
worker threads to parallelize the shutdown, but the problem is
__cpu_notify() is not thread safe.  Putting a lock around it
caused all the worker threads to fight over the lock.

Note that __cpu_notify() has to be called for all cpus being
shut down because the cpu_chain notifier call chain has cpu as a
parameter.  The delema is that cpu_chain notifiers need to be called on
all cpus, but cannot be done in parallel due to __cpu_notify() not being
thread safe.  Spinning through the notifier chain sequentially for all
cpus just takes a long time.

The real fix would be to make the _chain notifier per cpu, or at
least thread safe, so that all the cpus being shut down could do so
in parallel.  That is a significant change with ramifications on
other code.

I will post a patch shortly with the cpu bitmask change.  Changing
__cpu_notify() will take more discussion.

> Linus

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Linus Torvalds
On Wed, Apr 10, 2013 at 4:16 AM, Ingo Molnar  wrote:
>
> I think rebooting on the same CPU where we booted up is something worth 
> having in
> general, as a firmware robustness feature. (assuming the CPU in question is 
> still
> online)

Yeah, we've had issues with ACPI in the past, so I do think we should
always reboot using the BP. Even if it almost certainly works on 99+%
of all machines on any random CPU.

The optimal solution would be to just speed up the
disable_nonboot_cpus() code so much that it isn't an issue. That would
be good for suspending too, although I guess suspend isn't a big issue
if you have a thousand CPU's.

Has anybody checked whether we could do the cpu_down() on non-boot
CPU's in parallel? Right now we serialize the thing completely, with
one single

for_each_online_cpu(cpu) {
...

loop that does a synchrinous _cpu_down() for each CPU. No wonder it
takes forever. We do __stop_machine() over and over and over again:
the whole thing is basically O(n**2) in CPU's.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
On Wed, Apr 10, 2013 at 01:16:20PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
> > On Mon, Apr 08, 2013 at 09:11:06AM -0700, H. Peter Anvin wrote:
> > > On 04/08/2013 08:57 AM, Ingo Molnar wrote:
> > > > 
> > > > I think the original commit:
> > > > 
> > > >   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in 
> > > > kernel_restart()
> > > > 
> > > > actually regressed your 1024 CPU systems, and should possibly be 
> > > > reverted or fixed 
> > > > in some other fashion - such as by migrating to the primary CPU (on 
> > > > architectures 
> > > > that require that), instead of hotplug offlining every secondary CPU on 
> > > > every 
> > > > architecture!
> > > > 
> > > > Alternatively, disable_nonboot_cpus() could perhaps be improved to down 
> > > > CPUs in 
> > > > parallel: issue the CPU-down requests to every CPU, then wait for them 
> > > > to complete 
> > > > - instead of the loop over every CPU?
> > > > 
> > > > This would be the conceptual counter part to parallel boot up of CPUs - 
> > > > something 
> > > > SGI might be interested in as well?
> > > > 
> > > 
> > > Migrating to the boot processor and then calling stop_machine() to
> > > defang any other processors should be sufficient, no?
> > > 
> > > I don't know if there is any reason to deschedule all tasks?
> > 
> > My reading of the original commit indicated that some architecture's
> > firmware needs the boot cpu to be the one initiating reboot.
> > 
> > If that is correct, then I can not see why a stop_machine() implementation
> > will not work.
> > 
> > Since this is in generic kernel code, how can I proceed?
> 
> I think rebooting on the same CPU where we booted up is something worth 
> having in 
> general, as a firmware robustness feature. (assuming the CPU in question is 
> still 
> online)
> 
> We have similar constraints in the suspend code for example - some x86 
> firmware 
> breaks if suspend related ACPI calls are not done on the boot CPU ...
> 
> So how about restoring the old "just reboot, don't shut down the others" 
> behavior, 
> extended with a "reboot on the CPU that booted up" reboot affinity logic?

Just want to be sure I am going the write direction, but in the shutdown and
reboot case, you would support something like:

diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..35845c5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -358,6 +358,18 @@ int unregister_reboot_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL(unregister_reboot_notifier);
 
+void migrate_to_boot_cpu(void)
+{
+   cpumask_t *shutdown_cpu_mask;
+
+   shutdown_cpu_mask = kzalloc(sizeof(cpumask_t), GFP_KERNEL);
+   if (shutdown_cpu_mask) {
+   cpumask_set_cpu(0, shutdown_cpu_mask);
+   cpumask_and(shutdown_cpu_mask, shutdown_cpu_mask, 
cpu_online_mask);
+   set_cpus_allowed_ptr(current, shutdown_cpu_mask);
+   }
+}
+
 /**
  * kernel_restart - reboot the system
  * @cmd: pointer to buffer containing command to execute for restart
@@ -369,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
 void kernel_restart(char *cmd)
 {
kernel_restart_prepare(cmd);
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
if (!cmd)
printk(KERN_EMERG "Restarting system.\n");
else
@@ -413,7 +425,7 @@ void kernel_power_off(void)
kernel_shutdown_prepare(SYSTEM_POWER_OFF);
if (pm_power_off_prepare)
pm_power_off_prepare();
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
syscore_shutdown();
printk(KERN_EMERG "Power down.\n");
kmsg_dump(KMSG_DUMP_POWEROFF);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Ingo Molnar

* Robin Holt  wrote:

> On Mon, Apr 08, 2013 at 09:11:06AM -0700, H. Peter Anvin wrote:
> > On 04/08/2013 08:57 AM, Ingo Molnar wrote:
> > > 
> > > I think the original commit:
> > > 
> > >   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in 
> > > kernel_restart()
> > > 
> > > actually regressed your 1024 CPU systems, and should possibly be reverted 
> > > or fixed 
> > > in some other fashion - such as by migrating to the primary CPU (on 
> > > architectures 
> > > that require that), instead of hotplug offlining every secondary CPU on 
> > > every 
> > > architecture!
> > > 
> > > Alternatively, disable_nonboot_cpus() could perhaps be improved to down 
> > > CPUs in 
> > > parallel: issue the CPU-down requests to every CPU, then wait for them to 
> > > complete 
> > > - instead of the loop over every CPU?
> > > 
> > > This would be the conceptual counter part to parallel boot up of CPUs - 
> > > something 
> > > SGI might be interested in as well?
> > > 
> > 
> > Migrating to the boot processor and then calling stop_machine() to
> > defang any other processors should be sufficient, no?
> > 
> > I don't know if there is any reason to deschedule all tasks?
> 
> My reading of the original commit indicated that some architecture's
> firmware needs the boot cpu to be the one initiating reboot.
> 
> If that is correct, then I can not see why a stop_machine() implementation
> will not work.
> 
> Since this is in generic kernel code, how can I proceed?

I think rebooting on the same CPU where we booted up is something worth having 
in 
general, as a firmware robustness feature. (assuming the CPU in question is 
still 
online)

We have similar constraints in the suspend code for example - some x86 firmware 
breaks if suspend related ACPI calls are not done on the boot CPU ...

So how about restoring the old "just reboot, don't shut down the others" 
behavior, 
extended with a "reboot on the CPU that booted up" reboot affinity logic?

That should fix the 1024 CPUs regression, and it should also keep those ARM 
systems working - without any special casing.

Of course I'd also be entirely happy about having true parallel shutdown...

It does not have to be entirely threaded: I bet most of the shutdown latency is 
in 
a few paranoia udelay()s or so, where some simple global lock could be dropped.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 On Mon, Apr 08, 2013 at 09:11:06AM -0700, H. Peter Anvin wrote:
  On 04/08/2013 08:57 AM, Ingo Molnar wrote:
   
   I think the original commit:
   
 f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in 
   kernel_restart()
   
   actually regressed your 1024 CPU systems, and should possibly be reverted 
   or fixed 
   in some other fashion - such as by migrating to the primary CPU (on 
   architectures 
   that require that), instead of hotplug offlining every secondary CPU on 
   every 
   architecture!
   
   Alternatively, disable_nonboot_cpus() could perhaps be improved to down 
   CPUs in 
   parallel: issue the CPU-down requests to every CPU, then wait for them to 
   complete 
   - instead of the loop over every CPU?
   
   This would be the conceptual counter part to parallel boot up of CPUs - 
   something 
   SGI might be interested in as well?
   
  
  Migrating to the boot processor and then calling stop_machine() to
  defang any other processors should be sufficient, no?
  
  I don't know if there is any reason to deschedule all tasks?
 
 My reading of the original commit indicated that some architecture's
 firmware needs the boot cpu to be the one initiating reboot.
 
 If that is correct, then I can not see why a stop_machine() implementation
 will not work.
 
 Since this is in generic kernel code, how can I proceed?

I think rebooting on the same CPU where we booted up is something worth having 
in 
general, as a firmware robustness feature. (assuming the CPU in question is 
still 
online)

We have similar constraints in the suspend code for example - some x86 firmware 
breaks if suspend related ACPI calls are not done on the boot CPU ...

So how about restoring the old just reboot, don't shut down the others 
behavior, 
extended with a reboot on the CPU that booted up reboot affinity logic?

That should fix the 1024 CPUs regression, and it should also keep those ARM 
systems working - without any special casing.

Of course I'd also be entirely happy about having true parallel shutdown...

It does not have to be entirely threaded: I bet most of the shutdown latency is 
in 
a few paranoia udelay()s or so, where some simple global lock could be dropped.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
On Wed, Apr 10, 2013 at 01:16:20PM +0200, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
  On Mon, Apr 08, 2013 at 09:11:06AM -0700, H. Peter Anvin wrote:
   On 04/08/2013 08:57 AM, Ingo Molnar wrote:

I think the original commit:

  f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in 
kernel_restart()

actually regressed your 1024 CPU systems, and should possibly be 
reverted or fixed 
in some other fashion - such as by migrating to the primary CPU (on 
architectures 
that require that), instead of hotplug offlining every secondary CPU on 
every 
architecture!

Alternatively, disable_nonboot_cpus() could perhaps be improved to down 
CPUs in 
parallel: issue the CPU-down requests to every CPU, then wait for them 
to complete 
- instead of the loop over every CPU?

This would be the conceptual counter part to parallel boot up of CPUs - 
something 
SGI might be interested in as well?

   
   Migrating to the boot processor and then calling stop_machine() to
   defang any other processors should be sufficient, no?
   
   I don't know if there is any reason to deschedule all tasks?
  
  My reading of the original commit indicated that some architecture's
  firmware needs the boot cpu to be the one initiating reboot.
  
  If that is correct, then I can not see why a stop_machine() implementation
  will not work.
  
  Since this is in generic kernel code, how can I proceed?
 
 I think rebooting on the same CPU where we booted up is something worth 
 having in 
 general, as a firmware robustness feature. (assuming the CPU in question is 
 still 
 online)
 
 We have similar constraints in the suspend code for example - some x86 
 firmware 
 breaks if suspend related ACPI calls are not done on the boot CPU ...
 
 So how about restoring the old just reboot, don't shut down the others 
 behavior, 
 extended with a reboot on the CPU that booted up reboot affinity logic?

Just want to be sure I am going the write direction, but in the shutdown and
reboot case, you would support something like:

diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..35845c5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -358,6 +358,18 @@ int unregister_reboot_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL(unregister_reboot_notifier);
 
+void migrate_to_boot_cpu(void)
+{
+   cpumask_t *shutdown_cpu_mask;
+
+   shutdown_cpu_mask = kzalloc(sizeof(cpumask_t), GFP_KERNEL);
+   if (shutdown_cpu_mask) {
+   cpumask_set_cpu(0, shutdown_cpu_mask);
+   cpumask_and(shutdown_cpu_mask, shutdown_cpu_mask, 
cpu_online_mask);
+   set_cpus_allowed_ptr(current, shutdown_cpu_mask);
+   }
+}
+
 /**
  * kernel_restart - reboot the system
  * @cmd: pointer to buffer containing command to execute for restart
@@ -369,7 +381,7 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
 void kernel_restart(char *cmd)
 {
kernel_restart_prepare(cmd);
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
if (!cmd)
printk(KERN_EMERG Restarting system.\n);
else
@@ -413,7 +425,7 @@ void kernel_power_off(void)
kernel_shutdown_prepare(SYSTEM_POWER_OFF);
if (pm_power_off_prepare)
pm_power_off_prepare();
-   disable_nonboot_cpus();
+   migrate_to_boot_cpu();
syscore_shutdown();
printk(KERN_EMERG Power down.\n);
kmsg_dump(KMSG_DUMP_POWEROFF);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Linus Torvalds
On Wed, Apr 10, 2013 at 4:16 AM, Ingo Molnar mi...@kernel.org wrote:

 I think rebooting on the same CPU where we booted up is something worth 
 having in
 general, as a firmware robustness feature. (assuming the CPU in question is 
 still
 online)

Yeah, we've had issues with ACPI in the past, so I do think we should
always reboot using the BP. Even if it almost certainly works on 99+%
of all machines on any random CPU.

The optimal solution would be to just speed up the
disable_nonboot_cpus() code so much that it isn't an issue. That would
be good for suspending too, although I guess suspend isn't a big issue
if you have a thousand CPU's.

Has anybody checked whether we could do the cpu_down() on non-boot
CPU's in parallel? Right now we serialize the thing completely, with
one single

for_each_online_cpu(cpu) {
...

loop that does a synchrinous _cpu_down() for each CPU. No wonder it
takes forever. We do __stop_machine() over and over and over again:
the whole thing is basically O(n**2) in CPU's.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Russ Anderson
On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
 On Wed, Apr 10, 2013 at 4:16 AM, Ingo Molnar mi...@kernel.org wrote:
 
  I think rebooting on the same CPU where we booted up is something worth 
  having in
  general, as a firmware robustness feature. (assuming the CPU in question is 
  still
  online)
 
 Yeah, we've had issues with ACPI in the past, so I do think we should
 always reboot using the BP. Even if it almost certainly works on 99+%
 of all machines on any random CPU.
 
 The optimal solution would be to just speed up the
 disable_nonboot_cpus() code so much that it isn't an issue. That would
 be good for suspending too, although I guess suspend isn't a big issue
 if you have a thousand CPU's.
 
 Has anybody checked whether we could do the cpu_down() on non-boot
 CPU's in parallel? Right now we serialize the thing completely, with
 one single
 
 for_each_online_cpu(cpu) {
 ...
 
 loop that does a synchrinous _cpu_down() for each CPU. No wonder it
 takes forever. We do __stop_machine() over and over and over again:
 the whole thing is basically O(n**2) in CPU's.

Yes, I have a test patch that replaces for_each_online_cpu(cpu)
with a cpu bitmask in disable_nonboot_cpus().  The lower level
routines already take a bitmask.  It allows __stop_machine() to
be called just once.  That change reduces shutdown time on a
1024 cpu machine from 16 minutes 4 minutes.  Significant improvement,
but not good enough.

The next significant bottleneck is __cpu_notify().  Tried creating
worker threads to parallelize the shutdown, but the problem is
__cpu_notify() is not thread safe.  Putting a lock around it
caused all the worker threads to fight over the lock.

Note that __cpu_notify() has to be called for all cpus being
shut down because the cpu_chain notifier call chain has cpu as a
parameter.  The delema is that cpu_chain notifiers need to be called on
all cpus, but cannot be done in parallel due to __cpu_notify() not being
thread safe.  Spinning through the notifier chain sequentially for all
cpus just takes a long time.

The real fix would be to make the cpu_chain notifier per cpu, or at
least thread safe, so that all the cpus being shut down could do so
in parallel.  That is a significant change with ramifications on
other code.

I will post a patch shortly with the cpu bitmask change.  Changing
__cpu_notify() will take more discussion.

 Linus

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Ingo Molnar

* Russ Anderson r...@sgi.com wrote:

 Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a cpu 
 bitmask in disable_nonboot_cpus().  The lower level routines already take a 
 bitmask.  It allows __stop_machine() to be called just once.  That change 
 reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
 Significant improvement, but not good enough.
 
 The next significant bottleneck is __cpu_notify().  Tried creating worker 
 threads to parallelize the shutdown, but the problem is __cpu_notify() is not 
 thread safe.  Putting a lock around it caused all the worker threads to fight 
 over the lock.

4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per CPU.

That sounds a lot, given that unlike bootup there's not much real work to be 
done 
during shutdown - we don't initialize anything, etc.

Maybe much of those 240 msecs are spent in some stupid udelay loop or so, which 
could be made parallel?

Would it be possible to create a 'reboot but stop at the end and reactivate all 
CPUs again' reboot flag, so that it can all be NMI-profiled, to see where the 
true 
bottleneck is? A naked disable_nonboot_cpus() call in essence.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
 
 * Russ Anderson r...@sgi.com wrote:
 
  Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a cpu 
  bitmask in disable_nonboot_cpus().  The lower level routines already take a 
  bitmask.  It allows __stop_machine() to be called just once.  That change 
  reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
  Significant improvement, but not good enough.
  
  The next significant bottleneck is __cpu_notify().  Tried creating worker 
  threads to parallelize the shutdown, but the problem is __cpu_notify() is 
  not 
  thread safe.  Putting a lock around it caused all the worker threads to 
  fight 
  over the lock.
 
 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
 CPU.
 
 That sounds a lot, given that unlike bootup there's not much real work to be 
 done 
 during shutdown - we don't initialize anything, etc.
 
 Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
 which 
 could be made parallel?
 
 Would it be possible to create a 'reboot but stop at the end and reactivate 
 all 
 CPUs again' reboot flag, so that it can all be NMI-profiled, to see where the 
 true 
 bottleneck is? A naked disable_nonboot_cpus() call in essence.

What, exactly, are you proposing with the NMI profiling?  Currently,
if I NMI the system, I get dump_stack() output for all cpus.  Without
introducing a lock to serialize those, they stacks are really just a
jumbled mess.  With a lock, things are fairly slow.

Are you proposing something other than looking at stack dumps?

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
  
  * Russ Anderson r...@sgi.com wrote:
  
   Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a 
   cpu 
   bitmask in disable_nonboot_cpus().  The lower level routines already take 
   a 
   bitmask.  It allows __stop_machine() to be called just once.  That change 
   reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
   Significant improvement, but not good enough.
   
   The next significant bottleneck is __cpu_notify().  Tried creating worker 
   threads to parallelize the shutdown, but the problem is __cpu_notify() is 
   not 
   thread safe.  Putting a lock around it caused all the worker threads to 
   fight 
   over the lock.
  
  4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
  CPU.
  
  That sounds a lot, given that unlike bootup there's not much real work to 
  be done 
  during shutdown - we don't initialize anything, etc.
  
  Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
  which 
  could be made parallel?
  
  Would it be possible to create a 'reboot but stop at the end and reactivate 
  all 
  CPUs again' reboot flag, so that it can all be NMI-profiled, to see where 
  the true 
  bottleneck is? A naked disable_nonboot_cpus() call in essence.
 
 What, exactly, are you proposing with the NMI profiling? [...]

I'm proposing to make 'reboot' overhead profilable, via a debug hack:

echo 1  /proc/sys/kernel/magic_dont_fully_reboot_flag

perf record reboot

perf is using NMIs to profile - and since much of cpu_down() is with irqs 
disabled, NMI profiling would be needed to see inside the overhead.

(Assuming the 240 msecs is CPU overhead, not waiting for some IRQ/IPI event.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
On Wed, Apr 10, 2013 at 07:22:36PM +0200, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
  On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
   
   * Russ Anderson r...@sgi.com wrote:
   
Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a 
cpu 
bitmask in disable_nonboot_cpus().  The lower level routines already 
take a 
bitmask.  It allows __stop_machine() to be called just once.  That 
change 
reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
Significant improvement, but not good enough.

The next significant bottleneck is __cpu_notify().  Tried creating 
worker 
threads to parallelize the shutdown, but the problem is __cpu_notify() 
is not 
thread safe.  Putting a lock around it caused all the worker threads to 
fight 
over the lock.
   
   4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs 
   per CPU.
   
   That sounds a lot, given that unlike bootup there's not much real work to 
   be done 
   during shutdown - we don't initialize anything, etc.
   
   Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
   which 
   could be made parallel?
   
   Would it be possible to create a 'reboot but stop at the end and 
   reactivate all 
   CPUs again' reboot flag, so that it can all be NMI-profiled, to see where 
   the true 
   bottleneck is? A naked disable_nonboot_cpus() call in essence.
  
  What, exactly, are you proposing with the NMI profiling? [...]
 
 I'm proposing to make 'reboot' overhead profilable, via a debug hack:
 
 echo 1  /proc/sys/kernel/magic_dont_fully_reboot_flag
 
   perf record reboot
 
 perf is using NMIs to profile - and since much of cpu_down() is with irqs 
 disabled, NMI profiling would be needed to see inside the overhead.
 
 (Assuming the 240 msecs is CPU overhead, not waiting for some IRQ/IPI event.)

Let me give it a try.

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread H. Peter Anvin
On 04/10/2013 09:59 AM, Ingo Molnar wrote:
 
 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
 CPU.
 
 That sounds a lot, given that unlike bootup there's not much real work to be 
 done 
 during shutdown - we don't initialize anything, etc.
 
 Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
 which 
 could be made parallel?
 

I wonder if it isn't calling stop_machine() 1024 times, each time
rendezvousing all the still-active CPUs.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Robin Holt
  I'm proposing to make 'reboot' overhead profilable, via a debug hack:
  
  echo 1  /proc/sys/kernel/magic_dont_fully_reboot_flag
  
  perf record reboot
  
  perf is using NMIs to profile - and since much of cpu_down() is with irqs 
  disabled, NMI profiling would be needed to see inside the overhead.
  
  (Assuming the 240 msecs is CPU overhead, not waiting for some IRQ/IPI 
  event.)


I had the machine booted as 512 cpus.
I tweaked the kernel like this:

diff --git a/kernel/sys.c b/kernel/sys.c
index 39c9c4a..b42bd4f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -368,8 +368,10 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
  */
 void kernel_restart(char *cmd)
 {
-   kernel_restart_prepare(cmd);
+   // kernel_restart_prepare(cmd);
disable_nonboot_cpus();
+   enable_nonboot_cpus();
+   return;
if (!cmd)
printk(KERN_EMERG Restarting system.\n);
else

perf record -a /sbin/reboot -d -f -n

The top of 'perf report' has:
Events: 14M cycles
22.58%  swapper  [kernel.kallsyms]   [k] 
update_cfs_rq_blocked_load
10.52%  swapper  [kernel.kallsyms]   [k] load_balance
 4.96%  swapper  [kernel.kallsyms]   [k] ktime_get
 4.12%  swapper  [kernel.kallsyms]   [k] 
update_blocked_averages
 3.55%  swapper  [kernel.kallsyms]   [k] idle_cpu
 1.97%  swapper  [kernel.kallsyms]   [k] uv_read_rtc
 0.98%  swapper  [kernel.kallsyms]   [k] rcu_process_gp_end
 0.84%  swapper  [kernel.kallsyms]   [k] 
apic_timer_interrupt
 0.84%  swapper  [kernel.kallsyms]   [k] __lock_text_start
 0.84%  swapper  [kernel.kallsyms]   [k] 
_raw_spin_lock_irqsave
 0.73%  swapper  [kernel.kallsyms]   [k] native_safe_halt
 0.56%  swapper  [kernel.kallsyms]   [k] rcu_check_callbacks
 0.56%  swapper  [kernel.kallsyms]   [k] 
native_write_msr_safe
 0.44%  swapper  [kernel.kallsyms]   [k] cpumask_next_and
 0.42%   reboot  [kernel.kallsyms]   [k] 
kmem_cache_alloc_node

The perf data is 676 MB.  I don't know how well it compresses, but the
lzma task has been running for a while.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Russ Anderson
On Wed, Apr 10, 2013 at 10:29:12AM -0500, Russ Anderson wrote:
 On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
  
  Yeah, we've had issues with ACPI in the past, so I do think we should
  always reboot using the BP. Even if it almost certainly works on 99+%
  of all machines on any random CPU.
  
  The optimal solution would be to just speed up the
  disable_nonboot_cpus() code so much that it isn't an issue. That would
  be good for suspending too, although I guess suspend isn't a big issue
  if you have a thousand CPU's.
  
  Has anybody checked whether we could do the cpu_down() on non-boot
  CPU's in parallel? Right now we serialize the thing completely, with
  one single
  
  for_each_online_cpu(cpu) {
  ...
  
  loop that does a synchrinous _cpu_down() for each CPU. No wonder it
  takes forever. We do __stop_machine() over and over and over again:
  the whole thing is basically O(n**2) in CPU's.
 
 Yes, I have a test patch that replaces for_each_online_cpu(cpu)
 with a cpu bitmask in disable_nonboot_cpus().  The lower level
 routines already take a bitmask.  It allows __stop_machine() to
 be called just once.  That change reduces shutdown time on a
 1024 cpu machine from 16 minutes 4 minutes.  Significant improvement,
 but not good enough.
 
 The next significant bottleneck is __cpu_notify().  Tried creating
 worker threads to parallelize the shutdown, but the problem is
 __cpu_notify() is not thread safe.  Putting a lock around it
 caused all the worker threads to fight over the lock.
 
 Note that __cpu_notify() has to be called for all cpus being
 shut down because the cpu_chain notifier call chain has cpu as a
 parameter.  The delema is that cpu_chain notifiers need to be called on
 all cpus, but cannot be done in parallel due to __cpu_notify() not being
 thread safe.  Spinning through the notifier chain sequentially for all
 cpus just takes a long time.
 
 The real fix would be to make the cpu_chain notifier per cpu, or at
 least thread safe, so that all the cpus being shut down could do so
 in parallel.  That is a significant change with ramifications on
 other code.
 
 I will post a patch shortly with the cpu bitmask change.  Changing
 __cpu_notify() will take more discussion.

Here is the test patch with the cpu bitmask change.  It results
in calling __stop_machine() just once.

After making feedback changes I'll formally submit the patch.

---
 kernel/cpu.c |   94 +++
 1 file changed, 56 insertions(+), 38 deletions(-)

Index: linux/kernel/cpu.c
===
--- linux.orig/kernel/cpu.c 2013-04-10 17:16:40.084960495 -0500
+++ linux/kernel/cpu.c  2013-04-10 17:17:27.988245160 -0500
@@ -262,10 +262,11 @@ static int __ref take_cpu_down(void *_pa
 }
 
 /* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+static int __ref _cpu_down(const cpumask_t *cpus_to_offline, int tasks_frozen)
 {
-   int err, nr_calls = 0;
+   int cpu = 0, err = 0, nr_calls = 0;
void *hcpu = (void *)(long)cpu;
+   cpumask_var_t cpus_offlined;
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct take_cpu_down_param tcd_param = {
.mod = mod,
@@ -278,46 +279,65 @@ static int __ref _cpu_down(unsigned int
if (!cpu_online(cpu))
return -EINVAL;
 
+   if (!alloc_cpumask_var(cpus_offlined, GFP_KERNEL))
+   return -ENOMEM;
+
cpu_hotplug_begin();
+   cpumask_clear(cpus_offlined);
+   cpumask_copy(cpus_offlined, cpus_to_offline);
 
-   err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, nr_calls);
-   if (err) {
-   nr_calls--;
-   __cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
-   printk(%s: attempt to take down CPU %u failed\n,
+   for_each_cpu_mask(cpu, *cpus_to_offline) {
+   hcpu = (void *)(long)cpu;
+   if (!cpu_online(cpu))
+   continue;
+   tcd_param.hcpu = hcpu;
+   err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, nr_calls);
+   if (err) {
+   nr_calls--;
+   __cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, 
NULL);
+   pr_err(%s: attempt to take down CPU %u failed\n,
__func__, cpu);
-   goto out_release;
+   goto out_release;
+   }
+   smpboot_park_threads(cpu);
}
-   smpboot_park_threads(cpu);
 
-   err = __stop_machine(take_cpu_down, tcd_param, cpumask_of(cpu));
+   err = __stop_machine(take_cpu_down, tcd_param, cpus_to_offline);
if (err) {
/* CPU didn't die: tell everyone.  Can't complain. */
-   smpboot_unpark_threads(cpu);
-   

Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Russ Anderson
On Wed, Apr 10, 2013 at 06:59:34PM +0200, Ingo Molnar wrote:
 * Russ Anderson r...@sgi.com wrote:
 
  Yes, I have a test patch that replaces for_each_online_cpu(cpu) with a cpu 
  bitmask in disable_nonboot_cpus().  The lower level routines already take a 
  bitmask.  It allows __stop_machine() to be called just once.  That change 
  reduces shutdown time on a 1024 cpu machine from 16 minutes 4 minutes.  
  Significant improvement, but not good enough.
  
  The next significant bottleneck is __cpu_notify().  Tried creating worker 
  threads to parallelize the shutdown, but the problem is __cpu_notify() is 
  not 
  thread safe.  Putting a lock around it caused all the worker threads to 
  fight 
  over the lock.
 
 4 minutes bootup is 240 seconds, with 1024 CPUs that's about 240 msecs per 
 CPU.
 
 That sounds a lot, given that unlike bootup there's not much real work to be 
 done 
 during shutdown - we don't initialize anything, etc.
 
 Maybe much of those 240 msecs are spent in some stupid udelay loop or so, 
 which 
 could be made parallel?

I was hoping for a stupid udelay when I first started looking
at this code, but found nothing obvious.

The bulk of the time (after making the cpu bitmask change) is
spent in __cpu_notify(), as explained above.


 Would it be possible to create a 'reboot but stop at the end and reactivate 
 all 
 CPUs again' reboot flag, so that it can all be NMI-profiled, to see where the 
 true 
 bottleneck is? A naked disable_nonboot_cpus() call in essence.

My testing was similar.  I hacked a kernel module to call
disable_nonboot_cpus() and enable_nonboot_cpus() and used
printks to narrow down the slow functions.  That points
at the cpu notifier call chain.  It's not clear if any
of the functions on the call chain take a long time, or
just going sequentially through the list for all cpus just
takes a long time.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc  r...@sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-10 Thread Paul Mackerras
On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote:
 The optimal solution would be to just speed up the
 disable_nonboot_cpus() code so much that it isn't an issue. That would
 be good for suspending too, although I guess suspend isn't a big issue
 if you have a thousand CPU's.
 
 Has anybody checked whether we could do the cpu_down() on non-boot
 CPU's in parallel? Right now we serialize the thing completely, with

I thought Srivatsa S. Bhat had a patchset that did exactly that.
Srivatsa?

Paul.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Russ Anderson
On Mon, Apr 08, 2013 at 05:57:01PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
> > We noticed that recently, reboot of a 1024 cpu machine takes approx 16
> > minutes of just stopping the cpus.  The slowdown was tracked to commit
> > f96972f which went into v3.7 and then to the stable trees.
> > 
> > x86 does not need to be running the boot cpu to pull reset and I don't
> > think it is really needed for shutdown either.
> > 
> > I decided to go the "simple" way and make this a config option that is
> > selected by the x86 arch.  I don't know which other arch's would also
> > benefit, if any.
> > 
> > Signed-off-by: Robin Holt 
> > To: Andrew Morton 
> > Cc: Russ Anderson 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin" 
> > Cc: Shawn Guo 
> > Cc: 
> > 
> > ---
> >  arch/x86/Kconfig| 3 +++
> >  kernel/Kconfig.shutdown | 3 +++
> >  kernel/sys.c| 4 
> >  3 files changed, 10 insertions(+)
> >  create mode 100644 kernel/Kconfig.shutdown
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 70c0f3d..9611942 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -120,6 +120,7 @@ config X86
> > select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> > select OLD_SIGACTION if X86_32
> > select COMPAT_OLD_SIGACTION if IA32_EMULATION
> > +   select ARCH_SHUTDOWN_TO_ANY_CPU
> >  
> >  config INSTRUCTION_DECODER
> > def_bool y
> > @@ -839,6 +840,8 @@ config SCHED_MC
> >   making when dealing with multi-core CPU chips at a cost of slightly
> >   increased overhead in some places. If unsure say N here.
> >  
> > +source "kernel/Kconfig.shutdown"
> > +
> >  source "kernel/Kconfig.preempt"
> >  
> >  config X86_UP_APIC
> > diff --git a/kernel/Kconfig.shutdown b/kernel/Kconfig.shutdown
> > new file mode 100644
> > index 000..d79fc04
> > --- /dev/null
> > +++ b/kernel/Kconfig.shutdown
> > @@ -0,0 +1,3 @@
> > +
> > +config ARCH_SHUTDOWN_TO_ANY_CPU
> > +   bool
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 39c9c4a..c0b8880 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -369,7 +369,9 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
> >  void kernel_restart(char *cmd)
> >  {
> > kernel_restart_prepare(cmd);
> > +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
> > disable_nonboot_cpus();
> > +#endif
> > if (!cmd)
> > printk(KERN_EMERG "Restarting system.\n");
> > else
> > @@ -413,7 +415,9 @@ void kernel_power_off(void)
> > kernel_shutdown_prepare(SYSTEM_POWER_OFF);
> > if (pm_power_off_prepare)
> > pm_power_off_prepare();
> > +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
> > disable_nonboot_cpus();
> > +#endif
> > syscore_shutdown();
> > printk(KERN_EMERG "Power down.\n");
> > kmsg_dump(KMSG_DUMP_POWEROFF);
> 
> Hm, the 'fix' is a pretty ugly workaround that does not fix much IMHO.
> 
> I think the original commit:
> 
>   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
> 
> actually regressed your 1024 CPU systems, and should possibly be reverted or 
> fixed 
> in some other fashion - such as by migrating to the primary CPU (on 
> architectures 
> that require that), instead of hotplug offlining every secondary CPU on every 
> architecture!

Sure.  There are multiple ways to fix this.
 
> Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs 
> in 
> parallel: issue the CPU-down requests to every CPU, then wait for them to 
> complete 
> - instead of the loop over every CPU?

I took a look at this.  disable_nonboot_cpus() loops through all online cpus,
shutting them down one cpu thread at a time.  More frustrating, it ends up 
calling __stop_machine() to stop all the cpus, then loops back up to stop
the next thread.  The underlying code takes a cpu bitmask, so changing
disable_nonboot_cpus() to pass in a cpu bitmask and changing _cpu_down()
to accept it allows __stop_machine() to be called just once.  This change
reduced the shutdown time on a 1024 cpus system from 16 minutes down to 4.
A significant improvement, but not good enough.

The next significant bottleneck is __cpu_notify().  Tried creating worker
threads to parallelize the shutdown, but the problem is __cpu_notify() is
not thread safe.  Putting a lock around it caused all the worker threads
to fight over the lock.

Wondered if __cpu_notify() needed to be called for all cpus being shut down,
and it does because the cpu_chain notifier call chain has cpu as a parameter.
So the delema is that cpu_chain notifiers need to be called on all cpus, but
cannot be done in parallel due to __cpu_notify() not being thread safe.
Spinning through the notifier chain sequentially for all cpus just takes a
long time.

The real fix would be to make the _chain notifier per cpu, or at
least thread safe, so that all the cpus being shut down could do so
in parallel.  That is a significant change with ramifications on
other code.

> This would be the conceptual counter 

Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Robin Holt
On Mon, Apr 08, 2013 at 09:11:06AM -0700, H. Peter Anvin wrote:
> On 04/08/2013 08:57 AM, Ingo Molnar wrote:
> > 
> > I think the original commit:
> > 
> >   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
> > 
> > actually regressed your 1024 CPU systems, and should possibly be reverted 
> > or fixed 
> > in some other fashion - such as by migrating to the primary CPU (on 
> > architectures 
> > that require that), instead of hotplug offlining every secondary CPU on 
> > every 
> > architecture!
> > 
> > Alternatively, disable_nonboot_cpus() could perhaps be improved to down 
> > CPUs in 
> > parallel: issue the CPU-down requests to every CPU, then wait for them to 
> > complete 
> > - instead of the loop over every CPU?
> > 
> > This would be the conceptual counter part to parallel boot up of CPUs - 
> > something 
> > SGI might be interested in as well?
> > 
> 
> Migrating to the boot processor and then calling stop_machine() to
> defang any other processors should be sufficient, no?
> 
> I don't know if there is any reason to deschedule all tasks?

My reading of the original commit indicated that some architecture's
firmware needs the boot cpu to be the one initiating reboot.

If that is correct, then I can not see why a stop_machine() implementation
will not work.

Since this is in generic kernel code, how can I proceed?

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Robin Holt
On Mon, Apr 08, 2013 at 05:57:01PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt  wrote:
> 
> > We noticed that recently, reboot of a 1024 cpu machine takes approx 16
> > minutes of just stopping the cpus.  The slowdown was tracked to commit
> > f96972f which went into v3.7 and then to the stable trees.
> > 
> > x86 does not need to be running the boot cpu to pull reset and I don't
> > think it is really needed for shutdown either.
> > 
> > I decided to go the "simple" way and make this a config option that is
> > selected by the x86 arch.  I don't know which other arch's would also
> > benefit, if any.
> > 
> > Signed-off-by: Robin Holt 
> > To: Andrew Morton 
> > Cc: Russ Anderson 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin" 
> > Cc: Shawn Guo 
> > Cc: 
> > 
> > ---
> >  arch/x86/Kconfig| 3 +++
> >  kernel/Kconfig.shutdown | 3 +++
> >  kernel/sys.c| 4 
> >  3 files changed, 10 insertions(+)
> >  create mode 100644 kernel/Kconfig.shutdown
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 70c0f3d..9611942 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -120,6 +120,7 @@ config X86
> > select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> > select OLD_SIGACTION if X86_32
> > select COMPAT_OLD_SIGACTION if IA32_EMULATION
> > +   select ARCH_SHUTDOWN_TO_ANY_CPU
> >  
> >  config INSTRUCTION_DECODER
> > def_bool y
> > @@ -839,6 +840,8 @@ config SCHED_MC
> >   making when dealing with multi-core CPU chips at a cost of slightly
> >   increased overhead in some places. If unsure say N here.
> >  
> > +source "kernel/Kconfig.shutdown"
> > +
> >  source "kernel/Kconfig.preempt"
> >  
> >  config X86_UP_APIC
> > diff --git a/kernel/Kconfig.shutdown b/kernel/Kconfig.shutdown
> > new file mode 100644
> > index 000..d79fc04
> > --- /dev/null
> > +++ b/kernel/Kconfig.shutdown
> > @@ -0,0 +1,3 @@
> > +
> > +config ARCH_SHUTDOWN_TO_ANY_CPU
> > +   bool
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 39c9c4a..c0b8880 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -369,7 +369,9 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
> >  void kernel_restart(char *cmd)
> >  {
> > kernel_restart_prepare(cmd);
> > +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
> > disable_nonboot_cpus();
> > +#endif
> > if (!cmd)
> > printk(KERN_EMERG "Restarting system.\n");
> > else
> > @@ -413,7 +415,9 @@ void kernel_power_off(void)
> > kernel_shutdown_prepare(SYSTEM_POWER_OFF);
> > if (pm_power_off_prepare)
> > pm_power_off_prepare();
> > +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
> > disable_nonboot_cpus();
> > +#endif
> > syscore_shutdown();
> > printk(KERN_EMERG "Power down.\n");
> > kmsg_dump(KMSG_DUMP_POWEROFF);
> 
> Hm, the 'fix' is a pretty ugly workaround that does not fix much IMHO.
> 
> I think the original commit:
> 
>   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
> 
> actually regressed your 1024 CPU systems, and should possibly be reverted or 
> fixed 
> in some other fashion - such as by migrating to the primary CPU (on 
> architectures 
> that require that), instead of hotplug offlining every secondary CPU on every 
> architecture!
> 
> Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs 
> in 
> parallel: issue the CPU-down requests to every CPU, then wait for them to 
> complete 
> - instead of the loop over every CPU?
> 
> This would be the conceptual counter part to parallel boot up of CPUs - 
> something 
> SGI might be interested in as well?

Interested, but even more so interested in parellelizing memory setup. ;)

How can we proceed with this?

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread H. Peter Anvin
On 04/08/2013 08:57 AM, Ingo Molnar wrote:
> 
> I think the original commit:
> 
>   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
> 
> actually regressed your 1024 CPU systems, and should possibly be reverted or 
> fixed 
> in some other fashion - such as by migrating to the primary CPU (on 
> architectures 
> that require that), instead of hotplug offlining every secondary CPU on every 
> architecture!
> 
> Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs 
> in 
> parallel: issue the CPU-down requests to every CPU, then wait for them to 
> complete 
> - instead of the loop over every CPU?
> 
> This would be the conceptual counter part to parallel boot up of CPUs - 
> something 
> SGI might be interested in as well?
> 

Migrating to the boot processor and then calling stop_machine() to
defang any other processors should be sufficient, no?

I don't know if there is any reason to deschedule all tasks?

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Ingo Molnar

* Robin Holt  wrote:

> We noticed that recently, reboot of a 1024 cpu machine takes approx 16
> minutes of just stopping the cpus.  The slowdown was tracked to commit
> f96972f which went into v3.7 and then to the stable trees.
> 
> x86 does not need to be running the boot cpu to pull reset and I don't
> think it is really needed for shutdown either.
> 
> I decided to go the "simple" way and make this a config option that is
> selected by the x86 arch.  I don't know which other arch's would also
> benefit, if any.
> 
> Signed-off-by: Robin Holt 
> To: Andrew Morton 
> Cc: Russ Anderson 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Shawn Guo 
> Cc: 
> 
> ---
>  arch/x86/Kconfig| 3 +++
>  kernel/Kconfig.shutdown | 3 +++
>  kernel/sys.c| 4 
>  3 files changed, 10 insertions(+)
>  create mode 100644 kernel/Kconfig.shutdown
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 70c0f3d..9611942 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -120,6 +120,7 @@ config X86
>   select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
>   select OLD_SIGACTION if X86_32
>   select COMPAT_OLD_SIGACTION if IA32_EMULATION
> + select ARCH_SHUTDOWN_TO_ANY_CPU
>  
>  config INSTRUCTION_DECODER
>   def_bool y
> @@ -839,6 +840,8 @@ config SCHED_MC
> making when dealing with multi-core CPU chips at a cost of slightly
> increased overhead in some places. If unsure say N here.
>  
> +source "kernel/Kconfig.shutdown"
> +
>  source "kernel/Kconfig.preempt"
>  
>  config X86_UP_APIC
> diff --git a/kernel/Kconfig.shutdown b/kernel/Kconfig.shutdown
> new file mode 100644
> index 000..d79fc04
> --- /dev/null
> +++ b/kernel/Kconfig.shutdown
> @@ -0,0 +1,3 @@
> +
> +config ARCH_SHUTDOWN_TO_ANY_CPU
> + bool
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 39c9c4a..c0b8880 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -369,7 +369,9 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
>  void kernel_restart(char *cmd)
>  {
>   kernel_restart_prepare(cmd);
> +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
>   disable_nonboot_cpus();
> +#endif
>   if (!cmd)
>   printk(KERN_EMERG "Restarting system.\n");
>   else
> @@ -413,7 +415,9 @@ void kernel_power_off(void)
>   kernel_shutdown_prepare(SYSTEM_POWER_OFF);
>   if (pm_power_off_prepare)
>   pm_power_off_prepare();
> +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
>   disable_nonboot_cpus();
> +#endif
>   syscore_shutdown();
>   printk(KERN_EMERG "Power down.\n");
>   kmsg_dump(KMSG_DUMP_POWEROFF);

Hm, the 'fix' is a pretty ugly workaround that does not fix much IMHO.

I think the original commit:

  f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()

actually regressed your 1024 CPU systems, and should possibly be reverted or 
fixed 
in some other fashion - such as by migrating to the primary CPU (on 
architectures 
that require that), instead of hotplug offlining every secondary CPU on every 
architecture!

Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs in 
parallel: issue the CPU-down requests to every CPU, then wait for them to 
complete 
- instead of the loop over every CPU?

This would be the conceptual counter part to parallel boot up of CPUs - 
something 
SGI might be interested in as well?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Ingo Molnar

* Robin Holt h...@sgi.com wrote:

 We noticed that recently, reboot of a 1024 cpu machine takes approx 16
 minutes of just stopping the cpus.  The slowdown was tracked to commit
 f96972f which went into v3.7 and then to the stable trees.
 
 x86 does not need to be running the boot cpu to pull reset and I don't
 think it is really needed for shutdown either.
 
 I decided to go the simple way and make this a config option that is
 selected by the x86 arch.  I don't know which other arch's would also
 benefit, if any.
 
 Signed-off-by: Robin Holt h...@sgi.com
 To: Andrew Morton a...@linux-foundation.org
 Cc: Russ Anderson r...@sgi.com
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Shawn Guo shawn@linaro.org
 Cc: sta...@vger.kernel.org
 
 ---
  arch/x86/Kconfig| 3 +++
  kernel/Kconfig.shutdown | 3 +++
  kernel/sys.c| 4 
  3 files changed, 10 insertions(+)
  create mode 100644 kernel/Kconfig.shutdown
 
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 70c0f3d..9611942 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -120,6 +120,7 @@ config X86
   select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
   select OLD_SIGACTION if X86_32
   select COMPAT_OLD_SIGACTION if IA32_EMULATION
 + select ARCH_SHUTDOWN_TO_ANY_CPU
  
  config INSTRUCTION_DECODER
   def_bool y
 @@ -839,6 +840,8 @@ config SCHED_MC
 making when dealing with multi-core CPU chips at a cost of slightly
 increased overhead in some places. If unsure say N here.
  
 +source kernel/Kconfig.shutdown
 +
  source kernel/Kconfig.preempt
  
  config X86_UP_APIC
 diff --git a/kernel/Kconfig.shutdown b/kernel/Kconfig.shutdown
 new file mode 100644
 index 000..d79fc04
 --- /dev/null
 +++ b/kernel/Kconfig.shutdown
 @@ -0,0 +1,3 @@
 +
 +config ARCH_SHUTDOWN_TO_ANY_CPU
 + bool
 diff --git a/kernel/sys.c b/kernel/sys.c
 index 39c9c4a..c0b8880 100644
 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -369,7 +369,9 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
  void kernel_restart(char *cmd)
  {
   kernel_restart_prepare(cmd);
 +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
   disable_nonboot_cpus();
 +#endif
   if (!cmd)
   printk(KERN_EMERG Restarting system.\n);
   else
 @@ -413,7 +415,9 @@ void kernel_power_off(void)
   kernel_shutdown_prepare(SYSTEM_POWER_OFF);
   if (pm_power_off_prepare)
   pm_power_off_prepare();
 +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
   disable_nonboot_cpus();
 +#endif
   syscore_shutdown();
   printk(KERN_EMERG Power down.\n);
   kmsg_dump(KMSG_DUMP_POWEROFF);

Hm, the 'fix' is a pretty ugly workaround that does not fix much IMHO.

I think the original commit:

  f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()

actually regressed your 1024 CPU systems, and should possibly be reverted or 
fixed 
in some other fashion - such as by migrating to the primary CPU (on 
architectures 
that require that), instead of hotplug offlining every secondary CPU on every 
architecture!

Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs in 
parallel: issue the CPU-down requests to every CPU, then wait for them to 
complete 
- instead of the loop over every CPU?

This would be the conceptual counter part to parallel boot up of CPUs - 
something 
SGI might be interested in as well?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread H. Peter Anvin
On 04/08/2013 08:57 AM, Ingo Molnar wrote:
 
 I think the original commit:
 
   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
 
 actually regressed your 1024 CPU systems, and should possibly be reverted or 
 fixed 
 in some other fashion - such as by migrating to the primary CPU (on 
 architectures 
 that require that), instead of hotplug offlining every secondary CPU on every 
 architecture!
 
 Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs 
 in 
 parallel: issue the CPU-down requests to every CPU, then wait for them to 
 complete 
 - instead of the loop over every CPU?
 
 This would be the conceptual counter part to parallel boot up of CPUs - 
 something 
 SGI might be interested in as well?
 

Migrating to the boot processor and then calling stop_machine() to
defang any other processors should be sufficient, no?

I don't know if there is any reason to deschedule all tasks?

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Robin Holt
On Mon, Apr 08, 2013 at 05:57:01PM +0200, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
  We noticed that recently, reboot of a 1024 cpu machine takes approx 16
  minutes of just stopping the cpus.  The slowdown was tracked to commit
  f96972f which went into v3.7 and then to the stable trees.
  
  x86 does not need to be running the boot cpu to pull reset and I don't
  think it is really needed for shutdown either.
  
  I decided to go the simple way and make this a config option that is
  selected by the x86 arch.  I don't know which other arch's would also
  benefit, if any.
  
  Signed-off-by: Robin Holt h...@sgi.com
  To: Andrew Morton a...@linux-foundation.org
  Cc: Russ Anderson r...@sgi.com
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: Ingo Molnar mi...@redhat.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: Shawn Guo shawn@linaro.org
  Cc: sta...@vger.kernel.org
  
  ---
   arch/x86/Kconfig| 3 +++
   kernel/Kconfig.shutdown | 3 +++
   kernel/sys.c| 4 
   3 files changed, 10 insertions(+)
   create mode 100644 kernel/Kconfig.shutdown
  
  diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
  index 70c0f3d..9611942 100644
  --- a/arch/x86/Kconfig
  +++ b/arch/x86/Kconfig
  @@ -120,6 +120,7 @@ config X86
  select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
  select OLD_SIGACTION if X86_32
  select COMPAT_OLD_SIGACTION if IA32_EMULATION
  +   select ARCH_SHUTDOWN_TO_ANY_CPU
   
   config INSTRUCTION_DECODER
  def_bool y
  @@ -839,6 +840,8 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
   
  +source kernel/Kconfig.shutdown
  +
   source kernel/Kconfig.preempt
   
   config X86_UP_APIC
  diff --git a/kernel/Kconfig.shutdown b/kernel/Kconfig.shutdown
  new file mode 100644
  index 000..d79fc04
  --- /dev/null
  +++ b/kernel/Kconfig.shutdown
  @@ -0,0 +1,3 @@
  +
  +config ARCH_SHUTDOWN_TO_ANY_CPU
  +   bool
  diff --git a/kernel/sys.c b/kernel/sys.c
  index 39c9c4a..c0b8880 100644
  --- a/kernel/sys.c
  +++ b/kernel/sys.c
  @@ -369,7 +369,9 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
   void kernel_restart(char *cmd)
   {
  kernel_restart_prepare(cmd);
  +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
  disable_nonboot_cpus();
  +#endif
  if (!cmd)
  printk(KERN_EMERG Restarting system.\n);
  else
  @@ -413,7 +415,9 @@ void kernel_power_off(void)
  kernel_shutdown_prepare(SYSTEM_POWER_OFF);
  if (pm_power_off_prepare)
  pm_power_off_prepare();
  +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
  disable_nonboot_cpus();
  +#endif
  syscore_shutdown();
  printk(KERN_EMERG Power down.\n);
  kmsg_dump(KMSG_DUMP_POWEROFF);
 
 Hm, the 'fix' is a pretty ugly workaround that does not fix much IMHO.
 
 I think the original commit:
 
   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
 
 actually regressed your 1024 CPU systems, and should possibly be reverted or 
 fixed 
 in some other fashion - such as by migrating to the primary CPU (on 
 architectures 
 that require that), instead of hotplug offlining every secondary CPU on every 
 architecture!
 
 Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs 
 in 
 parallel: issue the CPU-down requests to every CPU, then wait for them to 
 complete 
 - instead of the loop over every CPU?
 
 This would be the conceptual counter part to parallel boot up of CPUs - 
 something 
 SGI might be interested in as well?

Interested, but even more so interested in parellelizing memory setup. ;)

How can we proceed with this?

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Robin Holt
On Mon, Apr 08, 2013 at 09:11:06AM -0700, H. Peter Anvin wrote:
 On 04/08/2013 08:57 AM, Ingo Molnar wrote:
  
  I think the original commit:
  
f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
  
  actually regressed your 1024 CPU systems, and should possibly be reverted 
  or fixed 
  in some other fashion - such as by migrating to the primary CPU (on 
  architectures 
  that require that), instead of hotplug offlining every secondary CPU on 
  every 
  architecture!
  
  Alternatively, disable_nonboot_cpus() could perhaps be improved to down 
  CPUs in 
  parallel: issue the CPU-down requests to every CPU, then wait for them to 
  complete 
  - instead of the loop over every CPU?
  
  This would be the conceptual counter part to parallel boot up of CPUs - 
  something 
  SGI might be interested in as well?
  
 
 Migrating to the boot processor and then calling stop_machine() to
 defang any other processors should be sufficient, no?
 
 I don't know if there is any reason to deschedule all tasks?

My reading of the original commit indicated that some architecture's
firmware needs the boot cpu to be the one initiating reboot.

If that is correct, then I can not see why a stop_machine() implementation
will not work.

Since this is in generic kernel code, how can I proceed?

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not force shutdown/reboot to boot cpu.

2013-04-08 Thread Russ Anderson
On Mon, Apr 08, 2013 at 05:57:01PM +0200, Ingo Molnar wrote:
 
 * Robin Holt h...@sgi.com wrote:
 
  We noticed that recently, reboot of a 1024 cpu machine takes approx 16
  minutes of just stopping the cpus.  The slowdown was tracked to commit
  f96972f which went into v3.7 and then to the stable trees.
  
  x86 does not need to be running the boot cpu to pull reset and I don't
  think it is really needed for shutdown either.
  
  I decided to go the simple way and make this a config option that is
  selected by the x86 arch.  I don't know which other arch's would also
  benefit, if any.
  
  Signed-off-by: Robin Holt h...@sgi.com
  To: Andrew Morton a...@linux-foundation.org
  Cc: Russ Anderson r...@sgi.com
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: Ingo Molnar mi...@redhat.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: Shawn Guo shawn@linaro.org
  Cc: sta...@vger.kernel.org
  
  ---
   arch/x86/Kconfig| 3 +++
   kernel/Kconfig.shutdown | 3 +++
   kernel/sys.c| 4 
   3 files changed, 10 insertions(+)
   create mode 100644 kernel/Kconfig.shutdown
  
  diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
  index 70c0f3d..9611942 100644
  --- a/arch/x86/Kconfig
  +++ b/arch/x86/Kconfig
  @@ -120,6 +120,7 @@ config X86
  select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
  select OLD_SIGACTION if X86_32
  select COMPAT_OLD_SIGACTION if IA32_EMULATION
  +   select ARCH_SHUTDOWN_TO_ANY_CPU
   
   config INSTRUCTION_DECODER
  def_bool y
  @@ -839,6 +840,8 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
   
  +source kernel/Kconfig.shutdown
  +
   source kernel/Kconfig.preempt
   
   config X86_UP_APIC
  diff --git a/kernel/Kconfig.shutdown b/kernel/Kconfig.shutdown
  new file mode 100644
  index 000..d79fc04
  --- /dev/null
  +++ b/kernel/Kconfig.shutdown
  @@ -0,0 +1,3 @@
  +
  +config ARCH_SHUTDOWN_TO_ANY_CPU
  +   bool
  diff --git a/kernel/sys.c b/kernel/sys.c
  index 39c9c4a..c0b8880 100644
  --- a/kernel/sys.c
  +++ b/kernel/sys.c
  @@ -369,7 +369,9 @@ EXPORT_SYMBOL(unregister_reboot_notifier);
   void kernel_restart(char *cmd)
   {
  kernel_restart_prepare(cmd);
  +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
  disable_nonboot_cpus();
  +#endif
  if (!cmd)
  printk(KERN_EMERG Restarting system.\n);
  else
  @@ -413,7 +415,9 @@ void kernel_power_off(void)
  kernel_shutdown_prepare(SYSTEM_POWER_OFF);
  if (pm_power_off_prepare)
  pm_power_off_prepare();
  +#ifndef CONFIG_ARCH_SHUTDOWN_TO_ANY_CPU
  disable_nonboot_cpus();
  +#endif
  syscore_shutdown();
  printk(KERN_EMERG Power down.\n);
  kmsg_dump(KMSG_DUMP_POWEROFF);
 
 Hm, the 'fix' is a pretty ugly workaround that does not fix much IMHO.
 
 I think the original commit:
 
   f96972f2dc63 kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
 
 actually regressed your 1024 CPU systems, and should possibly be reverted or 
 fixed 
 in some other fashion - such as by migrating to the primary CPU (on 
 architectures 
 that require that), instead of hotplug offlining every secondary CPU on every 
 architecture!

Sure.  There are multiple ways to fix this.
 
 Alternatively, disable_nonboot_cpus() could perhaps be improved to down CPUs 
 in 
 parallel: issue the CPU-down requests to every CPU, then wait for them to 
 complete 
 - instead of the loop over every CPU?

I took a look at this.  disable_nonboot_cpus() loops through all online cpus,
shutting them down one cpu thread at a time.  More frustrating, it ends up 
calling __stop_machine() to stop all the cpus, then loops back up to stop
the next thread.  The underlying code takes a cpu bitmask, so changing
disable_nonboot_cpus() to pass in a cpu bitmask and changing _cpu_down()
to accept it allows __stop_machine() to be called just once.  This change
reduced the shutdown time on a 1024 cpus system from 16 minutes down to 4.
A significant improvement, but not good enough.

The next significant bottleneck is __cpu_notify().  Tried creating worker
threads to parallelize the shutdown, but the problem is __cpu_notify() is
not thread safe.  Putting a lock around it caused all the worker threads
to fight over the lock.

Wondered if __cpu_notify() needed to be called for all cpus being shut down,
and it does because the cpu_chain notifier call chain has cpu as a parameter.
So the delema is that cpu_chain notifiers need to be called on all cpus, but
cannot be done in parallel due to __cpu_notify() not being thread safe.
Spinning through the notifier chain sequentially for all cpus just takes a
long time.

The real fix would be to make the cpu_chain notifier per cpu, or at
least thread safe, so that all the cpus being shut down could do so
in parallel.  That is a significant change with ramifications on
other code.

 This would be the conceptual counter part to parallel boot up of CPUs -