Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Peter Zijlstra
On Mon, Mar 07, 2016 at 12:38:16PM -0500, Chris Metcalf wrote:
> On 03/07/2016 04:48 AM, Peter Zijlstra wrote:
> I'm a little skeptical that a single percpu write is going to add much
> measurable overhead to this path. 

So that write is almost guaranteed to be a cacheline miss, those things
hurt and do show up on profiles.

> However, we can certainly adapt
> alternate approaches that stay away from the actual idle code.
> 
> One approach (diff appended) is to just test to see if the PC is
> actually in the architecture-specific halt code.  There are two downsides:
> 
> 1. It requires a small amount of per-architecture support.  I've provided
>the tile support as an example, since that's what I tested.  I expect
>x86 is a little more complicated since there are more idle paths and
>they don't currently run the idle instruction(s) at a fixed address, but
>it's unlikely to be too complicated on any platform.
>Still, adding anything per-architecture is certainly a downside.
> 
> 2. As proposed, my new alternate solution only handles the non-polling
>case, so if you are in the polling loop, we won't benefit from having
>the NMI backtrace code skip over you.  However my guess is that 99% of
>the time folks do choose to run the default non-polling mode, so this
>probably still achieves a pretty reasonable outcome.
> 
> A different approach that would handle downside #2 and probably make it
> easier to implement the architecture-specific code for more complicated
> platforms like x86 would be to use the SCHED_TEXT model and tag all the
> low-level idling functions as CPUIDLE_TEXT.  Then the "are we idling"
> test is just a range compare on the PC against __cpuidle_text_{start,end}.
> 
> We'd have to decide whether to make cpu_idle_poll() non-inline and just
> test for being in that function, or whether we could tag all of
> cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace
> whenever the PC is anywhere in that function.  Obviously if we have
> called out to more complicated code (e.g. Daniel's concern about calling
> out to power management code) the PC would no longer be in the CPUIDLE_TEXT
> at that point, so that might be OK too.

But the CPU would also not be idle if its running pm code.

So I like the CPUIDLE_TEXT approach, since it has no impact on the
generated code.

An alternative option could be to inspect the stack, we already take a
stack dump, so you could say that everything that has cpuidle_enter() in
its callchain is an 'idle' cpu.

Yet another option would be to look at rq->idle_state or any other state
cpuidle already tracks. The 'obvious' downside is relying on cpuidle,
which I understand isn't supported by everyone.


Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Peter Zijlstra
On Mon, Mar 07, 2016 at 12:38:16PM -0500, Chris Metcalf wrote:
> On 03/07/2016 04:48 AM, Peter Zijlstra wrote:
> I'm a little skeptical that a single percpu write is going to add much
> measurable overhead to this path. 

So that write is almost guaranteed to be a cacheline miss, those things
hurt and do show up on profiles.

> However, we can certainly adapt
> alternate approaches that stay away from the actual idle code.
> 
> One approach (diff appended) is to just test to see if the PC is
> actually in the architecture-specific halt code.  There are two downsides:
> 
> 1. It requires a small amount of per-architecture support.  I've provided
>the tile support as an example, since that's what I tested.  I expect
>x86 is a little more complicated since there are more idle paths and
>they don't currently run the idle instruction(s) at a fixed address, but
>it's unlikely to be too complicated on any platform.
>Still, adding anything per-architecture is certainly a downside.
> 
> 2. As proposed, my new alternate solution only handles the non-polling
>case, so if you are in the polling loop, we won't benefit from having
>the NMI backtrace code skip over you.  However my guess is that 99% of
>the time folks do choose to run the default non-polling mode, so this
>probably still achieves a pretty reasonable outcome.
> 
> A different approach that would handle downside #2 and probably make it
> easier to implement the architecture-specific code for more complicated
> platforms like x86 would be to use the SCHED_TEXT model and tag all the
> low-level idling functions as CPUIDLE_TEXT.  Then the "are we idling"
> test is just a range compare on the PC against __cpuidle_text_{start,end}.
> 
> We'd have to decide whether to make cpu_idle_poll() non-inline and just
> test for being in that function, or whether we could tag all of
> cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace
> whenever the PC is anywhere in that function.  Obviously if we have
> called out to more complicated code (e.g. Daniel's concern about calling
> out to power management code) the PC would no longer be in the CPUIDLE_TEXT
> at that point, so that might be OK too.

But the CPU would also not be idle if its running pm code.

So I like the CPUIDLE_TEXT approach, since it has no impact on the
generated code.

An alternative option could be to inspect the stack, we already take a
stack dump, so you could say that everything that has cpuidle_enter() in
its callchain is an 'idle' cpu.

Yet another option would be to look at rq->idle_state or any other state
cpuidle already tracks. The 'obvious' downside is relying on cpuidle,
which I understand isn't supported by everyone.


Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Chris Metcalf

On 03/07/2016 04:48 AM, Peter Zijlstra wrote:

On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote:

+++ b/kernel/sched/idle.c
@@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
  __setup("hlt", cpu_idle_nopoll_setup);
  #endif
+static DEFINE_PER_CPU(bool, cpu_idling);
+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+   return this_cpu_read(cpu_idling);
+}
+
  static inline int cpu_idle_poll(void)
  {
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
@@ -89,7 +99,9 @@ void default_idle_call(void)
local_irq_enable();
} else {
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
arch_cpu_idle();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
}
  }

No, we're not going to add random crap here. This is actually considered
a fast path for some workloads.

There's already far too much fat in the whole going to idle and coming
out of idle. We should be trimming this, not adding to it.


I'm a little skeptical that a single percpu write is going to add much
measurable overhead to this path.  However, we can certainly adapt
alternate approaches that stay away from the actual idle code.

One approach (diff appended) is to just test to see if the PC is
actually in the architecture-specific halt code.  There are two downsides:

1. It requires a small amount of per-architecture support.  I've provided
   the tile support as an example, since that's what I tested.  I expect
   x86 is a little more complicated since there are more idle paths and
   they don't currently run the idle instruction(s) at a fixed address, but
   it's unlikely to be too complicated on any platform.
   Still, adding anything per-architecture is certainly a downside.

2. As proposed, my new alternate solution only handles the non-polling
   case, so if you are in the polling loop, we won't benefit from having
   the NMI backtrace code skip over you.  However my guess is that 99% of
   the time folks do choose to run the default non-polling mode, so this
   probably still achieves a pretty reasonable outcome.

A different approach that would handle downside #2 and probably make it
easier to implement the architecture-specific code for more complicated
platforms like x86 would be to use the SCHED_TEXT model and tag all the
low-level idling functions as CPUIDLE_TEXT.  Then the "are we idling"
test is just a range compare on the PC against __cpuidle_text_{start,end}.

We'd have to decide whether to make cpu_idle_poll() non-inline and just
test for being in that function, or whether we could tag all of
cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace
whenever the PC is anywhere in that function.  Obviously if we have
called out to more complicated code (e.g. Daniel's concern about calling
out to power management code) the PC would no longer be in the CPUIDLE_TEXT
at that point, so that might be OK too.

Let me know what you think is the right direction here.

Thanks!

diff --git a/arch/tile/include/asm/thread_info.h 
b/arch/tile/include/asm/thread_info.h
index 4b7cef9e94e0..93ec51a4853b 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -92,6 +92,9 @@ extern void smp_nap(void);
 /* Enable interrupts racelessly and nap forever: helper for arch_cpu_idle(). */
 extern void _cpu_idle(void);
 
+/* The address of the actual nap instruction. */

+extern long _cpu_idle_nap[];
+
 #else /* __ASSEMBLY__ */
 
 /*

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..a83a426f1755 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,11 @@ void arch_cpu_idle(void)
_cpu_idle();
 }
 
+bool arch_cpu_in_idle(struct pt_regs *regs)

+{
+   return regs->pc == (unsigned long)_cpu_idle_nap;
+}
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d2ca8c38f9c4..24462927fa49 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -279,6 +279,7 @@ void arch_cpu_idle_prepare(void);
 void arch_cpu_idle_enter(void);
 void arch_cpu_idle_exit(void);
 void arch_cpu_idle_dead(void);
+bool arch_cpu_in_idle(struct pt_regs *);
 
 DECLARE_PER_CPU(bool, cpu_dead_idle);
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c

index 544a7133cbd1..d9dbab6526a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -77,6 +77,7 @@ void 

Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Chris Metcalf

On 03/07/2016 04:48 AM, Peter Zijlstra wrote:

On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote:

+++ b/kernel/sched/idle.c
@@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
  __setup("hlt", cpu_idle_nopoll_setup);
  #endif
+static DEFINE_PER_CPU(bool, cpu_idling);
+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+   return this_cpu_read(cpu_idling);
+}
+
  static inline int cpu_idle_poll(void)
  {
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
@@ -89,7 +99,9 @@ void default_idle_call(void)
local_irq_enable();
} else {
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
arch_cpu_idle();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
}
  }

No, we're not going to add random crap here. This is actually considered
a fast path for some workloads.

There's already far too much fat in the whole going to idle and coming
out of idle. We should be trimming this, not adding to it.


I'm a little skeptical that a single percpu write is going to add much
measurable overhead to this path.  However, we can certainly adapt
alternate approaches that stay away from the actual idle code.

One approach (diff appended) is to just test to see if the PC is
actually in the architecture-specific halt code.  There are two downsides:

1. It requires a small amount of per-architecture support.  I've provided
   the tile support as an example, since that's what I tested.  I expect
   x86 is a little more complicated since there are more idle paths and
   they don't currently run the idle instruction(s) at a fixed address, but
   it's unlikely to be too complicated on any platform.
   Still, adding anything per-architecture is certainly a downside.

2. As proposed, my new alternate solution only handles the non-polling
   case, so if you are in the polling loop, we won't benefit from having
   the NMI backtrace code skip over you.  However my guess is that 99% of
   the time folks do choose to run the default non-polling mode, so this
   probably still achieves a pretty reasonable outcome.

A different approach that would handle downside #2 and probably make it
easier to implement the architecture-specific code for more complicated
platforms like x86 would be to use the SCHED_TEXT model and tag all the
low-level idling functions as CPUIDLE_TEXT.  Then the "are we idling"
test is just a range compare on the PC against __cpuidle_text_{start,end}.

We'd have to decide whether to make cpu_idle_poll() non-inline and just
test for being in that function, or whether we could tag all of
cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace
whenever the PC is anywhere in that function.  Obviously if we have
called out to more complicated code (e.g. Daniel's concern about calling
out to power management code) the PC would no longer be in the CPUIDLE_TEXT
at that point, so that might be OK too.

Let me know what you think is the right direction here.

Thanks!

diff --git a/arch/tile/include/asm/thread_info.h 
b/arch/tile/include/asm/thread_info.h
index 4b7cef9e94e0..93ec51a4853b 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -92,6 +92,9 @@ extern void smp_nap(void);
 /* Enable interrupts racelessly and nap forever: helper for arch_cpu_idle(). */
 extern void _cpu_idle(void);
 
+/* The address of the actual nap instruction. */

+extern long _cpu_idle_nap[];
+
 #else /* __ASSEMBLY__ */
 
 /*

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..a83a426f1755 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,11 @@ void arch_cpu_idle(void)
_cpu_idle();
 }
 
+bool arch_cpu_in_idle(struct pt_regs *regs)

+{
+   return regs->pc == (unsigned long)_cpu_idle_nap;
+}
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d2ca8c38f9c4..24462927fa49 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -279,6 +279,7 @@ void arch_cpu_idle_prepare(void);
 void arch_cpu_idle_enter(void);
 void arch_cpu_idle_exit(void);
 void arch_cpu_idle_dead(void);
+bool arch_cpu_in_idle(struct pt_regs *);
 
 DECLARE_PER_CPU(bool, cpu_dead_idle);
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c

index 544a7133cbd1..d9dbab6526a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -77,6 +77,7 @@ void 

Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Chris Metcalf

On 03/07/2016 03:26 AM, Daniel Thompson wrote:

Chris Metcalf wrote:
+static DEFINE_PER_CPU(bool, cpu_idling);
+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+return this_cpu_read(cpu_idling);


I think we continue to need the code to identify a core that is 
running an interrupt handler. Interrupts are not masked at the point 
we set cpu_idling to false meaning we can easily be preempted before 
we clear the flag.


Yes, good catch.  However, mooted by PeterZ wanting to keep any extra 
state-switching code out of the idle path.  See my reply to him for more 
on that.


--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Chris Metcalf

On 03/07/2016 03:26 AM, Daniel Thompson wrote:

Chris Metcalf wrote:
+static DEFINE_PER_CPU(bool, cpu_idling);
+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+return this_cpu_read(cpu_idling);


I think we continue to need the code to identify a core that is 
running an interrupt handler. Interrupts are not masked at the point 
we set cpu_idling to false meaning we can easily be preempted before 
we clear the flag.


Yes, good catch.  However, mooted by PeterZ wanting to keep any extra 
state-switching code out of the idle path.  See my reply to him for more 
on that.


--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Peter Zijlstra
On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote:
> +++ b/kernel/sched/idle.c
> @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
>  __setup("hlt", cpu_idle_nopoll_setup);
>  #endif
> +static DEFINE_PER_CPU(bool, cpu_idling);
> +
> +/* Was the cpu was in the low-level idle code when interrupted? */
> +bool in_cpu_idle(void)
> +{
> + return this_cpu_read(cpu_idling);
> +}
> +
>  static inline int cpu_idle_poll(void)
>  {
>   rcu_idle_enter();
>   trace_cpu_idle_rcuidle(0, smp_processor_id());
>   local_irq_enable();
>   stop_critical_timings();
> + this_cpu_write(cpu_idling, true);
>   while (!tif_need_resched() &&
>   (cpu_idle_force_poll || tick_check_broadcast_expired()))
>   cpu_relax();
> + this_cpu_write(cpu_idling, false);
>   start_critical_timings();
>   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>   rcu_idle_exit();
> @@ -89,7 +99,9 @@ void default_idle_call(void)
>   local_irq_enable();
>   } else {
>   stop_critical_timings();
> + this_cpu_write(cpu_idling, true);
>   arch_cpu_idle();
> + this_cpu_write(cpu_idling, false);
>   start_critical_timings();
>   }
>  }

No, we're not going to add random crap here. This is actually considered
a fast path for some workloads.

There's already far too much fat in the whole going to idle and coming
out of idle. We should be trimming this, not adding to it.


Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Peter Zijlstra
On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote:
> +++ b/kernel/sched/idle.c
> @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
>  __setup("hlt", cpu_idle_nopoll_setup);
>  #endif
> +static DEFINE_PER_CPU(bool, cpu_idling);
> +
> +/* Was the cpu was in the low-level idle code when interrupted? */
> +bool in_cpu_idle(void)
> +{
> + return this_cpu_read(cpu_idling);
> +}
> +
>  static inline int cpu_idle_poll(void)
>  {
>   rcu_idle_enter();
>   trace_cpu_idle_rcuidle(0, smp_processor_id());
>   local_irq_enable();
>   stop_critical_timings();
> + this_cpu_write(cpu_idling, true);
>   while (!tif_need_resched() &&
>   (cpu_idle_force_poll || tick_check_broadcast_expired()))
>   cpu_relax();
> + this_cpu_write(cpu_idling, false);
>   start_critical_timings();
>   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>   rcu_idle_exit();
> @@ -89,7 +99,9 @@ void default_idle_call(void)
>   local_irq_enable();
>   } else {
>   stop_critical_timings();
> + this_cpu_write(cpu_idling, true);
>   arch_cpu_idle();
> + this_cpu_write(cpu_idling, false);
>   start_critical_timings();
>   }
>  }

No, we're not going to add random crap here. This is actually considered
a fast path for some workloads.

There's already far too much fat in the whole going to idle and coming
out of idle. We should be trimming this, not adding to it.


Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Daniel Thompson

On 01/03/16 23:01, Chris Metcalf wrote:

(+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler)

On 03/01/2016 09:23 AM, Daniel Thompson wrote:

On 29/02/16 21:40, Chris Metcalf wrote:

When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".


I can see this makes the logs more attractive but this is code for
emergency situations.

The idle task is responsible for certain power management activities.
How can you be sure the system is wedged because of bugs in that code?


It's a fair point, but as core count increases, you really run the risk
of losing the valuable data in a sea of data that isn't.  For example,
for the architecture I maintain, we have the TILE-Gx72, which is a
72-core chip.  If each core's register dump and backtrace is 40 lines,
we're up to around 3,000 lines of console output. Filtering that down by
a factor of 10x or more (if only a handful of cores happen to be active,
which is not uncommon) is a substantial usability improvement.


No objections to your use case. The output feels very verbose even with 
"only" eight cores.




That said, it's true that the original solution I offered (examining
just is_idle_task() plus interrupt nesting) is imprecise.  It is
relatively straightforward to add a bit of per-cpu state that is set at
the same moment we currently do stop/start_critical_timings(), which
would indicate much more specifically that the cpu was running the
idling code itself, and not anything more complex.  In that case if the
flag was set, you would know you were either sitting on a
processor-specific idle instruction in arch_cpu_idle(), or else polling
one or two memory locations in a tight loop in cpu_idle_poll(), which
presumably would offer sufficient precision to feel safe.

Here's an alternative version of the patch which incorporates this
idea.  Do you think this is preferable?  Thanks!


I prefer the approach taken by the new patch although I think the 
implementation might be buggy...




commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262
Author: Chris Metcalf 
Date:   Mon Feb 29 11:56:32 2016 -0500

 nmi_backtrace: generate one-line reports for idle cpus
 When doing an nmi backtrace of many cores, and most of them are idle,
 the output is a little overwhelming and very uninformative.  Suppress
 messages for cpus that are idling when they are interrupted and
 just emit one line, "NMI backtrace for N skipped: idle".
 Signed-off-by: Chris Metcalf 

diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 786ad32631a6..b8c3c4cf88ad 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct
cpuidle_driver *drv,
  /* kernel/sched/idle.c */
  extern void sched_idle_set_state(struct cpuidle_state *idle_state);
  extern void default_idle_call(void);
+extern bool in_cpu_idle(void);

  #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
  void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev,
atomic_t *a);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 544a7133cbd1..9aff315f278b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
  __setup("hlt", cpu_idle_nopoll_setup);
  #endif

+static DEFINE_PER_CPU(bool, cpu_idling);
+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+return this_cpu_read(cpu_idling);


I think we continue to need the code to identify a core that is running 
an interrupt handler. Interrupts are not masked at the point we set 
cpu_idling to false meaning we can easily be preempted before we clear 
the flag.




+}
+
  static inline int cpu_idle_poll(void)
  {
  rcu_idle_enter();
  trace_cpu_idle_rcuidle(0, smp_processor_id());
  local_irq_enable();
  stop_critical_timings();
+this_cpu_write(cpu_idling, true);
  while (!tif_need_resched() &&
  (cpu_idle_force_poll || tick_check_broadcast_expired()))
  cpu_relax();
+this_cpu_write(cpu_idling, false);
  start_critical_timings();
  trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
  rcu_idle_exit();
@@ -89,7 +99,9 @@ void default_idle_call(void)
  local_irq_enable();
  } else {
  stop_critical_timings();
+this_cpu_write(cpu_idling, true);
  arch_cpu_idle();
+this_cpu_write(cpu_idling, false);
  start_critical_timings();
  }
  }
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..75b5eacaa5d3 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -17,6 +17,7 @@
  #include 
  #include 
  #include 
+#include 

  #ifdef arch_trigger_cpumask_backtrace
  /* For 

Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-07 Thread Daniel Thompson

On 01/03/16 23:01, Chris Metcalf wrote:

(+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler)

On 03/01/2016 09:23 AM, Daniel Thompson wrote:

On 29/02/16 21:40, Chris Metcalf wrote:

When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".


I can see this makes the logs more attractive but this is code for
emergency situations.

The idle task is responsible for certain power management activities.
How can you be sure the system is wedged because of bugs in that code?


It's a fair point, but as core count increases, you really run the risk
of losing the valuable data in a sea of data that isn't.  For example,
for the architecture I maintain, we have the TILE-Gx72, which is a
72-core chip.  If each core's register dump and backtrace is 40 lines,
we're up to around 3,000 lines of console output. Filtering that down by
a factor of 10x or more (if only a handful of cores happen to be active,
which is not uncommon) is a substantial usability improvement.


No objections to your use case. The output feels very verbose even with 
"only" eight cores.




That said, it's true that the original solution I offered (examining
just is_idle_task() plus interrupt nesting) is imprecise.  It is
relatively straightforward to add a bit of per-cpu state that is set at
the same moment we currently do stop/start_critical_timings(), which
would indicate much more specifically that the cpu was running the
idling code itself, and not anything more complex.  In that case if the
flag was set, you would know you were either sitting on a
processor-specific idle instruction in arch_cpu_idle(), or else polling
one or two memory locations in a tight loop in cpu_idle_poll(), which
presumably would offer sufficient precision to feel safe.

Here's an alternative version of the patch which incorporates this
idea.  Do you think this is preferable?  Thanks!


I prefer the approach taken by the new patch although I think the 
implementation might be buggy...




commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262
Author: Chris Metcalf 
Date:   Mon Feb 29 11:56:32 2016 -0500

 nmi_backtrace: generate one-line reports for idle cpus
 When doing an nmi backtrace of many cores, and most of them are idle,
 the output is a little overwhelming and very uninformative.  Suppress
 messages for cpus that are idling when they are interrupted and
 just emit one line, "NMI backtrace for N skipped: idle".
 Signed-off-by: Chris Metcalf 

diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 786ad32631a6..b8c3c4cf88ad 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct
cpuidle_driver *drv,
  /* kernel/sched/idle.c */
  extern void sched_idle_set_state(struct cpuidle_state *idle_state);
  extern void default_idle_call(void);
+extern bool in_cpu_idle(void);

  #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
  void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev,
atomic_t *a);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 544a7133cbd1..9aff315f278b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
  __setup("hlt", cpu_idle_nopoll_setup);
  #endif

+static DEFINE_PER_CPU(bool, cpu_idling);
+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+return this_cpu_read(cpu_idling);


I think we continue to need the code to identify a core that is running 
an interrupt handler. Interrupts are not masked at the point we set 
cpu_idling to false meaning we can easily be preempted before we clear 
the flag.




+}
+
  static inline int cpu_idle_poll(void)
  {
  rcu_idle_enter();
  trace_cpu_idle_rcuidle(0, smp_processor_id());
  local_irq_enable();
  stop_critical_timings();
+this_cpu_write(cpu_idling, true);
  while (!tif_need_resched() &&
  (cpu_idle_force_poll || tick_check_broadcast_expired()))
  cpu_relax();
+this_cpu_write(cpu_idling, false);
  start_critical_timings();
  trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
  rcu_idle_exit();
@@ -89,7 +99,9 @@ void default_idle_call(void)
  local_irq_enable();
  } else {
  stop_critical_timings();
+this_cpu_write(cpu_idling, true);
  arch_cpu_idle();
+this_cpu_write(cpu_idling, false);
  start_critical_timings();
  }
  }
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..75b5eacaa5d3 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -17,6 +17,7 @@
  #include 
  #include 
  #include 
+#include 

  #ifdef arch_trigger_cpumask_backtrace
  /* For reliability, we're prepared to waste bits here. 

Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-01 Thread Chris Metcalf

(+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler)

On 03/01/2016 09:23 AM, Daniel Thompson wrote:

On 29/02/16 21:40, Chris Metcalf wrote:

When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".


I can see this makes the logs more attractive but this is code for 
emergency situations.


The idle task is responsible for certain power management activities. 
How can you be sure the system is wedged because of bugs in that code?


It's a fair point, but as core count increases, you really run the risk 
of losing the valuable data in a sea of data that isn't.  For example, 
for the architecture I maintain, we have the TILE-Gx72, which is a 
72-core chip.  If each core's register dump and backtrace is 40 lines, 
we're up to around 3,000 lines of console output. Filtering that down by 
a factor of 10x or more (if only a handful of cores happen to be active, 
which is not uncommon) is a substantial usability improvement.


That said, it's true that the original solution I offered (examining 
just is_idle_task() plus interrupt nesting) is imprecise.  It is 
relatively straightforward to add a bit of per-cpu state that is set at 
the same moment we currently do stop/start_critical_timings(), which 
would indicate much more specifically that the cpu was running the 
idling code itself, and not anything more complex.  In that case if the 
flag was set, you would know you were either sitting on a 
processor-specific idle instruction in arch_cpu_idle(), or else polling 
one or two memory locations in a tight loop in cpu_idle_poll(), which 
presumably would offer sufficient precision to feel safe.


Here's an alternative version of the patch which incorporates this 
idea.  Do you think this is preferable?  Thanks!


commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262
Author: Chris Metcalf 
Date:   Mon Feb 29 11:56:32 2016 -0500

nmi_backtrace: generate one-line reports for idle cpus

When doing an nmi backtrace of many cores, and most of them are idle,

the output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".

Signed-off-by: Chris Metcalf 


diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 786ad32631a6..b8c3c4cf88ad 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct 
cpuidle_driver *drv,
 /* kernel/sched/idle.c */
 extern void sched_idle_set_state(struct cpuidle_state *idle_state);
 extern void default_idle_call(void);
+extern bool in_cpu_idle(void);
 
 #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED

 void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 544a7133cbd1..9aff315f278b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
 __setup("hlt", cpu_idle_nopoll_setup);
 #endif
 
+static DEFINE_PER_CPU(bool, cpu_idling);

+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+   return this_cpu_read(cpu_idling);
+}
+
 static inline int cpu_idle_poll(void)
 {
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
@@ -89,7 +99,9 @@ void default_idle_call(void)
local_irq_enable();
} else {
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
arch_cpu_idle();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
}
 }
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..75b5eacaa5d3 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef arch_trigger_cpumask_backtrace

 /* For reliability, we're prepared to waste bits here. */
@@ -151,11 +152,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
 
 		/* Replace printk to write into the NMI seq */

this_cpu_write(printk_func, nmi_vprintk);
-   pr_warn("NMI backtrace for cpu %d\n", cpu);
-   if (regs)
-   show_regs(regs);
-   else
-   

Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-01 Thread Chris Metcalf

(+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler)

On 03/01/2016 09:23 AM, Daniel Thompson wrote:

On 29/02/16 21:40, Chris Metcalf wrote:

When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".


I can see this makes the logs more attractive but this is code for 
emergency situations.


The idle task is responsible for certain power management activities. 
How can you be sure the system is wedged because of bugs in that code?


It's a fair point, but as core count increases, you really run the risk 
of losing the valuable data in a sea of data that isn't.  For example, 
for the architecture I maintain, we have the TILE-Gx72, which is a 
72-core chip.  If each core's register dump and backtrace is 40 lines, 
we're up to around 3,000 lines of console output. Filtering that down by 
a factor of 10x or more (if only a handful of cores happen to be active, 
which is not uncommon) is a substantial usability improvement.


That said, it's true that the original solution I offered (examining 
just is_idle_task() plus interrupt nesting) is imprecise.  It is 
relatively straightforward to add a bit of per-cpu state that is set at 
the same moment we currently do stop/start_critical_timings(), which 
would indicate much more specifically that the cpu was running the 
idling code itself, and not anything more complex.  In that case if the 
flag was set, you would know you were either sitting on a 
processor-specific idle instruction in arch_cpu_idle(), or else polling 
one or two memory locations in a tight loop in cpu_idle_poll(), which 
presumably would offer sufficient precision to feel safe.


Here's an alternative version of the patch which incorporates this 
idea.  Do you think this is preferable?  Thanks!


commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262
Author: Chris Metcalf 
Date:   Mon Feb 29 11:56:32 2016 -0500

nmi_backtrace: generate one-line reports for idle cpus

When doing an nmi backtrace of many cores, and most of them are idle,

the output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".

Signed-off-by: Chris Metcalf 


diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 786ad32631a6..b8c3c4cf88ad 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct 
cpuidle_driver *drv,
 /* kernel/sched/idle.c */
 extern void sched_idle_set_state(struct cpuidle_state *idle_state);
 extern void default_idle_call(void);
+extern bool in_cpu_idle(void);
 
 #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED

 void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 544a7133cbd1..9aff315f278b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
 __setup("hlt", cpu_idle_nopoll_setup);
 #endif
 
+static DEFINE_PER_CPU(bool, cpu_idling);

+
+/* Was the cpu was in the low-level idle code when interrupted? */
+bool in_cpu_idle(void)
+{
+   return this_cpu_read(cpu_idling);
+}
+
 static inline int cpu_idle_poll(void)
 {
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
@@ -89,7 +99,9 @@ void default_idle_call(void)
local_irq_enable();
} else {
stop_critical_timings();
+   this_cpu_write(cpu_idling, true);
arch_cpu_idle();
+   this_cpu_write(cpu_idling, false);
start_critical_timings();
}
 }
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..75b5eacaa5d3 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef arch_trigger_cpumask_backtrace

 /* For reliability, we're prepared to waste bits here. */
@@ -151,11 +152,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
 
 		/* Replace printk to write into the NMI seq */

this_cpu_write(printk_func, nmi_vprintk);
-   pr_warn("NMI backtrace for cpu %d\n", cpu);
-   if (regs)
-   show_regs(regs);
-   else
-   dump_stack();
+   

Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-01 Thread Daniel Thompson

On 29/02/16 21:40, Chris Metcalf wrote:

When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".


I can see this makes the logs more attractive but this is code for 
emergency situations.


The idle task is responsible for certain power management activities. 
How can you be sure the system is wedged because of bugs in that code?



Daniel.


Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-03-01 Thread Daniel Thompson

On 29/02/16 21:40, Chris Metcalf wrote:

When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".


I can see this makes the logs more attractive but this is code for 
emergency situations.


The idle task is responsible for certain power management activities. 
How can you be sure the system is wedged because of bugs in that code?



Daniel.


[PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-02-29 Thread Chris Metcalf
When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".

Signed-off-by: Chris Metcalf 
---
 lib/nmi_backtrace.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..f878efc9e851 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -142,6 +142,15 @@ static int nmi_vprintk(const char *fmt, va_list args)
return seq_buf_used(>seq) - len;
 }
 
+static bool idling(void)
+{
+   if (!is_idle_task(current))
+   return false;
+
+   /* Account for nmi_enter() having been called once. */
+   return hardirq_count() == HARDIRQ_OFFSET && !in_softirq();
+}
+
 bool nmi_cpu_backtrace(struct pt_regs *regs)
 {
int cpu = smp_processor_id();
@@ -151,11 +160,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
 
/* Replace printk to write into the NMI seq */
this_cpu_write(printk_func, nmi_vprintk);
-   pr_warn("NMI backtrace for cpu %d\n", cpu);
-   if (regs)
-   show_regs(regs);
-   else
-   dump_stack();
+   if (idling()) {
+   pr_warn("NMI backtrace for cpu %d skipped: idle\n",
+   cpu);
+   } else {
+   pr_warn("NMI backtrace for cpu %d\n", cpu);
+   if (regs)
+   show_regs(regs);
+   else
+   dump_stack();
+   }
this_cpu_write(printk_func, printk_func_save);
 
cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
-- 
2.1.2



[PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

2016-02-29 Thread Chris Metcalf
When doing an nmi backtrace of many cores, and most of them are idle,
the output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and
just emit one line, "NMI backtrace for N skipped: idle".

Signed-off-by: Chris Metcalf 
---
 lib/nmi_backtrace.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..f878efc9e851 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -142,6 +142,15 @@ static int nmi_vprintk(const char *fmt, va_list args)
return seq_buf_used(>seq) - len;
 }
 
+static bool idling(void)
+{
+   if (!is_idle_task(current))
+   return false;
+
+   /* Account for nmi_enter() having been called once. */
+   return hardirq_count() == HARDIRQ_OFFSET && !in_softirq();
+}
+
 bool nmi_cpu_backtrace(struct pt_regs *regs)
 {
int cpu = smp_processor_id();
@@ -151,11 +160,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
 
/* Replace printk to write into the NMI seq */
this_cpu_write(printk_func, nmi_vprintk);
-   pr_warn("NMI backtrace for cpu %d\n", cpu);
-   if (regs)
-   show_regs(regs);
-   else
-   dump_stack();
+   if (idling()) {
+   pr_warn("NMI backtrace for cpu %d skipped: idle\n",
+   cpu);
+   } else {
+   pr_warn("NMI backtrace for cpu %d\n", cpu);
+   if (regs)
+   show_regs(regs);
+   else
+   dump_stack();
+   }
this_cpu_write(printk_func, printk_func_save);
 
cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
-- 
2.1.2