Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On Mon, Mar 07, 2016 at 12:38:16PM -0500, Chris Metcalf wrote: > On 03/07/2016 04:48 AM, Peter Zijlstra wrote: > I'm a little skeptical that a single percpu write is going to add much > measurable overhead to this path. So that write is almost guaranteed to be a cacheline miss, those things hurt and do show up on profiles. > However, we can certainly adapt > alternate approaches that stay away from the actual idle code. > > One approach (diff appended) is to just test to see if the PC is > actually in the architecture-specific halt code. There are two downsides: > > 1. It requires a small amount of per-architecture support. I've provided >the tile support as an example, since that's what I tested. I expect >x86 is a little more complicated since there are more idle paths and >they don't currently run the idle instruction(s) at a fixed address, but >it's unlikely to be too complicated on any platform. >Still, adding anything per-architecture is certainly a downside. > > 2. As proposed, my new alternate solution only handles the non-polling >case, so if you are in the polling loop, we won't benefit from having >the NMI backtrace code skip over you. However my guess is that 99% of >the time folks do choose to run the default non-polling mode, so this >probably still achieves a pretty reasonable outcome. > > A different approach that would handle downside #2 and probably make it > easier to implement the architecture-specific code for more complicated > platforms like x86 would be to use the SCHED_TEXT model and tag all the > low-level idling functions as CPUIDLE_TEXT. Then the "are we idling" > test is just a range compare on the PC against __cpuidle_text_{start,end}. > > We'd have to decide whether to make cpu_idle_poll() non-inline and just > test for being in that function, or whether we could tag all of > cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace > whenever the PC is anywhere in that function. Obviously if we have > called out to more complicated code (e.g. Daniel's concern about calling > out to power management code) the PC would no longer be in the CPUIDLE_TEXT > at that point, so that might be OK too. But the CPU would also not be idle if its running pm code. So I like the CPUIDLE_TEXT approach, since it has no impact on the generated code. An alternative option could be to inspect the stack, we already take a stack dump, so you could say that everything that has cpuidle_enter() in its callchain is an 'idle' cpu. Yet another option would be to look at rq->idle_state or any other state cpuidle already tracks. The 'obvious' downside is relying on cpuidle, which I understand isn't supported by everyone.
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On Mon, Mar 07, 2016 at 12:38:16PM -0500, Chris Metcalf wrote: > On 03/07/2016 04:48 AM, Peter Zijlstra wrote: > I'm a little skeptical that a single percpu write is going to add much > measurable overhead to this path. So that write is almost guaranteed to be a cacheline miss, those things hurt and do show up on profiles. > However, we can certainly adapt > alternate approaches that stay away from the actual idle code. > > One approach (diff appended) is to just test to see if the PC is > actually in the architecture-specific halt code. There are two downsides: > > 1. It requires a small amount of per-architecture support. I've provided >the tile support as an example, since that's what I tested. I expect >x86 is a little more complicated since there are more idle paths and >they don't currently run the idle instruction(s) at a fixed address, but >it's unlikely to be too complicated on any platform. >Still, adding anything per-architecture is certainly a downside. > > 2. As proposed, my new alternate solution only handles the non-polling >case, so if you are in the polling loop, we won't benefit from having >the NMI backtrace code skip over you. However my guess is that 99% of >the time folks do choose to run the default non-polling mode, so this >probably still achieves a pretty reasonable outcome. > > A different approach that would handle downside #2 and probably make it > easier to implement the architecture-specific code for more complicated > platforms like x86 would be to use the SCHED_TEXT model and tag all the > low-level idling functions as CPUIDLE_TEXT. Then the "are we idling" > test is just a range compare on the PC against __cpuidle_text_{start,end}. > > We'd have to decide whether to make cpu_idle_poll() non-inline and just > test for being in that function, or whether we could tag all of > cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace > whenever the PC is anywhere in that function. Obviously if we have > called out to more complicated code (e.g. Daniel's concern about calling > out to power management code) the PC would no longer be in the CPUIDLE_TEXT > at that point, so that might be OK too. But the CPU would also not be idle if its running pm code. So I like the CPUIDLE_TEXT approach, since it has no impact on the generated code. An alternative option could be to inspect the stack, we already take a stack dump, so you could say that everything that has cpuidle_enter() in its callchain is an 'idle' cpu. Yet another option would be to look at rq->idle_state or any other state cpuidle already tracks. The 'obvious' downside is relying on cpuidle, which I understand isn't supported by everyone.
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 03/07/2016 04:48 AM, Peter Zijlstra wrote: On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote: +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ + return this_cpu_read(cpu_idling); +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); + this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); + this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); + this_cpu_write(cpu_idling, true); arch_cpu_idle(); + this_cpu_write(cpu_idling, false); start_critical_timings(); } } No, we're not going to add random crap here. This is actually considered a fast path for some workloads. There's already far too much fat in the whole going to idle and coming out of idle. We should be trimming this, not adding to it. I'm a little skeptical that a single percpu write is going to add much measurable overhead to this path. However, we can certainly adapt alternate approaches that stay away from the actual idle code. One approach (diff appended) is to just test to see if the PC is actually in the architecture-specific halt code. There are two downsides: 1. It requires a small amount of per-architecture support. I've provided the tile support as an example, since that's what I tested. I expect x86 is a little more complicated since there are more idle paths and they don't currently run the idle instruction(s) at a fixed address, but it's unlikely to be too complicated on any platform. Still, adding anything per-architecture is certainly a downside. 2. As proposed, my new alternate solution only handles the non-polling case, so if you are in the polling loop, we won't benefit from having the NMI backtrace code skip over you. However my guess is that 99% of the time folks do choose to run the default non-polling mode, so this probably still achieves a pretty reasonable outcome. A different approach that would handle downside #2 and probably make it easier to implement the architecture-specific code for more complicated platforms like x86 would be to use the SCHED_TEXT model and tag all the low-level idling functions as CPUIDLE_TEXT. Then the "are we idling" test is just a range compare on the PC against __cpuidle_text_{start,end}. We'd have to decide whether to make cpu_idle_poll() non-inline and just test for being in that function, or whether we could tag all of cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace whenever the PC is anywhere in that function. Obviously if we have called out to more complicated code (e.g. Daniel's concern about calling out to power management code) the PC would no longer be in the CPUIDLE_TEXT at that point, so that might be OK too. Let me know what you think is the right direction here. Thanks! diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index 4b7cef9e94e0..93ec51a4853b 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -92,6 +92,9 @@ extern void smp_nap(void); /* Enable interrupts racelessly and nap forever: helper for arch_cpu_idle(). */ extern void _cpu_idle(void); +/* The address of the actual nap instruction. */ +extern long _cpu_idle_nap[]; + #else /* __ASSEMBLY__ */ /* diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index b5f30d376ce1..a83a426f1755 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,11 @@ void arch_cpu_idle(void) _cpu_idle(); } +bool arch_cpu_in_idle(struct pt_regs *regs) +{ + return regs->pc == (unsigned long)_cpu_idle_nap; +} + /* * Release a thread_info structure */ diff --git a/include/linux/cpu.h b/include/linux/cpu.h index d2ca8c38f9c4..24462927fa49 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -279,6 +279,7 @@ void arch_cpu_idle_prepare(void); void arch_cpu_idle_enter(void); void arch_cpu_idle_exit(void); void arch_cpu_idle_dead(void); +bool arch_cpu_in_idle(struct pt_regs *); DECLARE_PER_CPU(bool, cpu_dead_idle); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..d9dbab6526a9 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -77,6 +77,7 @@ void
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 03/07/2016 04:48 AM, Peter Zijlstra wrote: On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote: +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ + return this_cpu_read(cpu_idling); +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); + this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); + this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); + this_cpu_write(cpu_idling, true); arch_cpu_idle(); + this_cpu_write(cpu_idling, false); start_critical_timings(); } } No, we're not going to add random crap here. This is actually considered a fast path for some workloads. There's already far too much fat in the whole going to idle and coming out of idle. We should be trimming this, not adding to it. I'm a little skeptical that a single percpu write is going to add much measurable overhead to this path. However, we can certainly adapt alternate approaches that stay away from the actual idle code. One approach (diff appended) is to just test to see if the PC is actually in the architecture-specific halt code. There are two downsides: 1. It requires a small amount of per-architecture support. I've provided the tile support as an example, since that's what I tested. I expect x86 is a little more complicated since there are more idle paths and they don't currently run the idle instruction(s) at a fixed address, but it's unlikely to be too complicated on any platform. Still, adding anything per-architecture is certainly a downside. 2. As proposed, my new alternate solution only handles the non-polling case, so if you are in the polling loop, we won't benefit from having the NMI backtrace code skip over you. However my guess is that 99% of the time folks do choose to run the default non-polling mode, so this probably still achieves a pretty reasonable outcome. A different approach that would handle downside #2 and probably make it easier to implement the architecture-specific code for more complicated platforms like x86 would be to use the SCHED_TEXT model and tag all the low-level idling functions as CPUIDLE_TEXT. Then the "are we idling" test is just a range compare on the PC against __cpuidle_text_{start,end}. We'd have to decide whether to make cpu_idle_poll() non-inline and just test for being in that function, or whether we could tag all of cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace whenever the PC is anywhere in that function. Obviously if we have called out to more complicated code (e.g. Daniel's concern about calling out to power management code) the PC would no longer be in the CPUIDLE_TEXT at that point, so that might be OK too. Let me know what you think is the right direction here. Thanks! diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index 4b7cef9e94e0..93ec51a4853b 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -92,6 +92,9 @@ extern void smp_nap(void); /* Enable interrupts racelessly and nap forever: helper for arch_cpu_idle(). */ extern void _cpu_idle(void); +/* The address of the actual nap instruction. */ +extern long _cpu_idle_nap[]; + #else /* __ASSEMBLY__ */ /* diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index b5f30d376ce1..a83a426f1755 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,11 @@ void arch_cpu_idle(void) _cpu_idle(); } +bool arch_cpu_in_idle(struct pt_regs *regs) +{ + return regs->pc == (unsigned long)_cpu_idle_nap; +} + /* * Release a thread_info structure */ diff --git a/include/linux/cpu.h b/include/linux/cpu.h index d2ca8c38f9c4..24462927fa49 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -279,6 +279,7 @@ void arch_cpu_idle_prepare(void); void arch_cpu_idle_enter(void); void arch_cpu_idle_exit(void); void arch_cpu_idle_dead(void); +bool arch_cpu_in_idle(struct pt_regs *); DECLARE_PER_CPU(bool, cpu_dead_idle); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..d9dbab6526a9 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -77,6 +77,7 @@ void
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 03/07/2016 03:26 AM, Daniel Thompson wrote: Chris Metcalf wrote: +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ +return this_cpu_read(cpu_idling); I think we continue to need the code to identify a core that is running an interrupt handler. Interrupts are not masked at the point we set cpu_idling to false meaning we can easily be preempted before we clear the flag. Yes, good catch. However, mooted by PeterZ wanting to keep any extra state-switching code out of the idle path. See my reply to him for more on that. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 03/07/2016 03:26 AM, Daniel Thompson wrote: Chris Metcalf wrote: +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ +return this_cpu_read(cpu_idling); I think we continue to need the code to identify a core that is running an interrupt handler. Interrupts are not masked at the point we set cpu_idling to false meaning we can easily be preempted before we clear the flag. Yes, good catch. However, mooted by PeterZ wanting to keep any extra state-switching code out of the idle path. See my reply to him for more on that. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote: > +++ b/kernel/sched/idle.c > @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) > __setup("hlt", cpu_idle_nopoll_setup); > #endif > +static DEFINE_PER_CPU(bool, cpu_idling); > + > +/* Was the cpu was in the low-level idle code when interrupted? */ > +bool in_cpu_idle(void) > +{ > + return this_cpu_read(cpu_idling); > +} > + > static inline int cpu_idle_poll(void) > { > rcu_idle_enter(); > trace_cpu_idle_rcuidle(0, smp_processor_id()); > local_irq_enable(); > stop_critical_timings(); > + this_cpu_write(cpu_idling, true); > while (!tif_need_resched() && > (cpu_idle_force_poll || tick_check_broadcast_expired())) > cpu_relax(); > + this_cpu_write(cpu_idling, false); > start_critical_timings(); > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > rcu_idle_exit(); > @@ -89,7 +99,9 @@ void default_idle_call(void) > local_irq_enable(); > } else { > stop_critical_timings(); > + this_cpu_write(cpu_idling, true); > arch_cpu_idle(); > + this_cpu_write(cpu_idling, false); > start_critical_timings(); > } > } No, we're not going to add random crap here. This is actually considered a fast path for some workloads. There's already far too much fat in the whole going to idle and coming out of idle. We should be trimming this, not adding to it.
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote: > +++ b/kernel/sched/idle.c > @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) > __setup("hlt", cpu_idle_nopoll_setup); > #endif > +static DEFINE_PER_CPU(bool, cpu_idling); > + > +/* Was the cpu was in the low-level idle code when interrupted? */ > +bool in_cpu_idle(void) > +{ > + return this_cpu_read(cpu_idling); > +} > + > static inline int cpu_idle_poll(void) > { > rcu_idle_enter(); > trace_cpu_idle_rcuidle(0, smp_processor_id()); > local_irq_enable(); > stop_critical_timings(); > + this_cpu_write(cpu_idling, true); > while (!tif_need_resched() && > (cpu_idle_force_poll || tick_check_broadcast_expired())) > cpu_relax(); > + this_cpu_write(cpu_idling, false); > start_critical_timings(); > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > rcu_idle_exit(); > @@ -89,7 +99,9 @@ void default_idle_call(void) > local_irq_enable(); > } else { > stop_critical_timings(); > + this_cpu_write(cpu_idling, true); > arch_cpu_idle(); > + this_cpu_write(cpu_idling, false); > start_critical_timings(); > } > } No, we're not going to add random crap here. This is actually considered a fast path for some workloads. There's already far too much fat in the whole going to idle and coming out of idle. We should be trimming this, not adding to it.
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 01/03/16 23:01, Chris Metcalf wrote: (+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler) On 03/01/2016 09:23 AM, Daniel Thompson wrote: On 29/02/16 21:40, Chris Metcalf wrote: When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". I can see this makes the logs more attractive but this is code for emergency situations. The idle task is responsible for certain power management activities. How can you be sure the system is wedged because of bugs in that code? It's a fair point, but as core count increases, you really run the risk of losing the valuable data in a sea of data that isn't. For example, for the architecture I maintain, we have the TILE-Gx72, which is a 72-core chip. If each core's register dump and backtrace is 40 lines, we're up to around 3,000 lines of console output. Filtering that down by a factor of 10x or more (if only a handful of cores happen to be active, which is not uncommon) is a substantial usability improvement. No objections to your use case. The output feels very verbose even with "only" eight cores. That said, it's true that the original solution I offered (examining just is_idle_task() plus interrupt nesting) is imprecise. It is relatively straightforward to add a bit of per-cpu state that is set at the same moment we currently do stop/start_critical_timings(), which would indicate much more specifically that the cpu was running the idling code itself, and not anything more complex. In that case if the flag was set, you would know you were either sitting on a processor-specific idle instruction in arch_cpu_idle(), or else polling one or two memory locations in a tight loop in cpu_idle_poll(), which presumably would offer sufficient precision to feel safe. Here's an alternative version of the patch which incorporates this idea. Do you think this is preferable? Thanks! I prefer the approach taken by the new patch although I think the implementation might be buggy... commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262 Author: Chris MetcalfDate: Mon Feb 29 11:56:32 2016 -0500 nmi_backtrace: generate one-line reports for idle cpus When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 786ad32631a6..b8c3c4cf88ad 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct cpuidle_driver *drv, /* kernel/sched/idle.c */ extern void sched_idle_set_state(struct cpuidle_state *idle_state); extern void default_idle_call(void); +extern bool in_cpu_idle(void); #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..9aff315f278b 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ +return this_cpu_read(cpu_idling); I think we continue to need the code to identify a core that is running an interrupt handler. Interrupts are not masked at the point we set cpu_idling to false meaning we can easily be preempted before we clear the flag. +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); +this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); +this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); +this_cpu_write(cpu_idling, true); arch_cpu_idle(); +this_cpu_write(cpu_idling, false); start_critical_timings(); } } diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..75b5eacaa5d3 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -17,6 +17,7 @@ #include #include #include +#include #ifdef arch_trigger_cpumask_backtrace /* For
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 01/03/16 23:01, Chris Metcalf wrote: (+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler) On 03/01/2016 09:23 AM, Daniel Thompson wrote: On 29/02/16 21:40, Chris Metcalf wrote: When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". I can see this makes the logs more attractive but this is code for emergency situations. The idle task is responsible for certain power management activities. How can you be sure the system is wedged because of bugs in that code? It's a fair point, but as core count increases, you really run the risk of losing the valuable data in a sea of data that isn't. For example, for the architecture I maintain, we have the TILE-Gx72, which is a 72-core chip. If each core's register dump and backtrace is 40 lines, we're up to around 3,000 lines of console output. Filtering that down by a factor of 10x or more (if only a handful of cores happen to be active, which is not uncommon) is a substantial usability improvement. No objections to your use case. The output feels very verbose even with "only" eight cores. That said, it's true that the original solution I offered (examining just is_idle_task() plus interrupt nesting) is imprecise. It is relatively straightforward to add a bit of per-cpu state that is set at the same moment we currently do stop/start_critical_timings(), which would indicate much more specifically that the cpu was running the idling code itself, and not anything more complex. In that case if the flag was set, you would know you were either sitting on a processor-specific idle instruction in arch_cpu_idle(), or else polling one or two memory locations in a tight loop in cpu_idle_poll(), which presumably would offer sufficient precision to feel safe. Here's an alternative version of the patch which incorporates this idea. Do you think this is preferable? Thanks! I prefer the approach taken by the new patch although I think the implementation might be buggy... commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262 Author: Chris Metcalf Date: Mon Feb 29 11:56:32 2016 -0500 nmi_backtrace: generate one-line reports for idle cpus When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 786ad32631a6..b8c3c4cf88ad 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct cpuidle_driver *drv, /* kernel/sched/idle.c */ extern void sched_idle_set_state(struct cpuidle_state *idle_state); extern void default_idle_call(void); +extern bool in_cpu_idle(void); #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..9aff315f278b 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ +return this_cpu_read(cpu_idling); I think we continue to need the code to identify a core that is running an interrupt handler. Interrupts are not masked at the point we set cpu_idling to false meaning we can easily be preempted before we clear the flag. +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); +this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); +this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); +this_cpu_write(cpu_idling, true); arch_cpu_idle(); +this_cpu_write(cpu_idling, false); start_critical_timings(); } } diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..75b5eacaa5d3 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -17,6 +17,7 @@ #include #include #include +#include #ifdef arch_trigger_cpumask_backtrace /* For reliability, we're prepared to waste bits here.
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
(+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler) On 03/01/2016 09:23 AM, Daniel Thompson wrote: On 29/02/16 21:40, Chris Metcalf wrote: When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". I can see this makes the logs more attractive but this is code for emergency situations. The idle task is responsible for certain power management activities. How can you be sure the system is wedged because of bugs in that code? It's a fair point, but as core count increases, you really run the risk of losing the valuable data in a sea of data that isn't. For example, for the architecture I maintain, we have the TILE-Gx72, which is a 72-core chip. If each core's register dump and backtrace is 40 lines, we're up to around 3,000 lines of console output. Filtering that down by a factor of 10x or more (if only a handful of cores happen to be active, which is not uncommon) is a substantial usability improvement. That said, it's true that the original solution I offered (examining just is_idle_task() plus interrupt nesting) is imprecise. It is relatively straightforward to add a bit of per-cpu state that is set at the same moment we currently do stop/start_critical_timings(), which would indicate much more specifically that the cpu was running the idling code itself, and not anything more complex. In that case if the flag was set, you would know you were either sitting on a processor-specific idle instruction in arch_cpu_idle(), or else polling one or two memory locations in a tight loop in cpu_idle_poll(), which presumably would offer sufficient precision to feel safe. Here's an alternative version of the patch which incorporates this idea. Do you think this is preferable? Thanks! commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262 Author: Chris MetcalfDate: Mon Feb 29 11:56:32 2016 -0500 nmi_backtrace: generate one-line reports for idle cpus When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 786ad32631a6..b8c3c4cf88ad 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct cpuidle_driver *drv, /* kernel/sched/idle.c */ extern void sched_idle_set_state(struct cpuidle_state *idle_state); extern void default_idle_call(void); +extern bool in_cpu_idle(void); #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..9aff315f278b 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ + return this_cpu_read(cpu_idling); +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); + this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); + this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); + this_cpu_write(cpu_idling, true); arch_cpu_idle(); + this_cpu_write(cpu_idling, false); start_critical_timings(); } } diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..75b5eacaa5d3 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -17,6 +17,7 @@ #include #include #include +#include #ifdef arch_trigger_cpumask_backtrace /* For reliability, we're prepared to waste bits here. */ @@ -151,11 +152,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs) /* Replace printk to write into the NMI seq */ this_cpu_write(printk_func, nmi_vprintk); - pr_warn("NMI backtrace for cpu %d\n", cpu); - if (regs) - show_regs(regs); - else -
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
(+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler) On 03/01/2016 09:23 AM, Daniel Thompson wrote: On 29/02/16 21:40, Chris Metcalf wrote: When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". I can see this makes the logs more attractive but this is code for emergency situations. The idle task is responsible for certain power management activities. How can you be sure the system is wedged because of bugs in that code? It's a fair point, but as core count increases, you really run the risk of losing the valuable data in a sea of data that isn't. For example, for the architecture I maintain, we have the TILE-Gx72, which is a 72-core chip. If each core's register dump and backtrace is 40 lines, we're up to around 3,000 lines of console output. Filtering that down by a factor of 10x or more (if only a handful of cores happen to be active, which is not uncommon) is a substantial usability improvement. That said, it's true that the original solution I offered (examining just is_idle_task() plus interrupt nesting) is imprecise. It is relatively straightforward to add a bit of per-cpu state that is set at the same moment we currently do stop/start_critical_timings(), which would indicate much more specifically that the cpu was running the idling code itself, and not anything more complex. In that case if the flag was set, you would know you were either sitting on a processor-specific idle instruction in arch_cpu_idle(), or else polling one or two memory locations in a tight loop in cpu_idle_poll(), which presumably would offer sufficient precision to feel safe. Here's an alternative version of the patch which incorporates this idea. Do you think this is preferable? Thanks! commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262 Author: Chris Metcalf Date: Mon Feb 29 11:56:32 2016 -0500 nmi_backtrace: generate one-line reports for idle cpus When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 786ad32631a6..b8c3c4cf88ad 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct cpuidle_driver *drv, /* kernel/sched/idle.c */ extern void sched_idle_set_state(struct cpuidle_state *idle_state); extern void default_idle_call(void); +extern bool in_cpu_idle(void); #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..9aff315f278b 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ + return this_cpu_read(cpu_idling); +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); + this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); + this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); + this_cpu_write(cpu_idling, true); arch_cpu_idle(); + this_cpu_write(cpu_idling, false); start_critical_timings(); } } diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..75b5eacaa5d3 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -17,6 +17,7 @@ #include #include #include +#include #ifdef arch_trigger_cpumask_backtrace /* For reliability, we're prepared to waste bits here. */ @@ -151,11 +152,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs) /* Replace printk to write into the NMI seq */ this_cpu_write(printk_func, nmi_vprintk); - pr_warn("NMI backtrace for cpu %d\n", cpu); - if (regs) - show_regs(regs); - else - dump_stack(); +
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 29/02/16 21:40, Chris Metcalf wrote: When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". I can see this makes the logs more attractive but this is code for emergency situations. The idle task is responsible for certain power management activities. How can you be sure the system is wedged because of bugs in that code? Daniel.
Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
On 29/02/16 21:40, Chris Metcalf wrote: When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". I can see this makes the logs more attractive but this is code for emergency situations. The idle task is responsible for certain power management activities. How can you be sure the system is wedged because of bugs in that code? Daniel.
[PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf--- lib/nmi_backtrace.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..f878efc9e851 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -142,6 +142,15 @@ static int nmi_vprintk(const char *fmt, va_list args) return seq_buf_used(>seq) - len; } +static bool idling(void) +{ + if (!is_idle_task(current)) + return false; + + /* Account for nmi_enter() having been called once. */ + return hardirq_count() == HARDIRQ_OFFSET && !in_softirq(); +} + bool nmi_cpu_backtrace(struct pt_regs *regs) { int cpu = smp_processor_id(); @@ -151,11 +160,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs) /* Replace printk to write into the NMI seq */ this_cpu_write(printk_func, nmi_vprintk); - pr_warn("NMI backtrace for cpu %d\n", cpu); - if (regs) - show_regs(regs); - else - dump_stack(); + if (idling()) { + pr_warn("NMI backtrace for cpu %d skipped: idle\n", + cpu); + } else { + pr_warn("NMI backtrace for cpu %d\n", cpu); + if (regs) + show_regs(regs); + else + dump_stack(); + } this_cpu_write(printk_func, printk_func_save); cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); -- 2.1.2
[PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus
When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf --- lib/nmi_backtrace.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..f878efc9e851 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -142,6 +142,15 @@ static int nmi_vprintk(const char *fmt, va_list args) return seq_buf_used(>seq) - len; } +static bool idling(void) +{ + if (!is_idle_task(current)) + return false; + + /* Account for nmi_enter() having been called once. */ + return hardirq_count() == HARDIRQ_OFFSET && !in_softirq(); +} + bool nmi_cpu_backtrace(struct pt_regs *regs) { int cpu = smp_processor_id(); @@ -151,11 +160,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs) /* Replace printk to write into the NMI seq */ this_cpu_write(printk_func, nmi_vprintk); - pr_warn("NMI backtrace for cpu %d\n", cpu); - if (regs) - show_regs(regs); - else - dump_stack(); + if (idling()) { + pr_warn("NMI backtrace for cpu %d skipped: idle\n", + cpu); + } else { + pr_warn("NMI backtrace for cpu %d\n", cpu); + if (regs) + show_regs(regs); + else + dump_stack(); + } this_cpu_write(printk_func, printk_func_save); cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); -- 2.1.2