On Fri, Jan 25, 2019 at 8:02 AM Andi Kleen <a...@linux.intel.com> wrote: > > > [Fri Jan 25 10:28:53 2019] perf: interrupt took too long (2501 > 2500), > > lowering kernel.perf_event_max_sample_rate to 79750 > > [Fri Jan 25 10:29:08 2019] perf: interrupt took too long (3136 > 3126), > > lowering kernel.perf_event_max_sample_rate to 63750 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (4140 > 3920), > > lowering kernel.perf_event_max_sample_rate to 48250 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (5231 > 5175), > > lowering kernel.perf_event_max_sample_rate to 38000 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (6736 > 6538), > > lowering kernel.perf_event_max_sample_rate to 29500 > > These are fairly normal. > > > [Fri Jan 25 10:32:44 2019] ------------[ cut here ]------------ > > [Fri Jan 25 10:32:44 2019] perfevents: irq loop stuck! > > I believe it's always possible to cause an irq loop. This happens when > the PMU is programmed to cause PMIs on multiple counters > too quickly. Maybe should just recover from it without printing such > scary messages.
Yeah, a loop stuck looks really scary inside an NMI handler. Should I just go ahead to send a patch to remove this warning? Or probably turn it into a pr_info()? > > Right now the scary message is justified because it resets the complete > PMU. Perhaps need to be a bit more selective resetting on only > the events that loop. > > > [Fri Jan 25 10:32:44 2019] WARNING: CPU: 1 PID: 0 at > > arch/x86/events/intel/core.c:2440 intel_pmu_handle_irq+0x158/0x170 > > This looks independent. > > I would apply the following patch (cut'n'pasted, so may need manual apply) > and then run with > I would like to help as we keep seeing this warning for a rather long time, but unfortunately the reproducer provided by Ravi doesn't trigger any warning or crash here. Maybe I don't use a right hardware to trigger it? [ 0.132136] Performance Events: PEBS fmt2+, Broadwell events, 16-deep LBR, full-width counters, Intel PMU driver. [ 0.133003] ... version: 3 [ 0.134001] ... bit width: 48 [ 0.135001] ... generic registers: 4 [ 0.136001] ... value mask: 0000ffffffffffff [ 0.137001] ... max period: 00007fffffffffff [ 0.138001] ... fixed-purpose events: 3 [ 0.139001] ... event mask: 000000070000000f Thanks!