On Tue, Apr 02, 2019 at 03:03:02PM +0200, Peter Zijlstra wrote: > On Mon, Apr 01, 2019 at 09:46:33PM +0000, Lendacky, Thomas wrote: > > This patch series addresses issues with increased NMI latency in newer > > AMD processors that can result in unknown NMI messages when PMC counters > > are active. > > > > The following fixes are included in this series: > > > > - Resolve a race condition when disabling an overflowed PMC counter, > > specifically when updating the PMC counter with a new value. > > - Resolve handling of active PMC counter overflows in the perf NMI > > handler and when to report that the NMI is not related to a PMC. > > - Remove earlier workaround for spurious NMIs by re-ordering the > > PMC stop sequence to disable the PMC first and then remove the PMC > > bit from the active_mask bitmap. As part of disabling the PMC, the > > code will wait for an overflow to be reset. > > > > The last patch re-works the order of when the PMC is removed from the > > active_mask. There was a comment from a long time ago about having > > to clear the bit in active_mask before disabling the counter because > > the perf NMI handler could re-enable the PMC again. Looking at the > > handler today, I don't see that as possible, hence the reordering. The > > question will be whether the Intel PMC support will now have issues. > > There is still support for using x86_pmu_handle_irq() in the Intel > > core.c file. Did Intel have any issues with spurious NMIs in the past? > > Peter Z, any thoughts on this? > > I can't remember :/ I suppose we'll see if anything pops up after these > here patches. At least then we get a chance to properly document things. > > > Also, I couldn't completely get rid of the "running" bit because it > > is used by arch/x86/events/intel/p4.c. An old commit comment that > > seems to indicate the p4 code suffered the spurious interrupts: > > 03e22198d237 ("perf, x86: Handle in flight NMIs on P4 platform"). > > So maybe that partially answers my previous question... > > Yeah, the P4 code is magic, and I don't have any such machines left, nor > do I think does Cyrill who wrote much of that.
It was so long ago :) What I remember from the head is some of the counters were borken on hardware level so that I had to use only one counter instead of two present in the system. And there were spurious NMIs too. I think we can move this "running" bit to per-cpu base declared inside p4 code only, so get rid of it from cpu_hw_events? > I have vague memories of the P4 thing crashing with Vince's perf_fuzzer, > but maybe I'm wrong. No, you're correct. p4 was crashing many times before we manage to make it more-less stable. The main problem though that to find working p4 box is really a problem. > Ideally we'd find a willing victim to maintain that thing, or possibly > just delete it, dunno if anybody still cares. As to me, I would rather mark this p4pmu code as deprecated, until there is *real* need for its support. > > Anyway, I like these patches, but I cannot apply since you send them > base64 encoded and my script chokes on that.