On Fri, Nov 14, 2014 at 4:52 PM, Luck, Tony <tony.l...@intel.com> wrote: >> causes Tony's MCE stress test to fail, presumably when some CPU either >> becomes permanently non-interruptable or otherwise wanders off into >> the weeds. > > It might be that recent "improvements" I made to my test harness have > messed things up. I trimmed one delay (between injection and consumption), > but it turns out the other delay in the code never get executed (because we > take a SIGBUS on consumption and then longjmp). So my test that used > to pause a bit between iterations were running almost back to back > consumption and injection of next error.
Hmm. Am I right that the timeout code in mce.c is overly aggressive, too? > > This meant the serial console was a huge bottleneck (especially as my > development BIOS is also kicking its own debug junk onto the same port). > Some of the errors pointed obliquely at console. > > I've slowed things back down to where they used to be, and things are > ticking along nicely (with 0.6 second delay between iterations). Just > passed the 2800 mark and still going. I'm leaving it running over the > weekend - if it makes it into the 50k level I'm willing to call it good. > Phew :) FWIW, I've confirmed that my code survives int3 from userspace, int3 from normal kernel code, and int3 from kernel with user gs. I'm not completely thrilled with what it does to double_fault, though. If we somehow get a double fault caused by an interrupt hitting userspace with a bad kernel_stack, then we'll end up page faulting in the double_fault prologue. I'm not convinced that this is worth worrying about. It would be easy enough to fix, though, even if it would further uglify the code. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/