On Mon, Nov 17, 2014 at 3:16 PM, Luck, Tony <tony.l...@intel.com> wrote: >> I still wonder whether the timeout code is the real culprit. My patch >> will slow down entry into do_machine_check by tens of cycles, several >> cachelines, and possibly a couple of TLB misses. Given that the >> timing seemed marginal to me, it's possible (albeit not that likely) >> that it pushed the time needed for synchronization into the range of >> unreliability. > > I think we have very conservative timeouts. Here are the significant bits > from mce > > First SPINUNIT ... how many nanoseconds to spin "ndelay(SPINUNIT)" before > pounding on an atomic op to see if we are done: > #define SPINUNIT 100 /* 100ns */ > > Initialization of our timeout - comment above this claims we are aiming for > "one second" > > cfg->monarch_timeout = USEC_PER_SEC; > > Inside mce_start() - set up our timeout. Math says this will be 1e9 > > u64 timeout = (u64)mca_cfg.monarch_timeout * NSEC_PER_USEC;
Aha. I missed the extra multiplication. > > Now the actual loop: > while (atomic_read(&mce_callin) != cpus) { > if (mce_timed_out(&timeout)) { > atomic_set(&global_nwo, 0); > return -1; > } > ndelay(SPINUNIT); > } And I missed the ndelay. *sigh* reading comprehension wasn't so good there. >From a debugging POV, it would be kind of nice to know which of the four mce_timed_out calls is reporting the failure. IIUC, the first one would mean that a cpu didn't make it in to do_machine_check at all, whereas the second one would mean that a cpu got lost after entering do_machine_check. Also, the second one presumably knows how far it got (i.e. mce_executing and order). It could also be interesting to tweak mce_panic to not actually panic the machine but to try to return and stop the test instead. Then real debugging could be possible :) I still wish I could test this. The mce-inject code is completely useless here, since it calls do_machine_check directly instead of going through the MCE code. > > And the inside of mce_timed_out() ... we decrement our 1e9 by 100 for each > call: > > if ((s64)*t < SPINUNIT) { > if (mca_cfg.tolerant <= 1) > mce_panic("Timeout synchronizing machine check over > CPUs", > NULL, NULL); > cpu_missing = 1; > return 1; > } > *t -= SPINUNIT; > > Not sure what the worst case is for waiting for other processors to show up. > Perhaps it is > when the other cpu is in some deep C-state and we have to wait for it to > power the core > back on and resync clock. Maybe it is when a cpu has just started a "wbinvd" > instruction > with a cache entirely full of dirty lines that belong on some other processor > so have to > be written back across QPI (may when QPI links are congested with 10gbit > network traffic? > > But a full second seems like a lot ... you adding a handful of cache and TLB > misses shouldn't > push us over the edge. Agreed. > > -Tony > > -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/