On Fri, May 23, 2014 at 4:57 AM, Chen Yucong <sla...@gmail.com> wrote: > If (mca_cfg.tolerant == 2 || mce_cfg.tolerant == 3), what can you do for > it?
Maybe we need to look again at the effects of "tolerant" - and maybe specify what happens at various levels, There are some obvious silly bits of code (picking one that is my fault): if (cfg->tolerant < 3) { if (no_way_out) mce_panic("Fatal machine check on current CPU", &m, msg); if (worst == MCE_AR_SEVERITY) { /* schedule action before return to userland */ mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV); set_thread_flag(TIF_MCE_NOTIFY); } else if (kill_it) { force_sig(SIGBUS, current); } } Why is the MCE_AR_SEVERITY recovery code not even attempted if tolerant is >=3? That block of code dates back to before there were any recoverable cases ... so the insane option of just ignoring the error and hoping that the end result wasn't too bad made some sort of sense when compared against a machine crash and not getting any answer at all. Or one that Andi pointed out years ago (and had a fix in a tree for): if (order == 1) { /* CHECKME: Can this race with a parallel hotplug? */ int cpus = num_online_cpus(); /* * Monarch: Wait for everyone to go through their scanning * loops. */ while (atomic_read(&mce_executing) <= cpus) { What if some cpus were offline when this machine check arrived? Our "offline" code doesn't do anything to the h/w to prevent those cpus from joining in the machine check fun. So we'll see more than num_online_cpus() processors arrive to process the machine check. Andi's fix was in the start of do_machine_check() and just had each cpu that showed up check whether it was listed as "online" by Linux. If not, it just cleared MCG_STATUS and returned. I didn't apply it because I thought we needed to be a bit more robust (what if the offline cpu actually did have a problem? ... we should at least check that MCG_STATUS.RIPV=1 before rashly returning ... perhaps even more tests are needed if the cpu had never been online at all). So I'm happy that you are taking an interest in machine check code. I think there are places where it can be made a lot better. I don't think that moving where mces_seen gets cleared is one of those places. -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/