>> mce_regin, which is only called by monarch CPU, can be used for system >> panics as quickly as possible if there is a truly data corrupting error. >> But Monarch CPU don't have to help all other CPU to clean mces_clean. >> One advantage of Per-CPU is the isolation of errors propagation, being >> so, why do not we clean mces_seen by Per-CPU? > > What kind of error propagations are you expecting/concerning here? > Could you explain the problem more in detail?
Please do give us more detail on the scenario that you see that would make your new version behave better. I'm sure the current code has no races w.r.t. clearing mces_seen. The monarch clears them all in mce_reign() before clearing mce_executing at the foot of mce_end() and allowing the others to run again. Your code has the monarch release all the other cpus from the spinloop in mce_end() so they will all rush together through the final lines of do_machine_check(). Some of them will have work to do if they saw errors - they may have to send signals, or log the error. Others can fly directly to the end of do_machine_check() and clear MCG_STATUS and return to executing whatever code was interrupted. So it is possible that some processors will be out doing things that can generate another machine check, before others have finished their tasks and got to the point to clear mces_seen.(*) -Tony (*) maybe that doesn't matter because they haven't zeroed MCG_STATUS yet - so this second machine check will force those cpus to shutdown. See MCIP description in "15.3.1.2 IA32_MCG_STATUS_MSR" section of software developer manual.