On Wed, Jan 06, 2021 at 09:41:02AM -0800, Paul E. McKenney wrote: > The "Timeout: Not all CPUs entered broadcast exception handler" message > will appear from time to time given enough systems, but this message does > not identify which CPUs failed to enter the broadcast exception handler. > This information would be valuable if available, for example, in order to > correlated with other hardware-oriented error messages.
Because you're expecting that the CPUs which have not entered the exception handler might have stuck earlier and that's the correlation there? > This commit That's a tautology. :) > therefore maintains a cpumask_t of CPUs that have entered this handler, > and prints out which ones failed to enter in the event of a timeout. > Build-tested only. > > Cc: Tony Luck <[email protected]> > Cc: Borislav Petkov <[email protected]> > Cc: Thomas Gleixner <[email protected]> > Cc: Ingo Molnar <[email protected]> > Cc: "H. Peter Anvin" <[email protected]> > Cc: <[email protected]> > Cc: <[email protected]> > Reported-by: Jonathan Lemon <[email protected]> > Signed-off-by: Paul E. McKenney <[email protected]> > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 13d3f1c..44d2b99 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -878,6 +878,12 @@ static atomic_t mce_executing; > static atomic_t mce_callin; > > /* > + * Track which CPUs entered and not in order to print holdouts. > + */ > +static cpumask_t mce_present_cpus; > +static cpumask_t mce_missing_cpus; > + > +/* > * Check if a timeout waiting for other CPUs happened. > */ > static int mce_timed_out(u64 *t, const char *msg) > @@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg) > if (!mca_cfg.monarch_timeout) > goto out; > if ((s64)*t < SPINUNIT) { > - if (mca_cfg.tolerant <= 1) > + if (mca_cfg.tolerant <= 1) { > + if (!cpumask_andnot(&mce_missing_cpus, cpu_online_mask, > &mce_present_cpus)) > + pr_info("%s: MCE holdout CPUs: %*pbl\n", > + __func__, > cpumask_pr_args(&mce_missing_cpus)); > mce_panic(msg, NULL, NULL); > + } > cpu_missing = 1; > return 1; > } > @@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out) > * is updated before mce_callin. > */ > order = atomic_inc_return(&mce_callin); Doesn't a single mce_callin_mask suffice? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette

