> -----Original Message----- > From: linux-edac-ow...@vger.kernel.org <linux-edac-ow...@vger.kernel.org> On > Behalf Of Borislav Petkov > Sent: Monday, March 11, 2019 1:21 PM > To: Ghannam, Yazen <yazen.ghan...@amd.com> > Cc: linux-e...@vger.kernel.org; Borislav Petkov <b...@suse.de>; Tony Luck > <tony.l...@intel.com>; x...@kernel.org; linux- > ker...@vger.kernel.org; ra...@milecki.pl; cle...@gmail.com > Subject: Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA > errors on some Family 17h models > > On Thu, Mar 07, 2019 at 09:26:04PM +0000, Ghannam, Yazen wrote: > > +static bool smca_filter_mce(struct mce *m) > > +{ > > + enum smca_bank_types bank_type = smca_get_bank_type(m->bank); > > + struct cpuinfo_x86 *c = &boot_cpu_data; > > + u8 xec = XEC(m->status, xec_mask); > > + > > + /* > > + * Spurious errors of this type may be reported. > > + * See Family 17h Models 10h-2Fh Erratum #1114. > > + */ > > + if (c->x86 == 0x17 && > > + (c->x86_model >= 0x10 && c->x86_model <= 0x2F) && > > + bank_type == SMCA_IF && xec == 10) > > + return true; > > This is happening too late and we need it much earlier, from Rafal's dmesg: > > [ 1.070855] mce: [Hardware Error]: Machine check events logged > [ 1.070860] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: > d8200000000a0151 > [ 1.070863] mce: [Hardware Error]: TSC 73fa0765c MISC d01b0fff00000000 > SYND 4a000000 IPID 100b000000000 > [ 1.071065] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1543481411 > SOCKET 0 APIC 2 microcode 810100b > > that's __print_mce() from the notifier. > > So we'd need a filter function which is called in do_machine_check() and > machine_check_poll() right after we've collected enough info to be able > to filter out the MCE based on the signature. In this case the extended > error core and SMCA bank type suffices but we should put those functions > late enough so that they can be used for other filtering later. >
Okay, understood. Should I keep the filter in edac_mce_amd? I guess it's not necessary if filtered out earlier. > Alternatively, if this error type has a special bit in the mask registers so > that you can disable it there ala > > if (c->x86_vendor == X86_VENDOR_AMD) { > if (c->x86 == 15 && cfg->banks > 4) { > /* > * disable GART TBL walk error reporting, which > * trips off incorrectly with the IOMMU & 3ware > * & Cerberus: > */ > clear_bit(10, (unsigned long *)&mce_banks[4].ctl); > > > that would be even better but I'd guess it doesn't have a special bit... > Yes, that's right. Clearing a bit in MCA_CTL is not recommend in this case. Thanks, Yazen