> -----Original Message----- > From: linux-edac-ow...@vger.kernel.org <linux-edac-ow...@vger.kernel.org> On > Behalf Of Borislav Petkov > Sent: Wednesday, April 10, 2019 12:26 PM > To: Ghannam, Yazen <yazen.ghan...@amd.com> > Cc: linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org; > tony.l...@intel.com; x...@kernel.org > Subject: Re: [PATCH RESEND 2/5] x86/MCE: Handle MCA controls in a per_cpu way > > On Wed, Apr 10, 2019 at 04:58:12PM +0000, Ghannam, Yazen wrote: > > Yes, unused banks in the middle are counted in the MCG_CAP[Count] value. > > Good. > > > Okay, so you're saying the sysfs access should fail if a bank is > > disabled. Is that correct? > > Well, think about it. If a bank is not operational for whatever reason, > we should tell the user that. > > > Does "disabled" mean one or both of these? > > Unused = RAZ/WI in hardware > > Uninitialized = Not initialized by kernel due to quirks, etc. > > > > For an unused bank, it doesn't hurt to write MCA_CTL, but really > > there's no reason to do so and go through mce_restart(). > > Yes, but that bank is non-operational in some form. So we should prevent > all writes to it because, well, it is not going to do anything. And this > would be a good way to give feedback to the user that that is the case. > > > For an uninitialized bank, should we prevent users from overriding the > > kernel's settings? > > That all depends on the quirks. Whether we should allow them to be > overridden or not. I don't think we've ever thought about it, though. > > Let's look at one: > > if (c->x86_vendor == X86_VENDOR_AMD) { > if (c->x86 == 15 && cfg->banks > 4) { > /* > * disable GART TBL walk error reporting, which > * trips off incorrectly with the IOMMU & 3ware > * & Cerberus: > */ > clear_bit(10, (unsigned long *)&mce_banks[4].ctl); > > > Yah, so if the user reenables those GART errors, then she/he will see a > lot of MCEs reported and will maybe complain about it. And then we'll > say, but why did you enable them then. And she/he'll say: uh, didn't > know. Or, I was just poking at sysfs and this happened. > > Then we can say, well, don't do that then! :-) > > So my current position is, meh, who cares. But then I'm looking at > another quirk: > > if (c->x86_vendor == X86_VENDOR_INTEL) { > /* > * SDM documents that on family 6 bank 0 should not be written > * because it aliases to another special BIOS controlled > * register. > * But it's not aliased anymore on model 0x1a+ > * Don't ignore bank 0 completely because there could be a > * valid event later, merely don't write CTL0. > */ > > if (c->x86 == 6 && c->x86_model < 0x1A && cfg->banks > 0) > mce_banks[0].init = 0; > > > which basically prevents that bank from being reinitialized. So I guess > we have that functionality already - we simply need to pay attention to > w->init. > > Right?
Okay, I'm with you. So I'm thinking to add another patch to the set. This will set mce_bank.init=0 if we read MCA_CTL=0 from the hardware. Then we check if mce_bank.init=0 in the set/show functions and give a message if the bank is not used. How does that sound? Thanks, Yazen