On 1/15/20 2:54 PM, Stefan Reiter wrote: > A forum user reported that our kernel does not boot on Threadripper 3000 > series CPUs, unless 'mce=off' is provided on the kernel commandline. [0] > > This is a known issue, which has been fixed in mainline kernels and > backported to 5.4, 4.19 and 4.14 [1]. It is not, however, included in > 5.3, nor in the Ubuntu builds. [2] > > This patch is the original one posted for 5.5, which is the same as the > one ported to 5.4. It also applies cleanly to 5.3, and should work the > same, seeing as the backports to older versions do not have functional > changes either. > > [0] https://forum.proxmox.com/threads/bug-pve-wont-boot-properly.63432/ > [1] > https://patchwork.kernel.org/project/linux-edac/list/?q=Allow+Reserved+types+to+be+overwritten+in+smca_banks > [2] > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/log/?qt=grep&q=Allow+Reserved+types+to+be+overwritten+in+smca_banks > > Signed-off-by: Stefan Reiter <s.rei...@proxmox.com> > --- > > Not sure if we usually include fixes like that, but I feel like this could > avoid > a lot of Forum threads once TR 3000 gets more commonplace :)
I'd like to include this! It'd be also great to post it to ubuntu-kernel list, and/or maybe even lkml-stable list for backporting to next stable release, they probably want this too. > > > ...w-Reserved-types-to-be-overwritten-i.patch | 88 +++++++++++++++++++ > 1 file changed, 88 insertions(+) > create mode 100644 > patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch > > diff --git > a/patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch > > b/patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch > new file mode 100644 > index 0000000..6f49ff6 > --- /dev/null > +++ > b/patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch > @@ -0,0 +1,88 @@ > +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 > +From: Yazen Ghannam <yazen.ghan...@amd.com> > +Date: Thu, 21 Nov 2019 08:15:08 -0600 > +Subject: [PATCH] x86/MCE/AMD: Allow Reserved types to be overwritten in > + smca_banks[] > + > +Each logical CPU in Scalable MCA systems controls a unique set of MCA > +banks in the system. These banks are not shared between CPUs. The bank > +types and ordering will be the same across CPUs on currently available > +systems. > + > +However, some CPUs may see a bank as Reserved/Read-as-Zero (RAZ) while > +other CPUs do not. In this case, the bank seen as Reserved on one CPU is > +assumed to be the same type as the bank seen as a known type on another > +CPU. > + > +In general, this occurs when the hardware represented by the MCA bank > +is disabled, e.g. disabled memory controllers on certain models, etc. > +The MCA bank is disabled in the hardware, so there is no possibility of > +getting an MCA/MCE from it even if it is assumed to have a known type. > + > +For example: > + > +Full system: > + Bank | Type seen on CPU0 | Type seen on CPU1 > + ------------------------------------------------ > + 0 | LS | LS > + 1 | UMC | UMC > + 2 | CS | CS > + > +System with hardware disabled: > + Bank | Type seen on CPU0 | Type seen on CPU1 > + ------------------------------------------------ > + 0 | LS | LS > + 1 | UMC | RAZ > + 2 | CS | CS > + > +For this reason, there is a single, global struct smca_banks[] that is > +initialized at boot time. This array is initialized on each CPU as it > +comes online. However, the array will not be updated if an entry already > +exists. > + > +This works as expected when the first CPU (usually CPU0) has all > +possible MCA banks enabled. But if the first CPU has a subset, then it > +will save a "Reserved" type in smca_banks[]. Successive CPUs will then > +not be able to update smca_banks[] even if they encounter a known bank > +type. > + > +This may result in unexpected behavior. Depending on the system > +configuration, a user may observe issues enumerating the MCA > +thresholding sysfs interface. The issues may be as trivial as sysfs > +entries not being available, or as severe as system hangs. > + > +For example: > + > + Bank | Type seen on CPU0 | Type seen on CPU1 > + ------------------------------------------------ > + 0 | LS | LS > + 1 | RAZ | UMC > + 2 | CS | CS > + > +Extend the smca_banks[] entry check to return if the entry is a > +non-reserved type. Otherwise, continue so that CPUs that encounter a > +known bank type can update smca_banks[]. > + > +Fixes: 68627a697c19 ("x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA > bank type") > +Signed-off-by: Yazen Ghannam <yazen.ghan...@amd.com> > +Signed-off-by: Borislav Petkov <b...@suse.de> > +--- > + arch/x86/kernel/cpu/mce/amd.c | 2 +- > + 1 file changed, 1 insertion(+), 1 deletion(-) > + > +diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c > +index 6ea7fdc82f3c..08e09c8c269f 100644 > +--- a/arch/x86/kernel/cpu/mce/amd.c > ++++ b/arch/x86/kernel/cpu/mce/amd.c > +@@ -266,7 +266,7 @@ static void smca_configure(unsigned int bank, unsigned > int cpu) > + smca_set_misc_banks_map(bank, cpu); > + > + /* Return early if this bank was already initialized. */ > +- if (smca_banks[bank].hwid) > ++ if (smca_banks[bank].hwid && smca_banks[bank].hwid->hwid_mcatype != 0) > + return; > + > + if (rdmsr_safe_on_cpu(cpu, MSR_AMD64_SMCA_MCx_IPID(bank), &low, &high)) > { > +-- > +2.20.1 > + > _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel