On 3/5/19 1:19 PM, Hans de Goede wrote: > Hi, > > On 05-03-19 17:02, Hans de Goede wrote: >> Hi, >> >> On 05-03-19 15:06, Lendacky, Thomas wrote: >>> On 3/3/19 4:57 AM, Hans de Goede wrote: >>>> Hi, >>>> >>>> On 21-02-19 13:30, Hans de Goede wrote: >>>>> Hi, >>>>> >>>>> On 19-02-19 22:47, Lendacky, Thomas wrote: >>>>>> On 2/19/19 3:01 PM, Thomas Gleixner wrote: >>>>>>> Hans, >>>>>>> >>>>>>> On Tue, 19 Feb 2019, Hans de Goede wrote: >>>>>>> >>>>>>> Cc+: ACPI/AMD folks >>>>>>> >>>>>>>> Various people are reporting false positive "do_IRQ: #.55 No irq >>>>>>>> handler for >>>>>>>> vector" >>>>>>>> messages on AMD ryzen based laptops, see e.g.: >>>>>>>> >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1551605 >>>>>>>> >>>>>>>> Which contains this dmesg snippet: >>>>>>>> >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Bringing up >>>>>>>> secondary CPUs >>>>>>>> ... >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: x86: Booting SMP >>>>>>>> configuration: >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: .... node #0, >>>>>>>> CPUs: #1 >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 1.55 No irq >>>>>>>> handler for >>>>>>>> vector >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: #2 >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 2.55 No irq >>>>>>>> handler for >>>>>>>> vector >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: #3 >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 3.55 No irq >>>>>>>> handler for >>>>>>>> vector >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Brought up 1 node, >>>>>>>> 4 CPUs >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Max logical >>>>>>>> packages: 1 >>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Total of 4 >>>>>>>> processors >>>>>>>> activated (15968.49 BogoMIPS) >>>>>>>> >>>>>>>> It seems that we get an IRQ for each CPU as we bring it online, >>>>>>>> which feels to me like it is some sorta false-positive. >>>>>>> >>>>>>> Sigh, that looks like BIOS value add again. >>>>>>> >>>>>>> It's not a false positive. Something _IS_ sending a vector 55 to these >>>>>>> CPUs >>>>>>> for whatever reason. >>>>>>> >>>>>> >>>>>> I remember seeing something like this in the past and it turned out >>>>>> to be >>>>>> a BIOS issue. BIOS was enabling the APs to interact with the legacy >>>>>> 8259 >>>>>> interrupt controller when only the BSP should. During POST the APs were >>>>>> exposed to ExtINT/INTR events as a result of the mis-configuration >>>>>> (probably due to a UEFI timer-tick using the 8259) and this left a >>>>>> pending >>>>>> ExtINT/INTR interrupt latched on the APs. >>>>>> >>>>>> When the APs were started by the OS, the latched ExtINT/INTR >>>>>> interrupt is >>>>>> processed shortly after the OS enables interrupts. The AP then >>>>>> queries the >>>>>> 8259 to identify the vector number (which is the value of the 8259's >>>>>> ICW2 >>>>>> register + the IRQ level). The master 8259's ICW2 was set to 0x30 and, >>>>>> since no interrupts are actually pending, the 8259 will respond with >>>>>> IRQ7 >>>>>> (spurious interrupt) yielding a vector of 0x37 or 55. >>>>>> >>>>>> The OS was not expecting vector 55 and printed the message. >>>>>> >>>>>> From the Intel Developer's Manual: Vol 3a, Section 10.5.1: >>>>>> "Only one processor in the system should have an LVT entry >>>>>> configured to >>>>>> use the ExtINT delivery mode." >>>>>> >>>>>> Not saying this is the problem, but very well could be. >>>>> >>>>> That sounds like a likely candidate, esp. also since this only happens >>>>> once per CPU when we first only the CPU. >>>>> >>>>> Can you provide me with a patch with some printk-s / pr_debugs to >>>>> test for this, then I can build a kernel with that patch added and >>>>> we can see if your hypothesis is right. >>>> >>>> Ping? I like your theory, can you provide some help with debugging this >>>> further (to prove that your theory is correct ) ? >>> >>> It's been a very long time since I dealt with this and I was only on the >>> periphery. You might be able to print the LVT entries from the APIC and >>> see if any of them have an un-masked ExtINT delivery mode. You would need >>> to do this very early before Linux modifies any values. >> >> I'm afraid I'm not familiar enough with the interrupt / APIC parts of >> the kernel to do something like this myself. >> >>> Or you can report the issue to the OEM and have them check their BIOS >>> code to see if they are doing this. >> >> I will try to go this route, but I'm not really hopeful that will >> lead to a solution. > > A similar issue is also reported here: > > https://bugzilla.redhat.com/show_bug.cgi?id=1551605 > > There are multiple people with different vectors (so likely / possibly > different bugs) commenting on that bug, but I just got confirmation > that the vector 55 issue is also happening on an Acer system with an AMD > A8 processor (I suspect a Ryzen, but that still needs to be confirmed). > > So this seems to be a generic issue with (some) AMD laptops and > not specific to one OEM.
I also see that comment 17 is for an Intel based machine, which to me implies that it really is a BIOS issue. Thanks, Tom > > Regards, > > Hans