On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges <chris.j.ar...@canonical.com> wrote: > > I've been able to repro with your patch and observed the WARN_ON when booting > a > VM on affected hardware and non affected hardware:
Ok, interesting. So the whole "we try to do an APIC ACK with the ISR bit clear" seems to be a real issue. > I modified the posted patch with the following: > - WARN_ON_ONCE(!(v & 1)); > + WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector); Yes, makes sense, although I'm not sure what the vector translations end up being. See below. > And it showed vector = 1b when booting. However, when I run the reproducer on > an affected machine I get the following WARNs before the hang: Ok, so the boot-time thing I think happens because a device irq happens but goes away immediately because the CPU that triggers it also clears it immediately in the device initialization code, so it's a level-triggered interrupt that goes away "on its own". But vector 0x1b seems odd. I thought we mapped external interrupts to 0x20+ (FIRST_EXTERNAL_VECTOR). Ingo/Peter? Is there any sane interface to look up the percpu apic vector data? Chris, since this is repeatable for you, can you do int irq; irq = __this_cpu_read(vector_irq[vector]); and print that out too? That *should* show the actual hardware irq, although there are a few magic cases too (-1/-2 mean special things) But the fact that you get the warning before the hang is much more interesting. > [ 36.301299] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 > apic_ack_edge+0x93/0xa0() > [ 36.301301] ack_APIC_irq: vector = e1 Is this repeatable? Does it happen before *every* hang, or at least often enough to be a good pattern? > [ 40.430533] ack_APIC_irq: vector = 22 > > So vector = e1 then 22 before the hang. Is it always the same ones? I assume that on different machines the vector allocations would be different, but is it consistent on any particular machine? That's assuming the whole warning is consistent at all before the hang, of course. > Anyway, maybe this sheds some more light on this issue. I can reproduce this > at > will, so let me know of other experiments to do. Somebody else who knows the apic needs to also take a look, but I'd love to hear what the actual hardware interrupt is (from that "vector_irq[vector]" thing above. I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they are 0xf0-0xff), so it sounds like an external one. But that then requires the whole mapping table thing. Ingo/Peter/Jiang - is there anything else useful we could print out? I worry about the irq movement code. Can we add printk's to when an irq is chasing from one CPU to another and doing that "move_in_progress" thing? I've always been scared of that code. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/