Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
> On 1 Apr 2019, at 20:28, Vitaly Kuznetsov wrote: > > Liran Alon writes: > >>> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov wrote: >>> >>> Liran Alon writes: >>> > On 1 Apr 2019, at 16:36, Vitaly Kuznetsov wrote: > > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine > + > piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion > that > it is mishandling level-triggered interrupt performing EOI without fixing > the root cause. I would rephrase as: It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. This results in IOAPIC keep re-raising the level-triggered interrupt after EOI because irq-line remains asserted. >>> >>> Ok, thanks for the suggestion. >>> > This causes immediate re-assertion and L2 VM (which is > supposedly expected to fix the cause of the interrupt) is not making any > progress. I don’t know why you assume this. From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest This means that the handler for this level-triggered interrupt is on Hyper-V and not it’s guest. >>> >>> If you let it run (with e.g. this patch or by setting preemtion timer > >>> 0) you'll see that MMIO write fixing the cause of the interrupt is >>> happening from L2: >>> >>> (qemu) info pci: >>> >>> Bus 0, device 4, function 0: >>> USB controller: PCI device 8086:7112 >>> PCI subsystem 1af4:1100 >>> IRQ 23. >>> BAR4: I/O at 0x6060 [0x607f]. >>> id "" >>> >>> ... >>> 538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 >>> info 0 0 >>> 538597.212499: kvm_entry:vcpu 0 >>> 538597.212506: kvm_exit: reason IO_INSTRUCTION rip >>> 0xf80e02ac6a27 info 60620009 0 >>> 538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason >>> IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 >>> 538597.212509: kvm_fpu: unload >>> 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) >>> 538597.212516: kvm_fpu: load >>> 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val >>> 0x1 >>> 538597.212519: kvm_entry:vcpu 0 >>> 538597.212523: kvm_exit: reason IO_INSTRUCTION rip >>> 0xf80e02ac6a61 info 60640009 0 >>> 538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason >>> IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 >>> 538597.212524: kvm_fpu: unload >>> 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) >>> 538597.212528: kvm_fpu: load >>> 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val >>> 0xf >>> ... >>> >>> and this happens after EOI from L1. >> >> I see that the L2 guest is doing I/O read to the device BAR4 but do these >> reads lower the irq-line? >> I would expect a write to lower the irq-line. >> >> Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return >> a value. Doesn’t lower irq-line. >> (Even though offset 0x04 returns the "interrupt enable register”). >> In contrast, looking at uhci_port_write(), it seems that writing to either >> offset 0x02 or 0x04 could lower the irq-line. >> So you should look for pio_write to port 0x6062 or 0x6064 to see who >> is actually responsible for lowering the irq-line. > > Sorry, I probably like trimming traces too much. Writes happen too: > > [005] 538597.212532: kvm_exit: reason IO_INSTRUCTION rip > 0xf80e02ac6a8f info 60620001 0 > [005] 538597.212533: kvm_nested_vmexit:rip f80e02ac6a8f reason > IO_INSTRUCTION info1 60620001 info2 0 int_info 0 int_info_err 0 > [005] 538597.212534: kvm_pio: pio_write at 0x6062 size 2 count 1 > val 0x1 This clears bit UHCI_STS_USBINT from “status” register and zero “status2” register. > [005] 538597.212534: kvm_fpu: unload > [005] 538597.212535: kvm_userspace_exit: reason KVM_EXIT_IO (2) > [005] 538597.212543: kvm_fpu: load > [005] 538597.212544: kvm_entry:vcpu 0 > [005] 538597.212547: kvm_exit: reason IO_INSTRUCTION rip > 0xf80e02ac6a9c info 60640001 0 > [005] 538597.212548: kvm_nested_vmexit:rip f80e02ac6a9c reason > IO_INSTRUCTION info1 60640001 info2 0 int_info 0 int_info_err 0 > [005] 538597.212548: kvm_pio: pio_write at 0x6064 size 2 count 1 > val 0x0 This sets 0 in the “intr” register (Interrupt enable register). At this point it is indeed likely that uhci_update_irq() will lower the irq-line. I agree. :) > [005] 538597.212549: kvm_fpu: unload > [005] 538597.212550: kvm_userspace_exit: reason KVM_EXIT_IO (2) > > and this likely lowers the line. OK
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
Liran Alon writes: >> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov wrote: >> >> Liran Alon writes: >> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov wrote: It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that it is mishandling level-triggered interrupt performing EOI without fixing the root cause. >>> >>> I would rephrase as: >>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >>> piix4-usb-uhci) hangs on boot. >>> Root-cause was that one of Hyper-V level-triggered interrupt handler >>> performs EOI before fixing the root-cause. >>> This results in IOAPIC keep re-raising the level-triggered interrupt >>> after EOI because irq-line remains asserted. >> >> Ok, thanks for the suggestion. >> >>> This causes immediate re-assertion and L2 VM (which is supposedly expected to fix the cause of the interrupt) is not making any progress. >>> >>> I don’t know why you assume this. >>> From the trace we have examined, it seems that the EOI is performed by >>> Hyper-V and not it’s guest >>> This means that the handler for this level-triggered interrupt is on >>> Hyper-V and not it’s guest. >> >> If you let it run (with e.g. this patch or by setting preemtion timer > >> 0) you'll see that MMIO write fixing the cause of the interrupt is >> happening from L2: >> >> (qemu) info pci: >> >> Bus 0, device 4, function 0: >>USB controller: PCI device 8086:7112 >> PCI subsystem 1af4:1100 >> IRQ 23. >> BAR4: I/O at 0x6060 [0x607f]. >> id "" >> >> ... >> 538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 >> info 0 0 >> 538597.212499: kvm_entry:vcpu 0 >> 538597.212506: kvm_exit: reason IO_INSTRUCTION rip >> 0xf80e02ac6a27 info 60620009 0 >> 538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason >> IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 >> 538597.212509: kvm_fpu: unload >> 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) >> 538597.212516: kvm_fpu: load >> 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val >> 0x1 >> 538597.212519: kvm_entry:vcpu 0 >> 538597.212523: kvm_exit: reason IO_INSTRUCTION rip >> 0xf80e02ac6a61 info 60640009 0 >> 538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason >> IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 >> 538597.212524: kvm_fpu: unload >> 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) >> 538597.212528: kvm_fpu: load >> 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val >> 0xf >> ... >> >> and this happens after EOI from L1. > > I see that the L2 guest is doing I/O read to the device BAR4 but do these > reads lower the irq-line? > I would expect a write to lower the irq-line. > > Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a > value. Doesn’t lower irq-line. > (Even though offset 0x04 returns the "interrupt enable register”). > In contrast, looking at uhci_port_write(), it seems that writing to either > offset 0x02 or 0x04 could lower the irq-line. > So you should look for pio_write to port 0x6062 or 0x6064 to see who > is actually responsible for lowering the irq-line. Sorry, I probably like trimming traces too much. Writes happen too: [005] 538597.212532: kvm_exit: reason IO_INSTRUCTION rip 0xf80e02ac6a8f info 60620001 0 [005] 538597.212533: kvm_nested_vmexit:rip f80e02ac6a8f reason IO_INSTRUCTION info1 60620001 info2 0 int_info 0 int_info_err 0 [005] 538597.212534: kvm_pio: pio_write at 0x6062 size 2 count 1 val 0x1 [005] 538597.212534: kvm_fpu: unload [005] 538597.212535: kvm_userspace_exit: reason KVM_EXIT_IO (2) [005] 538597.212543: kvm_fpu: load [005] 538597.212544: kvm_entry:vcpu 0 [005] 538597.212547: kvm_exit: reason IO_INSTRUCTION rip 0xf80e02ac6a9c info 60640001 0 [005] 538597.212548: kvm_nested_vmexit:rip f80e02ac6a9c reason IO_INSTRUCTION info1 60640001 info2 0 int_info 0 int_info_err 0 [005] 538597.212548: kvm_pio: pio_write at 0x6064 size 2 count 1 val 0x0 [005] 538597.212549: kvm_fpu: unload [005] 538597.212550: kvm_userspace_exit: reason KVM_EXIT_IO (2) and this likely lowers the line. I honestly have no idea how this all works on real hw but the comment in kernel ioapic says something about non-immediate delivery of the reasserted interrupt. True or not, this gives me some peace of mind :-) -- Vitaly
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov wrote: > > Liran Alon writes: > >>> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov wrote: >>> >>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >>> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that >>> it is mishandling level-triggered interrupt performing EOI without fixing >>> the root cause. >> >> I would rephrase as: >> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >> piix4-usb-uhci) hangs on boot. >> Root-cause was that one of Hyper-V level-triggered interrupt handler >> performs EOI before fixing the root-cause. >> This results in IOAPIC keep re-raising the level-triggered interrupt >> after EOI because irq-line remains asserted. > > Ok, thanks for the suggestion. > >> >>> This causes immediate re-assertion and L2 VM (which is >>> supposedly expected to fix the cause of the interrupt) is not making any >>> progress. >> >> I don’t know why you assume this. >> From the trace we have examined, it seems that the EOI is performed by >> Hyper-V and not it’s guest >> This means that the handler for this level-triggered interrupt is on >> Hyper-V and not it’s guest. > > If you let it run (with e.g. this patch or by setting preemtion timer > > 0) you'll see that MMIO write fixing the cause of the interrupt is > happening from L2: > > (qemu) info pci: > > Bus 0, device 4, function 0: >USB controller: PCI device 8086:7112 > PCI subsystem 1af4:1100 > IRQ 23. > BAR4: I/O at 0x6060 [0x607f]. > id "" > > ... > 538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 > info 0 0 > 538597.212499: kvm_entry:vcpu 0 > 538597.212506: kvm_exit: reason IO_INSTRUCTION rip > 0xf80e02ac6a27 info 60620009 0 > 538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason > IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 > 538597.212509: kvm_fpu: unload > 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) > 538597.212516: kvm_fpu: load > 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val > 0x1 > 538597.212519: kvm_entry:vcpu 0 > 538597.212523: kvm_exit: reason IO_INSTRUCTION rip > 0xf80e02ac6a61 info 60640009 0 > 538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason > IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 > 538597.212524: kvm_fpu: unload > 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) > 538597.212528: kvm_fpu: load > 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val > 0xf > ... > > and this happens after EOI from L1. I see that the L2 guest is doing I/O read to the device BAR4 but do these reads lower the irq-line? I would expect a write to lower the irq-line. Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a value. Doesn’t lower irq-line. (Even though offset 0x04 returns the "interrupt enable register”). In contrast, looking at uhci_port_write(), it seems that writing to either offset 0x02 or 0x04 could lower the irq-line. So you should look for pio_write to port 0x6062 or 0x6064 to see who is actually responsible for lowering the irq-line. Having said that, it does make sense that the layer (L1/L2) who interacts in any way with the device’s BAR is probably also the one who lowers the irq-line. I just fine it hard to believe that on a real physical hardware with this device, that there will be no issue. i.e. That hardware IOAPIC will delay the re-injection of the level-triggered interrupt (since the IOAPIC EOI) by so much that Hyper-V can make it to resume into it’s guest before it is injected. Maybe Hyper-V was never tested with such a real device? ;) > >> >>> >>> Gory details: >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA= >> >> Maybe worth to note that one should read the entire thread to understand the >> analysis. >> > > Sure. > >>> >>> Turns out we were dealing with similar issues before; in-kernel IOAPIC >>> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay >>> irq delivery duringeoi broadcast") which describes a very similar issue. >>> >>> Steal the idea from the above mentioned commit for IOAPIC implementation in >>> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. >>> >>> Signed-off-by: Vitaly Kuznetsov >>> --- >>> hw/intc/ioapic.c | 43 ++- >>> hw/intc/trace-events | 1 + >>> include/hw/i386/ioapic_internal.h | 3 +++ >>> 3 files changed, 46 insertions(+), 1 deletion(-) >>> >>> diff
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
Liran Alon writes: >> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov wrote: >> >> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that >> it is mishandling level-triggered interrupt performing EOI without fixing >> the root cause. > > I would rephrase as: > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + > piix4-usb-uhci) hangs on boot. > Root-cause was that one of Hyper-V level-triggered interrupt handler performs > EOI before fixing the root-cause. > This results in IOAPIC keep re-raising the level-triggered interrupt > after EOI because irq-line remains asserted. Ok, thanks for the suggestion. > >> This causes immediate re-assertion and L2 VM (which is >> supposedly expected to fix the cause of the interrupt) is not making any >> progress. > > I don’t know why you assume this. > From the trace we have examined, it seems that the EOI is performed by > Hyper-V and not it’s guest > This means that the handler for this level-triggered interrupt is on > Hyper-V and not it’s guest. If you let it run (with e.g. this patch or by setting preemtion timer > 0) you'll see that MMIO write fixing the cause of the interrupt is happening from L2: (qemu) info pci: Bus 0, device 4, function 0: USB controller: PCI device 8086:7112 PCI subsystem 1af4:1100 IRQ 23. BAR4: I/O at 0x6060 [0x607f]. id "" ... 538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 info 0 0 538597.212499: kvm_entry:vcpu 0 538597.212506: kvm_exit: reason IO_INSTRUCTION rip 0xf80e02ac6a27 info 60620009 0 538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 538597.212509: kvm_fpu: unload 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) 538597.212516: kvm_fpu: load 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val 0x1 538597.212519: kvm_entry:vcpu 0 538597.212523: kvm_exit: reason IO_INSTRUCTION rip 0xf80e02ac6a61 info 60640009 0 538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 538597.212524: kvm_fpu: unload 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) 538597.212528: kvm_fpu: load 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val 0xf ... and this happens after EOI from L1. > >> >> Gory details: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA= > > Maybe worth to note that one should read the entire thread to understand the > analysis. > Sure. >> >> Turns out we were dealing with similar issues before; in-kernel IOAPIC >> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay >> irq delivery duringeoi broadcast") which describes a very similar issue. >> >> Steal the idea from the above mentioned commit for IOAPIC implementation in >> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. >> >> Signed-off-by: Vitaly Kuznetsov >> --- >> hw/intc/ioapic.c | 43 ++- >> hw/intc/trace-events | 1 + >> include/hw/i386/ioapic_internal.h | 3 +++ >> 3 files changed, 46 insertions(+), 1 deletion(-) >> >> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c >> index 9d75f84d3b..daf45cc8a8 100644 >> --- a/hw/intc/ioapic.c >> +++ b/hw/intc/ioapic.c >> @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) >>} >> } >> >> +#define SUCCESSIVE_IRQ_MAX_COUNT 1 >> + >> +static void ioapic_timer(void *opaque) >> +{ >> +IOAPICCommonState *s = opaque; >> + >> +ioapic_service(s); >> +} >> + >> static void ioapic_set_irq(void *opaque, int vector, int level) >> { >>IOAPICCommonState *s = opaque; >> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) >>trace_ioapic_clear_remote_irr(n, vector); >>s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; > > This clear of remote-irr should happen only for level-triggered interrupts. > So we can make the code here more structured like KVM’s > __kvm_ioapic_update_eoi(). > It also have an extra-value of not advancing "s->irq_reassert[vector]” for > vectors which are edge-triggered which is a bit misleading. > >>if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { >> -ioapic_service(s); >> +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) >> & 1) >> +== IOAPIC_TRIGGER_LEVEL; > > Nit: I
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
On 01/04/19 17:13, Vitaly Kuznetsov wrote: >>> +trace_ioapic_eoi_delayed_reassert(vector); >>> +timer_mod(s->timer, >>> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >>> + NANOSECONDS_PER_SECOND / 100); >> Should this be done only if the timer isn't pending? > Hm, maybe ... but how can this happen? To get here we need remote IRR > bit and we clear it so someone needs to re-set it. The source won't > probably be doing this (it is a level-triggered interrupt and it is > already pending - why re-asserting?) but even if it does > ioapic_service(s) will be called and when our timer fires we will just > do nothing (consequitive ioapic_service() doesn't hurt). Yeah, it's just for cleanliness. Let's change it to timer_mod_anticipate and call it a day. Paolo
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
Paolo Bonzini writes: > On 01/04/19 15:36, Vitaly Kuznetsov wrote: ... >> static void ioapic_set_irq(void *opaque, int vector, int level) >> { >> IOAPICCommonState *s = opaque; >> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) >> trace_ioapic_clear_remote_irr(n, vector); >> s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; >> if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { >> -ioapic_service(s); >> +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) >> & 1) >> +== IOAPIC_TRIGGER_LEVEL; >> + >> +++s->irq_reassert[vector]; >> +if (!level || >> +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) >> { >> +ioapic_service(s); >> +} else { >> +/* >> + * Real hardware does not deliver the interrupt >> + * immediately during eoi broadcast, and this lets a >> + * buggy guest make slow progress even if it does >> not >> + * correctly handle a level-triggered interrupt. >> Emulate >> + * this behavior if we detect an interrupt storm. >> + */ >> +trace_ioapic_eoi_delayed_reassert(vector); >> +timer_mod(s->timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >> + NANOSECONDS_PER_SECOND / 100); > > Should this be done only if the timer isn't pending? Hm, maybe ... but how can this happen? To get here we need remote IRR bit and we clear it so someone needs to re-set it. The source won't probably be doing this (it is a level-triggered interrupt and it is already pending - why re-asserting?) but even if it does ioapic_service(s) will be called and when our timer fires we will just do nothing (consequitive ioapic_service() doesn't hurt). -- Vitaly
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov wrote: > > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + > piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that > it is mishandling level-triggered interrupt performing EOI without fixing > the root cause. I would rephrase as: It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. This results in IOAPIC keep re-raising the level-triggered interrupt after EOI because irq-line remains asserted. > This causes immediate re-assertion and L2 VM (which is > supposedly expected to fix the cause of the interrupt) is not making any > progress. I don’t know why you assume this. From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest This means that the handler for this level-triggered interrupt is on Hyper-V and not it’s guest. > > Gory details: > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA= Maybe worth to note that one should read the entire thread to understand the analysis. > > Turns out we were dealing with similar issues before; in-kernel IOAPIC > implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay > irq delivery duringeoi broadcast") which describes a very similar issue. > > Steal the idea from the above mentioned commit for IOAPIC implementation in > QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. > > Signed-off-by: Vitaly Kuznetsov > --- > hw/intc/ioapic.c | 43 ++- > hw/intc/trace-events | 1 + > include/hw/i386/ioapic_internal.h | 3 +++ > 3 files changed, 46 insertions(+), 1 deletion(-) > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > index 9d75f84d3b..daf45cc8a8 100644 > --- a/hw/intc/ioapic.c > +++ b/hw/intc/ioapic.c > @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) >} > } > > +#define SUCCESSIVE_IRQ_MAX_COUNT 1 > + > +static void ioapic_timer(void *opaque) > +{ > +IOAPICCommonState *s = opaque; > + > +ioapic_service(s); > +} > + > static void ioapic_set_irq(void *opaque, int vector, int level) > { >IOAPICCommonState *s = opaque; > @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) >trace_ioapic_clear_remote_irr(n, vector); >s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; This clear of remote-irr should happen only for level-triggered interrupts. So we can make the code here more structured like KVM’s __kvm_ioapic_update_eoi(). It also have an extra-value of not advancing "s->irq_reassert[vector]” for vectors which are edge-triggered which is a bit misleading. >if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { > -ioapic_service(s); > +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & > 1) > +== IOAPIC_TRIGGER_LEVEL; Nit: I would declare variable “trig_mode” and compare it later explicitly to IOAPIC_TRIGGER_LEVEL. > + > +++s->irq_reassert[vector]; Nit: I would rename “irq_reassert” to “irq_eoi” as it is named in KVM IOAPIC code. > +if (!level || > +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { > +ioapic_service(s); > +} else { > +/* > + * Real hardware does not deliver the interrupt > + * immediately during eoi broadcast, and this lets a > + * buggy guest make slow progress even if it does not > + * correctly handle a level-triggered interrupt. > Emulate > + * this behavior if we detect an interrupt storm. > + */ > +trace_ioapic_eoi_delayed_reassert(vector); > +timer_mod(s->timer, > + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + > + NANOSECONDS_PER_SECOND / 100); You need to zero here s->irq_reassert[vector]. I think you would avoid these kind of little mistakes if you will attempt to follow KVM’s commit code structure. > +} > +} else { > +s->irq_reassert[vector] = 0; >} >} >} > @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp) >memory_region_init_io(>io_memory, OBJECT(s), _io_ops, s, > "ioapic", 0x1000); > > +
Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
On 01/04/19 15:36, Vitaly Kuznetsov wrote: > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + > piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that > it is mishandling level-triggered interrupt performing EOI without fixing > the root cause. This causes immediate re-assertion and L2 VM (which is > supposedly expected to fix the cause of the interrupt) is not making any > progress. > > Gory details: https://www.spinics.net/lists/kvm/msg184484.html > > Turns out we were dealing with similar issues before; in-kernel IOAPIC > implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay > irq delivery duringeoi broadcast") which describes a very similar issue. > > Steal the idea from the above mentioned commit for IOAPIC implementation in > QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. > > Signed-off-by: Vitaly Kuznetsov > --- > hw/intc/ioapic.c | 43 ++- > hw/intc/trace-events | 1 + > include/hw/i386/ioapic_internal.h | 3 +++ > 3 files changed, 46 insertions(+), 1 deletion(-) > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > index 9d75f84d3b..daf45cc8a8 100644 > --- a/hw/intc/ioapic.c > +++ b/hw/intc/ioapic.c > @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) > } > } > > +#define SUCCESSIVE_IRQ_MAX_COUNT 1 > + > +static void ioapic_timer(void *opaque) > +{ > +IOAPICCommonState *s = opaque; > + > +ioapic_service(s); > +} > + > static void ioapic_set_irq(void *opaque, int vector, int level) > { > IOAPICCommonState *s = opaque; > @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) > trace_ioapic_clear_remote_irr(n, vector); > s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; > if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { > -ioapic_service(s); > +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & > 1) > +== IOAPIC_TRIGGER_LEVEL; > + > +++s->irq_reassert[vector]; > +if (!level || > +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { > +ioapic_service(s); > +} else { > +/* > + * Real hardware does not deliver the interrupt > + * immediately during eoi broadcast, and this lets a > + * buggy guest make slow progress even if it does not > + * correctly handle a level-triggered interrupt. > Emulate > + * this behavior if we detect an interrupt storm. > + */ > +trace_ioapic_eoi_delayed_reassert(vector); > +timer_mod(s->timer, > + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + > + NANOSECONDS_PER_SECOND / 100); Should this be done only if the timer isn't pending? Paolo
[Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that it is mishandling level-triggered interrupt performing EOI without fixing the root cause. This causes immediate re-assertion and L2 VM (which is supposedly expected to fix the cause of the interrupt) is not making any progress. Gory details: https://www.spinics.net/lists/kvm/msg184484.html Turns out we were dealing with similar issues before; in-kernel IOAPIC implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay irq delivery duringeoi broadcast") which describes a very similar issue. Steal the idea from the above mentioned commit for IOAPIC implementation in QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. Signed-off-by: Vitaly Kuznetsov --- hw/intc/ioapic.c | 43 ++- hw/intc/trace-events | 1 + include/hw/i386/ioapic_internal.h | 3 +++ 3 files changed, 46 insertions(+), 1 deletion(-) diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c index 9d75f84d3b..daf45cc8a8 100644 --- a/hw/intc/ioapic.c +++ b/hw/intc/ioapic.c @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) } } +#define SUCCESSIVE_IRQ_MAX_COUNT 1 + +static void ioapic_timer(void *opaque) +{ +IOAPICCommonState *s = opaque; + +ioapic_service(s); +} + static void ioapic_set_irq(void *opaque, int vector, int level) { IOAPICCommonState *s = opaque; @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) trace_ioapic_clear_remote_irr(n, vector); s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { -ioapic_service(s); +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) +== IOAPIC_TRIGGER_LEVEL; + +++s->irq_reassert[vector]; +if (!level || +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { +ioapic_service(s); +} else { +/* + * Real hardware does not deliver the interrupt + * immediately during eoi broadcast, and this lets a + * buggy guest make slow progress even if it does not + * correctly handle a level-triggered interrupt. Emulate + * this behavior if we detect an interrupt storm. + */ +trace_ioapic_eoi_delayed_reassert(vector); +timer_mod(s->timer, + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + + NANOSECONDS_PER_SECOND / 100); +} +} else { +s->irq_reassert[vector] = 0; } } } @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp) memory_region_init_io(>io_memory, OBJECT(s), _io_ops, s, "ioapic", 0x1000); +s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, ioapic_timer, s); + qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS); ioapics[ioapic_no] = s; @@ -408,6 +440,14 @@ static void ioapic_realize(DeviceState *dev, Error **errp) qemu_add_machine_init_done_notifier(>machine_done); } +static void ioapic_unrealize(DeviceState *dev, Error **errp) +{ +IOAPICCommonState *s = IOAPIC_COMMON(dev); + +timer_del(s->timer); +timer_free(s->timer); +} + static Property ioapic_properties[] = { DEFINE_PROP_UINT8("version", IOAPICCommonState, version, IOAPIC_VER_DEF), DEFINE_PROP_END_OF_LIST(), @@ -419,6 +459,7 @@ static void ioapic_class_init(ObjectClass *klass, void *data) DeviceClass *dc = DEVICE_CLASS(klass); k->realize = ioapic_realize; +k->unrealize = ioapic_unrealize; /* * If APIC is in kernel, we need to update the kernel cache after * migration, otherwise first 24 gsi routes will be invalid. diff --git a/hw/intc/trace-events b/hw/intc/trace-events index a28bdce925..90c9d07c1a 100644 --- a/hw/intc/trace-events +++ b/hw/intc/trace-events @@ -25,6 +25,7 @@ apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 0x%08x" ioapic_set_remote_irr(int n) "set remote irr for pin %d" ioapic_clear_remote_irr(int n, int vector) "clear remote irr for pin %d vector %d" ioapic_eoi_broadcast(int vector) "EOI broadcast for vector %d" +ioapic_eoi_delayed_reassert(int vector) "delayed reassert on EOI broadcast for vector %d" ioapic_mem_read(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem read addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" retval 0x%"PRIx32 ioapic_mem_write(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem write