Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Liran Alon



> On 1 Apr 2019, at 20:28, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov  wrote:
>>> 
>>> Liran Alon  writes:
>>> 
> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov  wrote:
> 
> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine 
> +
> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion 
> that
> it is mishandling level-triggered interrupt performing EOI without fixing
> the root cause.
 
 I would rephrase as:
 It was found that Hyper-V 2016 on KVM in some configurations (q35 machine 
 + piix4-usb-uhci) hangs on boot.
 Root-cause was that one of Hyper-V level-triggered interrupt handler 
 performs EOI before fixing the root-cause.
 This results in IOAPIC keep re-raising the level-triggered interrupt
 after EOI because irq-line remains asserted.
>>> 
>>> Ok, thanks for the suggestion.
>>> 
 
> This causes immediate re-assertion and L2 VM (which is
> supposedly expected to fix the cause of the interrupt) is not making any
> progress.
 
 I don’t know why you assume this.
 From the trace we have examined, it seems that the EOI is performed by 
 Hyper-V and not it’s guest
 This means that the handler for this level-triggered interrupt is on
 Hyper-V and not it’s guest.
>>> 
>>> If you let it run (with e.g. this patch or by setting preemtion timer >
>>> 0) you'll see that MMIO write fixing the cause of the interrupt is
>>> happening from L2:
>>> 
>>> (qemu) info pci:
>>> 
>>> Bus  0, device   4, function 0:
>>>   USB controller: PCI device 8086:7112
>>> PCI subsystem 1af4:1100
>>> IRQ 23.
>>> BAR4: I/O at 0x6060 [0x607f].
>>> id ""
>>> 
>>> ...
>>> 538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 
>>> info 0 0
>>> 538597.212499: kvm_entry:vcpu 0
>>> 538597.212506: kvm_exit: reason IO_INSTRUCTION rip 
>>> 0xf80e02ac6a27 info 60620009 0
>>> 538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason 
>>> IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0
>>> 538597.212509: kvm_fpu:  unload
>>> 538597.212511: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>>> 538597.212516: kvm_fpu:  load
>>> 538597.212518: kvm_pio:  pio_read at 0x6062 size 2 count 1 val 
>>> 0x1 
>>> 538597.212519: kvm_entry:vcpu 0
>>> 538597.212523: kvm_exit: reason IO_INSTRUCTION rip 
>>> 0xf80e02ac6a61 info 60640009 0
>>> 538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason 
>>> IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0
>>> 538597.212524: kvm_fpu:  unload
>>> 538597.212525: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>>> 538597.212528: kvm_fpu:  load
>>> 538597.212528: kvm_pio:  pio_read at 0x6064 size 2 count 1 val 
>>> 0xf 
>>> ...
>>> 
>>> and this happens after EOI from L1.
>> 
>> I see that the L2 guest is doing I/O read to the device BAR4 but do these 
>> reads lower the irq-line?
>> I would expect a write to lower the irq-line.
>> 
>> Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return 
>> a value. Doesn’t lower irq-line.
>> (Even though offset 0x04 returns the "interrupt enable register”).
>> In contrast, looking at uhci_port_write(), it seems that writing to either 
>> offset 0x02 or 0x04 could lower the irq-line.
>> So you should look for pio_write to port 0x6062 or 0x6064 to see who
>> is actually responsible for lowering the irq-line.
> 
> Sorry, I probably like trimming traces too much. Writes happen too:
> 
> [005] 538597.212532: kvm_exit: reason IO_INSTRUCTION rip 
> 0xf80e02ac6a8f info 60620001 0
> [005] 538597.212533: kvm_nested_vmexit:rip f80e02ac6a8f reason 
> IO_INSTRUCTION info1 60620001 info2 0 int_info 0 int_info_err 0
> [005] 538597.212534: kvm_pio:  pio_write at 0x6062 size 2 count 1 
> val 0x1 

This clears bit UHCI_STS_USBINT from “status” register and zero “status2” 
register.

> [005] 538597.212534: kvm_fpu:  unload
> [005] 538597.212535: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
> [005] 538597.212543: kvm_fpu:  load
> [005] 538597.212544: kvm_entry:vcpu 0
> [005] 538597.212547: kvm_exit: reason IO_INSTRUCTION rip 
> 0xf80e02ac6a9c info 60640001 0
> [005] 538597.212548: kvm_nested_vmexit:rip f80e02ac6a9c reason 
> IO_INSTRUCTION info1 60640001 info2 0 int_info 0 int_info_err 0
> [005] 538597.212548: kvm_pio:  pio_write at 0x6064 size 2 count 1 
> val 0x0 

This sets 0 in the “intr” register (Interrupt enable register).
At this point it is indeed likely that uhci_update_irq() will lower the 
irq-line. I agree. :)

> [005] 538597.212549: kvm_fpu:  unload
> [005] 538597.212550: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
> 
> and this likely lowers the line.

OK 

Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Vitaly Kuznetsov
Liran Alon  writes:

>> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov  wrote:
>> 
>> Liran Alon  writes:
>> 
 On 1 Apr 2019, at 16:36, Vitaly Kuznetsov  wrote:
 
 It was found that Hyper-V 2016 on KVM in some configurations (q35 machine +
 piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that
 it is mishandling level-triggered interrupt performing EOI without fixing
 the root cause.
>>> 
>>> I would rephrase as:
>>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + 
>>> piix4-usb-uhci) hangs on boot.
>>> Root-cause was that one of Hyper-V level-triggered interrupt handler 
>>> performs EOI before fixing the root-cause.
>>> This results in IOAPIC keep re-raising the level-triggered interrupt
>>> after EOI because irq-line remains asserted.
>> 
>> Ok, thanks for the suggestion.
>> 
>>> 
 This causes immediate re-assertion and L2 VM (which is
 supposedly expected to fix the cause of the interrupt) is not making any
 progress.
>>> 
>>> I don’t know why you assume this.
>>> From the trace we have examined, it seems that the EOI is performed by 
>>> Hyper-V and not it’s guest
>>> This means that the handler for this level-triggered interrupt is on
>>> Hyper-V and not it’s guest.
>> 
>> If you let it run (with e.g. this patch or by setting preemtion timer >
>> 0) you'll see that MMIO write fixing the cause of the interrupt is
>> happening from L2:
>> 
>> (qemu) info pci:
>> 
>>  Bus  0, device   4, function 0:
>>USB controller: PCI device 8086:7112
>>  PCI subsystem 1af4:1100
>>  IRQ 23.
>>  BAR4: I/O at 0x6060 [0x607f].
>>  id ""
>> 
>> ...
>>  538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 
>> info 0 0
>>  538597.212499: kvm_entry:vcpu 0
>>  538597.212506: kvm_exit: reason IO_INSTRUCTION rip 
>> 0xf80e02ac6a27 info 60620009 0
>>  538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason 
>> IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0
>>  538597.212509: kvm_fpu:  unload
>>  538597.212511: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>>  538597.212516: kvm_fpu:  load
>>  538597.212518: kvm_pio:  pio_read at 0x6062 size 2 count 1 val 
>> 0x1 
>>  538597.212519: kvm_entry:vcpu 0
>>  538597.212523: kvm_exit: reason IO_INSTRUCTION rip 
>> 0xf80e02ac6a61 info 60640009 0
>>  538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason 
>> IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0
>>  538597.212524: kvm_fpu:  unload
>>  538597.212525: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>>  538597.212528: kvm_fpu:  load
>>  538597.212528: kvm_pio:  pio_read at 0x6064 size 2 count 1 val 
>> 0xf 
>> ...
>> 
>> and this happens after EOI from L1.
>
> I see that the L2 guest is doing I/O read to the device BAR4 but do these 
> reads lower the irq-line?
> I would expect a write to lower the irq-line.
>
> Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a 
> value. Doesn’t lower irq-line.
> (Even though offset 0x04 returns the "interrupt enable register”).
> In contrast, looking at uhci_port_write(), it seems that writing to either 
> offset 0x02 or 0x04 could lower the irq-line.
> So you should look for pio_write to port 0x6062 or 0x6064 to see who
> is actually responsible for lowering the irq-line.

Sorry, I probably like trimming traces too much. Writes happen too:

 [005] 538597.212532: kvm_exit: reason IO_INSTRUCTION rip 
0xf80e02ac6a8f info 60620001 0
 [005] 538597.212533: kvm_nested_vmexit:rip f80e02ac6a8f reason 
IO_INSTRUCTION info1 60620001 info2 0 int_info 0 int_info_err 0
 [005] 538597.212534: kvm_pio:  pio_write at 0x6062 size 2 count 1 
val 0x1 
 [005] 538597.212534: kvm_fpu:  unload
 [005] 538597.212535: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
 [005] 538597.212543: kvm_fpu:  load
 [005] 538597.212544: kvm_entry:vcpu 0
 [005] 538597.212547: kvm_exit: reason IO_INSTRUCTION rip 
0xf80e02ac6a9c info 60640001 0
 [005] 538597.212548: kvm_nested_vmexit:rip f80e02ac6a9c reason 
IO_INSTRUCTION info1 60640001 info2 0 int_info 0 int_info_err 0
 [005] 538597.212548: kvm_pio:  pio_write at 0x6064 size 2 count 1 
val 0x0 
 [005] 538597.212549: kvm_fpu:  unload
 [005] 538597.212550: kvm_userspace_exit:   reason KVM_EXIT_IO (2)

and this likely lowers the line. I honestly have no idea how this all
works on real hw but the comment in kernel ioapic says something about
non-immediate delivery of the reasserted interrupt. True or not, this
gives me some peace of mind :-)

-- 
Vitaly



Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Liran Alon



> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov  wrote:
>>> 
>>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine +
>>> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that
>>> it is mishandling level-triggered interrupt performing EOI without fixing
>>> the root cause.
>> 
>> I would rephrase as:
>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + 
>> piix4-usb-uhci) hangs on boot.
>> Root-cause was that one of Hyper-V level-triggered interrupt handler 
>> performs EOI before fixing the root-cause.
>> This results in IOAPIC keep re-raising the level-triggered interrupt
>> after EOI because irq-line remains asserted.
> 
> Ok, thanks for the suggestion.
> 
>> 
>>> This causes immediate re-assertion and L2 VM (which is
>>> supposedly expected to fix the cause of the interrupt) is not making any
>>> progress.
>> 
>> I don’t know why you assume this.
>> From the trace we have examined, it seems that the EOI is performed by 
>> Hyper-V and not it’s guest
>> This means that the handler for this level-triggered interrupt is on
>> Hyper-V and not it’s guest.
> 
> If you let it run (with e.g. this patch or by setting preemtion timer >
> 0) you'll see that MMIO write fixing the cause of the interrupt is
> happening from L2:
> 
> (qemu) info pci:
> 
>  Bus  0, device   4, function 0:
>USB controller: PCI device 8086:7112
>  PCI subsystem 1af4:1100
>  IRQ 23.
>  BAR4: I/O at 0x6060 [0x607f].
>  id ""
> 
> ...
>  538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 
> info 0 0
>  538597.212499: kvm_entry:vcpu 0
>  538597.212506: kvm_exit: reason IO_INSTRUCTION rip 
> 0xf80e02ac6a27 info 60620009 0
>  538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason 
> IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0
>  538597.212509: kvm_fpu:  unload
>  538597.212511: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>  538597.212516: kvm_fpu:  load
>  538597.212518: kvm_pio:  pio_read at 0x6062 size 2 count 1 val 
> 0x1 
>  538597.212519: kvm_entry:vcpu 0
>  538597.212523: kvm_exit: reason IO_INSTRUCTION rip 
> 0xf80e02ac6a61 info 60640009 0
>  538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason 
> IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0
>  538597.212524: kvm_fpu:  unload
>  538597.212525: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>  538597.212528: kvm_fpu:  load
>  538597.212528: kvm_pio:  pio_read at 0x6064 size 2 count 1 val 
> 0xf 
> ...
> 
> and this happens after EOI from L1.

I see that the L2 guest is doing I/O read to the device BAR4 but do these reads 
lower the irq-line?
I would expect a write to lower the irq-line.

Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a 
value. Doesn’t lower irq-line.
(Even though offset 0x04 returns the "interrupt enable register”).
In contrast, looking at uhci_port_write(), it seems that writing to either 
offset 0x02 or 0x04 could lower the irq-line.
So you should look for pio_write to port 0x6062 or 0x6064 to see who is 
actually responsible for lowering the irq-line.

Having said that, it does make sense that the layer (L1/L2) who interacts in 
any way with the device’s BAR is probably
also the one who lowers the irq-line.
I just fine it hard to believe that on a real physical hardware with this 
device, that there will be no issue. i.e. That hardware IOAPIC
will delay the re-injection of the level-triggered interrupt (since the IOAPIC 
EOI) by so much that Hyper-V can make it to resume into it’s guest before it is 
injected.
Maybe Hyper-V was never tested with such a real device? ;)

> 
>> 
>>> 
>>> Gory details: 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA=
>> 
>> Maybe worth to note that one should read the entire thread to understand the 
>> analysis.
>> 
> 
> Sure.
> 
>>> 
>>> Turns out we were dealing with similar issues before; in-kernel IOAPIC
>>> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay
>>> irq delivery duringeoi broadcast") which describes a very similar issue.
>>> 
>>> Steal the idea from the above mentioned commit for IOAPIC implementation in
>>> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well.
>>> 
>>> Signed-off-by: Vitaly Kuznetsov 
>>> ---
>>> hw/intc/ioapic.c  | 43 ++-
>>> hw/intc/trace-events  |  1 +
>>> include/hw/i386/ioapic_internal.h |  3 +++
>>> 3 files changed, 46 insertions(+), 1 deletion(-)
>>> 
>>> diff 

Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Vitaly Kuznetsov
Liran Alon  writes:

>> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov  wrote:
>> 
>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine +
>> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that
>> it is mishandling level-triggered interrupt performing EOI without fixing
>> the root cause.
>
> I would rephrase as:
> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + 
> piix4-usb-uhci) hangs on boot.
> Root-cause was that one of Hyper-V level-triggered interrupt handler performs 
> EOI before fixing the root-cause.
> This results in IOAPIC keep re-raising the level-triggered interrupt
> after EOI because irq-line remains asserted.

Ok, thanks for the suggestion.

>
>> This causes immediate re-assertion and L2 VM (which is
>> supposedly expected to fix the cause of the interrupt) is not making any
>> progress.
>
> I don’t know why you assume this.
> From the trace we have examined, it seems that the EOI is performed by 
> Hyper-V and not it’s guest
> This means that the handler for this level-triggered interrupt is on
> Hyper-V and not it’s guest.

If you let it run (with e.g. this patch or by setting preemtion timer >
0) you'll see that MMIO write fixing the cause of the interrupt is
happening from L2:

(qemu) info pci:

  Bus  0, device   4, function 0:
USB controller: PCI device 8086:7112
  PCI subsystem 1af4:1100
  IRQ 23.
  BAR4: I/O at 0x6060 [0x607f].
  id ""

...
  538597.212494: kvm_exit: reason VMRESUME rip 0xf80004250115 
info 0 0
  538597.212499: kvm_entry:vcpu 0
  538597.212506: kvm_exit: reason IO_INSTRUCTION rip 
0xf80e02ac6a27 info 60620009 0
  538597.212507: kvm_nested_vmexit:rip f80e02ac6a27 reason 
IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0
  538597.212509: kvm_fpu:  unload
  538597.212511: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
  538597.212516: kvm_fpu:  load
  538597.212518: kvm_pio:  pio_read at 0x6062 size 2 count 1 val 
0x1 
  538597.212519: kvm_entry:vcpu 0
  538597.212523: kvm_exit: reason IO_INSTRUCTION rip 
0xf80e02ac6a61 info 60640009 0
  538597.212523: kvm_nested_vmexit:rip f80e02ac6a61 reason 
IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0
  538597.212524: kvm_fpu:  unload
  538597.212525: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
  538597.212528: kvm_fpu:  load
  538597.212528: kvm_pio:  pio_read at 0x6064 size 2 count 1 val 
0xf 
...

and this happens after EOI from L1.

>
>> 
>> Gory details: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA=
>
> Maybe worth to note that one should read the entire thread to understand the 
> analysis.
>

Sure.

>> 
>> Turns out we were dealing with similar issues before; in-kernel IOAPIC
>> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay
>> irq delivery duringeoi broadcast") which describes a very similar issue.
>> 
>> Steal the idea from the above mentioned commit for IOAPIC implementation in
>> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well.
>> 
>> Signed-off-by: Vitaly Kuznetsov 
>> ---
>> hw/intc/ioapic.c  | 43 ++-
>> hw/intc/trace-events  |  1 +
>> include/hw/i386/ioapic_internal.h |  3 +++
>> 3 files changed, 46 insertions(+), 1 deletion(-)
>> 
>> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
>> index 9d75f84d3b..daf45cc8a8 100644
>> --- a/hw/intc/ioapic.c
>> +++ b/hw/intc/ioapic.c
>> @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s)
>>}
>> }
>> 
>> +#define SUCCESSIVE_IRQ_MAX_COUNT 1
>> +
>> +static void ioapic_timer(void *opaque)
>> +{
>> +IOAPICCommonState *s = opaque;
>> +
>> +ioapic_service(s);
>> +}
>> +
>> static void ioapic_set_irq(void *opaque, int vector, int level)
>> {
>>IOAPICCommonState *s = opaque;
>> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector)
>>trace_ioapic_clear_remote_irr(n, vector);
>>s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR;
>
> This clear of remote-irr should happen only for level-triggered interrupts.
> So we can make the code here more structured like KVM’s 
> __kvm_ioapic_update_eoi().
> It also have an extra-value of not advancing "s->irq_reassert[vector]” for 
> vectors which are edge-triggered which is a bit misleading.
>
>>if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) {
>> -ioapic_service(s);
>> +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) 
>> & 1)
>> +== IOAPIC_TRIGGER_LEVEL;
>
> Nit: I 

Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Paolo Bonzini
On 01/04/19 17:13, Vitaly Kuznetsov wrote:
>>> +trace_ioapic_eoi_delayed_reassert(vector);
>>> +timer_mod(s->timer,
>>> +  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
>>> +  NANOSECONDS_PER_SECOND / 100);
>> Should this be done only if the timer isn't pending?
> Hm, maybe ... but how can this happen? To get here we need remote IRR
> bit and we clear it so someone needs to re-set it. The source won't
> probably be doing this (it is a level-triggered interrupt and it is
> already pending - why re-asserting?) but even if it does
> ioapic_service(s) will be called and when our timer fires we will just
> do nothing (consequitive ioapic_service() doesn't hurt).

Yeah, it's just for cleanliness.  Let's change it to
timer_mod_anticipate and call it a day.

Paolo



Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Vitaly Kuznetsov
Paolo Bonzini  writes:

> On 01/04/19 15:36, Vitaly Kuznetsov wrote:
...
>>  static void ioapic_set_irq(void *opaque, int vector, int level)
>>  {
>>  IOAPICCommonState *s = opaque;
>> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector)
>>  trace_ioapic_clear_remote_irr(n, vector);
>>  s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR;
>>  if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) {
>> -ioapic_service(s);
>> +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) 
>> & 1)
>> +== IOAPIC_TRIGGER_LEVEL;
>> +
>> +++s->irq_reassert[vector];
>> +if (!level ||
>> +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) 
>> {
>> +ioapic_service(s);
>> +} else {
>> +/*
>> + * Real hardware does not deliver the interrupt
>> + * immediately during eoi broadcast, and this lets a
>> + * buggy guest make slow progress even if it does 
>> not
>> + * correctly handle a level-triggered interrupt. 
>> Emulate
>> + * this behavior if we detect an interrupt storm.
>> + */
>> +trace_ioapic_eoi_delayed_reassert(vector);
>> +timer_mod(s->timer,
>> +  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
>> +  NANOSECONDS_PER_SECOND / 100);
>
> Should this be done only if the timer isn't pending?

Hm, maybe ... but how can this happen? To get here we need remote IRR
bit and we clear it so someone needs to re-set it. The source won't
probably be doing this (it is a level-triggered interrupt and it is
already pending - why re-asserting?) but even if it does
ioapic_service(s) will be called and when our timer fires we will just
do nothing (consequitive ioapic_service() doesn't hurt).

-- 
Vitaly



Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Liran Alon



> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov  wrote:
> 
> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine +
> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that
> it is mishandling level-triggered interrupt performing EOI without fixing
> the root cause.

I would rephrase as:
It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + 
piix4-usb-uhci) hangs on boot.
Root-cause was that one of Hyper-V level-triggered interrupt handler performs 
EOI before fixing the root-cause.
This results in IOAPIC keep re-raising the level-triggered interrupt after EOI 
because irq-line remains asserted.

> This causes immediate re-assertion and L2 VM (which is
> supposedly expected to fix the cause of the interrupt) is not making any
> progress.

I don’t know why you assume this.
From the trace we have examined, it seems that the EOI is performed by Hyper-V 
and not it’s guest
This means that the handler for this level-triggered interrupt is on Hyper-V 
and not it’s guest.

> 
> Gory details: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA=

Maybe worth to note that one should read the entire thread to understand the 
analysis.

> 
> Turns out we were dealing with similar issues before; in-kernel IOAPIC
> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay
> irq delivery duringeoi broadcast") which describes a very similar issue.
> 
> Steal the idea from the above mentioned commit for IOAPIC implementation in
> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> hw/intc/ioapic.c  | 43 ++-
> hw/intc/trace-events  |  1 +
> include/hw/i386/ioapic_internal.h |  3 +++
> 3 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> index 9d75f84d3b..daf45cc8a8 100644
> --- a/hw/intc/ioapic.c
> +++ b/hw/intc/ioapic.c
> @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s)
>}
> }
> 
> +#define SUCCESSIVE_IRQ_MAX_COUNT 1
> +
> +static void ioapic_timer(void *opaque)
> +{
> +IOAPICCommonState *s = opaque;
> +
> +ioapic_service(s);
> +}
> +
> static void ioapic_set_irq(void *opaque, int vector, int level)
> {
>IOAPICCommonState *s = opaque;
> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector)
>trace_ioapic_clear_remote_irr(n, vector);
>s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR;

This clear of remote-irr should happen only for level-triggered interrupts.
So we can make the code here more structured like KVM’s 
__kvm_ioapic_update_eoi().
It also have an extra-value of not advancing "s->irq_reassert[vector]” for 
vectors which are edge-triggered which is a bit misleading.

>if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) {
> -ioapic_service(s);
> +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 
> 1)
> +== IOAPIC_TRIGGER_LEVEL;

Nit: I would declare variable “trig_mode” and compare it later explicitly to 
IOAPIC_TRIGGER_LEVEL.

> +
> +++s->irq_reassert[vector];

Nit: I would rename “irq_reassert” to “irq_eoi” as it is named in KVM IOAPIC 
code.

> +if (!level ||
> +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) {
> +ioapic_service(s);
> +} else {
> +/*
> + * Real hardware does not deliver the interrupt
> + * immediately during eoi broadcast, and this lets a
> + * buggy guest make slow progress even if it does not
> + * correctly handle a level-triggered interrupt. 
> Emulate
> + * this behavior if we detect an interrupt storm.
> + */
> +trace_ioapic_eoi_delayed_reassert(vector);
> +timer_mod(s->timer,
> +  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> +  NANOSECONDS_PER_SECOND / 100);

You need to zero here s->irq_reassert[vector].
I think you would avoid these kind of little mistakes if you will attempt to 
follow KVM’s commit code structure.

> +}
> +} else {
> +s->irq_reassert[vector] = 0;
>}
>}
>}
> @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp)
>memory_region_init_io(>io_memory, OBJECT(s), _io_ops, s,
>  "ioapic", 0x1000);
> 
> +

Re: [Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Paolo Bonzini
On 01/04/19 15:36, Vitaly Kuznetsov wrote:
> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine +
> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that
> it is mishandling level-triggered interrupt performing EOI without fixing
> the root cause. This causes immediate re-assertion and L2 VM (which is
> supposedly expected to fix the cause of the interrupt) is not making any
> progress.
> 
> Gory details: https://www.spinics.net/lists/kvm/msg184484.html
> 
> Turns out we were dealing with similar issues before; in-kernel IOAPIC
> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay
> irq delivery duringeoi broadcast") which describes a very similar issue.
> 
> Steal the idea from the above mentioned commit for IOAPIC implementation in
> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  hw/intc/ioapic.c  | 43 ++-
>  hw/intc/trace-events  |  1 +
>  include/hw/i386/ioapic_internal.h |  3 +++
>  3 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> index 9d75f84d3b..daf45cc8a8 100644
> --- a/hw/intc/ioapic.c
> +++ b/hw/intc/ioapic.c
> @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s)
>  }
>  }
>  
> +#define SUCCESSIVE_IRQ_MAX_COUNT 1
> +
> +static void ioapic_timer(void *opaque)
> +{
> +IOAPICCommonState *s = opaque;
> +
> +ioapic_service(s);
> +}
> +
>  static void ioapic_set_irq(void *opaque, int vector, int level)
>  {
>  IOAPICCommonState *s = opaque;
> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector)
>  trace_ioapic_clear_remote_irr(n, vector);
>  s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR;
>  if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) {
> -ioapic_service(s);
> +bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 
> 1)
> +== IOAPIC_TRIGGER_LEVEL;
> +
> +++s->irq_reassert[vector];
> +if (!level ||
> +s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) {
> +ioapic_service(s);
> +} else {
> +/*
> + * Real hardware does not deliver the interrupt
> + * immediately during eoi broadcast, and this lets a
> + * buggy guest make slow progress even if it does not
> + * correctly handle a level-triggered interrupt. 
> Emulate
> + * this behavior if we detect an interrupt storm.
> + */
> +trace_ioapic_eoi_delayed_reassert(vector);
> +timer_mod(s->timer,
> +  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> +  NANOSECONDS_PER_SECOND / 100);

Should this be done only if the timer isn't pending?

Paolo



[Qemu-devel] [PATCH] ioapic: allow buggy guests mishandling level-triggered interrupts to make progress

2019-04-01 Thread Vitaly Kuznetsov
It was found that Hyper-V 2016 on KVM in some configurations (q35 machine +
piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that
it is mishandling level-triggered interrupt performing EOI without fixing
the root cause. This causes immediate re-assertion and L2 VM (which is
supposedly expected to fix the cause of the interrupt) is not making any
progress.

Gory details: https://www.spinics.net/lists/kvm/msg184484.html

Turns out we were dealing with similar issues before; in-kernel IOAPIC
implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay
irq delivery duringeoi broadcast") which describes a very similar issue.

Steal the idea from the above mentioned commit for IOAPIC implementation in
QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well.

Signed-off-by: Vitaly Kuznetsov 
---
 hw/intc/ioapic.c  | 43 ++-
 hw/intc/trace-events  |  1 +
 include/hw/i386/ioapic_internal.h |  3 +++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 9d75f84d3b..daf45cc8a8 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s)
 }
 }
 
+#define SUCCESSIVE_IRQ_MAX_COUNT 1
+
+static void ioapic_timer(void *opaque)
+{
+IOAPICCommonState *s = opaque;
+
+ioapic_service(s);
+}
+
 static void ioapic_set_irq(void *opaque, int vector, int level)
 {
 IOAPICCommonState *s = opaque;
@@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector)
 trace_ioapic_clear_remote_irr(n, vector);
 s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR;
 if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) {
-ioapic_service(s);
+bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1)
+== IOAPIC_TRIGGER_LEVEL;
+
+++s->irq_reassert[vector];
+if (!level ||
+s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) {
+ioapic_service(s);
+} else {
+/*
+ * Real hardware does not deliver the interrupt
+ * immediately during eoi broadcast, and this lets a
+ * buggy guest make slow progress even if it does not
+ * correctly handle a level-triggered interrupt. 
Emulate
+ * this behavior if we detect an interrupt storm.
+ */
+trace_ioapic_eoi_delayed_reassert(vector);
+timer_mod(s->timer,
+  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
+  NANOSECONDS_PER_SECOND / 100);
+}
+} else {
+s->irq_reassert[vector] = 0;
 }
 }
 }
@@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp)
 memory_region_init_io(>io_memory, OBJECT(s), _io_ops, s,
   "ioapic", 0x1000);
 
+s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, ioapic_timer, s);
+
 qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS);
 
 ioapics[ioapic_no] = s;
@@ -408,6 +440,14 @@ static void ioapic_realize(DeviceState *dev, Error **errp)
 qemu_add_machine_init_done_notifier(>machine_done);
 }
 
+static void ioapic_unrealize(DeviceState *dev, Error **errp)
+{
+IOAPICCommonState *s = IOAPIC_COMMON(dev);
+
+timer_del(s->timer);
+timer_free(s->timer);
+}
+
 static Property ioapic_properties[] = {
 DEFINE_PROP_UINT8("version", IOAPICCommonState, version, IOAPIC_VER_DEF),
 DEFINE_PROP_END_OF_LIST(),
@@ -419,6 +459,7 @@ static void ioapic_class_init(ObjectClass *klass, void 
*data)
 DeviceClass *dc = DEVICE_CLASS(klass);
 
 k->realize = ioapic_realize;
+k->unrealize = ioapic_unrealize;
 /*
  * If APIC is in kernel, we need to update the kernel cache after
  * migration, otherwise first 24 gsi routes will be invalid.
diff --git a/hw/intc/trace-events b/hw/intc/trace-events
index a28bdce925..90c9d07c1a 100644
--- a/hw/intc/trace-events
+++ b/hw/intc/trace-events
@@ -25,6 +25,7 @@ apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 
0x%08x"
 ioapic_set_remote_irr(int n) "set remote irr for pin %d"
 ioapic_clear_remote_irr(int n, int vector) "clear remote irr for pin %d vector 
%d"
 ioapic_eoi_broadcast(int vector) "EOI broadcast for vector %d"
+ioapic_eoi_delayed_reassert(int vector) "delayed reassert on EOI broadcast for 
vector %d"
 ioapic_mem_read(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) 
"ioapic mem read addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" retval 
0x%"PRIx32
 ioapic_mem_write(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) 
"ioapic mem write