On 15/03/2022 09:09, Alex Williamson wrote:
[Cc +Alexey]

On Fri, 11 Mar 2022 12:35:45 -0600 (CST)
Timothy Pearson <tpear...@raptorengineering.com> wrote:

All,

I've been struggling for some time with what is looking like a
potential bug in QEMU/KVM on the POWER9 platform.  It appears that in
XIVE mode, when the in-kernel IRQ chip is enabled, an external device
that rapidly asserts IRQs via the legacy INTx level mechanism will
only receive one interrupt in the KVM guest.

Changing any one of those items appears to avoid the glitch, e.g.
XICS mode with the in-kernel IRQ chip works (all interrupts are
passed through), and XIVE mode with the in-kernel IRQ chip disabled
also works.  We are also not seeing any problems in XIVE mode with
the in-kernel chip from MSI/MSI-X devices.

The device in question is a real time card that needs to raise an
interrupt every 1ms.  It works perfectly on the host, but fails in
the guest -- with the in-kernel IRQ chip and XIVE enabled, it
receives exactly one interrupt, at which point the host continues to
see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
kernel is never reentered.

We have also seen some very rare glitches where, over a long period
of time, we can enter a similar deadlock in XICS mode.  Disabling the
in-kernel IRQ chip in XIVE mode will also lead to the lockup with
this device, since the userspace IRQ emulation cannot keep up with
the rapid interrupt firing (measurements show around 100ms required
for processing each interrupt in the user mode).

My understanding is the resample mechanism does some clever tricks
with level IRQs, but that QEMU needs to check if the IRQ is still
asserted by the device on guest EOI.  Since a failure here would
explain these symptoms I'm wondering if there is a bug in either QEMU
or KVM for POWER / pSeries (SPAPr) where the IRQ is not resampled and
therefore not re-fired in the guest?

Unfortunately I lack the resources at the moment to dig through the
QEMU codebase and try to find the bug.  Any IBMers here that might be
able to help out?  I can provide access to a test setup if desired.

Your experiments with in-kernel vs QEMU irqchip would suggest to me
that both the device and the generic INTx handling code are working
correctly, though it's hard to say that definitively given the massive
timing differences.

As an experiment, does anything change with the "nointxmask=1" vfio-pci
module option?

Adding Alexey, I have zero XIVE knowledge myself. Thanks,

Sorry about the delay, I'll get to it, just need to figure out first the host crashes on >128GB vMs on POWER8 with passthrough :-/


--
Alexey

Reply via email to