On Tue, 23 Sep 2025 12:04:33 -0500 (CDT) Timothy Pearson <[email protected]> wrote:
> PCI devices prior to PCI 2.3 both use level interrupts and do not support > interrupt masking, leading to a failure when passed through to a KVM guest on > at least the ppc64 platform. This failure manifests as receiving and > acknowledging a single interrupt in the guest, while the device continues to > assert the level interrupt indicating a need for further servicing. > > When lazy IRQ masking is used on DisINTx- (non-PCI 2.3) hardware, the > following > sequence occurs: > > * Level IRQ assertion on device > * IRQ marked disabled in kernel > * Host interrupt handler exits without clearing the interrupt on the device > * Eventfd is delivered to userspace > * Guest processes IRQ and clears device interrupt > * Device de-asserts INTx, then re-asserts INTx while the interrupt is masked > * Newly asserted interrupt acknowledged by kernel VMM without being handled > * Software mask removed by VFIO driver > * Device INTx still asserted, host controller does not see new edge after EOI > > The behavior is now platform-dependent. Some platforms (amd64) will continue > to spew IRQs for as long as the INTX line remains asserted, therefore the IRQ > will be handled by the host as soon as the mask is dropped. Others (ppc64) > will > only send the one request, and if it is not handled no further interrupts will > be sent. The former behavior theoretically leaves the system vulnerable to > interrupt storm, and the latter will result in the device stalling after > receiving exactly one interrupt in the guest. > > Work around this by disabling lazy IRQ masking for DisINTx- INTx devices. > > Signed-off-by: Timothy Pearson <[email protected]> > --- > drivers/vfio/pci/vfio_pci_intrs.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c > b/drivers/vfio/pci/vfio_pci_intrs.c > index 123298a4dc8f..61d29f6b3730 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -304,9 +304,14 @@ static int vfio_intx_enable(struct vfio_pci_core_device > *vdev, > > vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX; > > + if (!vdev->pci_2_3) > + irq_set_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY); > + > ret = request_irq(pdev->irq, vfio_intx_handler, > irqflags, ctx->name, ctx); > if (ret) { > + if (!vdev->pci_2_3) > + irq_clear_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY); > vdev->irq_type = VFIO_PCI_NUM_IRQS; > kfree(name); > vfio_irq_ctx_free(vdev, ctx, 0); > @@ -352,6 +357,8 @@ static void vfio_intx_disable(struct vfio_pci_core_device > *vdev) > vfio_virqfd_disable(&ctx->unmask); > vfio_virqfd_disable(&ctx->mask); > free_irq(pdev->irq, ctx); > + if (!vdev->pci_2_3) > + irq_clear_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY); > if (ctx->trigger) > eventfd_ctx_put(ctx->trigger); > kfree(ctx->name); As expected, I don't note any functional issues with this on x86. I didn't do a full statistical analysis, but I suspect this might slightly reduce the mean interrupt rate (netperf TCP_RR) and increase the standard deviation, but not sufficiently worrisome for a niche use case like this. Applied to vfio next branch for v6.18. Thanks, Alex
