----- Original Message -----
> From: "Alex Williamson" <[email protected]>
> To: "Timothy Pearson" <[email protected]>
> Cc: "kvm" <[email protected]>, "linuxppc-dev" 
> <[email protected]>
> Sent: Friday, September 19, 2025 1:56:03 PM
> Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI 
> devices

> On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
> Timothy Pearson <[email protected]> wrote:
> 
>> PCI devices prior to PCI 2.3 both use level interrupts and do not support
>> interrupt masking, leading to a failure when passed through to a KVM guest on
>> at least the ppc64 platform, which does not utilize the resample IRQFD. This
>> failure manifests as receiving and acknowledging a single interrupt in the 
>> guest
>> while leaving the host physical device VFIO IRQ pending.
>> 
>> Level interrupts in general require special handling due to their inherently
>> asynchronous nature; both the host and guest interrupt controller need to
>> remain in synchronization in order to coordinate mask and unmask operations.
>> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
>> occurs:
>>
>>  * Level IRQ assertion on host
>>  * IRQ trigger within host interrupt controller, routed to VFIO driver
>>  * Host EOI with hardware level IRQ still asserted
>>  * Software mask of interrupt source by VFIO driver
>>  * Generation of event and IRQ trigger in KVM guest interrupt controller
>>  * Level IRQ deassertion on host
>>  * Guest EOI
>>  * Guest IRQ level deassertion
>>  * Removal of software mask by VFIO driver
>> 
>> Note that no actual state change occurs within the host interrupt controller,
>> unlike what would happen with either DisINTx+ hardware or message interrupts.
>> The host EOI is not fired with the hardware level IRQ deasserted, and the
>> level interrupt is not re-armed within the host interrupt controller, leading
>> to an unrecoverable stall of the device.
>> 
>> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.
> 
> I'm not really following here.  It's claimed above that no actual state
> change occurs within the host interrupt controller, but that's exactly
> what disable_irq_nosync() intends to do, mask the interrupt line at the
> controller.

While it seems that way on the surface (and this tripped me up originally), the 
actual call chain is:

disable_irq_nosync()
__disable_irq_nosync()
__disable_irq()
irq_disable()

Inside void irq_disable(), __irq_disable() is gated on 
irq_settings_disable_unlazy().  The lazy disable is intended to *not* touch the 
interrupt controller itself, instead lazy mode masks the interrupt at the 
device level (DisINT+ registers).  If the IRQ is set up to run in lazy mode, 
the interrupt is not disabled at the actual interrupt controller by 
disable_irq_nosync().

> The lazy optimization that's being proposed here should
> only change the behavior such that the interrupt is masked at the call
> to disable_irq_nosync() rather than at a subsequent re-assertion of the
> interrupt.  In any case, enable_irq() should mark the line enabled and
> reenable the controller if necessary.

If the interrupt was not disabled at the controller, then reenabling a level 
interrupt is not guaranteed to actually do anything (although it *might*).  The 
hardware in the interrupt controller will still "see" an active level assert 
for which it fired an interrupt without a prior acknowledge (or disable/enable 
cycle) from software, and can then decide to not re-raise the IRQ on a 
platform-specific basis.

The key here is that the interrupt controllers differ somewhat in behavior 
across various architectures.  On POWER, the controller will only raise the 
external processor interrupt once for each level interrupt when that interrupt 
changes state to asserted, and will only re-raise the external processor 
interrupt once an acknowledge for that interrupt has been sent to the interrupt 
controller hardware while the level interrupt is deasserted.  As a result, if 
the interrupt handler executes (acknowledging the interrupt), but does not 
first clear the interrupt on the device itself, the interrupt controller will 
never re-raise that interrupt -- from its perspective, it has issued another 
IRQ (because the device level interrupt was left asserted) and the associated 
handler has never completed.  Disabling the interrupt causes the controller to 
reassert the interrupt if the level interrupt is still asserted when the 
interrupt is reenabled at the controller level.

On other platforms the external processor interrupt itself is disabled until 
the interrupt handler has finished, and the controller doesn't auto-mask the 
level interrupts at the hardware level; instead, it will happily re-assert the 
processor interrupt if the interrupt was not cleared at the device level after 
IRQ acknowledge.  I suspect on those platforms this bug may be masked at the 
expense of a bunch of "spurious" / unwanted interrupts if the interrupt handler 
hasn't acked the interrupt at the device level; as long as the guest interrupt 
handler is able to somewhat rapidly clear the device interrupt, performance 
won't be impacted too much by the extra interrupt load, further hiding the bug 
on these platforms.

Again, this is also a specific and unusual case of an old level-driven 
interrupt device that doesn't support interrupt masking at the device level 
(i.e. the device is DisINT-), in combination with the VFIO driver.  Under that 
*specific* use case, the VFIO driver purposefully acnowledges the interrupt 
without first clearing the interrupt on the device, which then exposes the 
platform-specific differences in interrupt controller behavior.

> Also, contrary to above, when a device supports DisINT+ we're not
> manipulating the host controller.  We're able to mask the interrupt at
> the device.  MSI is edge triggered, we don't mask it, so it's not
> relevant to this discussion afaict.

That's correct -- I don't recall saying the opposite. ;)  If I did, I apologize.

> There may be good reason to disable the lazy masking behavior as you're
> proposing, but I'm not able to glean it from this discussion of the
> issue.

Does this help clarify anything?

Reply via email to