This is an RFC posted for design feedback rather than merge.

Problem
-------

In split-irqchip mode, KVM unconditionally advertises x2APIC Suppress EOI
Broadcast (SEOIB) support to the guest. This is wrong in two ways:

  - IOAPIC v0x11 has no EOI register, so advertising SEOIB is incorrect.
  - Even with IOAPIC v0x20, KVM ignores the guest's suppression request
    and continues to broadcast LAPIC EOIs to the userspace IOAPIC.

This can cause interrupt storms in guests that rely on Directed EOI
semantics (e.g. Windows with Credential Guard, which hangs during boot).

KVM fix
-------

KVM now exposes two new x2APIC API flags to give userspace control:

  - KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST
  - KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST

This patch
----------

This patch wires those flags into QEMU via a machine-level field
(kvm_lapic_seoib_state) with three policy states:

  - SEOIB_STATE_QUIRKED (default): legacy behavior, no flags set
  - SEOIB_STATE_RESPECTED: advertise SEOIB and honor guest suppression
  - SEOIB_STATE_NOT_ADVERTISED: hide SEOIB from guest (for IOAPIC v0x11)

The current implementation automatically selects a policy based on IOAPIC
version at VM power-on time, and migrates the state as a VMState subsection.

Design challenges
-----------------

The KVM x2APIC API is one-way: once a flag is set, it cannot be reverted
back to the quirked state (consistent with other x2APIC API flags). This
has several implications:

  - During incoming migration, we must defer setting the flags until after
    the SEOIB state is loaded from the migration stream, since we cannot
    know the source VM's policy in advance.

  - Snapshot restore (loadvm) is problematic: if the running VM has already
    set enabled/disabled, restoring a QUIRKED snapshot cannot revert the
    KVM state. QMP/HMP loadvm makes this worse since it cannot be detected
    at init time.

  - The flags only take effect when kvm_apic_set_version() is called inside
    KVM, which happens from limited paths such as IOAPIC initialization or
    APIC state setting. This requires setting the flags before IOAPIC is
    initialized, necessitating fetching the IOAPIC version from global
    properties rather than from the initialized device.

  - Automatic policy selection (current approach) changes behavior for
    existing machine types without user opt-in, breaking migratability
    from a new QEMU to an older QEMU for the same machine type.

  - Machine-type gating (restrict to 10.2+) helps with older machine types,
    but still breaks migration to a destination with an older kernel that
    lacks the SEOIB API, again without user opt-in.

Preferred direction
-------------------

Given the above, I am leaning toward a KVM accelerator property (for
example, `seoib-policy=respected|not-advertised|quirked`) with the default
being quirked.

This approach:

  - The user explicitly opts in. But requires user to understand/configure
    the property.
  - Works with any machine type, no machine-type gating needed.
  - No need to fetch the IOAPIC version from globals.


Note: The current implementation also does not handle the QMP/HMP
loadvm corner cases. 

I would appreciate feedback on the preferred approach.

Changes in v2:
        - Update flags in patch description to match kernel naming.
  
Khushit Shah (1):
  target/i386/kvm: Configure proper KVM SEOIB behavior

 hw/i386/x86-common.c         | 99 ++++++++++++++++++++++++++++++++++++
 hw/i386/x86.c                |  1 +
 hw/intc/ioapic.c             |  2 -
 include/hw/i386/x86.h        | 12 +++++
 include/hw/intc/ioapic.h     |  2 +
 include/system/system.h      |  1 +
 system/vl.c                  |  5 ++
 target/i386/kvm/kvm.c        | 43 ++++++++++++++++
 target/i386/kvm/kvm_i386.h   | 12 +++++
 target/i386/kvm/trace-events |  4 ++
 10 files changed, 179 insertions(+), 2 deletions(-)

-- 
2.39.3


Reply via email to