On Mon, 30 Sep 2024 16:50:21 +0100
David Woodhouse <dw...@infradead.org> wrote:

> On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote:
> > On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote:  
> > > 
> > > The error is due to invalid MSIX routing entry passed to KVM.
> > > 
> > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can
> > > potentially result in IO performance loss in guest.
> > > I was interested to know if someone could boot a large Windows VM by
> > > some other means like kvm-msi-ext-dest-id.  
> > 
> > I think I may (with Alex Graf's suggestion) have found the Windows bug
> > with Intel IOMMU.
> > 
> > It looks like when interrupt remapping is enabled with an AMD CPU,
> > Windows *assumes* it can generate AMD-style MSI messages even if the
> > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt
> > remapping to make it interpret an AMD-style message, Windows seems to
> > boot at least a little bit further than it did before...  
> 
> Sadly, Windows has *more* bugs than that.
> 
> The previous hack extracted the Interrupt Remapping Table Entry (IRTE)
> index from an AMD-style MSI message, and looked it up in the Intel
> IOMMU's IR Table.
> 
> That works... for the MSIs generated by the I/O APIC.
> 
> However... in the Intel IOMMU model, there is a single global IRT, and
> each entry specifies which devices are permitted to invoke it. The AMD
> model is slightly nicer, in that it allows a per-device IRT.
> 
> So for a PCI device, Windows just seems to configure each MSI vector in
> order, with IRTE#0, 1, onwards. Because it's a per-device number space,
> right? Which means that first MSI vector on a PCI device gets aliased
> to IRQ#0 on the I/O APIC.
> 
> I dumped the whole IRT, and it isn't just that Windows is using the
> wrong index; it hasn't even set up the correct destination in *any* of
> the entries. So we can't even do a nasty trick like scanning and
> funding the Nth entry which is valid for a particular source-id.
> 
> Happily, Windows has *more* bugs than that... if I run with
> `-cpu host,+hv-avic' then it puts the high bits of the target APIC ID
> into the high bits of the MSI address. This *ought* to mean that MSIs
> from device miss the APIC (at 0x00000000FEExxxxx) and scribble over
> guest memory at addresses like 0x1FEE00004. But we can add yet
> *another* hack to catch that. For now I just hacked it to move the low
> 7 extra bits in to the "right" place for the 15-bit extension.
> 
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage 
> msg)
>          return;
>      }
>      attrs.requester_id = pci_requester_id(dev);
> +    printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, 
> attrs.requester_id);
> +    if (msg.address >> 32) {
> +        uint64_t ext_id = msg.address >> 32;
> +        msg.address &= 0xffffffff;
> +        msg.address |= ext_id << 5;
> +        printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, 
> ext_id);
> +    }
> +        
>      address_space_stl_le(&dev->bus_master_as, msg.address, msg.data,
>                           attrs, NULL);
>  }
> 
> We also need to stop forcing Windows to use logical mode, and force it
> to use physical mode instead:
> 
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, 
> Object *o,
>               * used
>               */
>              ((ms->smp.max_cpus > 8) ?
> -                        (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0),
> +                        (1 << 
> ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0),
>          .int_model = 1 /* Multiple APIC */,
>          .rtc_century = RTC_CENTURY,
>          .plvl2_lat = 0xfff /* C2 state not supported */,
> 
> 
> So now, with *no* IOMMU configured, Windows Server 2022 is booting and
> using CPUs > 255:
>   Send MSI 0x1fee01000/0x41b0 from 0xfa
>   Now 0xfee01020/0x41b0 with ext_id 1
> 
> That trick obviously can't work the the I/O APIC, but I haven't managed
> to persuade Windows to target I/O APIC interrupts at any CPU other than
> #0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to
> test.
> 
> It may be that we need to advertise an Intel IOMMU that *only* has the
> I/O APIC behind it, and all the actual PCI devices are direct, so we
> can abuse that last Windows bug.

It's interesting as an experiment, to prove that Windows is riddled with bugs.
(well, and it could serve as starting point to report issue to MS)
But I'd rather Microsoft fix bugs on their side, instead of putting hacks in
QEMU.

PS:
Given it's AMD cpu, I doubt very much that using intel_iommu would be
accepted by Microsoft as valid complaint though.


Reply via email to