On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote: > On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: > > > > The error is due to invalid MSIX routing entry passed to KVM. > > > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can > > potentially result in IO performance loss in guest. > > I was interested to know if someone could boot a large Windows VM by > > some other means like kvm-msi-ext-dest-id. > > I think I may (with Alex Graf's suggestion) have found the Windows bug > with Intel IOMMU. > > It looks like when interrupt remapping is enabled with an AMD CPU, > Windows *assumes* it can generate AMD-style MSI messages even if the > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt > remapping to make it interpret an AMD-style message, Windows seems to > boot at least a little bit further than it did before...
Sadly, Windows has *more* bugs than that. The previous hack extracted the Interrupt Remapping Table Entry (IRTE) index from an AMD-style MSI message, and looked it up in the Intel IOMMU's IR Table. That works... for the MSIs generated by the I/O APIC. However... in the Intel IOMMU model, there is a single global IRT, and each entry specifies which devices are permitted to invoke it. The AMD model is slightly nicer, in that it allows a per-device IRT. So for a PCI device, Windows just seems to configure each MSI vector in order, with IRTE#0, 1, onwards. Because it's a per-device number space, right? Which means that first MSI vector on a PCI device gets aliased to IRQ#0 on the I/O APIC. I dumped the whole IRT, and it isn't just that Windows is using the wrong index; it hasn't even set up the correct destination in *any* of the entries. So we can't even do a nasty trick like scanning and funding the Nth entry which is valid for a particular source-id. Happily, Windows has *more* bugs than that... if I run with `-cpu host,+hv-avic' then it puts the high bits of the target APIC ID into the high bits of the MSI address. This *ought* to mean that MSIs from device miss the APIC (at 0x00000000FEExxxxx) and scribble over guest memory at addresses like 0x1FEE00004. But we can add yet *another* hack to catch that. For now I just hacked it to move the low 7 extra bits in to the "right" place for the 15-bit extension. --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg) return; } attrs.requester_id = pci_requester_id(dev); + printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, attrs.requester_id); + if (msg.address >> 32) { + uint64_t ext_id = msg.address >> 32; + msg.address &= 0xffffffff; + msg.address |= ext_id << 5; + printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, ext_id); + } + address_space_stl_le(&dev->bus_master_as, msg.address, msg.data, attrs, NULL); } We also need to stop forcing Windows to use logical mode, and force it to use physical mode instead: --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, Object *o, * used */ ((ms->smp.max_cpus > 8) ? - (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0), + (1 << ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0), .int_model = 1 /* Multiple APIC */, .rtc_century = RTC_CENTURY, .plvl2_lat = 0xfff /* C2 state not supported */, So now, with *no* IOMMU configured, Windows Server 2022 is booting and using CPUs > 255: Send MSI 0x1fee01000/0x41b0 from 0xfa Now 0xfee01020/0x41b0 with ext_id 1 That trick obviously can't work the the I/O APIC, but I haven't managed to persuade Windows to target I/O APIC interrupts at any CPU other than #0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to test. It may be that we need to advertise an Intel IOMMU that *only* has the I/O APIC behind it, and all the actual PCI devices are direct, so we can abuse that last Windows bug.
smime.p7s
Description: S/MIME cryptographic signature