Re: More than 255 vcpus Windows VM setup without viommu ?

David Woodhouse Tue, 01 Oct 2024 09:38:16 -0700

On Tue, 2024-10-01 at 14:33 +0100, Daniel P. Berrangé wrote:
> 
> > It looks like when interrupt remapping is enabled with an AMD CPU,
> > Windows *assumes* it can generate AMD-style MSI messages even if the
> > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt
> > remapping to make it interpret an AMD-style message, Windows seems to
> > boot at least a little bit further than it did before...
> 
> Rather than filling the intel IOMMU impl with hacks to make Windows
> boot on AMD virtualized CPUs, shouldn't we steer people to use the
> amd-iommu that QEMU already ships [1] ?


No, because there's no way to disable *DMA* translation on that.

We absolutely don't want to offer guests another level of DMA
translation under their control, because of the performance and
security implications.

The way we implement 'dma-translation=pff' for Intel IOMMU is a bit of
a hack, disabling all three of the SAGAW bits which advertise support
for 3-level, 4-level or 5-level page tables and thus leaving the guest
without *any* workable DMA page table setup. (I have asked Intel to
officially bless this trick, FWIW).

Linux *used* to panic when it saw this, but I fixed it when I added the
'dma-translation=off' support to QEMU. Windows always just quietly
refrained from using such an IOMMU for DMA translation, while still
using it for Interrupt Remapping. Which was the point.

> Even if we hack the intel iommu, so current Windows boots, can we
> have confidence that future Windows releases will correctly boot
> on an intel iommu with AMD CPUs virtualized ?

I'm not really proposing that we hack the Intel IOMMU like this; it's a
proof of concept trying to understand the Windows bugs.

And it *only* works for interrupts generated by the I/O APIC anyway.
For real PCI MSI, Windows still generates an AMD-style remappable MSI
message but *doesn't* actually program it into the IOMMU's table!
Probably because in AMD mode, the IRTE indices are per-device rather
than global.

For PCI MSI(-X) we're actually better off without an IOMMU because then
we see a *different* Windows bug — it just puts the high bits of the
APIC ID into the high bits of the MSI address, instead. Obviously such
an MSI *ought* to miss the APIC at 0x00000000FEExxxxx completely, and
just scribble over guest memory, but we can cope with that as I showed
in a later email.

At this point I'm just hacking around and trying to understand how
Windows behaves; until I do that I don't have any concrete suggestions
for if/how we should attempt to support it.

There is a Design Change Request open with Microsoft already, to fix
some of this and use the KVM/Xen/Hyper-V 15-bit MSI extension sensible.
Hopefully they can fix it, and we don't have to worry too much about
what future Windows versions will do because they'll be a bit saner.

In the meantime we're trying to work out if it's even possible to make
today's versions of Windows work, without having to give them DMA
translation support.

With `-cpu host,+hv-avic` and a hack in pci_msi_trigger() to handle the
erroneous high bits in the MSI address, I do have Windows Server 2022
booting. I'm not sure what would happen if it ever tried to target an
I/O APIC interrupt at a CPU above 255 though.

FWIW I *already* wanted to rewrite QEMU's MSI handling before we gained
TCG X2APIC support, and now I want to rewrite it even more, even
without this Windows nonsense. We should have a *single* translation
function which covers KVM and TCG, which includes IOMMUs, Xen's PIRQ
hack, the 15-bit MSI extension, this Windows bug (if we want to support
it), and which will allow the IOMMU to know whether to deliver an IRQ
fault event or not. And which handles the cookies needed for IOMMU
invalidation, which needs to kick eventfd assignments out of the KVM
irq routing table.

When a guest programs a PCI device's MSIX table, this function should
be called with deliver_now==false. If the translation succeeds, yay! It
should be put into the KVM routing table and the VFIO eventfd should be
attached (which will allow posted interrupts to work). If the
translation fails, QEMU should just listen on the VFIO eventfd for
itself.

When an MSI happens in real time, either because a VFIO eventfd fires
or because an emulated PCI device calls pci_msi_trigger(), it calls the
same function with 'deliver_now==true'. If an IOMMU lookup *still*
fails, that's when the IOMMU will actually raise a fault.

That function allows us to collect all the various MSI format nonsense
in *once* place and handle it cleanly, converting to the KVM X2APIC MSI
format which both KVM *and* the TCG X2APIC implementation accept.

It would have a comment which looks something like this...

(Signed-off-by: David Woodhouse <d...@amazon.co.uk> in case anyone gets
around to such a rewrite before I do, and/or just wants to nab this and
put it somewhere useful)

/*
 * ===================
 * MSI MESSAGE FORMATS
 * ===================
 *
 * Message Signaled Interrupts are simply DMA transactions from the device.
 * It really is just "write <these> 32 bits <here> when you want attention."
 * The MSI (or MSI-X) message configured in the device is just the 64 bits
 * of the address to write to, and the 32 bits to write there.
 *
 * You can use this to do polled I/O by telling the device to write into a
 * data structure of your own choosing, then checking to see when it does
 * so.
 *
 * Or you can tell the device to poke at MMIO on *another* device, for
 * example when it's finished receiving a packet and it's time for the next
 * device to process that packet.
 *
 * Or — and this one is *actually* how it's expected to be used by sane
 * operating systems — you can point it at a special region of "physical
 * memory" which isn't actually memory; it's really an MMIO device which
 * can be used to trigger interrupts.
 *
 * That MMIO device is called the APIC, and on x86 machines it lives at
 * 0x00000000FEExxxxx in the physical memory space (the real one in host
 * physical space, and a virtual one in guest physical space).
 *
 * When the APIC receives a write transaction, it looks at the low 24 bits of
 * the address, and the 32 bits of data, and that conveys all the information
 * about which interrupt vector to raise on which CPU, and a few more details
 * besides. Some of those details include special cases like cluster delivery
 * modes and ways to delivery NMI/INIT/etc. which we won't go into here.
 *
 * Ih the beginning, there was only one way of doing this. This is what Intel
 * documentation now calls "Compatibility Format" (§5.1.2.1 of the VT-d spec).
 * It has 8 bits for the Destination APIC ID which are in bits 12-19 of the
 * MSI address (i.e. the XX in 0xfeeXX...). The *vector* to be raised on that
 * CPU is in the low 8 bits of the data written to that address.
 *
 *
 * Compatibility Format
 * --------------------
 *
 * Address: 1111.1110.1110.dddd.dddd.0000.0000.rmxx
 *               0xFEE    . Dest ID .  Rsvd   .↑↑↑
 *                                             ||Don't Care
 *                                             |Destination Mode
 *                                             Redirection Hint
 *
 * Data:    0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv
 *               Reserved      .↑↑     ↑ .  Vector
 *                              ||     Delivery Mode
 *                              Trigger Mode, Trigger Mode Level
 *
 * Crucially, this format has only 8 bits for the Destination ID. Since 0xFF
 * is the broadcast address, this allows only up to 255 CPUs to be supported.
 *
 * For many years the Reserved bits in bit 4-11 of the address were labelled
 * in some Intel documentation as "Extended Destination ID", but never used.
 *
 *
 * I/O APIC Redirection Table Entries
 * ----------------------------------
 *
 * The I/O APIC is just a device for turning line-level interrupts into MSI
 * messages. Each pin on the I/O APIC has a Redirection Table Entry (RTE)
 * which configures the MSI message to be sent. The 64 bits of the RTE
 * include all the fields of the Compatibility Format MSI, including the
 * Extended Destination ID, but basically shuffled into a strange order for
 * historical reasons. Creating a Compatibility Format MSI from an I/O APIC
 * RTE is basically just a series of masks and shifts to move the bits into
 * the right place. Linux will compose an MSI message as appropriate for the
 * actual APIC or IOMMU in use (we'll get to those), then just shuffle the
 * bits around to program the I/O APIC RTE.
 *
 *
 * Intel "Remappable Format"
 * -------------------------
 *
 * When Intel started supporting more than 255 CPUs, the 8-bit limit in what
 * was not yet called "Compatibility Format" became a problem. To support
 * the full 32 bits of logical X2APIC IDs they had to come up with another
 * solution. Since MSIs are basically just a DMA write, the logical place for
 * this was the IOMMU, which already intercepts DMA writes from devices. So
 * they invented "Interrupt Remapping". The "Remappable Format" MSI does not
 * directly encode which vector to send to which CPU; instead it just
 * identifies an index into an IOMMU table (the Interrupt Remapping Table).
 *
 * The Interrupt Remapping Table Entry (IRTE) contains all the information
 * which was once present in the MSI address+data, but allows for a full 32
 * bits of destination ID. (It can also be used for posted interrupts,
 * delivering the interrupt *direcftly* to a vCPU in guest mode).
 *
 * To signal a Remappable Format MSI, Intel used bit 4 of the MSI address,
 * which is the lowest of the bits which were previously labelled "Extended
 * *Destination ID". With an Intel IOMMU doing Interrupt Remapping, you can
 * either submit Remappable Format MSIs, *or* Compatibilly Format, and the
 * IOMMU will only actually remap the former. (It can be told to block the
 * latter, for security reasons.)
 *
 * Intel calls the IRTE index the "handle". There are some legacy multi-MSI
 * devices which can't be explicitly configured with a different address/data
 * for each interrupt, but just add one to the data for each consecutive MSI
 * vector they generate. This *used* to correspond to consecutive IRQ vectors
 * on the same CPU. To cope with this, Intel added a "Subhandle" in the low
 * bits of the data, which *optionally* adds those bits to the handle
 * extracted from the MSI address:

 * Address: 1111.1110.1110.hhhh.hhhh.hhhh.hhh1.shxx
 *               0xFEE    .   Handle[14:0]    .↑↑↑
 *                                             ||Don't Care
 *                                             |Handle[15]
 *                                             Subhandle Valid (SHV)
 *
 * Data:    0000.0000.0000.0000.ssss.ssss.ssss.ssss
 *               Reserved      .  Subhandle (if SHV==1 in address)
 *
 * These is a slight complexity here for the I/O APIC, which doesn't *just*
 * shuffle the bits around to generate an MSI, but also handles EOI of line
 * level interrupts (and has to re-raise the IRQ if the line is actually
 * still asserted). For that, the I/O APIC interprets the RTE bits with their
 * original "compatibility" meaning. All those bits actually end up in the
 * low bits of the MSI data, so the OS has to program those bits accordingly
 * even though it sets SHV=0 so they're actually *ignored* when generating
 * the interrupt.
 *
 *
 * AMD Remappable MSI
 * ------------------
 *
 * AMD's IOMMU is completely different to Intel's, and they didn't make
 * things anywhere near as complicated. When the IOMMU is enabled, a
 * device cannot send "Compatibility Format" MSIs any more, so there is
 * no need to tell one format from the other. AMD just used the low 11
 * bits of the data as the IRTE index, and nothing else matters.
 *
 * Address: 1111.1110.1110.xxxx.xxxx.xxxx.xxxx.xxxx
 *               0xFEE    .       Don't Care
 *
 * Data:    xxxx.xxxx.xxxx.xxxx.xxxx.xiii.iiii.iiii
 *               Don't Care            IRTE Index
 *
 * The reason for using only 11 bits of IRTE index is because, as described
 * above, the I/O APIC actually *does* care about bit 11 of the MSI data,
 * (or, more accurately, it cares about the RTE bit which gets shuffled into
 * bit 11 of the MSI data). That's the original "Trigger Mode" bit, which the
 * I/O APIC needs in order to re-raise level-triggered interrupts which are
 * EOI'd while they're still asserted.
 *
 * Although the Intel IOMMU has a single Interrupt Remapping Table and a
 * single number space for IRTE indices across the whole system, the AMD
 * IOMMU has a table per device. This, sadly becomes important later.
 *
 * The 15-bit MSI extension
 * ------------------------
 *
 * The problem with IOMMUs is that they were designed to support DMA
 * translation, and there is no architectural way to disable that and offer
 * guests an IOMMU which *only* supports Interrupt Remapping. We really don't
 * want guests doing DMA translation, as it has severe performance and
 * security implications.
 *
 * So KVM, Hyper-V and Xen all define a virt extension which uses 7 of the
 * original "Extended Destination ID" bits to give support for up to 32768
 * virtual CPUs. (This extension avoids the low bit which Intel used to
 * indicate Remappable Format). This format is exactly like the Compatibility
 * Format, except that bits 5-11 of the MSI address are used as bits 8-15
 * of the destination APIC ID:
 *
 * Address: 1111.1110.1110.dddd.dddd.DDDD.DDD0.rmxx
 *               0xFEE    . Dest ID .ExtDest  .↑↑↑
 *                                             ||Don't Care
 *                                             |Destination Mode
 *                                             Redirection Hint
 *
 * Data:    0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv
 *               Reserved      .↑↑     ↑ .  Vector
 *                              ||     Delivery Mode
 *                              Trigger Mode, Trigger Mode Level
 *
 *
 * Xen MSI → PIRQ mapping
 * ----------------------
 *
 * All of the above are implementable in real hardware. Actual external PCI
 * devices can perform memory transactions to addresses in the physical
 * address range 0x00000000FEExxxxx, which reach the APIC and cause
 * interrupts to be injected into the relevant CPU.
 *
 * But Xen guests know that they are running in a virtual machine. So they
 * know that the PCI config space is a complete fiction. For example, if they
 * set up a BAR of a given device with a certain address, that is a *guest*
 * physical address. The hypervisor probably doesn't even change anything on
 * the device itself; it just adjusts the EPT page tables to make the
 * corresponding BAR *appear* in the guest physical address space at the
 * desired location.
 *
 * MSI messages are similarly fictional. The guest configures a PCI device
 * with its own vCPU APIC ID and vector, and the real hardware wouldn't know
 * what to do with that. (Well, we could design an IOMMU which *could* cope
 * with that, let guests write directly to the PCI devices' MSI tables, and
 * use the resulting MSIs for posted interrupts as a first-class citizen, but
 * nobody's done that.)
 *
 * In practice, what happens is that the hypervisor registers its *own*
 * handler for the interrupt in question (routing it to a given vector on a
 * given *host* CPU). When that host interrupt handler is triggered, the
 * hypervisor injects an interrupt to the guest vCPU accordingly. Most
 * hypervisors, including Xen and KVM, do *not* have a mechanism to simply
 * write to guest memory *instead* of injecting an interrupt. So if the guest
 * configured the MSI to target an address outside the 0x00000000FEExxxxx
 * range, it just gets dropped. (Boo, no DPDK polled-mode implementations
 * abusing MSIs for memory writes, in virt guests!)
 *
 * This means that we can abuse the high 32 bits of the address even in a
 * guest-visible way, right? Nothing would ever go wrong...
 *
 * Xen was the first to do this. It needed a way to map MSI from PCI devices
 * to a 'PIRQ', which is a form of Xen paravirtualised interrupt which binds
 * to Xen Event Channels. By using vector#0, Xen guests indicate a special
 * MSI message which is to be routed to a PIRQ. The actual PIRQ# is then in
 * the original Destination ID field... and the high bits of the address.
 *
 * (We'll gloss over the way that Xen snoops on these even while masked, and
 * actually unmasks the MSI when the guests binds to the corresponding PIRQ,
 * because there's only so much pain I can inflict on the reader in one
 * sitting.)
 *
 * AddrHi:  DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000
 *                    PIRQ#[31-8]        .  Rsvd
 *
 * AddrLo:  1111.1110.1110.dddd.dddd.0000.0000.xxxx
 *               0xFEE    .PIRQ[7-0].  Rsvd   .Don't Care
 *
 * Data:    xxxx.xxxx.xxxx.xxxx.xxxx.xxxx.0000.0000
 *                  Don't Care           . Vector == 0
 *
 *
 * KVM X2APIC MSI API
 * ------------------
 *
 * KVM has an ioctl() for injecting MSI interrupts, and routing table entries
 * which cause MSIs to be injected to the guest when triggered. For
 * convenience, KVM originally just used the Compatibility Format MSI message
 * as its userspace ABI for configuring these. This got less convenient when
 * X2APIC came along and we needed an extra 24 bits for the Destination ID.
 *
 * KVM's solution was to abuse the high 32 bits of the address, If this was a
 * true memory transaction, such a write would miss the APIC completely and
 * scribble over guest memory at an address like 0x00000100FEExxxxx. But in
 * this case it's just an ABI between KVM and userspace, using bits which
 * would otherwise be completely redundant. KVM uses the high 24 bits of the
 * MSI address (bits 40-63) as the high 24 bits of the destination ID.
 *
 * AddrHi:  DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000
 *            Destination ID bits 8-31   .  Rsvd
 *
 * AddrLo:  1111.1110.1110.dddd.dddd.0000.0000.rmxx
 *               0xFEE    . Dest ID .  Rsvd   .↑↑↑
 *                                             ||Don't Care
 *                                             |Destination Mode
 *                                             Redirection Hint
 *
 * Data:    0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv
 *               Reserved      .↑↑     ↑ .  Vector
 *                              ||     Delivery Mode
 *                              Trigger Mode, Trigger Mode Level
 *
 * This hack is not visible to a KVM guest. What a KVM guest programs into
 * the MSI descriptors of passthrough or emulated PCI devices is completely
 * different, and (at this point in our tale of woe, at least) never sets
 * the high 32 bits of the target address to anything but zero.
 *
 *
 * IOMMU interrupts
 * ----------------
 *
 * Since an IOMMU is responsible for remapping interrupts so they can reach
 * CPUs with higher APIC IDs, how do we actually configure the events from
 * the IOMMU itself?
 *
 * Intel uses the same format as the KVM X2APIC API (which may actually have
 * been why KVM did it that way). Since it's never going to be an actual
 * memory transaction, it's safe to abuse the high bits of the address. Intel
 * offers { Data, Address, Upper Address } registers for each type of event
 * that the IOMMU can generate for itself, with the high 24 bits of the
 * destination ID in the higher 24 bits of the address as shown above.
 *
 * AMD's IOMMU uses a completely different 64-bit register format (e.g XT
 * IOMMU General Interrupt Control Register) which doesn't pretend very hard
 * to look like an MSI at all. But just happens to have the DestMode at bit
 * 2, like in the MSI address. And just happens to have the vector and
 * Delivery Mode (from the low 9 bits of the MSI data) in the low 9 bits of
 * its high word (bits 32-40 of the register). And then just throws the
 * actual destination ID in around them in some other bits:
 *
 * Low32:   dddd.dddd.dddd.dddd.dddd.dddd.xxxx.xmxx
 *             Destination ID [23-0]     . DC . ↑↑
 *                                              |Don't Care
 *                                              Destination Mode
 *
 * High32:  DDDD.DDDD.xxxx.xxxx.xxxx.xxxD.vvvv.vvvv
 *        DestId[31-24]                 ↑.  Vector
 *                                      Delivery Mode
 *
 *
 * Windows, part 1: Intel IOMMU with no DMA translation
 * ----------------------------------------------------
 *
 * As noted above, the 15-bit extension was invented to avoid the need for
 * an IOMMU, because it is undesirable to offer a virtual IOMMU to guests
 * with support for them to do their own additional level of DMA translation.
 *
 * However, although Hyper-V exposes the 15-bit MSI feature, Windows as a
 * guest OS does not use it. In order to support Windows guests with more
 * than 255 vCPUs, a hack was found for the Intel IOMMU. Although there is no
 * official way to advertise that the IOMMU does not support DMA translation,
 * there *are* "Supported Adjusted Guest Address Width" bits which advertise
 * the ability to use 3-level, 4-level, and/or 5-level page tables. If
 * Windows encounters an IOMMU which sets *none* of these bits, Windows will
 * quietly refrain from attempting to use that IOMMU for DMA translation, but
 * will still use it for Interrupt Remapping.
 *
 * However, this only works correctly if Windows is running on an Intel CPU.
 * When Windows runs on an AMD CPU, it will happily configure and use the
 * Intel IOMMU, but misconfigures the MSI messages that it programs into the
 * devices. For I/O APIC interrupts, Windows programs the IRTE in the Intel
 * IOMMU correctly... but then configures the I/O APIC using the AMD format
 * (with the IRTE index where the vector would have been). A hack to the
 * virtual Intel IOMMU emulation can make it cope with this bug... but sadly
 * it *only* works for I/O APIC interrupts. For actual PCI MSI, Windows still
 * configures the device with an AMD-style remappable MSI but *doesn't*
 * actually configure the IRTE in the IOMMU at all. This is probably because
 * Intel's IRT is system-wide, while AMD has one per device; Windows does
 * seem to think it's using a separate IRTE space, so the first MSI vector
 * gets IRTE index 0 which conflicts with I/O APIC pin 0.
 *
 * So for PCI, the hypervisor has no idea where Windows intended a given MSI
 * to be routed, and cannot work around the Windows bugs to support >255 AMD
 * vCPUs this way.
 *
 *
 * Windows, part 2: No IOMMU
 * -------------------------
 *
 * If we do *not* offer an IOMMU to a Windows guest which has CPUs with high
 * APIC IDs, we encounter a *different* Windows bug, which is easier to work
 * around. Windows doesn't use the 15-bit extension described above, but it
 * *does* just throw the high bits of the destination ID into bits 32-55 of
 * the MSI address.
 *
 * Done without negotiation or discovery of any hypervisor feature, this
 * arguably ought to cause the device to write to an address in guest
 * *memory* and miss the APIC at 0x00000000FEExxxxx altogether, but we
 * already admitted almost no hypervisors actually *do* that. (QEMU is the
 * exception here, because for *emulated* PCI devices, pci_msi_trigger() does
 * actually generate true write cycles in the corresponding DMA address
 * space.)
 *
 * We can cope with this Windows bug and even use it to our advantage, by
 * spotting the high bits in the MSI address and using them. It does require
 * that we have an API which is specifically for *MSI*, not to be conflated
 * with actual DMA writes. So QEMU's pci_msi_trigger() would have to do
 * things differently. But let's pretend, for the same of argument, that I'm
 * typing this C-comment essay into a VMM other than QEMU, which already
 * does think that way and has a cleaner separation of emulated-PCI vs. the
 * VFIO or true emulation which can back it, and *does* always handle MSIs
 * explicity.
 *
 * In that case, all the translation function has to do, in addition to
 * invoking all the IOMMU and Xen and 15-bit translators as QEMU's
 * kvm_arch_fixup_msi_route() function already does, is add one more trivial
 * special case. This format is the same as the KVM X2APIC API format, with
 * the top 32 bits of the address shifted by 8 bits:
 *
 * AddrHi:  0000.0000.DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000
 *            Rsvd   .         Destination ID bits 8-31
 *
 * AddrLo:  1111.1110.1110.dddd.dddd.0000.0000.rmxx
 *               0xFEE    . Dest ID .  Rsvd   .↑↑↑
 *                                             ||Don't Care
 *                                             |Destination Mode
 *                                             Redirection Hint
 *
 * Data:    0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv
 *               Reserved      .↑↑     ↑ .  Vector
 *                              ||     Delivery Mode
 *                              Trigger Mode, Trigger Mode Level
 */

bool arch_translate_msi_message(struct kvm_irq_routing_entry *re,
                                const struct kvm_msi *in,
                                uint64_t *cookie,
                                bool deliver_now)
{

smime.p7s
Description: S/MIME cryptographic signature

Re: More than 255 vcpus Windows VM setup without viommu ?

Reply via email to