Hi Jean, On 1/25/24 19:48, Jean-Philippe Brucker wrote: > Hi, > > On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote: >> Hi Zhenzhong, >> On 1/18/24 08:10, Duan, Zhenzhong wrote: >>> Hi Eric, >>> >>>> -----Original Message----- >>>> From: Eric Auger <eric.au...@redhat.com> >>>> Cc: m...@redhat.com; c...@redhat.com >>>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling >>>> for hotplugged devices >>>> >>>> In [1] we attempted to fix a case where a VFIO-PCI device protected >>>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical >>>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the >>>> virtio-iommu used to expose a 64b address space by default. >>>> Hence the guest was trying to use the full 64b space and we hit >>>> DMA MAP failures. To work around this issue we managed to pass >>>> usable IOVA regions (excluding the out of range space) from VFIO >>>> to the virtio-iommu device. This was made feasible by introducing >>>> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). >>>> This latter gets called when the IOMMU MR is enabled which >>>> causes the vfio_listener_region_add() to be called. >>>> >>>> However with VFIO-PCI hotplug, this technique fails due to the >>>> race between the call to the callback in the add memory listener >>>> and the virtio-iommu probe request. Indeed the probe request gets >>>> called before the attach to the domain. So in that case the usable >>>> regions are communicated after the probe request and fail to be >>>> conveyed to the guest. To be honest the problem was hinted by >>>> Jean-Philippe in [1] and I should have been more careful at >>>> listening to him and testing with hotplug :-( >>> It looks the global virtio_iommu_config.bypass is never cleared in guest. >>> When guest virtio_iommu driver enable IOMMU, should it clear this >>> bypass attribute? >>> If it could be cleared in viommu_probe(), then qemu will call >>> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() >>> to enable IOMMU MR. Then both coldplugged and hotplugged devices will work. >> this field is iommu wide while the probe applies on a one device.In >> general I would prefer not to be dependent on the MR enablement. We know >> that the device is likely to be protected and we can collect its >> requirements beforehand. >> >>> Intel iommu has a similar bit in register GCMD_REG.TE, when guest >>> intel_iommu driver probe set it, on qemu side, >>> vtd_address_space_refresh_all() >>> is called to enable IOMMU MRs. >> interesting. >> >> Would be curious to get Jean Philippe's pov. > I'd rather not rely on this, it's hard to justify a driver change based > only on QEMU internals. And QEMU can't count on the driver always clearing > bypass. There could be situations where the guest can't afford to do it, > like if an endpoint is owned by the firmware and has to keep running. > > There may be a separate argument for clearing bypass. With a coldplugged > VFIO device the flow is: > > 1. Map the whole guest address space in VFIO to implement boot-bypass. > This allocates all guest pages, which takes a while and is wasteful. > I've actually crashed a host that way, when spawning a guest with too > much RAM. interesting > 2. Start the VM > 3. When the virtio-iommu driver attaches a (non-identity) domain to the > assigned endpoint, then unmap the whole address space in VFIO, and most > pages are given back to the host. > > We can't disable boot-bypass because the BIOS needs it. But instead the > flow could be: > > 1. Start the VM, with only the virtual endpoints. Nothing to pin. > 2. The virtio-iommu driver disables bypass during boot We needed this boot-bypass mode for booting with virtio-blk-scsi protected with virtio-iommu for instance. That was needed because we don't have any virtio-iommu driver in edk2 as opposed to intel iommu driver, right? > 3. Hotplug the VFIO device. With bypass disabled there is no need to pin > the whole guest address space, unless the guest explicitly asks for an > identity domain. > > However, I don't know if this is a realistic scenario that will actually > be used. > > By the way, do you have an easy way to reproduce the issue described here? > I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux > just allocates 32-bit IOVAs. I don't have a simple generic reproducer. It happens when assigning this device: Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)
I have not encountered that issue with another device yet. I see on guest side in dmesg: [ 6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses That's emitted in dma-iommu.c iommu_dma_alloc_iova(). Looks like the guest first tries to allocate an iova in the 32-bit AS and if this fails use the whole dma_limit. Seems the 32b IOVA alloc failed here ;-) Thanks Eric > >>>> For coldplugged device the technique works because we make sure all >>>> the IOMMU MR are enabled once on the machine init done: 94df5b2180 >>>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment") >>>> for granule freeze. But I would be keen to get rid of this trick. >>>> >>>> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU >>>> MR to have been enabled and the corresponding vfio_listener_region_add() >>>> to be executed. Instead this series proposes to replace the usage of this >>>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: >>>> modify >>>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be >>>> called earlier, once the usable IOVA regions have been collected by >>>> VFIO, without the need for the IOMMU MR to be enabled. >>>> >>>> This looks cleaner. In the short term this may also be used for >>>> passing the page size mask, which would allow to get rid of the >>>> hacky transient IOMMU MR enablement mentionned above. >>>> >>>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space >>>> https://lore.kernel.org/all/20231019134651.842175-1- >>>> eric.au...@redhat.com/ >>>> >>>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ >>>> >>>> Extra Notes: >>>> With that series, the reserved memory regions are communicated on time >>>> so that the virtio-iommu probe request grabs them. However this is not >>>> sufficient. In some cases (my case), I still see some DMA MAP failures >>>> and the guest keeps on using IOVA ranges outside the geometry of the >>>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the >>>> same iommu group as the pcie root port. Normally the kernel >>>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call >>>> reserve_iova() >>>> for each reserved IOVA, which carves them out of the allocator. When >>>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device >>>> the iova domain is already allocated and set and we don't call >>>> iova_reserve_iommu_regions() again for the vfio-pci device. So its >>>> corresponding reserved regions are not properly taken into account. >>> I suspect there is same issue with coldplugged devices. If those devices >>> are in same group, get iova_reserve_iommu_regions() is only called >>> for first device. But other devices's reserved regions are missed. >> Correct >>> Curious how you make passthrough device and pcie root port under same >>> group. >>> When I start a x86 guest with passthrough device, I see passthrough >>> device and pcie root port are in different group. >>> >>> -[0000:00]-+-00.0 >>> +-01.0 >>> +-02.0 >>> +-03.0-[01]----00.0 >>> >>> /sys/kernel/iommu_groups/3/devices: >>> 0000:00:03.0 >>> /sys/kernel/iommu_groups/7/devices: >>> 0000:01:00.0 >>> >>> My qemu cmdline: >>> -device pcie-root-port,id=root0,slot=0 >>> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0 >> I just replayed the scenario: >> - if you have a coldplugged vfio-pci device, the pci root port and the >> passthroughed device end up in different iommu groups. On my end I use >> ioh3420 but you confirmed that's the same for the generic pcie-root-port >> - however if you hotplug the vfio-pci device that's a different story: >> they end up in the same group. Don't ask me why. I tried with >> both virtio-iommu and intel iommu and I end up with the same topology. >> That looks really weird to me. > It also took me a while to get hotplug to work on x86: > pcie_cap_slot_plug_cb() didn't get called, instead it would call > ich9_pm_device_plug_cb(). Not sure what I'm doing wrong. > To work around that I instantiated a second pxb-pcie root bus and then a > pcie root port on there. So my command-line looks like: > > -device virtio-iommu > -device pxb-pcie,id=pcie.1,bus_nr=1 > -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1 > > device_add vfio-pci,host=00:04.0,bus=pcie.2 > > And somehow pcieport and the assigned device do end up in separate IOMMU > groups. > > Thanks, > Jean >