Hi Jean,

On 1/25/24 19:48, Jean-Philippe Brucker wrote:
> Hi,
>
> On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote:
>> Hi Zhenzhong,
>> On 1/18/24 08:10, Duan, Zhenzhong wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.au...@redhat.com>
>>>> Cc: m...@redhat.com; c...@redhat.com
>>>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
>>>> for hotplugged devices
>>>>
>>>> In [1] we attempted to fix a case where a VFIO-PCI device protected
>>>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
>>>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
>>>> virtio-iommu used to expose a 64b address space by default.
>>>> Hence the guest was trying to use the full 64b space and we hit
>>>> DMA MAP failures. To work around this issue we managed to pass
>>>> usable IOVA regions (excluding the out of range space) from VFIO
>>>> to the virtio-iommu device. This was made feasible by introducing
>>>> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
>>>> This latter gets called when the IOMMU MR is enabled which
>>>> causes the vfio_listener_region_add() to be called.
>>>>
>>>> However with VFIO-PCI hotplug, this technique fails due to the
>>>> race between the call to the callback in the add memory listener
>>>> and the virtio-iommu probe request. Indeed the probe request gets
>>>> called before the attach to the domain. So in that case the usable
>>>> regions are communicated after the probe request and fail to be
>>>> conveyed to the guest. To be honest the problem was hinted by
>>>> Jean-Philippe in [1] and I should have been more careful at
>>>> listening to him and testing with hotplug :-(
>>> It looks the global virtio_iommu_config.bypass is never cleared in guest.
>>> When guest virtio_iommu driver enable IOMMU, should it clear this
>>> bypass attribute?
>>> If it could be cleared in viommu_probe(), then qemu will call
>>> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
>>> to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.
>> this field is iommu wide while the probe applies on a one device.In
>> general I would prefer not to be dependent on the MR enablement. We know
>> that the device is likely to be protected and we can collect its
>> requirements beforehand.
>>
>>> Intel iommu has a similar bit in register GCMD_REG.TE, when guest
>>> intel_iommu driver probe set it, on qemu side, 
>>> vtd_address_space_refresh_all()
>>> is called to enable IOMMU MRs.
>> interesting.
>>
>> Would be curious to get Jean Philippe's pov.
> I'd rather not rely on this, it's hard to justify a driver change based
> only on QEMU internals. And QEMU can't count on the driver always clearing
> bypass. There could be situations where the guest can't afford to do it,
> like if an endpoint is owned by the firmware and has to keep running.
>
> There may be a separate argument for clearing bypass. With a coldplugged
> VFIO device the flow is:
>
> 1. Map the whole guest address space in VFIO to implement boot-bypass.
>    This allocates all guest pages, which takes a while and is wasteful.
>    I've actually crashed a host that way, when spawning a guest with too
>    much RAM.
interesting
> 2. Start the VM
> 3. When the virtio-iommu driver attaches a (non-identity) domain to the
>    assigned endpoint, then unmap the whole address space in VFIO, and most
>    pages are given back to the host.
>
> We can't disable boot-bypass because the BIOS needs it. But instead the
> flow could be:
>
> 1. Start the VM, with only the virtual endpoints. Nothing to pin.
> 2. The virtio-iommu driver disables bypass during boot
We needed this boot-bypass mode for booting with virtio-blk-scsi
protected with virtio-iommu for instance.
That was needed because we don't have any virtio-iommu driver in edk2 as
opposed to intel iommu driver, right?
> 3. Hotplug the VFIO device. With bypass disabled there is no need to pin
>    the whole guest address space, unless the guest explicitly asks for an
>    identity domain.
>
> However, I don't know if this is a realistic scenario that will actually
> be used.
>
> By the way, do you have an easy way to reproduce the issue described here?
> I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
> just allocates 32-bit IOVAs.
I don't have a simple generic reproducer. It happens when assigning this
device:
Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)

I have not encountered that issue with another device yet.
I see on guest side in dmesg:
[    6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses

That's emitted in dma-iommu.c iommu_dma_alloc_iova().
Looks like the guest first tries to allocate an iova in the 32-bit AS
and if this fails use the whole dma_limit.
Seems the 32b IOVA alloc failed here ;-)

Thanks

Eric





>
>>>> For coldplugged device the technique works because we make sure all
>>>> the IOMMU MR are enabled once on the machine init done: 94df5b2180
>>>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
>>>> for granule freeze. But I would be keen to get rid of this trick.
>>>>
>>>> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
>>>> MR to have been enabled and the corresponding vfio_listener_region_add()
>>>> to be executed. Instead this series proposes to replace the usage of this
>>>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
>>>> modify
>>>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
>>>> called earlier, once the usable IOVA regions have been collected by
>>>> VFIO, without the need for the IOMMU MR to be enabled.
>>>>
>>>> This looks cleaner. In the short term this may also be used for
>>>> passing the page size mask, which would allow to get rid of the
>>>> hacky transient IOMMU MR enablement mentionned above.
>>>>
>>>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
>>>>    https://lore.kernel.org/all/20231019134651.842175-1-
>>>> eric.au...@redhat.com/
>>>>
>>>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
>>>>
>>>> Extra Notes:
>>>> With that series, the reserved memory regions are communicated on time
>>>> so that the virtio-iommu probe request grabs them. However this is not
>>>> sufficient. In some cases (my case), I still see some DMA MAP failures
>>>> and the guest keeps on using IOVA ranges outside the geometry of the
>>>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the
>>>> same iommu group as the pcie root port. Normally the kernel
>>>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
>>>> reserve_iova()
>>>> for each reserved IOVA, which carves them out of the allocator. When
>>>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
>>>> the iova domain is already allocated and set and we don't call
>>>> iova_reserve_iommu_regions() again for the vfio-pci device. So its
>>>> corresponding reserved regions are not properly taken into account.
>>> I suspect there is same issue with coldplugged devices. If those devices
>>> are in same group, get iova_reserve_iommu_regions() is only called
>>> for first device. But other devices's reserved regions are missed.
>> Correct
>>> Curious how you make passthrough device and pcie root port under same
>>> group.
>>> When I start a x86 guest with passthrough device, I see passthrough
>>> device and pcie root port are in different group.
>>>
>>> -[0000:00]-+-00.0
>>>            +-01.0
>>>            +-02.0
>>>            +-03.0-[01]----00.0
>>>
>>> /sys/kernel/iommu_groups/3/devices:
>>> 0000:00:03.0
>>> /sys/kernel/iommu_groups/7/devices:
>>> 0000:01:00.0
>>>
>>> My qemu cmdline:
>>> -device pcie-root-port,id=root0,slot=0
>>> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0
>> I just replayed the scenario:
>> - if you have a coldplugged vfio-pci device, the pci root port and the
>> passthroughed device end up in different iommu groups. On my end I use
>> ioh3420 but you confirmed that's the same for the generic pcie-root-port
>> - however if you hotplug the vfio-pci device that's a different story:
>> they end up in the same group. Don't ask me why. I tried with
>> both virtio-iommu and intel iommu and I end up with the same topology.
>> That looks really weird to me.
> It also took me a while to get hotplug to work on x86:
> pcie_cap_slot_plug_cb() didn't get called, instead it would call
> ich9_pm_device_plug_cb(). Not sure what I'm doing wrong.
> To work around that I instantiated a second pxb-pcie root bus and then a
> pcie root port on there. So my command-line looks like:
>
>  -device virtio-iommu
>  -device pxb-pcie,id=pcie.1,bus_nr=1
>  -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1
>
>  device_add vfio-pci,host=00:04.0,bus=pcie.2
>
> And somehow pcieport and the assigned device do end up in separate IOMMU
> groups.
>
> Thanks,
> Jean
>


Reply via email to