>-----Original Message-----
>From: Nicolin Chen <nicol...@nvidia.com>
>Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>accelerated SMMUv3 to vfio-pci endpoints with iommufd
>
>On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote:
>>
>>
>> >-----Original Message-----
>> >From: Shameer Kolothum <shameerali.kolothum.th...@huawei.com>
>> >Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>accelerated
>> >SMMUv3 to vfio-pci endpoints with iommufd
>> >
>> >Accelerated SMMUv3 is only useful when the device can take advantage of
>> >the host's SMMUv3 in nested mode. To keep things simple and correct, we
>> >only allow this feature for vfio-pci endpoint devices that use the iommufd
>> >backend. We also allow non-endpoint emulated devices like PCI bridges
>and
>> >root ports, so that users can plug in these vfio-pci devices.
>> >
>> >Another reason for this limit is to avoid problems with IOTLB
>> >invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
>associated
>> >SID, making it difficult to trace the originating device. If we allowed
>> >emulated endpoint devices, QEMU would have to invalidate both its own
>> >software IOTLB and the host's hardware IOTLB, which could slow things
>> >down.
>> >
>> >Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>> >translation (S1+S2), their get_address_space() callback must return the
>> >system address space to enable correct S2 mappings of guest RAM.
>> >
>> >So in short:
>> > - vfio-pci devices return the system address space
>> > - bridges and root ports return the IOMMU address space
>> >
>> >Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
>>
>> So the translation result is a doorbell addr(gpa) for guest?
>> IIUC, there should be a mapping between guest doorbell addr(gpa) to host
>> doorbell addr(hpa) in stage2 page table? Where is this mapping setup?
>
>Yes and yes.
>
>On ARM, MSI is behind IOMMU. When 2-stage translation is enabled,
>it goes through two stages as you understood.
>
>There are a few ways to implement this, though the current kernel
>only supports one solution, which is a hard-coded RMR (reserved
>memory region).
>
>The solution sets up a RMR region in the ACPI's IORT, which maps
>the stage1 linearly, i.e. gIOVA=gPA.
>
>The gPA=>hPA mappings in the stage-2 are done by the kernel that
>polls an IOMMU_RESV_SW_MSI region defined in the kernel driver.
>
>It's not the ideal solution, but it's the simplest to implement.
>
>There are other ways to support this like a true 2-stage mapping
>but they are still on the way.
>
>For more details, please refer to this:
>https://lore.kernel.org/all/cover.1740014950.git.nicol...@nvidia.com/

Thanks for the link, it helps much for understanding arm smmu arch.

>
>> >+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool
>*vfio_pci)
>> >+{
>> >+
>> >+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
>> >+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
>> >+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
>> >+        return true;
>> >+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
>> >+        object_property_find(OBJECT(pdev), "iommufd"))) {
>>
>> Will this always return true?
>
>It won't if a vfio-pci device doesn't have the "iommufd" property?

IIUC, iommufd property is always there, just value not filled for legacy 
container case.
What about checking VFIOPCIDevice.vbasedev.iommufd?

>
>> >+        *vfio_pci = true;
>> >+        return true;
>> >+    }
>> >+    return false;
>
>Then, it returns "false" here.
>
>> > static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>> >*opaque,
>> >                                               int devfn)
>> > {
>> >+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>> >     SMMUState *bs = opaque;
>> >+    bool vfio_pci = false;
>> >     SMMUPciBus *sbus;
>> >     SMMUv3AccelDevice *accel_dev;
>> >     SMMUDevice *sdev;
>> >
>> >+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>> >+        error_report("Device(%s) not allowed. Only PCIe root complex
>> >devices "
>> >+                     "or PCI bridge devices or vfio-pci endpoint
>devices
>> >with "
>> >+                     "iommufd as backend is allowed with
>> >arm-smmuv3,accel=on",
>> >+                     pdev->name);
>> >+        exit(1);
>>
>> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?
>
>Hotplug will unlikely be supported well, as it would introduce
>too much complication.
>
>With iommufd, a vIOMMU object is allocated per device (vfio). If
>the device fd (cdev) is not yet given to the QEMU. It isn't able
>to allocate a vIOMMU object when creating a VM.
>
>While a vIOMMU object can be allocated at a later stage once the
>device is hotplugged. But things like IORT mappings aren't able
>to get refreshed since the OS is likely already booted. Even an
>IOMMU capability sync via the hw_info ioctl will be difficult to
>do at the runtime post the guest iommu driver's initialization.
>
>I am not 100% sure. But I think Intel model could have a similar
>problem if the guest boots with zero cold-plugged device and then
>hot-plugs a PASID-capable device at the runtime, when the guest-
>level IOMMU driver is already inited?

For vtd we define a property for each capability we care about.
When hotplug a device, we get hw_info through ioctl and compare
host's capability with virtual vtd's property setting, if incompatible,
we fail the hotplug.

In old implementation we sync host iommu caps into virtual vtd's cap,
but that's Naked by maintainer. The suggested way is to define property
for each capability we care and do compatibility check.

There is a "pasid" property in virtual vtd, only when it's true, the 
PASID-capable
device can work with pasid.

Zhenzhong

>
>FWIW, Shameer's cover-letter has the following line:
> "At least one vfio-pci device must currently be cold-plugged to
>  a PCIe root complex associated with arm-smmuv3,accel=on."
>
>Perhaps there should be a similar highlight in this smmuv3-accel
>file as well (@Shameer).
>
>Nicolin

Reply via email to