On Mon, Sep 29, 2025 at 02:36:22PM +0100, Shameer Kolothum wrote: > Accelerated SMMUv3 is only useful when the device can take advantage of > the host's SMMUv3 in nested mode. To keep things simple and correct, we > only allow this feature for vfio-pci endpoint devices that use the iommufd > backend. We also allow non-endpoint emulated devices like PCI bridges and > root ports, so that users can plug in these vfio-pci devices. We can only > enforce this if devices are cold plugged. For hotplug cases, give appropriate > warnings. > > Another reason for this limit is to avoid problems with IOTLB > invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated > SID, making it difficult to trace the originating device. If we allowed > emulated endpoint devices, QEMU would have to invalidate both its own > software IOTLB and the host's hardware IOTLB, which could slow things > down. > > Since vfio-pci devices in nested mode rely on the host SMMUv3's nested > translation (S1+S2), their get_address_space() callback must return the > system address space so that VFIO core can setup correct S2 mappings > for guest RAM. > > So in short: > - vfio-pci devices(with iommufd as backend) return the system address > space. > - bridges and root ports return the IOMMU address space. > > Signed-off-by: Shameer Kolothum <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]> With some nits: > + /* > + * We return the system address for vfio-pci devices(with iommufd as > + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for > + * guest RAM. This is needed because, in the accelerated SMMUv3 case, > + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest > + * manages its own S1 page tables while the host manages S2. > + * > + * We are using the global &address_space_memory here, as this will > ensure > + * same system address space pointer for all devices behind the > accelerated > + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in > + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared > + * within the VM instead of duplicating them for every SMMUv3 instance. > + */ > + if (vfio_pci) { > + return &address_space_memory; How about: /* * In the accelerated case, a vfio-pci device passed through via the iommufd * backend must stay in the system address space, as it is always translated * by its physical SMMU (using a stage-2-only STE or a nested STE), in which * case the stage-2 nesting parent page table is allocated by the vfio core, * backing up the system address space. * * So, return the system address space via the global address_space_memory. * The shared address_space_memory also allows devices under different vSMMU * instances in a VM to reuse a single nesting parent HWPT in the vfio core. */ ? And I think this would be clearer by having get_viommu_flags() in this patch. > diff --git a/hw/pci-bridge/pci_expander_bridge.c > b/hw/pci-bridge/pci_expander_bridge.c > index 1bcceddbc4..a8eb2d2426 100644 > --- a/hw/pci-bridge/pci_expander_bridge.c > +++ b/hw/pci-bridge/pci_expander_bridge.c > @@ -48,7 +48,6 @@ struct PXBBus { > char bus_path[8]; > }; > > -#define TYPE_PXB_PCIE_DEV "pxb-pcie" > OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV) > > static GList *pxb_dev_list; > diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h > index a055fd8d32..b61360b900 100644 > --- a/include/hw/pci/pci_bridge.h > +++ b/include/hw/pci/pci_bridge.h > @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev { > > #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus" > #define TYPE_PXB_CXL_BUS "pxb-cxl-bus" > +#define TYPE_PXB_PCIE_DEV "pxb-pcie" > #define TYPE_PXB_DEV "pxb" > OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV) Maybe this can be a patch itself. Nicolin
