On Mon, Jun 01, 2026 at 12:42:20PM +0100, Shameer Kolothum wrote: > + * Lifecycle (driven by guest events) > + * ---------------------------------- > + * 1. First vfio-pci device attach (.set_iommu_device) triggers: > + * - tegra241_cmdqv_probe(): IOMMU_GET_HW_INFO confirms host CMDQV > support. > + * - IOMMU_VIOMMU_ALLOC: the kernel allocates a VINTF for this VM, > + * configures the VM's VMID (from its stage-2 HWPT) in VINTF_CONFIG, > + * forces HYP_OWN=0, and returns the mmap offset/length for VINTF Page > 0. > + * > + * 2. Guest writes VINTF_CONFIG.ENABLE = 1: > + * QEMU mmap()s the offset from step 1 into its address space and reports > + * STATUS.ENABLE_OK = 1. The host VINTF was already enabled by > + * IOMMU_VIOMMU_ALLOC; QEMU only acks back.
I wonder this mmap should be done after the CMDQV_CONFIG.CMDQV_EN=1 rather than VINTF_CONFIG.ENABLE=1. Spec says that: "Program the CMDQV_CMDQ_ALLOC_MAP_X register(s) to map the Virtual CMDQ(s) to the logical CMDQs on Virtual Interface following the CMDQ allocation rules. This again is an optional step and the CMDQ allocation can be done after the Virtual Interface is initialized." This means the LVCMDQ mapping can happen before VINTF_CONFIG.ENABLE=1. And the mmap info is returned with IOMMU_VIOMMU_ALLOC. So, this is doable? FWIW, kernel finalizes a LVCMDQ mapping when handling HW_QUEUE_ALLOC. > + * 3. Guest completes vCMDQ setup (BASE, CMDQ_ALLOC_MAP.ALLOC, CMDQV_EN, > + * VINTF.ENABLE, in any order; each precondition write retries the > + * allocation): > + * IOMMU_HW_QUEUE_ALLOC binds the guest BASE GPA (translated through > + * stage-2 and pinned by the kernel) to a host vCMDQ in this VM's VINTF. > + * > + * 4. After the first successful HW_QUEUE_ALLOC, the mmap'd VINTF Page 0 is > + * installed into guest MMIO as a RAM-device subregion. Guest VINTF Page 0 > + * accesses (CMDQ_EN, PROD/CONS_INDX, STATUS, GERROR/GERRORN) thereafter > + * go straight to host hardware, bypassing QEMU. > + * > + * 5. Guest SMMU driver programs a Stream Table Entry for a passthrough > + * device: IOMMU_VDEVICE_ALLOC programs SID_MATCH/SID_REPLACE in this Ideally, the VDEVICE_ALLOC should happen right after the device gets plugged *and* its vSID is available; similarly the vdevice should be only freed when the device gets unplugged. So, its lifecycle should stay with the plugged state. We do VDEVICE_ALLOC when guest programming STE is because that's the only place we found in the existing functions. We could (or arguably maybe we should) find a better place to allocate VDEVICE. E.g., vfio code can call qemu_add_vm_change_state_handler_prio() to register a callback function, where vSID (vBDF) should be available once the VM start to be "running". I think your narrative fits the current design. But we might need to be more accurate here, since this is about "lifecycle". > + * VM's VINTF so the device's guest vSID translates to its host pSID. Nit: so that the HW translates the guest vSID into host pSID. > + * Limits exposed to the guest > + * --------------------------- > + * One VINTF per emulated SMMUv3 and two vCMDQs per VINTF. Maximum vCMDQ > + * size is 8MiB. The queue must be physically contiguous (the HW reads it > + * via host PA), so QEMU caps it to the host memory-backend page size. Use > + * hugepage backing large enough to keep CMDQS at the HW maximum. The "physically contiguous" limit is handled by QEMU, not exposed to the guest at all; guest is unaware of this. Nicolin
