> -----Original Message-----
> From: Nicolin Chen <[email protected]>
> Sent: 02 June 2026 00:47
> To: Shameer Kolothum Thodi <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Nathan Chen <[email protected]>; Matt Ochs
> <[email protected]>; Jiandi An <[email protected]>; Jason Gunthorpe
> <[email protected]>; [email protected]; Krishnakant Jaju
> <[email protected]>; [email protected]
> Subject: Re: [PATCH v6 30/31] hw/arm/tegra241-cmdqv: Document the
> CMDQV design and lifecycle
>
> On Mon, Jun 01, 2026 at 12:42:20PM +0100, Shameer Kolothum wrote:
> > + * Lifecycle (driven by guest events)
> > + * ----------------------------------
> > + * 1. First vfio-pci device attach (.set_iommu_device) triggers:
> > + * - tegra241_cmdqv_probe(): IOMMU_GET_HW_INFO confirms host
> CMDQV support.
> > + * - IOMMU_VIOMMU_ALLOC: the kernel allocates a VINTF for this VM,
> > + * configures the VM's VMID (from its stage-2 HWPT) in VINTF_CONFIG,
> > + * forces HYP_OWN=0, and returns the mmap offset/length for VINTF
> Page 0.
> > + *
> > + * 2. Guest writes VINTF_CONFIG.ENABLE = 1:
> > + * QEMU mmap()s the offset from step 1 into its address space and
> reports
> > + * STATUS.ENABLE_OK = 1. The host VINTF was already enabled by
> > + * IOMMU_VIOMMU_ALLOC; QEMU only acks back.
>
> I wonder this mmap should be done after the
> CMDQV_CONFIG.CMDQV_EN=1
> rather than VINTF_CONFIG.ENABLE=1.
>
> Spec says that:
> "Program the CMDQV_CMDQ_ALLOC_MAP_X register(s) to map the Virtual
> CMDQ(s) to the logical CMDQs on Virtual Interface following the
> CMDQ allocation rules. This again is an optional step and the CMDQ
> allocation can be done after the Virtual Interface is initialized."
>
> This means the LVCMDQ mapping can happen before
> VINTF_CONFIG.ENABLE=1.
>
> And the mmap info is returned with IOMMU_VIOMMU_ALLOC. So, this is
> doable?
I think it is. We could in theory as well do it just after IOMMU_VIOMMU_ALLOC.
What you think?
>
> FWIW, kernel finalizes a LVCMDQ mapping when handling
> HW_QUEUE_ALLOC.
>
> > + * 3. Guest completes vCMDQ setup (BASE, CMDQ_ALLOC_MAP.ALLOC,
> CMDQV_EN,
> > + * VINTF.ENABLE, in any order; each precondition write retries the
> > + * allocation):
> > + * IOMMU_HW_QUEUE_ALLOC binds the guest BASE GPA (translated
> through
> > + * stage-2 and pinned by the kernel) to a host vCMDQ in this VM's VINTF.
> > + *
> > + * 4. After the first successful HW_QUEUE_ALLOC, the mmap'd VINTF Page
> 0 is
> > + * installed into guest MMIO as a RAM-device subregion. Guest VINTF
> Page 0
> > + * accesses (CMDQ_EN, PROD/CONS_INDX, STATUS, GERROR/GERRORN)
> thereafter
> > + * go straight to host hardware, bypassing QEMU.
> > + *
> > + * 5. Guest SMMU driver programs a Stream Table Entry for a passthrough
> > + * device: IOMMU_VDEVICE_ALLOC programs SID_MATCH/SID_REPLACE
> in this
>
> Ideally, the VDEVICE_ALLOC should happen right after the device gets
> plugged *and* its vSID is available; similarly the vdevice should be
> only freed when the device gets unplugged. So, its lifecycle should
> stay with the plugged state.
>
> We do VDEVICE_ALLOC when guest programming STE is because that's the
> only place we found in the existing functions. We could (or arguably
> maybe we should) find a better place to allocate VDEVICE. E.g., vfio
> code can call qemu_add_vm_change_state_handler_prio() to register a
> callback function, where vSID (vBDF) should be available once the VM
> start to be "running".
>
> I think your narrative fits the current design. But we might need to
> be more accurate here, since this is about "lifecycle".
Yes, the above is based on the current design. I will add a caveat to
cover that.
I remember discussions about use cases(CCA realm ?) requiring an early
mandatory VDEVICE_ALLOC before guest boot. Regarding the suggestion
to use qemu_add_vm_change_state_handler_prio(), it might help,
but I'm not sure it covers all the cases like, hotplug, guest PCI
re-enumeration during boot, etc. (Having _DSM #5 might help here,
but per the PCI Firmware Spec it only "indicates to an operating
system that it should preserve assignments of PCI bus numbers",
so, guests are not strictly required to honour it.) Anyway, I'll take
another look as a follow up.
> > + * VM's VINTF so the device's guest vSID translates to its host pSID.
>
> Nit: so that the HW translates the guest vSID into host pSID.
>
> > + * Limits exposed to the guest
> > + * ---------------------------
> > + * One VINTF per emulated SMMUv3 and two vCMDQs per VINTF.
> Maximum vCMDQ
> > + * size is 8MiB. The queue must be physically contiguous (the HW reads it
> > + * via host PA), so QEMU caps it to the host memory-backend page size.
> Use
> > + * hugepage backing large enough to keep CMDQS at the HW maximum.
>
> The "physically contiguous" limit is handled by QEMU, not exposed to
> the guest at all; guest is unaware of this.
Ok. The intend was the queue size exposed may be limited because
of the "physically contiguous" requirement. I will make it clear.
Thanks,
Shameer