RE: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-08-12 Thread Duan, Zhenzhong


>-Original Message-
>From: Jim Shu 
>Subject: Re: [PATCH v4 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>Hi Zhenzhong,
>
>On Tue, Jul 29, 2025 at 5:21 PM Zhenzhong Duan
> wrote:
>>
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
>> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
>> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
>> that can be called on a PCI device potentially protected by a vIOMMU.
>>
>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>> capabilities which are only derermined by user's configuration, no host
>> capabilities involved. Reasons are:
>>
>> 1. there can be more than one host IOMMUs with different capabilities
>> 2. there can also be more than one vIOMMUs with different user
>>configuration, e.g., arm smmuv3.
>> 3. This is migration friendly, return value is consistent between source
>>and target.
>> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>>because we need get_viommu_cap() to determine if creating nested
>parent
>>hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
>>devid and hwpt_id which are ready after attach_device().
>>See below sequence:
>>
>>  attach_device()
>>get_viommu_cap()
>>create hwpt
>>  ...
>>  create hiod
>>  set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>  MAINTAINERS  |  1 +
>>  include/hw/iommu.h   | 17 +
>>  include/hw/pci/pci.h | 25 +
>>  hw/pci/pci.c | 11 +++
>>  4 files changed, 54 insertions(+)
>>  create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 37879ab64e..840cb1e604 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
>>  F: backends/host_iommu_device.c
>>  F: include/system/host_iommu_device.h
>>  F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>>  F: util/chardev_open.c
>>  F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 00..021db50db5
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,17 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>
>Could we include header "qemu/bitops.h" here to use `BIT_ULL()` in this file?
>Or it will get compile error when including it in the other IOMMU.

Oh, thanks for point out, fixed at 
https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting.v5.wip/

BRs,
Zhenzhong


Re: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-08-12 Thread Jim Shu
Hi Zhenzhong,

On Tue, Jul 29, 2025 at 5:21 PM Zhenzhong Duan  wrote:
>
> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
> allows to retrieve capabilities exposed by a vIOMMU. The first planned
> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
> that can be called on a PCI device potentially protected by a vIOMMU.
>
> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
> capabilities which are only derermined by user's configuration, no host
> capabilities involved. Reasons are:
>
> 1. there can be more than one host IOMMUs with different capabilities
> 2. there can also be more than one vIOMMUs with different user
>configuration, e.g., arm smmuv3.
> 3. This is migration friendly, return value is consistent between source
>and target.
> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>because we need get_viommu_cap() to determine if creating nested parent
>hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
>devid and hwpt_id which are ready after attach_device().
>See below sequence:
>
>  attach_device()
>get_viommu_cap()
>create hwpt
>  ...
>  create hiod
>  set_iommu_device(hiod)
>
> Suggested-by: Yi Liu 
> Signed-off-by: Zhenzhong Duan 
> ---
>  MAINTAINERS  |  1 +
>  include/hw/iommu.h   | 17 +
>  include/hw/pci/pci.h | 25 +
>  hw/pci/pci.c | 11 +++
>  4 files changed, 54 insertions(+)
>  create mode 100644 include/hw/iommu.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 37879ab64e..840cb1e604 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
>  F: backends/host_iommu_device.c
>  F: include/system/host_iommu_device.h
>  F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
>  F: util/chardev_open.c
>  F: docs/devel/vfio-iommufd.rst
>
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 00..021db50db5
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,17 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +

Could we include header "qemu/bitops.h" here to use `BIT_ULL()` in this file?
Or it will get compile error when including it in the other IOMMU.

> +enum {
> +/* hardware nested stage-1 page table support */
> +VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
> +};
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 6b7d3ac8a3..d89aefc030 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
>   * @devfn: device and function number of the PCI device.
>   */
>  void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> +/**
> + * @get_viommu_cap: get vIOMMU capabilities
> + *
> + * Optional callback, if not implemented, then vIOMMU doesn't
> + * support exposing capabilities to other subsystem, e.g., VFIO.
> + * vIOMMU can choose which capabilities to expose.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * Returns: 64bit bitmap with each bit represents a capability emulated 
> by
> + * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
> + * which are only determined by user's configuration and independent on 
> the
> + * actual host capabilities they may depend on.
> + */
> +uint64_t (*get_viommu_cap)(void *opaque);
>  /**
>   * @get_iotlb_info: get properties required to initialize a device IOTLB.
>   *
> @@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, 
> HostIOMMUDevice *hiod,
>   Error **errp);
>  void pci_device_unset_iommu_device(PCIDevice *dev);
>
> +/**
> + * pci_device_get_viommu_cap: get vIOMMU capabilities.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * capability, 0 if vIOMMU doesn't support esposing capabilities.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
> +
>  /**
>   * pci_iommu_get_iotlb_info: get properties required to initialize a
>   * device IOTLB.
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c70b5ceeba..df1fb615a8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
>  }
>  }
>
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
> +{
> +PCIBus *iommu_bus;
> +
> +pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> +if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
> +return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->i

RE: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-08-04 Thread Duan, Zhenzhong



>-Original Message-
>From: Nicolin Chen 
>Subject: Re: [PATCH v4 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On Wed, Jul 30, 2025 at 10:51:13AM +, Duan, Zhenzhong wrote:
>> >> 2. there can also be more than one vIOMMUs with different user
>> >> configuration, e.g., arm smmuv3.
>
>That's correct. But would you please elaborate how different user
>configurations would benefit from this new op? I don't see a good
>reasoning behind that.

It is the reason to have get_viommu_cap(), get_viommu_cap()
returns capabilities of vIOMMU which underlying VFIO device attaches to,
but indeed, it's not reason for leaving out host capabilities, I'll drop this 
line.

>
>> >> 4. It's too late for VFIO to call get_viommu_cap() after
>set_iommu_device()
>> >> because we need get_viommu_cap() to determine if creating
>nested
>> >parent
>> >> hwpt or not at attaching stage, meanwhile hiod realize needs
>> >iommufd,
>> >
>> >hiod -> "host IOMMU device"
>>
>> Will do.
>>
>> >
>> >> devid and hwpt_id which are ready after attach_device().
>> >
>> >I find the above sentence difficult to understand.
>>
>> This is trying to explain the reason of order between attach_device(),
>get_viommu_cap() and hiod realizing.
>> What about:
>>
>> 4. host IOMMU capabilities are passed to vIOMMU through
>set_iommu_device()
>>interface which have to be after attach_device(), when
>get_viommu_cap()
>>is called in attach_device(), there is no way for vIOMMU to get host
>>IOMMU capabilities yet, so only emulated capabilities can be returned.
>>See below sequence:
>>
>>  attach_device()
>>get_viommu_cap()
>>create hwpt
>>  ...
>>  vfio_device_hiod_create_and_realize()
>>  set_iommu_device(hiod)
>
>I think it should be:
>
>vfio_device_attach():
>iommufd_cdev_attach():
>pci_device_get_viommu_cap() for HW nesting cap
>create a nesting parent hwpt
>attach device to the hwpt
>vfio_device_hiod_create_and_realize() creating hiod
>...
>pci_device_set_iommu_device(hiod)

Good catch, done, 
https://github.com/yiliu1765/qemu/commit/34b02bdc0d615c60d1b07bfee842aabf1f7e2b28

Thanks
Zhenzhong



Re: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-08-04 Thread Nicolin Chen
On Wed, Jul 30, 2025 at 10:51:13AM +, Duan, Zhenzhong wrote:
> >> 2. there can also be more than one vIOMMUs with different user
> >> configuration, e.g., arm smmuv3.

That's correct. But would you please elaborate how different user
configurations would benefit from this new op? I don't see a good
reasoning behind that.

> >> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
> >> because we need get_viommu_cap() to determine if creating nested
> >parent
> >> hwpt or not at attaching stage, meanwhile hiod realize needs
> >iommufd,
> >
> >hiod -> "host IOMMU device"
> 
> Will do.
> 
> >
> >> devid and hwpt_id which are ready after attach_device().
> >
> >I find the above sentence difficult to understand.
> 
> This is trying to explain the reason of order between attach_device(), 
> get_viommu_cap() and hiod realizing.
> What about:
> 
> 4. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
>interface which have to be after attach_device(), when get_viommu_cap()
>is called in attach_device(), there is no way for vIOMMU to get host
>IOMMU capabilities yet, so only emulated capabilities can be returned.
>See below sequence:
> 
>  attach_device()
>get_viommu_cap()
>create hwpt
>  ...
>  vfio_device_hiod_create_and_realize()
>  set_iommu_device(hiod)

I think it should be:

vfio_device_attach():
iommufd_cdev_attach():
pci_device_get_viommu_cap() for HW nesting cap
create a nesting parent hwpt
attach device to the hwpt
vfio_device_hiod_create_and_realize() creating hiod
...
pci_device_set_iommu_device(hiod)

?

Thanks
Nicolin



RE: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-30 Thread Duan, Zhenzhong
Hi Cédric,

>-Original Message-
>From: Cédric Le Goater 
>Subject: Re: [PATCH v4 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>Hello Zhenzhong
>
>On 7/29/25 11:20, Zhenzhong Duan wrote:
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
>vIOMMU device
>
>"device" is implied, but I find it easier to understand if it is stated
>openly.

Will do.

>
>> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
>> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
>> that can be called on a PCI device potentially protected by a vIOMMU.
>>
>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>> capabilities which are only derermined by user's configuration, no host
>
>typo: determined

Will fix.

>
>> capabilities involved. Reasons are:
>>
>> 1. there can be more than one host IOMMUs with different capabilities
>
>"host IOMMU devices"

Will do.

>
>Such as iommufd and VFIO IOMMU Type1 ?

Not VFIO iommu backend, I mean host iommu hardware units.

>
>> 2. there can also be more than one vIOMMUs with different user
>> configuration, e.g., arm smmuv3.
>> 3. This is migration friendly, return value is consistent between source
>> and target.
>> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>> because we need get_viommu_cap() to determine if creating nested
>parent
>> hwpt or not at attaching stage, meanwhile hiod realize needs
>iommufd,
>
>hiod -> "host IOMMU device"

Will do.

>
>> devid and hwpt_id which are ready after attach_device().
>
>I find the above sentence difficult to understand.

This is trying to explain the reason of order between attach_device(), 
get_viommu_cap() and hiod realizing.
What about:

4. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
   interface which have to be after attach_device(), when get_viommu_cap()
   is called in attach_device(), there is no way for vIOMMU to get host
   IOMMU capabilities yet, so only emulated capabilities can be returned.
   See below sequence:

 attach_device()
   get_viommu_cap()
   create hwpt
 ...
 vfio_device_hiod_create_and_realize()
 set_iommu_device(hiod)


>
>
>> See below sequence:
>>
>>   attach_device()
>> get_viommu_cap()
>> create hwpt
>>   ...
>>   create hiod
>>   set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   MAINTAINERS  |  1 +
>>   include/hw/iommu.h   | 17 +
>>   include/hw/pci/pci.h | 25 +
>>   hw/pci/pci.c | 11 +++
>>   4 files changed, 54 insertions(+)
>>   create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 37879ab64e..840cb1e604 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
>>   F: backends/host_iommu_device.c
>>   F: include/system/host_iommu_device.h
>>   F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>>   F: util/chardev_open.c
>>   F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 00..021db50db5
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,17 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +enum {
>> +/* hardware nested stage-1 page table support */
>> +VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
>> +};
>> +
>> +#endif /* HW_IOMMU_H */
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 6b7d3ac8a3..d89aefc030 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
>>* @devfn: device and function number of the PCI device.
>>*/
>>   void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>> +/**
>> + * @get_viommu_cap: get vIOMMU capabilities
>> + *
>> + * Optional callback, if not implemented, then vIOMMU doesn't
>> + * support ex

Re: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-29 Thread Cédric Le Goater

Hello Zhenzhong

On 7/29/25 11:20, Zhenzhong Duan wrote:

Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
allows to retrieve capabilities exposed by a vIOMMU. The first planned

   vIOMMU device

"device" is implied, but I find it easier to understand if it is stated
openly.


capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
that can be called on a PCI device potentially protected by a vIOMMU.

get_viommu_cap() is designed to return 64bit bitmap of purely emulated
capabilities which are only derermined by user's configuration, no host


typo: determined


capabilities involved. Reasons are:

1. there can be more than one host IOMMUs with different capabilities


"host IOMMU devices"

Such as iommufd and VFIO IOMMU Type1 ?


2. there can also be more than one vIOMMUs with different user
configuration, e.g., arm smmuv3.
3. This is migration friendly, return value is consistent between source
and target.
4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
because we need get_viommu_cap() to determine if creating nested parent
hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,


hiod -> "host IOMMU device"


devid and hwpt_id which are ready after attach_device().


I find the above sentence difficult to understand.



See below sequence:

  attach_device()
get_viommu_cap()
create hwpt
  ...
  create hiod
  set_iommu_device(hiod)

Suggested-by: Yi Liu 
Signed-off-by: Zhenzhong Duan 
---
  MAINTAINERS  |  1 +
  include/hw/iommu.h   | 17 +
  include/hw/pci/pci.h | 25 +
  hw/pci/pci.c | 11 +++
  4 files changed, 54 insertions(+)
  create mode 100644 include/hw/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 37879ab64e..840cb1e604 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
  F: backends/host_iommu_device.c
  F: include/system/host_iommu_device.h
  F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
  F: util/chardev_open.c
  F: docs/devel/vfio-iommufd.rst
  
diff --git a/include/hw/iommu.h b/include/hw/iommu.h

new file mode 100644
index 00..021db50db5
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,17 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+enum {
+/* hardware nested stage-1 page table support */
+VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6b7d3ac8a3..d89aefc030 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
   * @devfn: device and function number of the PCI device.
   */
  void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+/**
+ * @get_viommu_cap: get vIOMMU capabilities
+ *
+ * Optional callback, if not implemented, then vIOMMU doesn't
+ * support exposing capabilities to other subsystem, e.g., VFIO.
+ * vIOMMU can choose which capabilities to expose.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * Returns: 64bit bitmap with each bit represents a capability emulated by
+ * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
+ * which are only determined by user's configuration and independent on the


What is meant by "user's configuration" ? the vIOMMU device properties ?


+ * actual host capabilities they may depend on.
+ */
+uint64_t (*get_viommu_cap)(void *opaque);
  /**
   * @get_iotlb_info: get properties required to initialize a device IOTLB.
   *
@@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, 
HostIOMMUDevice *hiod,
   Error **errp);
  void pci_device_unset_iommu_device(PCIDevice *dev);
  
+/**

+ * pci_device_get_viommu_cap: get vIOMMU capabilities.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * capability, 0 if vIOMMU doesn't support esposing capabilities.


exposing

Thanks,

C.




+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
+
  /**
   * pci_iommu_get_iotlb_info: get properties required to initialize a
   * device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c70b5ceeba..df1fb615a8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
  }
  }
  
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev)

+{
+PCIBus *iommu_bus;
+
+pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
+return io

[PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-29 Thread Zhenzhong Duan
Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
allows to retrieve capabilities exposed by a vIOMMU. The first planned
capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
that can be called on a PCI device potentially protected by a vIOMMU.

get_viommu_cap() is designed to return 64bit bitmap of purely emulated
capabilities which are only derermined by user's configuration, no host
capabilities involved. Reasons are:

1. there can be more than one host IOMMUs with different capabilities
2. there can also be more than one vIOMMUs with different user
   configuration, e.g., arm smmuv3.
3. This is migration friendly, return value is consistent between source
   and target.
4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
   because we need get_viommu_cap() to determine if creating nested parent
   hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
   devid and hwpt_id which are ready after attach_device().
   See below sequence:

 attach_device()
   get_viommu_cap()
   create hwpt
 ...
 create hiod
 set_iommu_device(hiod)

Suggested-by: Yi Liu 
Signed-off-by: Zhenzhong Duan 
---
 MAINTAINERS  |  1 +
 include/hw/iommu.h   | 17 +
 include/hw/pci/pci.h | 25 +
 hw/pci/pci.c | 11 +++
 4 files changed, 54 insertions(+)
 create mode 100644 include/hw/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 37879ab64e..840cb1e604 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
 F: backends/host_iommu_device.c
 F: include/system/host_iommu_device.h
 F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
 F: util/chardev_open.c
 F: docs/devel/vfio-iommufd.rst
 
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 00..021db50db5
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,17 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+enum {
+/* hardware nested stage-1 page table support */
+VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6b7d3ac8a3..d89aefc030 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
  * @devfn: device and function number of the PCI device.
  */
 void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+/**
+ * @get_viommu_cap: get vIOMMU capabilities
+ *
+ * Optional callback, if not implemented, then vIOMMU doesn't
+ * support exposing capabilities to other subsystem, e.g., VFIO.
+ * vIOMMU can choose which capabilities to expose.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * Returns: 64bit bitmap with each bit represents a capability emulated by
+ * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
+ * which are only determined by user's configuration and independent on the
+ * actual host capabilities they may depend on.
+ */
+uint64_t (*get_viommu_cap)(void *opaque);
 /**
  * @get_iotlb_info: get properties required to initialize a device IOTLB.
  *
@@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, 
HostIOMMUDevice *hiod,
  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 
+/**
+ * pci_device_get_viommu_cap: get vIOMMU capabilities.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * capability, 0 if vIOMMU doesn't support esposing capabilities.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
+
 /**
  * pci_iommu_get_iotlb_info: get properties required to initialize a
  * device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c70b5ceeba..df1fb615a8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
 }
 }
 
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
+{
+PCIBus *iommu_bus;
+
+pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
+return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
+}
+return 0;
+}
+
 int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
  bool exec_req, hwaddr addr, bool lpig,
  uint16_t prgi, bool is_read, bool is_write)
-- 
2.47.1