>-----Original Message-----
>From: Joao Martins <joao.m.mart...@oracle.com>
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 11:05, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.mart...@oracle.com>
>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>
>>> On 17/07/2024 03:18, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Joao Martins <joao.m.mart...@oracle.com>
>>>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>>>
>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>
>>>>> * The simple user API which intends to perform relatively simple things
>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to
>VFIO
>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>
>>>>> * The native IOMMUFD API where you have fine grained control of the
>>>>> IOMMU domain and model it accordingly. This is where most new
>feature
>>>>> are being steered to.
>>>>>
>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>> attachments for IOMMU domains.
>>>>>
>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>>>> automatically
>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>> the needed handling for mdevs.
>>>>>
>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>> similar logic, where IOMMU domains are created and devices attached
>to
>>>>> compatible domains. Essentially mimmicing kernel
>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>> IOMMU
>>>>> domain
>>>>> it falls back to IOAS attach.
>>>>>
>>>>> The auto domain logic allows different IOMMU domains to be created
>>> when
>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>> where
>>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>>> state is initialized after the device attachment. But such mixed mode of
>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>can
>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>> been using so far between container vs device dirty tracking.
>>>>>
>>>>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>>>>> ---
>>>>> include/hw/vfio/vfio-common.h |  9 ++++
>>>>> include/sysemu/iommufd.h      |  5 +++
>>>>> backends/iommufd.c            | 30 +++++++++++++
>>>>> hw/vfio/iommufd.c             | 82
>>>>> +++++++++++++++++++++++++++++++++++
>>>>> backends/trace-events         |  1 +
>>>>> 5 files changed, 127 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>>> common.h
>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>
>>>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>
>>>>> +typedef struct VFIOIOASHwpt {
>>>>> +    uint32_t hwpt_id;
>>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>> +} VFIOIOASHwpt;
>>>>> +
>>>>> typedef struct VFIOIOMMUFDContainer {
>>>>>     VFIOContainerBase bcontainer;
>>>>>     IOMMUFDBackend *be;
>>>>>     uint32_t ioas_id;
>>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>> } VFIOIOMMUFDContainer;
>>>>>
>>>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>>>> VFIO_IOMMU_IOMMUFD);
>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>>     HostIOMMUDevice *hiod;
>>>>>     int devid;
>>>>>     IOMMUFDBackend *iommufd;
>>>>> +    VFIOIOASHwpt *hwpt;
>>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>> } VFIODevice;
>>>>>
>>>>> struct VFIODeviceOps {
>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>>> index 57d502a1c79a..e917e7591d05 100644
>>>>> --- a/include/sysemu/iommufd.h
>>>>> +++ b/include/sysemu/iommufd.h
>>>>> @@ -50,6 +50,11 @@ int
>>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>>> ioas_id,
>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t
>>>>> devid,
>>>>>                                      uint32_t *type, void *data, uint32_t 
>>>>> len,
>>>>>                                      uint64_t *caps, Error **errp);
>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>uint32_t
>>>>> dev_id,
>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>> +                                Error **errp);
>>>>>
>>>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>> #endif
>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>>> --- a/backends/iommufd.c
>>>>> +++ b/backends/iommufd.c
>>>>> @@ -208,6 +208,36 @@ int
>>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>>> ioas_id,
>>>>>     return ret;
>>>>> }
>>>>>
>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>uint32_t
>>>>> dev_id,
>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>> +                                Error **errp)
>>>>> +{
>>>>> +    int ret, fd = be->fd;
>>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>>> +        .flags = flags,
>>>>> +        .dev_id = dev_id,
>>>>> +        .pt_id = pt_id,
>>>>> +        .data_type = data_type,
>>>>> +        .data_len = data_len,
>>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>>> +    };
>>>>> +
>>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>>> data_type,
>>>>> +                                     data_len, (uint64_t)data_ptr,
>>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>>> +    if (ret) {
>>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t
>>>>> devid,
>>>>>                                      uint32_t *type, void *data, uint32_t 
>>>>> len,
>>>>>                                      uint64_t *caps, Error **errp)
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -212,10 +212,86 @@ static bool
>>>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error
>**errp)
>>>>>     return true;
>>>>> }
>>>>>
>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>> +                                         VFIOIOMMUFDContainer *container,
>>>>> +                                         Error **errp)
>>>>> +{
>>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>>> +    uint32_t flags = 0;
>>>>> +    VFIOIOASHwpt *hwpt;
>>>>> +    uint32_t hwpt_id;
>>>>> +    int ret;
>>>>> +
>>>>> +    /* Try to find a domain */
>>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt-
>>hwpt_id,
>>>>> errp);
>>>>
>>>> If there is already an hwpt that supports dirty tracking.
>>>> Another device that doesn't support dirty tracking attaches to this hwpt,
>>> will it succeed?
>>>>
>>>
>>> It returns -EINVAL, and we handle that right after this statement. Which
>>> means
>>> another HWPT is created.
>>
>> Looked into kernel code, I didn't see the check about dirty tracking
>between device and hwpt, do you know which func does that?
>>
>
>A device is associated with a group (aka IOMMU instance) and those checks
>happens when the device in a group is firstly being attached the first time or
>belongs to some *other* group and gets attach to this domain with dirty
>tracking
>enforced. If the device belongs to the same group that had a device attached
>already there's just a bump in the refcount and device is added to the /same
>group/ device list. Otherwise the device belongs to a different group and it's
>being attached to a domain and the various checks get triggered (dirty
>tracking
>being one of them). These attachment validation checks are part of the
>iommu
>driver, not core (the core just sees a .attach_dev() failure).
>
>Usually follows this codepath when the group attachment checks are firstly
>being
>done:
>
>vfio_iommufd_physical_attach_ioas()
> iommufd_device_attach()
>  iommufd_device_do_attach()
>   iommufd_hw_pagetable_attach()
>    iommu_attach_group()
>    ...
>     __iommu_attach_device()
>
>Then each iommu driver defines the compatibility checks and if the domain
>has
>dirty_ops set (that comes from this ALLOC_DIRTY_TRACKING flag) and the
>IOMMU
>backing the device doesn't have dirty tracking the driver returns -EINVAL
>e.g. on Intel IOMMU:
>
>intel_iommu_attach_device()
>  prepare_domain_attach_device():
>    domain->dirty_ops && !ssads_supported(iommu)
>       return -EINVAL;

Understood, thanks.

>
>
>>>
>>>> If existing hwpt doesn't support dirty tracking.
>>>> Another device supporting dirty tracking attaches to that hwpt, what
>will
>>> happen?
>>>>
>>>
>>> Hmm, It succeeds as there's no incompatbility. At the very least I plan on
>>> blocking migration if the device neither has VF dirty tracking, nor IOMMU
>>> dirty
>>> tracking (and patch 11 needs to be adjusted to check hwpt_flags instead
>of
>>> container).
>>
>> When bcontainer->dirty_pages_supported is true, I think that container
>should only contains hwpt list that support dirty tracking. All hwpt not
>supporting dirty tracking should be in other container.
>>
>Well but we are adopting this auto domains scheme and works for any
>device,
>dirty tracking or not. We already track hwpt flags so we know which ones
>support
>dirty tracking. This differentiation would (IMHO) complicate more and I am
>not
>sure the gain

OK, I was trying to make bcontainer->dirty_pages_supported  accurate because it 
is used in many functions such as vfio_get_dirty_bitmap() which require an 
accurate value. If there is mix of hwpt in that container, that's impossible.

But as you say you want to address the mix issue in a follow-up and presume all 
are homogeneous hw for now, then OK, there is no conflict.

>
>> If device supports dirty tracking, it should bypass attaching container that
>doesn't support dirty tracking. Vise versa.
>> This way we can support the mixing environment.
>>
>
>It's not that easy as the whole flow doesn't handle this mixed mode (even
>excluding this series). We would to have device-dirty-tracking start all
>non-disabled device trackers first [and stop them as well], and then we
>would
>always iterate those first (if device dirty trackers are active), and then 
>defer
>to IOMMU tracker for those who don't.

Why is device-dirty-tracking preferred over IOMMU dirty tracking?
Imagine if many devices attached to same domain.

>
>But given this mixed mode might be prone to regressions plus with me being
>dangerously close to softfreeze too, I was deeming it follow-up. And hence
>hoping I improve detection when the IOMMU doesn't provide the lowest
>common
>denominator for the 'all or nothing' mode then it would block migration. I
>can
>turn that if statement in {start,query}_dirty_tracking into an assert if that
>improves things.

OK

>
>>
>>>
>>> Qemu right now doesn't handle heteregenous environment, it's all of
>>> nothing
>>> approach even before this patchset. Additionally, I am not sure server
>>> environments are applicable here. So essentially I kept the status quo --
>>> more
>>> follow-up is needed to support a mix and match of IOMMU + VF dirty
>>> tracking too.
>>> The challenge is having the migration state of VFIO device initialized early
>>> enough that we can make all sort of decisions whether IOMMU dirty
>tracking
>>> is
>>> desired on a per-device basis.

OK.

Thanks
Zhenzhong

Reply via email to