On 17/07/2024 11:05, Duan, Zhenzhong wrote: >> -----Original Message----- >> From: Joao Martins <joao.m.mart...@oracle.com> >> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain >> creation >> >> On 17/07/2024 03:18, Duan, Zhenzhong wrote: >>> >>> >>>> -----Original Message----- >>>> From: Joao Martins <joao.m.mart...@oracle.com> >>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain >> creation >>>> >>>> There's generally two modes of operation for IOMMUFD: >>>> >>>> * The simple user API which intends to perform relatively simple things >>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO >>>> and mainly performs IOAS_MAP and UNMAP. >>>> >>>> * The native IOMMUFD API where you have fine grained control of the >>>> IOMMU domain and model it accordingly. This is where most new feature >>>> are being steered to. >>>> >>>> For dirty tracking 2) is required, as it needs to ensure that >>>> the stage-2/parent IOMMU domain will only attach devices >>>> that support dirty tracking (so far it is all homogeneous in x86, likely >>>> not the case for smmuv3). Such invariant on dirty tracking provides a >>>> useful guarantee to VMMs that will refuse incompatible device >>>> attachments for IOMMU domains. >>>> >>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is >>>> responsible for creating an IOMMU domain. This is contrast to the >>>> 'simple API' where the IOMMU domain is created by IOMMUFD >>>> automatically >>>> when it attaches to VFIO (usually referred as autodomains) but it has >>>> the needed handling for mdevs. >>>> >>>> To support dirty tracking with the advanced IOMMUFD API, it needs >>>> similar logic, where IOMMU domains are created and devices attached to >>>> compatible domains. Essentially mimmicing kernel >>>> iommufd_device_auto_get_domain(). With mdevs given there's no >> IOMMU >>>> domain >>>> it falls back to IOAS attach. >>>> >>>> The auto domain logic allows different IOMMU domains to be created >> when >>>> DMA dirty tracking is not desired (and VF can provide it), and others >> where >>>> it is. Here is not used in this way here given how VFIODevice migration >>>> state is initialized after the device attachment. But such mixed mode of >>>> IOMMU dirty tracking + device dirty tracking is an improvement that can >>>> be added on. Keep the 'all of nothing' of type1 approach that we have >>>> been using so far between container vs device dirty tracking. >>>> >>>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com> >>>> --- >>>> include/hw/vfio/vfio-common.h | 9 ++++ >>>> include/sysemu/iommufd.h | 5 +++ >>>> backends/iommufd.c | 30 +++++++++++++ >>>> hw/vfio/iommufd.c | 82 >>>> +++++++++++++++++++++++++++++++++++ >>>> backends/trace-events | 1 + >>>> 5 files changed, 127 insertions(+) >>>> >>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio- >>>> common.h >>>> index 7419466bca92..2dd468ce3c02 100644 >>>> --- a/include/hw/vfio/vfio-common.h >>>> +++ b/include/hw/vfio/vfio-common.h >>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow { >>>> >>>> typedef struct IOMMUFDBackend IOMMUFDBackend; >>>> >>>> +typedef struct VFIOIOASHwpt { >>>> + uint32_t hwpt_id; >>>> + QLIST_HEAD(, VFIODevice) device_list; >>>> + QLIST_ENTRY(VFIOIOASHwpt) next; >>>> +} VFIOIOASHwpt; >>>> + >>>> typedef struct VFIOIOMMUFDContainer { >>>> VFIOContainerBase bcontainer; >>>> IOMMUFDBackend *be; >>>> uint32_t ioas_id; >>>> + QLIST_HEAD(, VFIOIOASHwpt) hwpt_list; >>>> } VFIOIOMMUFDContainer; >>>> >>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, >>>> VFIO_IOMMU_IOMMUFD); >>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice { >>>> HostIOMMUDevice *hiod; >>>> int devid; >>>> IOMMUFDBackend *iommufd; >>>> + VFIOIOASHwpt *hwpt; >>>> + QLIST_ENTRY(VFIODevice) hwpt_next; >>>> } VFIODevice; >>>> >>>> struct VFIODeviceOps { >>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h >>>> index 57d502a1c79a..e917e7591d05 100644 >>>> --- a/include/sysemu/iommufd.h >>>> +++ b/include/sysemu/iommufd.h >>>> @@ -50,6 +50,11 @@ int >>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t >> ioas_id, >>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, >> uint32_t >>>> devid, >>>> uint32_t *type, void *data, uint32_t >>>> len, >>>> uint64_t *caps, Error **errp); >>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t >>>> dev_id, >>>> + uint32_t pt_id, uint32_t flags, >>>> + uint32_t data_type, uint32_t data_len, >>>> + void *data_ptr, uint32_t *out_hwpt, >>>> + Error **errp); >>>> >>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD >>>> TYPE_HOST_IOMMU_DEVICE "-iommufd" >>>> #endif >>>> diff --git a/backends/iommufd.c b/backends/iommufd.c >>>> index 2b3d51af26d2..5d3dfa917415 100644 >>>> --- a/backends/iommufd.c >>>> +++ b/backends/iommufd.c >>>> @@ -208,6 +208,36 @@ int >>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t >> ioas_id, >>>> return ret; >>>> } >>>> >>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t >>>> dev_id, >>>> + uint32_t pt_id, uint32_t flags, >>>> + uint32_t data_type, uint32_t data_len, >>>> + void *data_ptr, uint32_t *out_hwpt, >>>> + Error **errp) >>>> +{ >>>> + int ret, fd = be->fd; >>>> + struct iommu_hwpt_alloc alloc_hwpt = { >>>> + .size = sizeof(struct iommu_hwpt_alloc), >>>> + .flags = flags, >>>> + .dev_id = dev_id, >>>> + .pt_id = pt_id, >>>> + .data_type = data_type, >>>> + .data_len = data_len, >>>> + .data_uptr = (uint64_t)data_ptr, >>>> + }; >>>> + >>>> + ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt); >>>> + trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, >> data_type, >>>> + data_len, (uint64_t)data_ptr, >>>> + alloc_hwpt.out_hwpt_id, ret); >>>> + if (ret) { >>>> + error_setg_errno(errp, errno, "Failed to allocate hwpt"); >>>> + return false; >>>> + } >>>> + >>>> + *out_hwpt = alloc_hwpt.out_hwpt_id; >>>> + return true; >>>> +} >>>> + >>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, >> uint32_t >>>> devid, >>>> uint32_t *type, void *data, uint32_t >>>> len, >>>> uint64_t *caps, Error **errp) >>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c >>>> index 077dea8f1b64..325c7598d5a1 100644 >>>> --- a/hw/vfio/iommufd.c >>>> +++ b/hw/vfio/iommufd.c >>>> @@ -212,10 +212,86 @@ static bool >>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp) >>>> return true; >>>> } >>>> >>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev, >>>> + VFIOIOMMUFDContainer *container, >>>> + Error **errp) >>>> +{ >>>> + IOMMUFDBackend *iommufd = vbasedev->iommufd; >>>> + uint32_t flags = 0; >>>> + VFIOIOASHwpt *hwpt; >>>> + uint32_t hwpt_id; >>>> + int ret; >>>> + >>>> + /* Try to find a domain */ >>>> + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { >>>> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, >>>> errp); >>> >>> If there is already an hwpt that supports dirty tracking. >>> Another device that doesn't support dirty tracking attaches to this hwpt, >> will it succeed? >>> >> >> It returns -EINVAL, and we handle that right after this statement. Which >> means >> another HWPT is created. > > Looked into kernel code, I didn't see the check about dirty tracking between > device and hwpt, do you know which func does that? >
A device is associated with a group (aka IOMMU instance) and those checks happens when the device in a group is firstly being attached the first time or belongs to some *other* group and gets attach to this domain with dirty tracking enforced. If the device belongs to the same group that had a device attached already there's just a bump in the refcount and device is added to the /same group/ device list. Otherwise the device belongs to a different group and it's being attached to a domain and the various checks get triggered (dirty tracking being one of them). These attachment validation checks are part of the iommu driver, not core (the core just sees a .attach_dev() failure). Usually follows this codepath when the group attachment checks are firstly being done: vfio_iommufd_physical_attach_ioas() iommufd_device_attach() iommufd_device_do_attach() iommufd_hw_pagetable_attach() iommu_attach_group() ... __iommu_attach_device() Then each iommu driver defines the compatibility checks and if the domain has dirty_ops set (that comes from this ALLOC_DIRTY_TRACKING flag) and the IOMMU backing the device doesn't have dirty tracking the driver returns -EINVAL e.g. on Intel IOMMU: intel_iommu_attach_device() prepare_domain_attach_device(): domain->dirty_ops && !ssads_supported(iommu) return -EINVAL; >> >>> If existing hwpt doesn't support dirty tracking. >>> Another device supporting dirty tracking attaches to that hwpt, what will >> happen? >>> >> >> Hmm, It succeeds as there's no incompatbility. At the very least I plan on >> blocking migration if the device neither has VF dirty tracking, nor IOMMU >> dirty >> tracking (and patch 11 needs to be adjusted to check hwpt_flags instead of >> container). > > When bcontainer->dirty_pages_supported is true, I think that container should > only contains hwpt list that support dirty tracking. All hwpt not supporting > dirty tracking should be in other container. > Well but we are adopting this auto domains scheme and works for any device, dirty tracking or not. We already track hwpt flags so we know which ones support dirty tracking. This differentiation would (IMHO) complicate more and I am not sure the gain > If device supports dirty tracking, it should bypass attaching container that > doesn't support dirty tracking. Vise versa. > This way we can support the mixing environment. > It's not that easy as the whole flow doesn't handle this mixed mode (even excluding this series). We would to have device-dirty-tracking start all non-disabled device trackers first [and stop them as well], and then we would always iterate those first (if device dirty trackers are active), and then defer to IOMMU tracker for those who don't. But given this mixed mode might be prone to regressions plus with me being dangerously close to softfreeze too, I was deeming it follow-up. And hence hoping I improve detection when the IOMMU doesn't provide the lowest common denominator for the 'all or nothing' mode then it would block migration. I can turn that if statement in {start,query}_dirty_tracking into an assert if that improves things. > >> >> Qemu right now doesn't handle heteregenous environment, it's all of >> nothing >> approach even before this patchset. Additionally, I am not sure server >> environments are applicable here. So essentially I kept the status quo -- >> more >> follow-up is needed to support a mix and match of IOMMU + VF dirty >> tracking too. >> The challenge is having the migration state of VFIO device initialized early >> enough that we can make all sort of decisions whether IOMMU dirty tracking >> is >> desired on a per-device basis.