On 17/07/2024 10:48, Duan, Zhenzhong wrote: > > >> -----Original Message----- >> From: Joao Martins <joao.m.mart...@oracle.com> >> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain >> creation >> >> On 17/07/2024 03:52, Duan, Zhenzhong wrote: >>> >>> >>>> -----Original Message----- >>>> From: Joao Martins <joao.m.mart...@oracle.com> >>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain >>>> creation >>>> >>>> On 16/07/2024 17:44, Joao Martins wrote: >>>>> On 16/07/2024 17:04, Eric Auger wrote: >>>>>> Hi Joao, >>>>>> >>>>>> On 7/12/24 13:46, Joao Martins wrote: >>>>>>> There's generally two modes of operation for IOMMUFD: >>>>>>> >>>>>>> * The simple user API which intends to perform relatively simple >> things >>>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to >> VFIO >>>>>> >>>>>> It generally creates? can you explicit what is "it" >>>>>> >>>>> 'It' here refers to the process/API-user >>>>> >>>>>> I am confused by this automatic terminology again (not your fault). the >>>> doc says: >>>>>> " >>>>>> >>>>>> * >>>>>> >>>>>> Automatic domain - refers to an iommu domain created >> automatically >>>>>> when attaching a device to an IOAS object. This is compatible to the >>>>>> semantics of VFIO type1. >>>>>> >>>>>> * >>>>>> >>>>>> Manual domain - refers to an iommu domain designated by the user >> as >>>>>> the target pagetable to be attached to by a device. Though currently >>>>>> there are no uAPIs to directly create such domain, the datastructure >>>>>> and algorithms are ready for handling that use case. >>>>>> >>>>>> " >>>>>> >>>>>> >>>>>> in 1) the device is attached to the ioas id (using the auto domain if I >>>>>> am >>>> not wrong) >>>>>> Here you attach to an hwpt id. Isn't it a manual domain? >>>>>> >>>>> >>>>> Correct. >>>>> >>>>> The 'auto domains' generally refers to the kernel-equivalent own >>>> automatic >>>>> attaching to a new pagetable. >>>>> >>>>> Here I call 'auto domains' in the userspace version too because we are >>>> doing the >>>>> exact same but from userspace, using the manual API in IOMMUFD. >>>>> >>>>>>> and mainly performs IOAS_MAP and UNMAP. >>>>>>> >>>>>>> * The native IOMMUFD API where you have fine grained control of >> the >>>>>>> IOMMU domain and model it accordingly. This is where most new >>>> feature >>>>>>> are being steered to. >>>>>>> >>>>>>> For dirty tracking 2) is required, as it needs to ensure that >>>>>>> the stage-2/parent IOMMU domain will only attach devices >>>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely >>>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a >>>>>>> useful guarantee to VMMs that will refuse incompatible device >>>>>>> attachments for IOMMU domains. >>>>>>> >>>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is >>>>>>> responsible for creating an IOMMU domain. This is contrast to the >>>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD >>>> automatically >>>>>>> when it attaches to VFIO (usually referred as autodomains) but it has >>>>>>> the needed handling for mdevs. >>>>>>> >>>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs >>>>>>> similar logic, where IOMMU domains are created and devices >> attached >>>> to >>>>>>> compatible domains. Essentially mimmicing kernel >>>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no >>>> IOMMU domain >>>>>>> it falls back to IOAS attach. >>>>>>> >>>>>>> The auto domain logic allows different IOMMU domains to be created >>>> when >>>>>>> DMA dirty tracking is not desired (and VF can provide it), and others >>>> where >>>>>>> it is. Here is not used in this way here given how VFIODevice >> migration >>>>>> >>>>>> Here is not used in this way here ? >>>>>> >>>>> >>>>> I meant, 'Here it is not used in this way given (...)' >>>>> >>>>>>> state is initialized after the device attachment. But such mixed mode >> of >>>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that >>>> can >>>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have >>>>>>> been using so far between container vs device dirty tracking. >>>>>>> >>>>>>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com> >>>>>>> --- >>>>>>> include/hw/vfio/vfio-common.h | 9 ++++ >>>>>>> include/sysemu/iommufd.h | 5 +++ >>>>>>> backends/iommufd.c | 30 +++++++++++++ >>>>>>> hw/vfio/iommufd.c | 82 >>>> +++++++++++++++++++++++++++++++++++ >>>>>>> backends/trace-events | 1 + >>>>>>> 5 files changed, 127 insertions(+) >>>>>>> >>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio- >>>> common.h >>>>>>> index 7419466bca92..2dd468ce3c02 100644 >>>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow { >>>>>>> >>>>>>> typedef struct IOMMUFDBackend IOMMUFDBackend; >>>>>>> >>>>>>> +typedef struct VFIOIOASHwpt { >>>>>>> + uint32_t hwpt_id; >>>>>>> + QLIST_HEAD(, VFIODevice) device_list; >>>>>>> + QLIST_ENTRY(VFIOIOASHwpt) next; >>>>>>> +} VFIOIOASHwpt; >>>>>>> + >>>>>>> typedef struct VFIOIOMMUFDContainer { >>>>>>> VFIOContainerBase bcontainer; >>>>>>> IOMMUFDBackend *be; >>>>>>> uint32_t ioas_id; >>>>>>> + QLIST_HEAD(, VFIOIOASHwpt) hwpt_list; >>>>>>> } VFIOIOMMUFDContainer; >>>>>>> >>>>>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, >>>> VFIO_IOMMU_IOMMUFD); >>>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice { >>>>>>> HostIOMMUDevice *hiod; >>>>>>> int devid; >>>>>>> IOMMUFDBackend *iommufd; >>>>>>> + VFIOIOASHwpt *hwpt; >>>>>>> + QLIST_ENTRY(VFIODevice) hwpt_next; >>>>>>> } VFIODevice; >>>>>>> >>>>>>> struct VFIODeviceOps { >>>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h >>>>>>> index 57d502a1c79a..e917e7591d05 100644 >>>>>>> --- a/include/sysemu/iommufd.h >>>>>>> +++ b/include/sysemu/iommufd.h >>>>>>> @@ -50,6 +50,11 @@ int >>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t >> ioas_id, >>>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, >>>> uint32_t devid, >>>>>>> uint32_t *type, void *data, >>>>>>> uint32_t len, >>>>>>> uint64_t *caps, Error **errp); >>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, >> uint32_t >>>> dev_id, >>>>>>> + uint32_t pt_id, uint32_t flags, >>>>>>> + uint32_t data_type, uint32_t data_len, >>>>>>> + void *data_ptr, uint32_t *out_hwpt, >>>>>>> + Error **errp); >>>>>>> >>>>>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD >>>> TYPE_HOST_IOMMU_DEVICE "-iommufd" >>>>>>> #endif >>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c >>>>>>> index 2b3d51af26d2..5d3dfa917415 100644 >>>>>>> --- a/backends/iommufd.c >>>>>>> +++ b/backends/iommufd.c >>>>>>> @@ -208,6 +208,36 @@ int >>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t >> ioas_id, >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, >> uint32_t >>>> dev_id, >>>>>>> + uint32_t pt_id, uint32_t flags, >>>>>>> + uint32_t data_type, uint32_t data_len, >>>>>>> + void *data_ptr, uint32_t *out_hwpt, >>>>>>> + Error **errp) >>>>>>> +{ >>>>>>> + int ret, fd = be->fd; >>>>>>> + struct iommu_hwpt_alloc alloc_hwpt = { >>>>>>> + .size = sizeof(struct iommu_hwpt_alloc), >>>>>>> + .flags = flags, >>>>>>> + .dev_id = dev_id, >>>>>>> + .pt_id = pt_id, >>>>>>> + .data_type = data_type, >>>>>>> + .data_len = data_len, >>>>>>> + .data_uptr = (uint64_t)data_ptr, >>>>>>> + }; >>>>>>> + >>>>>>> + ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt); >>>>>>> + trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, >>>> data_type, >>>>>>> + data_len, (uint64_t)data_ptr, >>>>>>> + alloc_hwpt.out_hwpt_id, ret); >>>>>>> + if (ret) { >>>>>>> + error_setg_errno(errp, errno, "Failed to allocate hwpt"); >>>>>>> + return false; >>>>>>> + } >>>>>>> + >>>>>>> + *out_hwpt = alloc_hwpt.out_hwpt_id; >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, >>>> uint32_t devid, >>>>>>> uint32_t *type, void *data, >>>>>>> uint32_t len, >>>>>>> uint64_t *caps, Error **errp) >>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c >>>>>>> index 077dea8f1b64..325c7598d5a1 100644 >>>>>>> --- a/hw/vfio/iommufd.c >>>>>>> +++ b/hw/vfio/iommufd.c >>>>>>> @@ -212,10 +212,86 @@ static bool >>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp) >>>>>>> return true; >>>>>>> } >>>>>>> >>>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev, >>>>>>> + VFIOIOMMUFDContainer >>>>>>> *container, >>>>>>> + Error **errp) >>>>>>> +{ >>>>>>> + IOMMUFDBackend *iommufd = vbasedev->iommufd; >>>>>>> + uint32_t flags = 0; >>>>>>> + VFIOIOASHwpt *hwpt; >>>>>>> + uint32_t hwpt_id; >>>>>>> + int ret; >>>>>>> + >>>>>>> + /* Try to find a domain */ >>>>>>> + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { >>>>>>> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt- >>> hwpt_id, >>>> errp); >>>>>>> + if (ret) { >>>>>>> + /* -EINVAL means the domain is incompatible with the >>>>>>> device. >>>> */ >>>>>>> + if (ret == -EINVAL) { >>>>>>> + /* >>>>>>> + * It is an expected failure and it just means we will >>>>>>> try >>>>>>> + * another domain, or create one if no existing >>>>>>> compatible >>>>>>> + * domain is found. Hence why the error is discarded >>>>>>> below. >>>>>>> + */ >>>>>>> + error_free(*errp); >>>>>>> + *errp = NULL; >>>>>>> + continue; >>>>>>> + } >>>>>>> + >>>>>>> + return false; >>>>>>> + } else { >>>>>>> + vbasedev->hwpt = hwpt; >>>>>>> + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, >> hwpt_next); >>>>>>> + return true; >>>>>>> + } >>>>>>> + } >>>>>>> + >>>>>>> + if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid, >>>>>>> + container->ioas_id, flags, >>>>>>> + IOMMU_HWPT_DATA_NONE, 0, NULL, >>>>>>> + &hwpt_id, errp)) { >>>>>>> + return false; >>>>>>> + } >>>>>>> + >>>>>>> + hwpt = g_malloc0(sizeof(*hwpt)); >>>>>>> + hwpt->hwpt_id = hwpt_id; >>>>>>> + QLIST_INIT(&hwpt->device_list); >>>>>>> + >>>>>>> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt- >>> hwpt_id, >>>> errp); >>>>>>> + if (ret) { >>>>>>> + iommufd_backend_free_id(container->be, hwpt->hwpt_id); >>>>>>> + g_free(hwpt); >>>>>>> + return false; >>>>>>> + } >>>>>>> + >>>>>>> + vbasedev->hwpt = hwpt; >>>>>>> + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next); >>>>>>> + QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next); >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev, >>>>>>> + VFIOIOMMUFDContainer >>>>>>> *container) >>>>>>> +{ >>>>>>> + VFIOIOASHwpt *hwpt = vbasedev->hwpt; >>>>>>> + >>>>>>> + QLIST_REMOVE(vbasedev, hwpt_next); >>>>>> don't you want to reset vbasedev->hwpt = NULL too? >>>>>> >>>>> Yeap, Thanks for catching that >>>>> >>>>>> >>>>>>> + if (QLIST_EMPTY(&hwpt->device_list)) { >>>>>>> + QLIST_REMOVE(hwpt, next); >>>>>>> + iommufd_backend_free_id(container->be, hwpt->hwpt_id); >>>>>>> + g_free(hwpt); >>>>>>> + } >>>>>>> +} >>>>>>> + >>>>>>> static bool iommufd_cdev_attach_container(VFIODevice *vbasedev, >>>>>>> VFIOIOMMUFDContainer >>>>>>> *container, >>>>>>> Error **errp) >>>>>>> { >>>>>>> + /* mdevs aren't physical devices and will fail with auto domains >> */ >>>>>>> + if (!vbasedev->mdev) { >>>>>>> + return iommufd_cdev_autodomains_get(vbasedev, container, >>>> errp); >>>>>>> + } >>>>>>> + >>>>>>> return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container- >>>>> ioas_id, errp); >>>>>>> } >>>>>>> >>>>>>> @@ -224,6 +300,11 @@ static void >>>> iommufd_cdev_detach_container(VFIODevice *vbasedev, >>>>>>> { >>>>>>> Error *err = NULL; >>>>>>> >>>>>>> + if (vbasedev->hwpt) { >>>>>>> + iommufd_cdev_autodomains_put(vbasedev, container); >>>>>>> + return; >>>>>> Where do we detach the device from the hwpt? >>>>>> >>>>> In iommufd_backend_free_id() for auto domains >>>>> >>>> >>>> to clarify here I meant *userspace* auto domains >>>> >>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT >>> >>> If the device is still attached to the hwpt, will iommufd_backend_free_id() >> succeed? >>> Have you tried the hot unplug? >>> >> >> I have but I didn't see any errors. But I will check again for v5 as it could >> also be my oversight. >> >> I was thinking about Eric's remark overnight and I think what I am doing is >> not >> correct regardless of the above. >> >> I should be calling DETACH_IOMMUFD_PT pairing with >> ATTACH_IOMMUFD_PT, and the >> iommufd_backend_free_id() is to drop the final reference pairing with >> alloc_hwpt() when the device list is empty i.e. when there's no more devices >> in >> that vdev::hwpt. >> >> DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't >> differentiate >> between auto domains vs manual domains. > > Yes, missing DETACH_IOMMUFD_PT so ref count isn't decreased. > My understanding is freeing hwpt will fails become device is still attached, > such as return -EBUSY, > But may be I understand wrong as you didn't see that failure. >
I recall exercising *hotunplug/hotplug* that working it out, but the error likely could go silent as it doesn't does fail higher levels. So part of the reason that although it seemed to work it might be my oversight in seeing that the errno was returned from free id operation.