Re: [Intel-gfx] [PATCH v15 00/26] Add vfio_device cdev for iommufd support

2023-07-26 Thread Jason Gunthorpe
On Tue, Jul 25, 2023 at 12:00:09PM -0600, Alex Williamson wrote:
> On Mon, 24 Jul 2023 13:09:22 -0600
> Alex Williamson  wrote:
> 
> > On Tue, 18 Jul 2023 13:57:46 -0300
> > Jason Gunthorpe  wrote:
> > 
> > > On Tue, Jul 18, 2023 at 06:55:25AM -0700, Yi Liu wrote:  
> > > > Existing VFIO provides group-centric user APIs for userspace. Userspace
> > > > opens the /dev/vfio/$group_id first before getting device fd and hence
> > > > getting access to device. This is not the desired model for iommufd. Per
> > > > the conclusion of community discussion[1], iommufd provides 
> > > > device-centric
> > > > kAPIs and requires its consumer (like VFIO) to be device-centric user
> > > > APIs. Such user APIs are used to associate device with iommufd and also
> > > > the I/O address spaces managed by the iommufd.
> > > > 
> > > > This series first introduces a per device file structure to be prepared
> > > > for further enhancement and refactors the kvm-vfio code to be prepared
> > > > for accepting device file from userspace. After this, adds a mechanism 
> > > > for
> > > > blocking device access before iommufd bind. Then refactors the vfio to 
> > > > be
> > > > able to handle cdev paths (e.g. iommufd binding, no-iommufd, [de]attach 
> > > > ioas).
> > > > This refactor includes making the device_open exclusive between the 
> > > > group
> > > > and the cdev path, only allow single device open in cdev path; 
> > > > vfio-iommufd
> > > > code is also refactored to support cdev. e.g. split the 
> > > > vfio_iommufd_bind()
> > > > into two steps. Eventually, adds the cdev support for vfio device and 
> > > > the
> > > > new ioctls, then makes group infrastructure optional as it is not needed
> > > > when vfio device cdev is compiled.
> > > > 
> > > > This series is based on some preparation works done to vfio emulated 
> > > > devices[2]
> > > > and vfio pci hot reset enhancements[3]. Per discussion[4], this series 
> > > > does not
> > > > support cdev for physical devices that do not have IOMMU. Such devices 
> > > > only
> > > > have group-centric user APIs.
> > > > 
> > > > This series is a prerequisite for iommu nesting for vfio device[5] [6].
> > > > 
> > > > The complete code can be found in below branch, simple tests done to the
> > > > legacy group path and the cdev path. QEMU changes are in upstreaming[7]
> > > > and the complete code can be found at[8]
> > > > 
> > > > https://github.com/yiliu1765/iommufd/tree/vfio_device_cdev_v15
> > > > (config CONFIG_IOMMUFD=y CONFIG_VFIO_DEVICE_CDEV=y)
> > > 
> > > Alex, if you are still good with this lets make this into a shared
> > > branch, do you want to do it or would you like a PR from me?  
> > 
> > Sorry, was out much of last week.  Yes, my intent would be to put this
> > both in a shared branch and my next branch for v6.6.  Given this is
> > mostly vfio, it seems like it'd make sense for me to provide that
> > branch but I may not get to it until tomorrow.  Thanks,
> 
> Both series are applied to my next branch for v6.6 and I've also
> published them to the v6.6/vfio/cdev branch[1].  Thanks for all the
> work and collaboration on this effort!

Great, I pulled it and merged the next series

Thanks,
Jason


Re: [Intel-gfx] [PATCH v15 00/26] Add vfio_device cdev for iommufd support

2023-07-18 Thread Jason Gunthorpe
On Tue, Jul 18, 2023 at 06:55:25AM -0700, Yi Liu wrote:
> Existing VFIO provides group-centric user APIs for userspace. Userspace
> opens the /dev/vfio/$group_id first before getting device fd and hence
> getting access to device. This is not the desired model for iommufd. Per
> the conclusion of community discussion[1], iommufd provides device-centric
> kAPIs and requires its consumer (like VFIO) to be device-centric user
> APIs. Such user APIs are used to associate device with iommufd and also
> the I/O address spaces managed by the iommufd.
> 
> This series first introduces a per device file structure to be prepared
> for further enhancement and refactors the kvm-vfio code to be prepared
> for accepting device file from userspace. After this, adds a mechanism for
> blocking device access before iommufd bind. Then refactors the vfio to be
> able to handle cdev paths (e.g. iommufd binding, no-iommufd, [de]attach ioas).
> This refactor includes making the device_open exclusive between the group
> and the cdev path, only allow single device open in cdev path; vfio-iommufd
> code is also refactored to support cdev. e.g. split the vfio_iommufd_bind()
> into two steps. Eventually, adds the cdev support for vfio device and the
> new ioctls, then makes group infrastructure optional as it is not needed
> when vfio device cdev is compiled.
> 
> This series is based on some preparation works done to vfio emulated 
> devices[2]
> and vfio pci hot reset enhancements[3]. Per discussion[4], this series does 
> not
> support cdev for physical devices that do not have IOMMU. Such devices only
> have group-centric user APIs.
> 
> This series is a prerequisite for iommu nesting for vfio device[5] [6].
> 
> The complete code can be found in below branch, simple tests done to the
> legacy group path and the cdev path. QEMU changes are in upstreaming[7]
> and the complete code can be found at[8]
> 
> https://github.com/yiliu1765/iommufd/tree/vfio_device_cdev_v15
> (config CONFIG_IOMMUFD=y CONFIG_VFIO_DEVICE_CDEV=y)

Alex, if you are still good with this lets make this into a shared
branch, do you want to do it or would you like a PR from me?

Thanks,
Jason


Re: [Intel-gfx] [PATCH v15 22/26] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-07-18 Thread Jason Gunthorpe
On Tue, Jul 18, 2023 at 06:55:47AM -0700, Yi Liu wrote:
> This adds ioctl for userspace to bind device cdev fd to iommufd.
> 
> VFIO_DEVICE_BIND_IOMMUFD: bind device to an iommufd, hence gain DMA
> control provided by the iommufd. open_device
> op is called after bind_iommufd op.
> 
> Tested-by: Nicolin Chen 
> Tested-by: Matthew Rosato 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Tested-by: Terrence Xu 
> Tested-by: Zhenzhong Duan 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/device_cdev.c | 107 +
>  drivers/vfio/vfio.h|  13 +
>  drivers/vfio/vfio_main.c   |   5 ++
>  include/linux/vfio.h   |   5 +-
>  include/uapi/linux/vfio.h  |  27 ++
>  5 files changed, 155 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v15 20/26] iommufd: Add iommufd_ctx_from_fd()

2023-07-18 Thread Jason Gunthorpe
On Tue, Jul 18, 2023 at 06:55:45AM -0700, Yi Liu wrote:
> It's common to get a reference to the iommufd context from a given file
> descriptor. So adds an API for it. Existing users of this API are compiled
> only when IOMMUFD is enabled, so no need to have a stub for the IOMMUFD
> disabled case.
> 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/main.c | 24 
>  include/linux/iommufd.h  |  1 +
>  2 files changed, 25 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v10 09/10] vfio/pci: Copy hot-reset device info to userspace in the devices loop

2023-07-18 Thread Jason Gunthorpe
On Tue, Jul 18, 2023 at 03:55:41AM -0700, Yi Liu wrote:
> This copies the vfio_pci_dependent_device to userspace during looping each
> affected device for reporting vfio_pci_hot_reset_info. This avoids counting
> the affected devices and allocating a potential large buffer to store the
> vfio_pci_dependent_device of all the affected devices before copying them
> to userspace.
> 
> Suggested-by: Jason Gunthorpe 
> Tested-by: Yanting Jiang 
> Tested-by: Zhenzhong Duan 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 93 
>  1 file changed, 33 insertions(+), 60 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v13 21/22] vfio: Compile vfio_group infrastructure optionally

2023-07-17 Thread Jason Gunthorpe
On Mon, Jul 17, 2023 at 06:36:19AM +, Liu, Yi L wrote:
> Hi Alex, Jason,
> 
> I found a minor nit on the kconfig. The below configuration is valid.
> But user cannot use vfio directly as there is no /dev/vfio/vfio. Although
> user can open /dev/iommu instead. This is not good.
> 
> CONFIG_IOMMUFD=y
> CONFIG_VFIO_DEVICE_CDEv=n
> CONFIG_VFIO_GROUP=y
> CONFIG_VFIO_CONTAINER=n
> CONFIG_IOMMUFD_VFIO_CONTAINER=n
> 
> So need to have the below change. I'll incorporate this change in
> this series after your ack.

This is fine, we decided to allow this in the original vfio series. It
is usable in that you have to use iommufd natively with the group
interface. It won't be backwards compatible with current userspace.

Jason


Re: [Intel-gfx] [PATCH v14 22/26] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-07-15 Thread Jason Gunthorpe
On Sat, Jul 15, 2023 at 04:16:52AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Friday, July 14, 2023 10:42 PM
> > 
> > On Mon, Jul 10, 2023 at 07:59:24PM -0700, Yi Liu wrote:
> > 
> > > +static inline long vfio_df_ioctl_bind_iommufd(struct vfio_device_file 
> > > *df,
> > > +   struct vfio_device_bind_iommufd 
> > > __user
> > *arg)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > 
> > This should be -ENOTTY
> 
> Okay. Since there are quite a few stub functions in the drivers/vfio/vfio.h.
> Let me check the rule. All the stub functions should return -ENOTTY in
> the !IS_ENABLED(CONFIG_XXX) case, if the function returns int., is
> it?

No, just ioctl returns ENOTTY, so really just this function.

Jason


Re: [Intel-gfx] [PATCH v14 26/26] docs: vfio: Add vfio device cdev description

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:28PM -0700, Yi Liu wrote:
> This gives notes for userspace applications on device cdev usage.
> 
> Reviewed-by: Kevin Tian 
> Signed-off-by: Yi Liu 
> ---
>  Documentation/driver-api/vfio.rst | 139 ++
>  1 file changed, 139 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v14 24/26] vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev()

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:26PM -0700, Yi Liu wrote:
> The IOMMU_CAP_CACHE_COHERENCY check only applies to the physical devices
> that are IOMMU-backed. But it is now in the group code. If want to compile
> vfio_group infrastructure out, this check needs to be moved out of the group
> code.
> 
> Another reason for this change is to fail the device registration for the
> physical devices that do not have IOMMU if the group code is not compiled
> as the cdev interface does not support such devices.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c | 10 --
>  drivers/vfio/vfio_main.c | 11 +++
>  2 files changed, 11 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v14 23/26] vfio: Add VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:25PM -0700, Yi Liu wrote:
> This adds ioctl for userspace to attach device cdev fd to and detach
> from IOAS/hw_pagetable managed by iommufd.
> 
> VFIO_DEVICE_ATTACH_IOMMUFD_PT: attach vfio device to IOAS or hw_pagetable
>  managed by iommufd. Attach can be undo
>  by VFIO_DEVICE_DETACH_IOMMUFD_PT or device
>  fd close.
> VFIO_DEVICE_DETACH_IOMMUFD_PT: detach vfio device from the current 
> attached
>  IOAS or hw_pagetable managed by iommufd.
> 
> Tested-by: Nicolin Chen 
> Tested-by: Matthew Rosato 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/device_cdev.c | 58 ++
>  drivers/vfio/vfio.h|  5 
>  drivers/vfio/vfio_main.c   | 15 +-
>  include/uapi/linux/vfio.h  | 44 +
>  4 files changed, 121 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v14 22/26] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:24PM -0700, Yi Liu wrote:

> +static inline long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
> +   struct vfio_device_bind_iommufd 
> __user *arg)
> +{
> + return -EOPNOTSUPP;
> +}

This should be -ENOTTY

> @@ -1149,6 +1151,9 @@ static long vfio_device_fops_unl_ioctl(struct file 
> *filep,
>   void __user *uptr = (void __user *)arg;
>   int ret;
>  
> + if (cmd == VFIO_DEVICE_BIND_IOMMUFD)
> + return vfio_df_ioctl_bind_iommufd(df, uptr);
> +

And this function has a mistake too:

default:
if (unlikely(!device->ops->ioctl))
ret = -EINVAL;

Should also be -ENOTTY

All the implementations of the ops already return -ENOTTY

However, I think this is all slightly not quite right, the proper
return code is supposed to be ENOIOCTLCMD which vfs_ioctl() then
translates into ENOTTY for some reason..

It looks Ok otherwise

Jason


Re: [Intel-gfx] [PATCH v14 21/26] vfio: Avoid repeated user pointer cast in vfio_device_fops_unl_ioctl()

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:23PM -0700, Yi Liu wrote:
> This adds a local variable to store the user pointer cast result from arg.
> It avoids the repeated casts in the code when more ioctls are added.
> 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/vfio_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v14 20/26] iommufd: Add iommufd_ctx_from_fd()

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:22PM -0700, Yi Liu wrote:
> It's common to get a reference to the iommufd context from a given file
> descriptor. So adds an API for it. Existing users of this API are compiled
> only when IOMMUFD is enabled, so no need to have a stub for the IOMMUFD
> disabled case.
> 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/main.c | 23 +++
>  include/linux/iommufd.h  |  1 +
>  2 files changed, 24 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 32ce7befc8dd..e99a338d4fdf 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -377,6 +377,29 @@ struct iommufd_ctx *iommufd_ctx_from_file(struct file 
> *file)
>  }
>  EXPORT_SYMBOL_NS_GPL(iommufd_ctx_from_file, IOMMUFD);
>  
> +/**
> + * iommufd_ctx_from_fd - Acquires a reference to the iommufd context
> + * @fd: File descriptor to obtain the reference from
> + *
> + * Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. On success
> + * the caller is responsible to call iommufd_ctx_put().
> + */
> +struct iommufd_ctx *iommufd_ctx_from_fd(int fd)
> +{
> + struct iommufd_ctx *iommufd;
> + struct fd f;
> +
> + f = fdget(fd);
> + if (!f.file)
> + return ERR_PTR(-EBADF);
> +
> + iommufd = iommufd_ctx_from_file(f.file);
> +
> + fdput(f);
> + return iommufd;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_from_fd, IOMMUFD);

This is a little wonky since iommufd_ctx_from_file() also obtains a
reference

Just needs to be like this:

struct iommufd_ctx *iommufd_ctx_from_fd(int fd)
{
struct file *file;

file = fget(fd);
if (!file)
return ERR_PTR(-EBADF);

if (file->f_op != _fops) {
fput(file);
return ERR_PTR(-EBADFD);
}
/* fget is the same as iommufd_ctx_get() */
return file->private_data;
}
EXPORT_SYMBOL_NS_GPL(iommufd_ctx_from_fd, IOMMUFD);

?

Jason


Re: [Intel-gfx] [PATCH v14 19/26] vfio: Test kvm pointer in _vfio_device_get_kvm_safe()

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:21PM -0700, Yi Liu wrote:
> This saves some lines when adding the kvm get logic for the vfio_device
> cdev path.
> 
> This also renames _vfio_device_get_kvm_safe() to be 
> vfio_device_get_kvm_safe().
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c | 7 +--
>  drivers/vfio/vfio.h  | 6 +++---
>  drivers/vfio/vfio_main.c | 5 -
>  3 files changed, 8 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v14 17/26] vfio: Move device_del() before waiting for the last vfio_device registration refcount

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:19PM -0700, Yi Liu wrote:
> device_del() destroys the vfio-dev/vfioX under the sysfs for vfio_device.
> There is no reason to keep it while the device is going to be unregistered.
> 
> This movement is also a preparation for adding vfio_device cdev. Kernel
> should remove the cdev node of the vfio_device to avoid new registration
> refcount increment while the device is going to be unregistered.
> 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/vfio_main.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v14 14/26] iommufd/device: Add iommufd_access_detach() API

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:59:16PM -0700, Yi Liu wrote:
> From: Nicolin Chen 
> 
> Previously, the detach routine is only done by the destroy(). And it was
> called by vfio_iommufd_emulated_unbind() when the device runs close(), so
> all the mappings in iopt were cleaned in that setup, when the call trace
> reaches this detach() routine.
> 
> Now, there's a need of a detach uAPI, meaning that it does not only need
> a new iommufd_access_detach() API, but also requires access->ops->unmap()
> call as a cleanup. So add one.
> 
> However, leaving that unprotected can introduce some potential of a race
> condition during the pin_/unpin_pages() call, where access->ioas->iopt is
> getting referenced. So, add an ioas_lock to protect the context of iopt
> referencings.
> 
> Also, to allow the iommufd_access_unpin_pages() callback to happen via
> this unmap() call, add an ioas_unpin pointer, so the unpin routine won't
> be affected by the "access->ioas = NULL" trick.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Matthew Rosato 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Nicolin Chen 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/device.c  | 74 +++--
>  drivers/iommu/iommufd/iommufd_private.h |  2 +
>  include/linux/iommufd.h |  1 +
>  3 files changed, 72 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v9 09/10] vfio/pci: Copy hot-reset device info to userspace in the devices loop

2023-07-14 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:31:25PM -0700, Yi Liu wrote:

> @@ -1311,29 +1296,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>   ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>   , slot);
>   mutex_unlock(>vdev.dev_set->lock);
> + if (ret)
> + return ret;
>  
> - /*
> -  * If a device was removed between counting and filling, we may come up
> -  * short of fill.max.  If a device was added, we'll have a return of
> -  * -EAGAIN above.
> -  */
> - if (!ret) {
> - hdr.count = fill.cur;
> - hdr.flags = fill.flags;
> - }
> -
> -reset_info_exit:
> + hdr.count = fill.count;
> + hdr.flags = fill.flags;
>   if (copy_to_user(arg, , minsz))
> - ret = -EFAULT;
> -
> - if (!ret) {
> - if (copy_to_user(>devices, devices,
> -  hdr.count * sizeof(*devices)))
> - ret = -EFAULT;
> - }
> + return -EFAULT;
>  
> - kfree(devices);
> - return ret;
> + if (fill.count != fill.devices - arg->devices)
> + return -ENOSPC;

This should be > right? The previous code returned ENOSPC only if
their were more devices than requested, not less.

Jason


Re: [Intel-gfx] [PATCH v9 06/10] vfio: Mark cdev usage in vfio_device

2023-07-12 Thread Jason Gunthorpe
On Mon, Jul 10, 2023 at 07:31:22PM -0700, Yi Liu wrote:
> This can be used to differentiate whether to report group_id or devid in
> the revised VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. At this moment, no
> cdev path yet, so the vfio_device_cdev_opened() helper always returns false.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  include/linux/vfio.h | 5 +
>  1 file changed, 5 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 18/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-06-28 Thread Jason Gunthorpe
On Mon, Jun 26, 2023 at 02:51:29PM +, Liu, Yi L wrote:
> > > >
> > > > Ah, any suggestion on the naming? How about 
> > > > vfio_device_get_kvm_safe_locked()?
> > >
> > > I thought you were using _df_ now for these functions?
> > >
> > 
> > I see. Your point is passing df to _vfio_device_get_kvm_safe(() and
> > test the df->kvm within it.  Hence rename it to be _df_. I think group
> > path should be ok with this change as well. Let me make it.
> 
> To pass vfio_device_file to _vfio_device_get_kvm_safe(), needs to make
> the below changes to the group path. If just wants to test null kvm in the
> _vfio_device_get_kvm_safe(), maybe simpler to keep the current parameters
> and just move the null kvm test within this function. Is it?

This does seem a bit nicer, yes

> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 8a9ebcc6980b..4e6ea2943d28 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -373,14 +373,22 @@ void vfio_unregister_group_dev(struct vfio_device 
> *device)
>  EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
>  
>  #ifdef CONFIG_HAVE_KVM
> -void _vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm)
> +void _vfio_df_get_kvm_safe(struct vfio_device_file *df)

But still avoid the leading _ here

Jason


Re: [Intel-gfx] [PATCH v13 22/22] docs: vfio: Add vfio device cdev description

2023-06-28 Thread Jason Gunthorpe
On Wed, Jun 28, 2023 at 01:10:11AM +, Liu, Yi L wrote:
> > From: Alex Williamson 
> > Sent: Wednesday, June 28, 2023 1:35 AM
> 
> [The Cc list gets broken in the reply from Alex to Jason, here I reply to
> Alex's email with the Cc list fixed. @Alex, seems like the same symptom
> with last time, do you have any idea on it?]

It is weird...

> > Are you suggesting a post-merge-window pull request for v6.5 (also
> > frowned on) or are you suggesting that it simmers in both our next
> > branches until v6.6?  Thanks,
> 
> It appears to me the latter one. When 6.5-rc1 is released, we immediately
> apply the hot-reset and cdev series onto it and put it in a shared tree to
> assist the other iommufd feature development (e.g. nesting). Jason, is it?

Yes, no reason to try to bend the rules with Linus at this point

Jason


Re: [Intel-gfx] [PATCH v13 22/22] docs: vfio: Add vfio device cdev description

2023-06-28 Thread Jason Gunthorpe
On Wed, Jun 28, 2023 at 12:56:40AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, June 28, 2023 12:12 AM
> > 
> > On Tue, Jun 27, 2023 at 08:54:33AM +, Liu, Yi L wrote:
> > > > From: Alex Williamson 
> > > > Sent: Thursday, June 22, 2023 5:54 AM
> > > >
> > > > On Fri, 16 Jun 2023 02:39:46 -0700
> > > > Yi Liu  wrote:
> > >
> > > > > +VFIO device cdev doesn't rely on VFIO group/container/iommu drivers.
> > > > > +Hence those modules can be fully compiled out in an environment
> > > > > +where no legacy VFIO application exists.
> > > > > +
> > > > > +So far SPAPR does not support IOMMUFD yet.  So it cannot support 
> > > > > device
> > > > > +cdev either.
> > > >
> > > > Why isn´t this enforced via Kconfig?  At the vfio level we could simply
> > > > add the following in patch 17/:
> > > >
> > > > config VFIO_DEVICE_CDEV
> > > > bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
> > > > depends on IOMMUFD && !SPAPR_TCE_IOMMU
> > > >^^^
> > > >
> 
> Proposal A.
> 
> > > > Or if Jason wants, IOMMUFD could depend on !SPAPR_TCE_IOMMU for now and
> > > > the existing Kconfig options would exclude it.  If we know it doesn't
> > > > work, let's not put the burden on the user to figure that out.  A
> > > > follow-up patch for this would be fine if there's no other reason to
> > > > respin the series.
> 
> Proposal B.
> 
> > >
> > > @Jason,
> > > How about your opinion? Seems reasonable to make IOMMUFD
> > > depend on !SPAPR_TCE_IOMMU. Is it?
> > 
> > The right kconfig would be to list all the iommu drivers that can
> > support iommufd and allow it to be selected if any of them are
> > enabled.
> > 
> > This seems too complex to bother with, so I like Alex's version above..
> 
> Sorry, I'm not quite clear. Alex has two proposals above (A and B). Which
> one do you mean? It looks like you prefer A. is it? :-)

A

Jason


Re: [Intel-gfx] [PATCH v13 22/22] docs: vfio: Add vfio device cdev description

2023-06-27 Thread Jason Gunthorpe
On Tue, Jun 27, 2023 at 08:54:33AM +, Liu, Yi L wrote:
> > From: Alex Williamson 
> > Sent: Thursday, June 22, 2023 5:54 AM
> > 
> > On Fri, 16 Jun 2023 02:39:46 -0700
> > Yi Liu  wrote:
> 
> > > +VFIO device cdev doesn't rely on VFIO group/container/iommu drivers.
> > > +Hence those modules can be fully compiled out in an environment
> > > +where no legacy VFIO application exists.
> > > +
> > > +So far SPAPR does not support IOMMUFD yet.  So it cannot support device
> > > +cdev either.
> > 
> > Why isn´t this enforced via Kconfig?  At the vfio level we could simply
> > add the following in patch 17/:
> > 
> > config VFIO_DEVICE_CDEV
> > bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
> > depends on IOMMUFD && !SPAPR_TCE_IOMMU
> >^^^
> > 
> > Or if Jason wants, IOMMUFD could depend on !SPAPR_TCE_IOMMU for now and
> > the existing Kconfig options would exclude it.  If we know it doesn't
> > work, let's not put the burden on the user to figure that out.  A
> > follow-up patch for this would be fine if there's no other reason to
> > respin the series.
> 
> @Jason,
> How about your opinion? Seems reasonable to make IOMMUFD
> depend on !SPAPR_TCE_IOMMU. Is it?

The right kconfig would be to list all the iommu drivers that can
support iommufd and allow it to be selected if any of them are
enabled.

This seems too complex to bother with, so I like Alex's version above..

> > Otherwise the series is looking pretty good to me.  It still requires
> > some reviews/acks in the iommufd space and it would be good to see more
> > reviews for the remainder given the amount of collaboration here.
> > 
> > I'm out for the rest of the week, but I'll leave open accepting this
> > and the hot-reset series next week for the merge window.  Thanks,
> 
> @Alex,
> Given Jason's remarks on cdev v12, I've already got a new version as below.
> I can post it once the above kconfig open is closed.

I think we don't need to bend the rules, Linus would not be happy to
see 30 major patches that never hit linux-next at all.

I'm happy if we put it on a branch at RC1 and merge it to the vfio &
iommufd trees, it is functionally the same outcome in the same time
frame.

Jason


Re: [Intel-gfx] [PATCH v12 18/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-06-26 Thread Jason Gunthorpe
On Mon, Jun 26, 2023 at 08:34:26AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Saturday, June 24, 2023 12:15 AM
> 
> > >  }
> > >
> > > +static void vfio_device_get_kvm_safe(struct vfio_device_file *df)
> > > +{
> > > + spin_lock(>kvm_ref_lock);
> > > + if (df->kvm)
> > > + _vfio_device_get_kvm_safe(df->device, df->kvm);
> > > + spin_unlock(>kvm_ref_lock);
> > > +}
> > 
> > I'm surprised symbol_get() can be called from a spinlock, but it sure
> > looks like it can..
> > 
> > Also moving the if kvm is null test into _vfio_device_get_kvm_safe()
> > will save a few lines.
> > 
> > Also shouldn't be called _vfio_device...
> 
> Ah, any suggestion on the naming? How about vfio_device_get_kvm_safe_locked()?

I thought you were using _df_ now for these functions?

Jason


Re: [Intel-gfx] [PATCH v13 00/22] Add vfio_device cdev for iommufd support

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 16, 2023 at 02:39:24AM -0700, Yi Liu wrote:
> Existing VFIO provides group-centric user APIs for userspace. Userspace
> opens the /dev/vfio/$group_id first before getting device fd and hence
> getting access to device. This is not the desired model for iommufd. Per
> the conclusion of community discussion[1], iommufd provides device-centric
> kAPIs and requires its consumer (like VFIO) to be device-centric user
> APIs. Such user APIs are used to associate device with iommufd and also
> the I/O address spaces managed by the iommufd.
> 
> This series first introduces a per device file structure to be prepared
> for further enhancement and refactors the kvm-vfio code to be prepared
> for accepting device file from userspace. After this, adds a mechanism for
> blocking device access before iommufd bind. Then refactors the vfio to be
> able to handle cdev path (e.g. iommufd binding, no-iommufd, [de]attach ioas).
> This refactor includes making the device_open exclusive between the group
> and the cdev path, only allow single device open in cdev path; vfio-iommufd
> code is also refactored to support cdev. e.g. split the vfio_iommufd_bind()
> into two steps. Eventually, adds the cdev support for vfio device and the
> new ioctls, then makes group infrastructure optional as it is not needed
> when vfio device cdev is compiled.
> 
> This series is based on some preparation works done to vfio emulated 
> devices[2]
> and vfio pci hot reset enhancements[3].
> 
> This series is a prerequisite for iommu nesting for vfio device[4] [5].
> 
> The complete code can be found in below branch, simple tests done to the
> legacy group path and the cdev path. Draft QEMU branch can be found at[6]
> However, the noiommu mode test is only done with some hacks in kernel and
> qemu to check if qemu can boot with noiommu devices.
> 
> https://github.com/yiliu1765/iommufd/tree/vfio_device_cdev_v13
> (config CONFIG_IOMMUFD=y CONFIG_VFIO_DEVICE_CDEV=y)
> 
> base-commit: dcc9d48709e6bc6ec3da97626b8768582e138326
> 
> [1] 
> https://lore.kernel.org/kvm/bn9pr11mb5433b1e4ae5b0480369f97178c...@bn9pr11mb5433.namprd11.prod.outlook.com/
> [2] https://lore.kernel.org/kvm/20230327093351.44505-1-yi.l@intel.com/ - 
> merged
> [3] https://lore.kernel.org/kvm/20230616093042.65094-1-yi.l@intel.com/
> [4] 
> https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l@intel.com/
> [5] 
> https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l@intel.com/#t
> [6] https://github.com/yiliu1765/qemu/tree/iommufd_rfcv4.mig.reset.v4_var3
> 
> Change log:
> 
> v13:
>  - vfio_device_first_open() and vfio_device_last_close() to be 
> vfio_df_device_first_open()
>vfio_df_device_last_close() (Alex)
>  - Define struct vfio_device_file::access_granted as u8 and also place the 
> u32 devid to
>be behind this flag as this structure access is hot, so needs to avoid too 
> much hole
>in the structure (Alex)
>  - Use u8 instead bool in the struct vfio_device for the flags (Alex)
>  - Define BIND, ATTACH, DETACH ioctl behind VFIO_DEVICE_FEATURE whose offset 
> is 17 (Alex)
>  - Drop patch 20, 21, 22 of v12 (Alex)
>  - Per the patch drop, still needs to detect the physical devices that do not 
> have
>IOMMU in the cdev registration as cdev does not support such devices. Per 
> the
>suggestion from Jason, lift the IOMMU_CAP_CACHE_COHERENCY check to be in 
> vfio_main.c
>so that it can fail the registration of such devices if only cdev is 
> compiled. (Jason, Alex)
>  - Refine the vfio.rst doc, highlight that the cdev device access is stil 
> bound with
>iommu group. (Alex)
>  - Reaffirm t-b from below folks:
>Nicolin Chen - Test nesting branch which is based on cdev v12, the test is 
> done on ARM64 (SMMUv3)
>Matthew Rosato - vfio-pci, vfio-ap, vfio-ccw under container, compat and 
> cdev mode, and nesting
> test on SMMUv3 and Intel.
>Yanting Jiang - regression tests with NIC passthrough on Intel platform

I accendiently put my remarks on v12, but they all apply here, and I
don't have any new remarks for this version.

Thanks,
Jason


Re: [Intel-gfx] [PATCH v12 23/24] vfio: Compile vfio_group infrastructure optionally

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:52AM -0700, Yi Liu wrote:
> vfio_group is not needed for vfio device cdev, so with vfio device cdev
> introduced, the vfio_group infrastructures can be compiled out if only
> cdev is needed.
> 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/Kconfig |  4 +-
>  drivers/vfio/Kconfig  | 15 +++
>  drivers/vfio/Makefile |  2 +-
>  drivers/vfio/vfio.h   | 84 ---
>  include/linux/vfio.h  | 25 +--
>  5 files changed, 118 insertions(+), 12 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 19/24] vfio: Add VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:48AM -0700, Yi Liu wrote:
> This adds ioctl for userspace to attach device cdev fd to and detach
> from IOAS/hw_pagetable managed by iommufd.
> 
> VFIO_DEVICE_ATTACH_IOMMUFD_PT: attach vfio device to IOAS, hw_pagetable
>  managed by iommufd. Attach can be
>  undo by VFIO_DEVICE_DETACH_IOMMUFD_PT
>  or device fd close.
> VFIO_DEVICE_DETACH_IOMMUFD_PT: detach vfio device from the current 
> attached
>  IOAS or hw_pagetable managed by iommufd.
> 
> noiommu devices do not support [AT|DE]TACH, if user invokes the two ioctls
> on such devices, shall fail.

Stale comment

> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/device_cdev.c | 66 ++
>  drivers/vfio/vfio.h| 16 +
>  drivers/vfio/vfio_main.c   |  8 +
>  include/uapi/linux/vfio.h  | 42 
>  4 files changed, 132 insertions(+)
> 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index a4498ddbe774..6e1d499ee160 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -167,6 +167,72 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file 
> *df,
>   return ret;
>  }
>  
> +int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
> + struct vfio_device_attach_iommufd_pt __user *arg)
> +{
> + struct vfio_device *device = df->device;
> + struct vfio_device_attach_iommufd_pt attach;
> + unsigned long minsz;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
> +
> + if (copy_from_user(, arg, minsz))
> + return -EFAULT;
> +
> + if (attach.argsz < minsz || attach.flags)
> + return -EINVAL;
> +
> + /* ATTACH only allowed for cdev fds */
> + if (df->group)
> + return -EINVAL;

I feel like vfio_device_fops_unl_ioctl() should do these group tests
for the whole lot

@@ -1187,19 +1187,24 @@ static long vfio_device_fops_unl_ioctl(struct file 
*filep,
if (ret)
return ret;
 
+   /* cdev only ioctls */
+   if (IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV) && !df->group) {
+   switch (cmd) {
+   case VFIO_DEVICE_ATTACH_IOMMUFD_PT:
+   ret = vfio_df_ioctl_attach_pt(df, (void __user *)arg);
+   goto out;
+
+   case VFIO_DEVICE_DETACH_IOMMUFD_PT:
+   ret = vfio_df_ioctl_detach_pt(df, (void __user *)arg);
+   goto out;
+   }
+   }
+
switch (cmd) {
case VFIO_DEVICE_FEATURE:
ret = vfio_ioctl_device_feature(device, (void __user *)arg);
break;
 
-   case VFIO_DEVICE_ATTACH_IOMMUFD_PT:
-   ret = vfio_df_ioctl_attach_pt(df, (void __user *)arg);
-   break;
-
-   case VFIO_DEVICE_DETACH_IOMMUFD_PT:
-   ret = vfio_df_ioctl_detach_pt(df, (void __user *)arg);
-   break;
-
default:
if (unlikely(!device->ops->ioctl))
ret = -EINVAL;

And also make a local var for void __user * to avoid the repeated
casts.

Also this construction avoids the stub static inlines since the
IS_ENABLED will compile out the call. Just use a normal function
prototype outside any ifdef.

> +
> + mutex_lock(>dev_set->lock);
> + ret = device->ops->attach_ioas(device, _id);
> + if (ret)
> + goto out_unlock;
> +
> + ret = copy_to_user(>pt_id, _id,
> +sizeof(attach.pt_id)) ? -EFAULT : 0;
> + if (ret)
> + goto out_detach;

Don't use the ?:

if (copy_to_user()..) {
ret = -EFAULT;
goto out_detach;
}

Jason


Re: [Intel-gfx] [PATCH v12 18/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:47AM -0700, Yi Liu wrote:
> This adds ioctl for userspace to bind device cdev fd to iommufd.
> 
> VFIO_DEVICE_BIND_IOMMUFD: bind device to an iommufd, hence gain DMA
> control provided by the iommufd. open_device
> op is called after bind_iommufd op.
> 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/device_cdev.c | 123 +
>  drivers/vfio/vfio.h|  13 
>  drivers/vfio/vfio_main.c   |   5 ++
>  include/linux/vfio.h   |   3 +-
>  include/uapi/linux/vfio.h  |  27 
>  5 files changed, 170 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index 1c640016a824..a4498ddbe774 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -3,6 +3,7 @@
>   * Copyright (c) 2023 Intel Corporation.
>   */
>  #include 
> +#include 
>  
>  #include "vfio.h"
>  
> @@ -44,6 +45,128 @@ int vfio_device_fops_cdev_open(struct inode *inode, 
> struct file *filep)
>   return ret;
>  }
>  
> +static void vfio_device_get_kvm_safe(struct vfio_device_file *df)
> +{
> + spin_lock(>kvm_ref_lock);
> + if (df->kvm)
> + _vfio_device_get_kvm_safe(df->device, df->kvm);
> + spin_unlock(>kvm_ref_lock);
> +}

I'm surprised symbol_get() can be called from a spinlock, but it sure
looks like it can..

Also moving the if kvm is null test into _vfio_device_get_kvm_safe()
will save a few lines.

Also shouldn't be called _vfio_device...

> +void vfio_df_cdev_close(struct vfio_device_file *df)
> +{
> + struct vfio_device *device = df->device;
> +
> + /*
> +  * In the time of close, there is no contention with another one
> +  * changing this flag.  So read df->access_granted without lock
> +  * and no smp_load_acquire() is ok.
> +  */
> + if (!df->access_granted)
> + return;
> +
> + mutex_lock(>dev_set->lock);
> + vfio_df_close(df);
> + vfio_device_put_kvm(device);
> + iommufd_ctx_put(df->iommufd);
> + device->cdev_opened = false;
> + mutex_unlock(>dev_set->lock);
> + vfio_device_unblock_group(device);
> +}

Lets call this vfio_df_unbind_iommufd() and put it near
vfio_df_ioctl_bind_iommufd()

> static struct iommufd_ctx *vfio_get_iommufd_from_fd(int fd)

This can probably be an iommufd function:
  iommufd_ctx_from_fd(int fd)

> +long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
> + struct vfio_device_bind_iommufd __user *arg)
> +{
> + ret = copy_to_user(>out_devid, >devid,
> +sizeof(df->devid)) ? -EFAULT : 0;
> + if (ret)
> + goto out_close_device;
> +
> + /*
> +  * Paired with smp_load_acquire() in vfio_device_fops::ioctl/
> +  * read/write/mmap
> +  */
> + smp_store_release(>access_granted, true);
> + device->cdev_opened = true;

Move the cdev_opened up above the release just for consistency.

Jason


Re: [Intel-gfx] [PATCH v12 17/24] vfio: Add cdev for vfio_device

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:46AM -0700, Yi Liu wrote:
> This allows user to directly open a vfio device w/o using the legacy
> container/group interface, as a prerequisite for supporting new iommu
> features like nested translation.
> 
> The device fd opened in this manner doesn't have the capability to access
> the device as the fops open() doesn't open the device until the successful
> BIND_IOMMUFD which be added in next patch.
> 
> With this patch, devices registered to vfio core have both group and device
> interface created.
> 
> - group interface : /dev/vfio/$groupID
> - device interface: /dev/vfio/devices/vfioX - normal device
>   ("X" is the minor number and is unique across devices)
> 
> Given a vfio device the user can identify the matching vfioX by checking
> the sysfs path of the device. Take PCI device (:6a:01.0) for example,
> /sys/bus/pci/devices/\:6a\:01.0/vfio-dev/vfio0/dev contains the
> major:minor of the matching vfioX.
> 
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.
> 
> The vfio_device cdev logic in this patch:
> *) __vfio_register_dev() path ends up doing cdev_device_add() for each
>vfio_device if VFIO_DEVICE_CDEV configured.
> *) vfio_unregister_group_dev() path does cdev_device_del();
> 
> device interface does not support noiommu devices, noiommu users should
> use the legacy group interface.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Matthew Rosato 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/Kconfig   | 12 
>  drivers/vfio/Makefile  |  1 +
>  drivers/vfio/device_cdev.c | 62 ++
>  drivers/vfio/vfio.h| 54 +
>  drivers/vfio/vfio_main.c   | 23 +++---
>  include/linux/vfio.h   |  4 +++
>  6 files changed, 151 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/vfio/device_cdev.c

Reviewed-by: Jason Gunthorpe 

> +/*
> + * device access via the fd opened by this function is blocked until
> + * .open_device() is called successfully during BIND_IOMMUFD.
> + */
> +int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep)
> +{
> + struct vfio_device *device = container_of(inode->i_cdev,
> +   struct vfio_device, cdev);
> + struct vfio_device_file *df;
> + int ret;
> +

Add the comment

 /* Paired with the put in vfio_device_fops_release() */
> + if (!vfio_device_try_get_registration(device))
> + return -ENODEV;


> @@ -338,6 +338,12 @@ void vfio_unregister_group_dev(struct vfio_device 
> *device)
>*/
>   vfio_device_group_unregister(device);
>  
> + /*
> +  * Balances vfio_device_add() in register path, also prevents
> +  * new device opened by userspace in the cdev path.
> +  */
> + vfio_device_del(device);
> +
>   vfio_device_put_registration(device);
>   rc = try_wait_for_completion(>comp);
>   while (rc <= 0) {
> @@ -361,9 +367,6 @@ void vfio_unregister_group_dev(struct vfio_device *device)
>   }
>   }
>  
> - /* Balances device_add in register path */
> - device_del(>device);
> -

This looks OK from what I can tell, but it might deserve its own patch
like was done for other movement.

Jason


Re: [Intel-gfx] [PATCH v12 16/24] vfio: Move vfio_device_group_unregister() to be the first operation in unregister

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:45AM -0700, Yi Liu wrote:
> This avoids endless vfio_device refcount increasement by userspace,
> which would keep blocking the vfio_unregister_group_dev().
> 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/vfio_main.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 

It looks OK, at least I couldn't find a reason why the group list
would need to continue to be valid while we are waiting for the
registration lock to release.

Jason


Re: [Intel-gfx] [PATCH v12 15/24] vfio-iommufd: Add detach_ioas support for emulated VFIO devices

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:44AM -0700, Yi Liu wrote:
> This prepares for adding DETACH ioctl for emulated VFIO devices.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Matthew Rosato 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Yi Liu 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  1 +
>  drivers/s390/cio/vfio_ccw_ops.c   |  1 +
>  drivers/s390/crypto/vfio_ap_ops.c |  1 +
>  drivers/vfio/iommufd.c| 13 +
>  include/linux/vfio.h  |  3 +++
>  samples/vfio-mdev/mbochs.c|  1 +
>  samples/vfio-mdev/mdpy.c  |  1 +
>  samples/vfio-mdev/mtty.c  |  1 +
>  8 files changed, 22 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 14/24] iommufd/device: Add iommufd_access_detach() API

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:43AM -0700, Yi Liu wrote:
> From: Nicolin Chen 
> 
> Previously, the detach routine is only done by the destroy(). And it was
> called by vfio_iommufd_emulated_unbind() when the device runs close(), so
> all the mappings in iopt were cleaned in that setup, when the call trace
> reaches this detach() routine.
> 
> Now, there's a need of a detach uAPI, meaning that it does not only need
> a new iommufd_access_detach() API, but also requires access->ops->unmap()
> call as a cleanup. So add one.
> 
> However, leaving that unprotected can introduce some potential of a race
> condition during the pin_/unpin_pages() call, where access->ioas->iopt is
> getting referenced. So, add an ioas_lock to protect the context of iopt
> referencings.
> 
> Also, to allow the iommufd_access_unpin_pages() callback to happen via
> this unmap() call, add an ioas_unpin pointer, so the unpin routine won't
> be affected by the "access->ioas = NULL" trick.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Nicolin Chen 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/device.c  | 76 +++--
>  drivers/iommu/iommufd/iommufd_private.h |  2 +
>  include/linux/iommufd.h |  1 +
>  3 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 96d4281bfa7c..6b4ff635c15e 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -486,6 +486,7 @@ iommufd_access_create(struct iommufd_ctx *ictx,
>   iommufd_ctx_get(ictx);
>   iommufd_object_finalize(ictx, >obj);
>   *id = access->obj.id;
> + mutex_init(>ioas_lock);
>   return access;
>  }
>  EXPORT_SYMBOL_NS_GPL(iommufd_access_create, IOMMUFD);
> @@ -505,26 +506,66 @@ void iommufd_access_destroy(struct iommufd_access 
> *access)
>  }
>  EXPORT_SYMBOL_NS_GPL(iommufd_access_destroy, IOMMUFD);
>  
> +static void __iommufd_access_detach(struct iommufd_access *access)
> +{
> + struct iommufd_ioas *cur_ioas = access->ioas;
> +
> + lockdep_assert_held(>ioas_lock);
> + /*
> +  * Set ioas to NULL to block any further iommufd_access_pin_pages().
> +  * iommufd_access_unpin_pages() can continue using access->ioas_unpin.
> +  */
> + access->ioas = NULL;
> +
> + if (access->ops->unmap) {
> + mutex_unlock(>ioas_lock);
> + access->ops->unmap(access->data, 0, ULONG_MAX);
> + mutex_lock(>ioas_lock);
> + }
> + iopt_remove_access(_ioas->iopt, access);
> + refcount_dec(_ioas->obj.users);
> +}
> +
> +void iommufd_access_detach(struct iommufd_access *access)
> +{
> + mutex_lock(>ioas_lock);
> + if (WARN_ON(!access->ioas))
> + goto out;
> + __iommufd_access_detach(access);
> +out:
> + access->ioas_unpin = NULL;
> + mutex_unlock(>ioas_lock);
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_access_detach, IOMMUFD);

There is not really any benefit to make this two functions

> int iommufd_access_attach(struct iommufd_access *access, u32 ioas_id)
> {
[..]
>   if (access->ioas) {

if (access->ioas || access->ioas_unpin) {

But I wonder if it should be a WARN_ON? Does VFIO protect against
the userspace racing detach and attach, or do we expect to do it here?

> @@ -579,8 +620,8 @@ void iommufd_access_notify_unmap(struct io_pagetable 
> *iopt, unsigned long iova,
>  void iommufd_access_unpin_pages(struct iommufd_access *access,
>   unsigned long iova, unsigned long length)
>  {
> - struct io_pagetable *iopt = >ioas->iopt;
>   struct iopt_area_contig_iter iter;
> + struct io_pagetable *iopt;
>   unsigned long last_iova;
>   struct iopt_area *area;
>  
> @@ -588,6 +629,13 @@ void iommufd_access_unpin_pages(struct iommufd_access 
> *access,
>   WARN_ON(check_add_overflow(iova, length - 1, _iova)))
>   return;
>  
> + mutex_lock(>ioas_lock);
> + if (!access->ioas_unpin) {

This should be WARN_ON(), the driver has done something wrong if we
call this after the access has been detached.

Jason


Re: [Intel-gfx] [PATCH v12 13/24] vfio-iommufd: Add detach_ioas support for physical VFIO devices

2023-06-23 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:42AM -0700, Yi Liu wrote:
> This prepares for adding DETACH ioctl for physical VFIO devices.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Matthew Rosato 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Yi Liu 
> ---
>  Documentation/driver-api/vfio.rst |  8 +---
>  drivers/vfio/fsl-mc/vfio_fsl_mc.c |  1 +
>  drivers/vfio/iommufd.c| 20 +++
>  .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c|  2 ++
>  drivers/vfio/pci/mlx5/main.c  |  1 +
>  drivers/vfio/pci/vfio_pci.c   |  1 +
>  drivers/vfio/platform/vfio_amba.c |  1 +
>  drivers/vfio/platform/vfio_platform.c |  1 +
>  drivers/vfio/vfio_main.c  |  3 ++-
>  include/linux/vfio.h  |  8 +++-
>  10 files changed, 41 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 12/24] vfio: Record devid in vfio_device_file

2023-06-22 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:41AM -0700, Yi Liu wrote:
> .bind_iommufd() will generate an ID to represent this bond, which is
> needed by userspace for further usage. Store devid in vfio_device_file
> to avoid passing the pointer in multiple places.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/iommufd.c   | 12 +++-
>  drivers/vfio/vfio.h  | 10 +-
>  drivers/vfio/vfio_main.c |  6 +++---
>  3 files changed, 15 insertions(+), 13 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 11/24] vfio-iommufd: Split bind/attach into two steps

2023-06-22 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:40AM -0700, Yi Liu wrote:
> This aligns the bind/attach logic with the coming vfio device cdev support.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c   | 17 +
>  drivers/vfio/iommufd.c | 35 +--
>  drivers/vfio/vfio.h|  9 +
>  3 files changed, 39 insertions(+), 22 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 10/24] vfio-iommufd: Move noiommu compat validation out of vfio_iommufd_bind()

2023-06-22 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:16:39AM -0700, Yi Liu wrote:
> This moves the noiommu compat validation logic into vfio_df_group_open().
> This is more consistent with what will be done in vfio device cdev path.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Tested-by: Nicolin Chen 
> Tested-by: Yanting Jiang 
> Tested-by: Shameer Kolothum 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c   | 13 +
>  drivers/vfio/iommufd.c | 22 --
>  drivers/vfio/vfio.h|  9 +
>  3 files changed, 30 insertions(+), 14 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v8 08/10] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-06-22 Thread Jason Gunthorpe
On Fri, Jun 16, 2023 at 02:30:40AM -0700, Yi Liu wrote:
> This allows VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl use the iommufd_ctx
> of the cdev device to check the ownership of the other affected devices.
> 
> When VFIO_DEVICE_GET_PCI_HOT_RESET_INFO is called on an IOMMUFD managed
> device, the new flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is reported to indicate
> the values returned are IOMMUFD devids rather than group IDs as used when
> accessing vfio devices through the conventional vfio group interface.
> Additionally the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED will be reported
> in this mode if all of the devices affected by the hot-reset are owned by
> either virtue of being directly bound to the same iommufd context as the
> calling device, or implicitly owned via a shared IOMMU group.
> 
> Suggested-by: Jason Gunthorpe 
> Suggested-by: Alex Williamson 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/iommufd.c   | 44 ++
>  drivers/vfio/pci/vfio_pci_core.c | 54 +++-
>  include/linux/vfio.h | 14 +
>  include/uapi/linux/vfio.h| 50 -
>  4 files changed, 154 insertions(+), 8 deletions(-)

I would have put patch 9 before this one, but it is OK this way too

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v12 21/24] vfio: Determine noiommu device in __vfio_register_dev()

2023-06-14 Thread Jason Gunthorpe
On Wed, Jun 14, 2023 at 01:12:50PM +, Liu, Yi L wrote:
> diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
> index 41a09a2df690..c2e0128323a7 100644
> --- a/drivers/vfio/group.c
> +++ b/drivers/vfio/group.c
> @@ -687,16 +687,6 @@ static struct vfio_group 
> *vfio_group_find_or_alloc(struct device *dev)
>   if (!iommu_group)
>   return ERR_PTR(-EINVAL);
>  
> - /*
> -  * VFIO always sets IOMMU_CACHE because we offer no way for userspace to
> -  * restore cache coherency. It has to be checked here because it is only
> -  * valid for cases where we are using iommu groups.
> -  */
> - if (!device_iommu_capable(dev, IOMMU_CAP_CACHE_COHERENCY)) {
> - iommu_group_put(iommu_group);
> - return ERR_PTR(-EINVAL);
> - }
> -
>   mutex_lock(_lock);
>   group = vfio_group_find_from_iommu(iommu_group);
>   if (group) {
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 51c80eb32af6..ffb4585b7f0e 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -292,6 +292,17 @@ static int __vfio_register_dev(struct vfio_device 
> *device,
>   if (ret)
>   return ret;
>  
> + /*
> +  * VFIO always sets IOMMU_CACHE because we offer no way for userspace to
> +  * restore cache coherency. It has to be checked here because it is only
> +  * valid for cases where we are using iommu groups.
> +  */
> + if (type == VFIO_IOMMU && !vfio_device_is_noiommu(device) &&
> + !device_iommu_capable(device->dev, IOMMU_CAP_CACHE_COHERENCY)) {
> + ret = -EINVAL;
> + goto err_out;
> + }
> +
>   ret = vfio_device_add(device);
>   if (ret)
>   goto err_out;

Yes that looks right

> 
> > I prefer the idea that vfio_device_is_noiommu() works in all the
> > kconfig scenarios rather than adding #ifdefs.
> 
> But the vfio_group would be empty when CONFIG_VFIO_GROUP is unset.
> From what I got now, when CONFIG_VFIO_GROUP is unset, the stub
> function always returns false.

It seems fine, you could also put the ifdef inside the stub

Jason


Re: [Intel-gfx] [PATCH v7 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-06-14 Thread Jason Gunthorpe
On Wed, Jun 14, 2023 at 01:05:45PM +, Liu, Yi L wrote:
> > -EAGAIN basically means the kernel internally malfunctioned - eg it
> > allocated too little space for the actual size of devices. That is no
> > longer possible in this version so it should never return -EAGAIN.
> 
> I still have one doubt. Per my understanding, this is to handle newly
> plugged devices during the info reporting path. I don’t think holding
> dev_set lock can prevent it. but maybe -ENOSPC is enough. @Alex,
> what about your opinion?

If the device was plug instantly before we computed the size we returned
ENOSPC

If it was plugged instantly after we computed the size we returned
EAGAIN

Here we just resolve this race consistently to always return ENOSPC,
which always means we ran out of space in the user provided buffer.

> > -   kfree(devices);
> > -   return ret;
> > +   if (fill.count != fill.devices - arg->devices)
> 
> Should be "if (fill.count != (fill.devices - arg->devices) / 
> sizeof(arg->devices[0]))" 

devices is already a typed pointer so the compiler computes the
/sizeof() itself

Your version  above is needed if it was void *

Jason


Re: [Intel-gfx] [PATCH v12 21/24] vfio: Determine noiommu device in __vfio_register_dev()

2023-06-14 Thread Jason Gunthorpe
On Wed, Jun 14, 2023 at 06:20:10AM +, Tian, Kevin wrote:
> > From: Liu, Yi L 
> > Sent: Wednesday, June 14, 2023 2:14 PM
> > 
> > 
> > > With that I think Jason's suggestion is to lift that test into main.c:
> > >
> > > int vfio_register_group_dev(struct vfio_device *device)
> > > {
> > >   /*
> > >* VFIO always sets IOMMU_CACHE because we offer no way for
> > userspace to
> > >* restore cache coherency. It has to be checked here because it is
> > only
> > >* valid for cases where we are using iommu groups.
> > >*/
> > >   if (type == VFIO_IOMMU && !vfio_device_is_noiommu(device) &&
> > >   !device_iommu_capable(dev, IOMMU_CAP_CACHE_COHERENCY))
> > >   return ERR_PTR(-EINVAL);
> > 
> > vfio_device_is_noiommu() needs to be called after vfio_device_set_group().
> > Otherwise, it's always false. So still needs to call it in the
> > __vfio_register_dev().
> 
> yes

Right, but it needs to be in vfio_main.c, not in the group.c - so
another patch should be added to move it.

I prefer the idea that vfio_device_is_noiommu() works in all the
kconfig scenarios rather than adding #ifdefs.

Jason


Re: [Intel-gfx] [PATCH v7 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-06-14 Thread Jason Gunthorpe
On Wed, Jun 14, 2023 at 10:35:10AM +, Liu, Yi L wrote:

> > -   if (fill->cur == fill->max)
> > -   return -EAGAIN; /* Something changed, try again */
> > +   if (fill->devices_end >= fill->devices)
> > +   return -ENOSPC;
> 
> This should be fill->devices_end <= fill->devices. 

Yep

> Even it's corrected. The
> new code does not return -EAGAIN. 

Right, there is no EAGAIN. If the caller didn't provide enough space
the previous version returned ENOSPC:

> > -   if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
> > -   ret = -ENOSPC;
> > -   hdr.count = fill.max;
> > -   goto reset_info_exit;
> > -   }

-EAGAIN basically means the kernel internally malfunctioned - eg it
allocated too little space for the actual size of devices. That is no
longer possible in this version so it should never return -EAGAIN.

> And if return -ENOSPC, the expected
> size should be returned. But I didn't see it. As the hunk below[1] is removed,
> seems no way to know how many memory it requires.

Yes, I missed that, it should keep counting

Like this then

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b0eadafcbcf502..05c064896a7a94 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -775,19 +775,25 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void 
*data)
 }
 
 struct vfio_pci_fill_info {
-   int max;
-   int cur;
-   struct vfio_pci_dependent_device *devices;
+   struct vfio_pci_dependent_device __user *devices;
+   struct vfio_pci_dependent_device __user *devices_end;
struct vfio_device *vdev;
+   u32 count;
u32 flags;
 };
 
 static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 {
+   struct vfio_pci_dependent_device info = {
+   .segment = pci_domain_nr(pdev->bus),
+   .bus = pdev->bus->number,
+   .devfn = pdev->devfn,
+   };
struct vfio_pci_fill_info *fill = data;
 
-   if (fill->cur == fill->max)
-   return -EAGAIN; /* Something changed, try again */
+   fill.count++;
+   if (fill->devices >= fill->devices_end)
+   return 0;
 
if (fill->flags & VFIO_PCI_HOT_RESET_FLAG_DEV_ID) {
struct iommufd_ctx *iommufd = 
vfio_iommufd_device_ictx(fill->vdev);
@@ -800,12 +806,12 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void 
*data)
 */
vdev = vfio_find_device_in_devset(dev_set, >dev);
if (!vdev)
-   fill->devices[fill->cur].devid = 
VFIO_PCI_DEVID_NOT_OWNED;
+   info.devid = VFIO_PCI_DEVID_NOT_OWNED;
else
-   fill->devices[fill->cur].devid =
-   vfio_iommufd_device_hot_reset_devid(vdev, 
iommufd);
+   info.devid = vfio_iommufd_device_hot_reset_devid(
+   vdev, iommufd);
/* If devid is VFIO_PCI_DEVID_NOT_OWNED, clear owned flag. */
-   if (fill->devices[fill->cur].devid == VFIO_PCI_DEVID_NOT_OWNED)
+   if (info.devid == VFIO_PCI_DEVID_NOT_OWNED)
fill->flags &= ~VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED;
} else {
struct iommu_group *iommu_group;
@@ -814,13 +820,13 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void 
*data)
if (!iommu_group)
return -EPERM; /* Cannot reset non-isolated devices */
 
-   fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+   info.group_id = iommu_group_id(iommu_group);
iommu_group_put(iommu_group);
}
-   fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
-   fill->devices[fill->cur].bus = pdev->bus->number;
-   fill->devices[fill->cur].devfn = pdev->devfn;
-   fill->cur++;
+
+   if (copy_to_user(fill->devices, , sizeof(info)))
+   return -EFAULT;
+   fill->devices++;
return 0;
 }
 
@@ -1212,8 +1218,7 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
unsigned long minsz =
offsetofend(struct vfio_pci_hot_reset_info, count);
struct vfio_pci_hot_reset_info hdr;
-   struct vfio_pci_fill_info fill = { 0 };
-   struct vfio_pci_dependent_device *devices = NULL;
+   struct vfio_pci_fill_info fill = {};
bool slot = false;
int ret = 0;
 
@@ -1231,29 +1236,9 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
else if (pci_probe_reset_bus(vdev->pdev->bus))
return -ENODEV;
 
-   /* How many devices are affected? */
-   ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
-   , slot);
-   if (ret)
-   return ret;
-
-   WARN_ON(!fill.max); /* Should always be at least one */
-
-   /*
-* If 

Re: [Intel-gfx] [PATCH v7 6/9] vfio: Mark cdev usage in vfio_device

2023-06-14 Thread Jason Gunthorpe
On Wed, Jun 14, 2023 at 05:56:08AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, June 14, 2023 1:56 AM
> > 
> > On Fri, Jun 02, 2023 at 05:15:12AM -0700, Yi Liu wrote:
> > > This can be used to differentiate whether to report group_id or devid in
> > > the revised VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. At this moment, no
> > > cdev path yet, so the vfio_device_cdev_opened() helper always returns 
> > > false.
> > >
> > > Reviewed-by: Kevin Tian 
> > > Tested-by: Terrence Xu 
> > > Signed-off-by: Yi Liu 
> > > ---
> > >  include/linux/vfio.h | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index 2c137ea94a3e..2a45853773a6 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -139,6 +139,11 @@ int vfio_iommufd_emulated_attach_ioas(struct 
> > > vfio_device
> > *vdev, u32 *pt_id);
> > >   ((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
> > >  #endif
> > >
> > > +static inline bool vfio_device_cdev_opened(struct vfio_device *device)
> > > +{
> > > + return false;
> > > +}
> > 
> > This and the two hunks in the other two patches that use this function
> > should be folded into the cdev series, probably just flattened to one
> > patch
> 
> Hmmm. I have a doubt about the rule. I think the reason to have this
> sub-series is to avoid bumping the cdev series. So perhaps we can still
> put this and the patch 9 in this series? Otherwise, most of the series
> needs to be in the cdev series.

Well, then Alex should apply them at the same time..

Jason


Re: [Intel-gfx] [PATCH v7 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:14AM -0700, Yi Liu wrote:
> This allows VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl use the iommufd_ctx
> of the cdev device to check the ownership of the other affected devices.
> 
> When VFIO_DEVICE_GET_PCI_HOT_RESET_INFO is called on an IOMMUFD managed
> device, the new flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is reported to indicate
> the values returned are IOMMUFD devids rather than group IDs as used when
> accessing vfio devices through the conventional vfio group interface.
> Additionally the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED will be reported
> in this mode if all of the devices affected by the hot-reset are owned by
> either virtue of being directly bound to the same iommufd context as the
> calling device, or implicitly owned via a shared IOMMU group.
> 
> Suggested-by: Jason Gunthorpe 
> Suggested-by: Alex Williamson 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/iommufd.c   | 49 +++
>  drivers/vfio/pci/vfio_pci_core.c | 47 +-
>  include/linux/vfio.h | 16 ++
>  include/uapi/linux/vfio.h| 50 +++-
>  4 files changed, 154 insertions(+), 8 deletions(-)

This could use some more fiddling, like we could copy each
vfio_pci_dependent_device to user memory inside the loop instead of
allocating an array.

Add another patch with something like this in it:

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b0eadafcbcf502..516e0fda74bec9 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -775,19 +775,23 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void 
*data)
 }
 
 struct vfio_pci_fill_info {
-   int max;
-   int cur;
-   struct vfio_pci_dependent_device *devices;
+   struct vfio_pci_dependent_device __user *devices;
+   struct vfio_pci_dependent_device __user *devices_end;
struct vfio_device *vdev;
u32 flags;
 };
 
 static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 {
+   struct vfio_pci_dependent_device info = {
+   .segment = pci_domain_nr(pdev->bus),
+   .bus = pdev->bus->number,
+   .devfn = pdev->devfn,
+   };
struct vfio_pci_fill_info *fill = data;
 
-   if (fill->cur == fill->max)
-   return -EAGAIN; /* Something changed, try again */
+   if (fill->devices_end >= fill->devices)
+   return -ENOSPC;
 
if (fill->flags & VFIO_PCI_HOT_RESET_FLAG_DEV_ID) {
struct iommufd_ctx *iommufd = 
vfio_iommufd_device_ictx(fill->vdev);
@@ -800,12 +804,12 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void 
*data)
 */
vdev = vfio_find_device_in_devset(dev_set, >dev);
if (!vdev)
-   fill->devices[fill->cur].devid = 
VFIO_PCI_DEVID_NOT_OWNED;
+   info.devid = VFIO_PCI_DEVID_NOT_OWNED;
else
-   fill->devices[fill->cur].devid =
-   vfio_iommufd_device_hot_reset_devid(vdev, 
iommufd);
+   info.devid = vfio_iommufd_device_hot_reset_devid(
+   vdev, iommufd);
/* If devid is VFIO_PCI_DEVID_NOT_OWNED, clear owned flag. */
-   if (fill->devices[fill->cur].devid == VFIO_PCI_DEVID_NOT_OWNED)
+   if (info.devid == VFIO_PCI_DEVID_NOT_OWNED)
fill->flags &= ~VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED;
} else {
struct iommu_group *iommu_group;
@@ -814,13 +818,13 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void 
*data)
if (!iommu_group)
return -EPERM; /* Cannot reset non-isolated devices */
 
-   fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+   info.group_id = iommu_group_id(iommu_group);
iommu_group_put(iommu_group);
}
-   fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
-   fill->devices[fill->cur].bus = pdev->bus->number;
-   fill->devices[fill->cur].devfn = pdev->devfn;
-   fill->cur++;
+
+   if (copy_to_user(fill->devices, , sizeof(info)))
+   return -EFAULT;
+   fill->devices++;
return 0;
 }
 
@@ -1212,8 +1216,7 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
unsigned long minsz =
offsetofend(struct vfio_pci_hot_reset_info, count);
struct vfio_pci_hot_reset_info hdr;
-   struct vfio_pci_fill_info fill = { 0 };
-   struct vfio_pci_dependent_device *devices = NULL;
+   struct vfio_pci_fill_info fill = {};
bool slot = false;
int ret = 0;
 
@@ -1231,29 +123

Re: [Intel-gfx] [PATCH v7 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:15AM -0700, Yi Liu wrote:
> This is the way user to invoke hot-reset for the devices opened by cdev
> interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED
> in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing
> hot-reset for cdev devices.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Reviewed-by: Jason Gunthorpe 
> Tested-by: Yanting Jiang 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 61 ++--
>  include/uapi/linux/vfio.h| 14 
>  2 files changed, 64 insertions(+), 11 deletions(-)

This looks OK but it should be in the cdev series..

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v7 6/9] vfio: Mark cdev usage in vfio_device

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:12AM -0700, Yi Liu wrote:
> This can be used to differentiate whether to report group_id or devid in
> the revised VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. At this moment, no
> cdev path yet, so the vfio_device_cdev_opened() helper always returns false.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  include/linux/vfio.h | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 2c137ea94a3e..2a45853773a6 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -139,6 +139,11 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device 
> *vdev, u32 *pt_id);
>   ((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
>  #endif
>  
> +static inline bool vfio_device_cdev_opened(struct vfio_device *device)
> +{
> + return false;
> +}

This and the two hunks in the other two patches that use this function
should be folded into the cdev series, probably just flattened to one
patch

Jason


Re: [Intel-gfx] [PATCH v7 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-06-13 Thread Jason Gunthorpe
On Tue, Jun 13, 2023 at 08:32:29AM -0600, Alex Williamson wrote:
> On Tue, 13 Jun 2023 12:50:43 +
> "Liu, Yi L"  wrote:
> 
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, June 13, 2023 7:47 PM
> > > 
> > > On Fri, Jun 02, 2023 at 05:15:14AM -0700, Yi Liu wrote:  
> > > > +/*
> > > > + * Return devid for a device which is affected by hot-reset.
> > > > + * - valid devid > 0 for the device that is bound to the input
> > > > + *   iommufd_ctx.
> > > > + * - devid == VFIO_PCI_DEVID_OWNED for the device that has not
> > > > + *   been bound to any iommufd_ctx but other device within its
> > > > + *   group has been bound to the input iommufd_ctx.
> > > > + * - devid == VFIO_PCI_DEVID_NOT_OWNED for others. e.g. device
> > > > + *   is bound to other iommufd_ctx etc.
> > > > + */
> > > > +int vfio_iommufd_device_hot_reset_devid(struct vfio_device *vdev,
> > > > +   struct iommufd_ctx *ictx)
> > > > +{
> > > > +   struct iommu_group *group;
> > > > +   int devid;
> > > > +
> > > > +   if (vfio_iommufd_device_ictx(vdev) == ictx)
> > > > +   return vfio_iommufd_device_id(vdev);
> > > > +
> > > > +   group = iommu_group_get(vdev->dev);
> > > > +   if (!group)
> > > > +   return VFIO_PCI_DEVID_NOT_OWNED;
> > > > +
> > > > +   if (iommufd_ctx_has_group(ictx, group))
> > > > +   devid = VFIO_PCI_DEVID_OWNED;
> > > > +   else
> > > > +   devid = VFIO_PCI_DEVID_NOT_OWNED;
> > > > +
> > > > +   iommu_group_put(group);
> > > > +
> > > > +   return devid;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_iommufd_device_hot_reset_devid);  
> > > 
> > > This function really should not be in the core iommufd.c file - it is
> > > a purely vfio-pci function - why did you have to place it here?  
> > 
> > Put it here can avoid calling iommufd_ctx_has_group() in vfio-pci,
> > which requires to import IOMMUFD_NS. If this reason is not so
> > strong I can move it back to vfio-pci code.
> 
> The PCI-isms here are the name of the function and the return value,
> otherwise this is simply a "give me the devid for this device in this
> context".  The function name is trivial to change and we can define the
> internal errno usage such that the vfio-pci-core code can interpret the
> correct uAPI value.  For example, -EXDEV ("Cross-device link") could
> maybe be interpreted as owned and any other errno is not-owned, -ENODEV
> maybe being the default.

Yeah, this approach seems logical

If the function is called

  vfio_iommufd_get_dev_id(struct vfio_device *vdev, struct iommufd_ctx *ictx)

Then maybe 
  ENOENT = device is owned but there is no Id
  ENODEV = device is not owned

EXDEV is good too, nice symmetry with ENODEV - it doesn't really
matter since there is only one caller and there is no embedded errno
propogation.

Jason


Re: [Intel-gfx] [PATCH v12 21/24] vfio: Determine noiommu device in __vfio_register_dev()

2023-06-13 Thread Jason Gunthorpe
On Tue, Jun 13, 2023 at 11:15:11AM -0600, Alex Williamson wrote:
> [Sorry for breaking threading, replying to my own message id with reply
>  content from Yi since the Cc list got broken]

Yikes it is really busted, I think I fixed it?

> If we renamed your function above to vfio_device_has_iommu_group(),
> couldn't we just wrap device_add like below instead to not have cdev
> setup for a noiommu device, generate an error for a physical device w/o
> IOMMU backing, and otherwise setup the cdev device?
> 
> static inline int vfio_device_add(struct vfio_device *device, enum 
> vfio_group_type type)
> {
> #if IS_ENABLED(CONFIG_VFIO_GROUP)
>   if (device->group->type == VFIO_NO_IOMMU)
>   return device_add(>device);

vfio_device_is_noiommu() embeds the IS_ENABLED

> #else
>   if (type == VFIO_IOMMU && !vfio_device_has_iommu_group(device))
>   return -EINVAL;
> #endif

The require test is this from the group code:

if (!device_iommu_capable(dev, IOMMU_CAP_CACHE_COHERENCY)) {

We could lift it out of the group code and call it from vfio_main.c like:

if (type == VFIO_IOMMU && !vfio_device_is_noiommu(vdev) && 
!device_iommu_capable(dev,
 IOMMU_CAP_CACHE_COHERENCY))
   FAIL

Jason


Re: [Intel-gfx] [PATCH v12 07/24] vfio: Block device access via device fd until device is opened

2023-06-13 Thread Jason Gunthorpe
On Tue, Jun 13, 2023 at 08:16:47AM -0600, Alex Williamson wrote:

> > Not quite get why bit field is going to be incompatible with smp
> > lockless operations. Could you elaborate a bit? And should I define
> > the access_granted as u8 or "u8:1"?
> 
> Perhaps FUD on my part, but load-acquire type operations have specific
> semantics and it's not clear to me that they interest with compiler
> generated bit operations.  Thanks,

They won't compile if you target bit ops, you can't take the address
of a bitfield.

Jason


Re: [Intel-gfx] [PATCH v7 7/9] vfio: Add helper to search vfio_device in a dev_set

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:13AM -0700, Yi Liu wrote:
> There are drivers that need to search vfio_device within a given dev_set.
> e.g. vfio-pci. So add a helper.
> 
> vfio_pci_is_device_in_set() now returns -EBUSY in commit a882c16a2b7e
> ("vfio/pci: Change vfio_pci_try_bus_reset() to use the dev_set") where
> it was trying to preserve the return of vfio_pci_try_zap_and_vma_lock_cb().
> However, it makes more sense to return -ENODEV.
> 
> Suggested-by: Alex Williamson 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/pci/vfio_pci_core.c |  6 +-
>  drivers/vfio/vfio_main.c | 15 +++
>  include/linux/vfio.h |  3 +++
>  3 files changed, 19 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v7 5/9] iommufd: Add helper to retrieve iommufd_ctx and devid

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:11AM -0700, Yi Liu wrote:
> This is needed by the vfio-pci driver to report affected devices in the
> hot-reset for a given device.
> 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/device.c | 12 
>  include/linux/iommufd.h|  3 +++
>  2 files changed, 15 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v7 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:14AM -0700, Yi Liu wrote:
> +/*
> + * Return devid for a device which is affected by hot-reset.
> + * - valid devid > 0 for the device that is bound to the input
> + *   iommufd_ctx.
> + * - devid == VFIO_PCI_DEVID_OWNED for the device that has not
> + *   been bound to any iommufd_ctx but other device within its
> + *   group has been bound to the input iommufd_ctx.
> + * - devid == VFIO_PCI_DEVID_NOT_OWNED for others. e.g. device
> + *   is bound to other iommufd_ctx etc.
> + */
> +int vfio_iommufd_device_hot_reset_devid(struct vfio_device *vdev,
> + struct iommufd_ctx *ictx)
> +{
> + struct iommu_group *group;
> + int devid;
> +
> + if (vfio_iommufd_device_ictx(vdev) == ictx)
> + return vfio_iommufd_device_id(vdev);
> +
> + group = iommu_group_get(vdev->dev);
> + if (!group)
> + return VFIO_PCI_DEVID_NOT_OWNED;
> +
> + if (iommufd_ctx_has_group(ictx, group))
> + devid = VFIO_PCI_DEVID_OWNED;
> + else
> + devid = VFIO_PCI_DEVID_NOT_OWNED;
> +
> + iommu_group_put(group);
> +
> + return devid;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommufd_device_hot_reset_devid);

This function really should not be in the core iommufd.c file - it is
a purely vfio-pci function - why did you have to place it here?

Jason


Re: [Intel-gfx] [PATCH v7 3/9] iommufd: Reserve all negative IDs in the iommufd xarray

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:09AM -0700, Yi Liu wrote:
> With this reservation, IOMMUFD users can encode the negative IDs for
> specific purposes. e.g. VFIO needs two reserved values to tell userspace
> the ID returned is not valid but has other meaning.
> 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v7 4/9] iommufd: Add iommufd_ctx_has_group()

2023-06-13 Thread Jason Gunthorpe
On Fri, Jun 02, 2023 at 05:15:10AM -0700, Yi Liu wrote:
> This adds the helper to check if any device within the given iommu_group
> has been bound with the iommufd_ctx. This is helpful for the checking on
> device ownership for the devices which have not been bound but cannot be
> bound to any other iommufd_ctx as the iommu_group has been bound.
> 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/device.c | 30 ++
>  include/linux/iommufd.h|  8 
>  2 files changed, 38 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v7 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET

2023-06-09 Thread Jason Gunthorpe
On Fri, Jun 09, 2023 at 12:13:58AM +, Liu, Yi L wrote:

> > Other than this and the couple other comments, the series looks ok to
> > me.  We still need acks from Jason for iommufd on 3-5.  Thanks,
> 
> Thanks, perhaps one more version after getting feedback from Jason.

Yes, perhaps today I can reach it

Jason


Re: [Intel-gfx] [PATCH v11 20/23] vfio: Add VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT

2023-06-06 Thread Jason Gunthorpe
On Thu, May 25, 2023 at 03:03:54AM +, Liu, Yi L wrote:

> vIOMMU may introduce some performance deduction if there
> are frequent map/unmap. 

DPDK doesn't do that.

And once you turn on the HW IOMMU you negate alot of the micro
performance wins of bypassing. Maybe there is still some argument
about giant/huge pages or something.

> without vIOMMU is supposed to be more robust. But I'm not
> sure if the noiommu userspace will adapt to cdev noiommu.
> Perhaps yes if group may be deprecated in future.

I think that is more a distro question, and we don't have to answer it
fully now.
 
Jason


Re: [Intel-gfx] [PATCH v11 20/23] vfio: Add VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT

2023-06-06 Thread Jason Gunthorpe
On Wed, May 24, 2023 at 09:31:42AM -0600, Alex Williamson wrote:

> If a user creates an ioas within an iommufd, attaches a device to that
> ioas and populates it with mappings, wouldn't the user expect the
> device to have access to and honor those mappings?  I think that's the
> path we're headed down if we report a successful attach of a noiommu
> device to an ioas.

I understand we are going to drop no-iommu from this series, so this
below is not relavent.

But to clarify my general design idea here again

The IOAS contains the mappings that userspace would like to use with
no-iommu. Userspace would use a new IOCTL to pin and return the DMA
addr's of those exact mappings.

So attaching a noiommu to an IOAS is a necessary operation that should
succeed. It doesn't make full API sense until we also get an ioctl to
return the dma_addr_t lists.

What is special about no-iommu is that the mapppings have to go
through the special ioctl API to pin and translate, the IOVA cannot be
used natively as a dma_addr. The IOAS is still used and still related
to the device, it just for pinning and dma_addr generation not HW
isolation.
 
> We need to keep in mind that noiommu was meant to be a minimally
> intrusive mechanism to provide a dummy vfio IOMMU backend and satisfy
> the group requirements, solely for the purpose of making use of the
> vfio device interface and without providing any DMA mapping services or
> expectations.  

Well, no-iommu turned into a total hack job as soon as it wrongly
relied on mlock() and /proc/ files to function. Even within its
defined limitations this is an incorrect way to use the mm and DMA
APIs. Memory under DMA must be locked using pin_user_pages(), mlock is
not a substitution.

I expect this is functionally broken these days, under some workloads,
on certain kernel configurations.

Even if we don't fully implement it, I prefer to imagine a design
where no-iommu is implemented correctly and orient things toward that.

> beyond the minimal code trickery of the legacy implementation.  I hate
> to ask, but could we reiterate our requirements for noiommu as a part of
> the native iommufd interface for vfio?  The nested userspace requirement
> is gone now that hypervisors have vIOMMU support, so my assumption is
> that this is only for bare metal systems without an IOMMU, which
> ideally are less and less prevalent.  

I understood there was some desire for DPDK users to do this for
higher performance on some systems.

> that are actually going to adopt the noiommu cdev interface?  What
> terrible things happen if noiommu only exists in the vfio group compat
> interface to iommufd and at some distant point in the future dies when
> that gets disabled?

I think it is fine, it is only for DPDK and if DPDK people really
really care about this then they can implement it properly someday.

I'm quite happy if we say we will not put no-iommu into the device
cdev until it is put in fully correctly without relying on mlock/etc.

Then the API construction would make alot more sense.

Jason


Re: [Intel-gfx] [PATCH v6 09/10] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev

2023-05-31 Thread Jason Gunthorpe
On Fri, May 26, 2023 at 10:04:27AM +0800, Baolu Lu wrote:
> On 5/25/23 9:02 PM, Liu, Yi L wrote:
> > >   It's possible that requirement
> > > might be relaxed in the new DMA ownership model, but as it is right
> > > now, the code enforces that requirement and any new discussion about
> > > what makes hot-reset available should note both the ownership and
> > > dev_set requirement.  Thanks,
> > I think your point is that if an iommufd_ctx has acquired DMA ownerhisp
> > of an iommu_group, it means the device is owned. And it should not
> > matter whether all the devices in the iommu_group is present in the
> > dev_set. It is allowed that some devices are bound to pci-stub or
> > pcieport driver. Is it?
> > 
> > Actually I have a doubt on it. IIUC, the above requirement on dev_set
> > is to ensure the reset to the devices are protected by the dev_set->lock.
> > So that either the reset issued by driver itself or a hot reset request
> > from user, there is no race. But if a device is not in the dev_set, then
> > hot reset request from user might race with the bound driver. DMA ownership
> > only guarantees the drivers won't handle DMA via DMA API which would have
> > conflict with DMA mappings from user. I'm not sure if it is able to
> > guarantee reset is exclusive as well. I see pci-stub and pcieport driver
> > are the only two drivers that set the driver_managed_dma flag besides the
> > vfio drivers. pci-stub may be fine. not sure about pcieport driver.
> 
> commit c7d469849747 ("PCI: portdrv: Set driver_managed_dma") described
> the criteria of adding driver_managed_dma to the pcieport driver.
> 
> "
> We achieve this by setting ".driver_managed_dma = true" in pci_driver
> structure. It is safe because the portdrv driver meets below criteria:
> 
> - This driver doesn't use DMA, as you can't find any related calls like
>   pci_set_master() or any kernel DMA API (dma_map_*() and etc.).
> - It doesn't use MMIO as you can't find ioremap() or similar calls. It's
>   tolerant to userspace possibly also touching the same MMIO registers
>   via P2P DMA access.
> "
> 
> pci_rest_device() definitely shouldn't be done by the kernel drivers
> that have driver_managed_dma set.

Right

The only time it is safe to reset is if you know there is no attached
driver or you know VFIO is the attached driver and the caller owns the
VFIO too.

We haven't done a no attached driver test due to races.

Jason


Re: [Intel-gfx] [PATCH v5 06/10] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device

2023-05-17 Thread Jason Gunthorpe
On Wed, May 17, 2023 at 12:40:32PM -0600, Alex Williamson wrote:
> On Wed, 17 May 2023 15:22:27 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, May 17, 2023 at 12:15:17PM -0600, Alex Williamson wrote:
> > 
> > > > +int vfio_iommufd_physical_devid(struct vfio_device *vdev)
> > > > +{
> > > > +   if (vdev->iommufd_device)
> > > > +   return iommufd_device_to_id(vdev->iommufd_device);
> > > > +   if (vdev->noiommu_access)
> > > > +   return iommufd_access_to_id(vdev->noiommu_access);
> > > > +   return -EINVAL;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);  
> > > 
> > > I think these exemplify that it would be better if both emulated and
> > > noiommu use the same iommufd_access pointer.  Thanks,  
> > 
> > Oh, I mis understood your other remark.. Yeah good question I have to
> > study this also
> 
> I guess I also missed that this wasn't iommufd_access vs
> noiommu_access, it's device vs access, but shouldn't any iommufd_access
> pointer provide the devid?  I need to go look why we need two different
> methods to get a devid...

At least this hunk above makes sense, access and device are two
different objects with different types, so having different ID
accessors is reasonably logical.

But it is a good point that this should return the dev id of the
normal access for a normal mdev too

Ideally I'd like to see that we always return a dev id to userspace
for all vfio device types. Then we can rely on it

Jason


Re: [Intel-gfx] [PATCH v5 06/10] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device

2023-05-17 Thread Jason Gunthorpe
On Wed, May 17, 2023 at 12:15:17PM -0600, Alex Williamson wrote:

> > +int vfio_iommufd_physical_devid(struct vfio_device *vdev)
> > +{
> > +   if (vdev->iommufd_device)
> > +   return iommufd_device_to_id(vdev->iommufd_device);
> > +   if (vdev->noiommu_access)
> > +   return iommufd_access_to_id(vdev->noiommu_access);
> > +   return -EINVAL;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);
> 
> I think these exemplify that it would be better if both emulated and
> noiommu use the same iommufd_access pointer.  Thanks,

Oh, I mis understood your other remark.. Yeah good question I have to
study this also

Jason


Re: [Intel-gfx] [PATCH v5 01/10] vfio-iommufd: Create iommufd_access for noiommu devices

2023-05-17 Thread Jason Gunthorpe
On Wed, May 17, 2023 at 11:26:09AM -0600, Alex Williamson wrote:

> It's not clear to me why we need a separate iommufd_access for
> noiommu.

The point was to allocate an ID for the device so we can use that ID
with the other interfaces in all cases.

Otherwise it is a too weird special case that is probably going to
keep causing trouble down the road...

Jason


Re: [Intel-gfx] [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices

2023-05-03 Thread Jason Gunthorpe
On Wed, May 03, 2023 at 09:48:36AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, May 3, 2023 2:12 AM
> > 
> > On Sat, Apr 29, 2023 at 12:07:24AM +0800, Yi Liu wrote:
> > > > The emulated stuff is for mdev only, it should not be confused with
> > > > no-iommu
> > >
> > > hmmm. I guess the confusion is due to the reuse of
> > > vfio_iommufd_emulated_bind().
> > 
> > This is probabl y not a good direction
> 
> I see. But if not reusing, then there may be a few code duplications.
> I'm fine to add separate _bind/unbind() functions for noiommu devices
> if Alex and you prefer it.

I think you will find there is minimal duplication

Jason


Re: [Intel-gfx] [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices

2023-05-03 Thread Jason Gunthorpe
> > My general idea to complete the no-iommu feature is to add a new IOCTL
> > to VFIO that is 'pin iova and return dma addr' that no-iommu userspace
> > would call instead of trying to abuse mlock and /proc/ to do it. That
> > ioctl would use the IOAS attached to the access just like a mdev would
> > do, so it has a real IOVA, but it is not a mdev.
> 
> This new ioctl may be IOMMUFD ioctl since its input is the IOAS and
> addr, nothing related to the device. Is it?

No, definately a VFIO special ioctl for VFIO no-iommu mode.

> Sure. That's also why I added a noiommu test to avoid calling
> unmap callback although it seems not possible to have unmap
> callback as mdev drivers would implement it.

Just have a special noiommu ops and use an empty unmap function and
pass that to the special noiommu access creation.

Jason


Re: [Intel-gfx] [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices

2023-05-02 Thread Jason Gunthorpe
On Sat, Apr 29, 2023 at 12:13:39AM +0800, Yi Liu wrote:

> > Whoa, noiommu is inherently unsafe an only meant to expose the vfio
> > device interface for userspace drivers that are going to do unsafe
> > things regardless.  Enabling noiommu to work with mdev, pin pages, or
> > anything else should not be on our agenda.  Userspaces relying on niommu
> > get the minimum viable interface and must impose a minuscule
> > incremental maintenance burden.  The only reason we're spending so much
> > effort on it here is to make iommufd noiommu support equivalent to
> > group/container noiommu support.  We should stop at that.  Thanks,
> 
> btw. I asked a question in [1] to check if we should allow attach/detach
> on noiommu devices. Jason has replied it. If in future noiommu userspace
> can pin page, then such userspace will need to attach/detach ioas. So I
> made cdev series[2] to allow attach ioas on noiommu devices. Supporting
> it from cdev day-1 may avoid probing if attach/detach is supported or
> not for specific devices when adding pin page for noiommu userspace.
> 
> But now, I think such a support will not in plan, is it? If so, will it
> be better to disallow attach/detach on noiommu devices in patch [2]?
> 
> [1] https://lore.kernel.org/kvm/zea+khh0tufst...@nvidia.com/
> [2] https://lore.kernel.org/kvm/20230426150321.454465-21-yi.l@intel.com/

If we block it then userspace has to act quite differently, I think we
should keep it.

My general idea to complete the no-iommu feature is to add a new IOCTL
to VFIO that is 'pin iova and return dma addr' that no-iommu userspace
would call instead of trying to abuse mlock and /proc/ to do it. That
ioctl would use the IOAS attached to the access just like a mdev would
do, so it has a real IOVA, but it is not a mdev.

unmap callback just does nothing, as Alex says it is all still totally
unsafe.

This just allows it use the mm a little more properly and safely (eg
mlock() doesn't set things like page_maybe_dma_pinned(), proc doesn't
reject things like DAX and it currently doesn't make an adjustment for
the PCI offset stuff..) So it would make DPDK a little more robust,
portable and make the whole VFIO no-iommu feature much easier to use.

To do that we need an iommufd access, an access ID and we need to link
the current IOAS to the special access, like mdev, but in any mdev
code paths.

That creating the access ID solves the reset problem as well is a nice
side effect and is the only part of this you should focus on for now..

Jason


Re: [Intel-gfx] [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices

2023-05-02 Thread Jason Gunthorpe
On Sat, Apr 29, 2023 at 12:07:24AM +0800, Yi Liu wrote:
> > The emulated stuff is for mdev only, it should not be confused with
> > no-iommu
> 
> hmmm. I guess the confusion is due to the reuse of
> vfio_iommufd_emulated_bind().

This is probabl y not a good direction

> > Eg if you had a no_iommu_access value to store the access it would be
> > fine and could serve as the 'this is no_iommu' flag
> 
> So this no_iommu_access shall be created per iommufd bind, and call the
> iommufd_access_create() with iommufd_access_ops. is it? If so, this is
> not 100% the same with no_iommu flag as this flag is static after device
> registration.

Something like that, yes

I don't think it is any real difference with the current flag, both
are determined at the first ioctl when the iommufd is presented and
both would state permanently until the fd close

Jason


Re: [Intel-gfx] [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices

2023-04-28 Thread Jason Gunthorpe
On Fri, Apr 28, 2023 at 02:21:26PM +0800, Yi Liu wrote:

> but this patch needs to use vfio_iommufd_emulated_bind() and
> vfio_iommufd_emulated_unbind() for the noiommu devices when binding
> to iommufd. So needs to check noiommu in the vfio_iommufd_bind()
> and vfio_iommu_unbind() as well.

I'm not sure this is strictly necessary, it just needs an access

The emulated stuff is for mdev only, it should not be confused with
no-iommu

Eg if you had a no_iommu_access value to store the access it would be
fine and could serve as the 'this is no_iommu' flag

The only purpose of the access at this moment is to get an iommufdctx
id to return to userspace.

Jason


Re: [Intel-gfx] [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray

2023-04-27 Thread Jason Gunthorpe
On Thu, Apr 27, 2023 at 07:09:38AM +, Liu, Yi L wrote:
> > > @@ -50,7 +53,7 @@ struct iommufd_object *_iommufd_object_alloc(struct
> > > iommufd_ctx *ictx,
> > >* before calling iommufd_object_finalize().
> > >*/
> > >   rc = xa_alloc(>objects, >id, XA_ZERO_ENTRY,
> > > -   xa_limit_32b, GFP_KERNEL_ACCOUNT);
> > > +   iommufd_xa_limit_32b, GFP_KERNEL_ACCOUNT);
> > 
> > Just direct use XA_LIMIT() here.
> 
> Ok.
> 
> > btw do we need a contract so vfio can learn 0 and -1 are reserved or
> > fine to have a fixed assumption in later patches?
> 
> I doubt how to do it. ☹ @Jason? What about your opinion?

It is probably fine to use xa_limit_31b and just say that -ve values
are reserved in a comment near the define for 0

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-24 Thread Jason Gunthorpe
On Sun, Apr 23, 2023 at 10:28:58AM +, Liu, Yi L wrote:

> This noiommu improvement shall allow user to attach ioas to noiommu devices.
> is it? This may be done by calling iommufd_access_attach(). So there is a
> quick question. In the cdev series, shall we allow the attachment
> for noiommu?

Yes, I think we need to undo the decision we talked about earlier
where no-iommu would be asked for with a -1 iommufd.

All vfio_devices should have an iommufd_ctx when container is compiled
out.

You don't need to do anything with the ctx for no-iommu beyond demand
that userspace provide it.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-21 Thread Jason Gunthorpe
On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote:

> > Hide this device in the list looks fine to me. But the calling user should
> > not do any new device open before finishing hot-reset. Otherwise, user may
> > miss a device that needs to do pre/post reset. I think this requirement is
> > acceptable. Is it? 
> 
> I think Kevin and Jason are leaning towards reporting the entire
> dev-set.  The INFO ioctl has always been a point-in-time reading, no
> guarantees are made if the host or user configuration is changed.
> Nothing changes in that respect.

Yeah, I think your point about qemu community formus suggest we should
err toward having qemu provide some fully detailed debug report.
 
> > > Whereas dev-id < 0
> > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > device, device configured within a different iommufd_ctx, or device
> > > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> > 
> > Do you mean to have separate err-code for the three possibilities? As
> > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > have such err-code definition without reserving some ids in iommufd. 
> 
> Yes, if we're going to report the full dev-set, I think we need at
> least two unique error codes or else the user has no way to determine
> the subset of invalid dev-ids which block the reset.

If you think this is important to report we should report 0 and -1,
and adjust the iommufd xarray allocator to reserve -1

It depends what you want to show for the debugging.

eg if we have debugging where qemu dumps this table:

   BDF   In VM   iommu_group   Has VFIO driver   Has Kernel Driver

By also doing various sysfs probes based on the BDF, then the admin
action to remedy the situation is:

Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in
the table to make the reset work.

And we don't need the distinction. Adding the 0/-1 lets you make a
useful table without doing any sysfs work.

> I think Jason is proposing the set of valid dev-ids are >0, a dev-id
> of zero indicates some form of non-blocking, while <0 (or maybe
> specifically -1) indicates a blocking device.

Yes, 0 and -1 would be fine with those definitions. The only use of
the data is to add a 'blocking use of reset' colum to the table
above..

Thanks,
Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2023 at 10:23:55AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Monday, April 17, 2023 9:39 PM
> > 
> > On Fri, Apr 14, 2023 at 09:11:30AM +, Tian, Kevin wrote:
> > 
> > > The only corner case with this option is when a user mixes group
> > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > In that case the kernel doesn't have sufficient knowledge to judge
> > > 'resettable' as it doesn't know which groups are opened by this user.
> > 
> > IMHO we don't need to support this combination.
> 
> Do you mean we don't support hot-reset for this combination or we don't
> support user using this combination. I guess the prior one. Right?

Yes

> Ditto. We just fail hot-reset for the multiple iommufds case. Is it?

Yes

> > I suppose we should have done that from the beginning - no-iommu is an
> > IOMMUFD access, it just uses a crazy /proc based way to learn the
> > PFNs. Making it a proper access and making a real VFIO ioctl that
> > calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> > to userspace would go a long way to making no-iommu work in a logical,
> > usable, way.
> 
> This seems to be an improvement for noiommu mode. It can be done later.
> For now, generating access_id and binding noiommu devices with iommufdctx
> is enough for supporting noiommu hot-reset.

Yes, I'm not sure there is much value in improving no-iommu unless
someone also wants to go in and update dpdk.

At some point we will need to revise dpdk to use iommufd, maybe that
would be a good time to fix this too.

The point is that using an access is actually a logical and sensible
thing to do, no a hack to make hot reset work better.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2023 at 05:02:44AM +, Tian, Kevin wrote:

> Yes I chatted with Yi about it.
> 
> If the calling device of the INFO ioctl is opened by group then behave
> as it does today.
> 
> If the calling device is opened via cdev then use dev_id scheme as
> discussed above.
> 
> in hot_reset ioctl the fd array only accepts group fd's.
> 
> cdev can be reset only via null fd array.

Agree
 
> It remains a small open that null fd array could potentially work for
> group-opened device too if vfio-compat is used. In that case devices
> are in same iommufd ctx with valid dev_id even though they are opened 
> via group. But probably it's not worthy blocking it?

IMHO not worth the complexity to block. Security is maintained if we
use an iommufd_ctx check.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-18 Thread Jason Gunthorpe
On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> On Mon, 17 Apr 2023 16:31:56 -0300
> Jason Gunthorpe  wrote:
> 
> > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > this means that regardless of which device calls INFO, there's only one
> > > answer (assuming same set of devices opened, all cdev, all within same
> > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > and Jason agreed to, I think the output would be:
> > > 
> > > flags: NOT_RESETABLE | DEV_ID
> > > {
> > >   { valid devA-id,  devA-BDF },
> > >   { valid devC-id,  devC-BDF },
> > >   { valid devD-id,  devD-BDF },
> > >   { invalid dev-id, devE-BDF },
> > > }
> > > 
> > > Here devB gets dropped because the kernel understands that devB is
> > > unopened, affected, and owned.  It's therefore not a blocker for
> > > hot-reset.  
> > 
> > I don't think we want to drop anything because it makes the API
> > ill suited for the debugging purpose.
> > 
> > devb should be returned with an invalid dev_id if I understand your
> > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > make the debugging a bit better.
> > 
> > Userspace should look at only NOT_RESETTABLE to determine if it
> > proceeds or not, and it should use the valid dev_id list to iterate
> > over the devices it has open to do the config stuff.
> 
> If an affected device is owned, not opened, and not interfering with
> the reset, what is it adding to the API to report it for debugging
> purposes?

It lets it print the entire group of devices, this is the only way
something can learn the actual list of all BDFs affected.

dev_id can just return 0, we don't need a complex bitmap. Userspace
looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-17 Thread Jason Gunthorpe
On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> Yes, it's not trivial, but Jason is now proposing that we consider
> mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> this means that regardless of which device calls INFO, there's only one
> answer (assuming same set of devices opened, all cdev, all within same
> iommufd_ctx).  Based on what I explained about my understanding of INFO2
> and Jason agreed to, I think the output would be:
> 
> flags: NOT_RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { invalid dev-id, devE-BDF },
> }
> 
> Here devB gets dropped because the kernel understands that devB is
> unopened, affected, and owned.  It's therefore not a blocker for
> hot-reset.

I don't think we want to drop anything because it makes the API
ill suited for the debugging purpose.

devb should be returned with an invalid dev_id if I understand your
example. Maybe it should return with -1 as the dev_id instead of 0, to
make the debugging a bit better.

Userspace should look at only NOT_RESETTABLE to determine if it
proceeds or not, and it should use the valid dev_id list to iterate
over the devices it has open to do the config stuff.

> OTOH, devE is unopened, affected, and un-owned, and we
> previously agreed against the opportunistic un-opened/un-owned loophole.

NOT_RESETABLE should be returned in this case, yes.

If we want to enable userspace to use the loophole it should be an
additional flag. RESETABLE_FOR_NOW or something

> I think we're narrowing in on an interface that isn't as arbitrary.  If
> we assume the restrictions that Jason proposes, then cdev is exclusively
> a kernel determined reset availability model

Yes, I think this is probably best looking forward.

> where I'd agree that
> passing device-fds as a proof of ownership is pointless.  The group
> interface would therefore remain exclusively a proof-of-ownership
> model since we have no incentive to extend it to kernel-determined
> given the limited use case of all affected devices managed by the same
> vfio container.

Yes

> Moot, but there's actually enough information there to infer IOMMU
> groups for each device, but we probably can't prove that would always
> be the case.  If we adopt Jason's proposal though, I don't see that we
> need either a group-id or BDF capability, the BDF is only for debug
> reporting.  However, there is a new burden on the kernel to identify
> the affected, un-owned devices for that report.  

Yes and yes

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-17 Thread Jason Gunthorpe
On Thu, Apr 13, 2023 at 12:07:12PM -0600, Alex Williamson wrote:

> IIUC, the semantics we're proposing is that an INFO2 ioctl would return
> success or failure indicating whether the user has sufficient ownership
> of the affected devices, 

Or a flag, but yes

> and in the success case returns an array of
> affected dev-ids within the user's iommufd_ctx.  Unopened, affected
> devices, are not reported via INFO2, and unopened, affected devices
> outside the user's scope of ownership (ie. outside the owned IOMMU
> group) will generate a failure condition.

Yes

> As for the INFO ioctl, it's described as unchanged, which does raise
> the question of what is reported for IOMMU groups and how does the
> value there coherently relate to anything else in the cdev-exclusive
> vfio API...

For cdev mode the value of the group_id has no functional
purpose. INFO has no functional purpose beyond debugging. The cdev
enabled userspace should print the BDFs from the INFO in a debug
message and ignore the group_id.

Kernel will still fill the group_id using the iommu_get_group() stuff,
and set -1 for no-iommu.

> We had already iterated a proposal where the group-id is replaced with
> a dev-id in the existing ioctl and a flag indicates when the return
> value is a dev-id vs group-id.  This had a gap that userspace cannot
> determine if a reset is available given this information since un-owned
> devices report an invalid dev-id and userspace can't know if it has
> implicit ownership.

IIRC, yes.

> It seems cleaner to me though that we would could still re-use INFO in
> a similar way, simply defining a new flag bit which is valid only in
> the case of returning dev-ids and indicates if the reset is
> available.

Yes, it could be done like this as well. INFO2 is more a discussion
object, how we encode it in the uAPI matters a lot less. The point is
that INFO2, as an idea, returns information that no other existing API
returns: the "ownership passed flag" and "dev_id list"

Then as I said in the other mail we roll no-iommu into an iommufd_ctx
object and just follow the design that userspace must have a single
iommufd_ctx containing all the devices to use the hot reset feature.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-17 Thread Jason Gunthorpe
On Fri, Apr 14, 2023 at 09:11:30AM +, Tian, Kevin wrote:

> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.

IMHO we don't need to support this combination.

We can say that to use the hot reset API the user must put all their
devices into the same iommufd_ctx and cover 100% of the known use
cases for this.

There are already other situations, like nesting, that do force users
to put everything into one iommufd_ctx.

No reason to make things harder and more complicated.

I'm coming to the feeling that we should put no-iommu devices in
iommufd_ctx's as well. They would be an iommufd_access like
mdevs. That would clean up the complications they cause here.

I suppose we should have done that from the beginning - no-iommu is an
IOMMUFD access, it just uses a crazy /proc based way to learn the
PFNs. Making it a proper access and making a real VFIO ioctl that
calls iommufd_access_pin_pages() and returns the DMA mapped addresses
to userspace would go a long way to making no-iommu work in a logical,
usable, way.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-13 Thread Jason Gunthorpe
On Thu, Apr 13, 2023 at 02:35:57PM +, Liu, Yi L wrote:

> Today, at least QEMU will not go to do hot-reset if _INFO fails. I think
> this check may need to be relaxed if want _INFO work when there is
> no VFIO_GROUP (also no fake iommu_group).

Current qemu does not work if there is no VFIO_GROUP, so it doesn't
matter.

In cdev mode qemu should work differently, we can make the kernel
return -1 for group_id and qemu can ignore group_id for the debug
print, or we can just make it fail.

Given qemu doesn't, and can't, support no-iommu this is pretty fringe
stuff.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-13 Thread Jason Gunthorpe
On Thu, Apr 13, 2023 at 08:25:52AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 13, 2023 4:07 AM
> > 
> > 
> > > in which case we need c) a way to
> > > report the overall set of affected devices regardless of ownership in
> > > support of 4), BDF?
> > 
> > Yes, continue to use INFO unmodified.
> > 
> > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > where an invalid dev-id either indicates an affected device with
> > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > is meant to indicate the overall disposition based on the availability
> > > of reset?
> > 
> > As you explore in the following this gets ugly. I prefer to keep INFO
> > unchanged and add INFO2.
> > 
> 
> INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> a valid iommu group always exists:
> 
> vfio_pci_fill_devs()
> {
>   ...
>   iommu_group = iommu_group_get(>dev);
>   if (!iommu_group)
>   return -EPERM; /* Cannot reset non-isolated devices */
>   ...
> }

This can still work in a ugly way. With a INFO2 the only purpose of
INFO would be debugging, so if someone uses no-iommu, with hotreset
and misconfigures it then the only downside is they don't get the
debugging print. But we know of nothing that uses this combination
anyhow..

> with that plus BDF cap, I'm curious what is the actual purpose of
> INFO2 or why cannot requirement#3 reuse the information collected
> via existing INFO?

It can - it is just more complicated for userspace to do it, it has to
extract and match the BDFs and then run some algorithm to determine if
the opened devices cover the right set of devices in the reset group,
and it has to have some special code for no-iommu.

VS info2 would return the dev_id's and a single yes/no if the right
set is present. Kernel runs the algorithm instead of userspace, it
seems more abstract this way.

Also, if we make iommufd return a 'ioas dev_id group' as well it
composes nicely that userspace just needs one translation from dev_id.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-12 Thread Jason Gunthorpe
On Wed, Apr 12, 2023 at 10:50:45AM -0600, Alex Williamson wrote:

> > You keep going back to this, but I maintain userspace doesn't
> > care. qemu is given a list of VFIO devices to use, all it wants to
> > know is if it is allowed to use reset or not. Why should it need to
> > know groups and group_ids to get that binary signal out of the kernel?
> 
> hw/vfio/pci.c:2320
> error_report("vfio: Cannot reset device %s, "
>  "depends on group %d which is not owned.",
>  vdev->vbasedev.name, devices[i].group_id);
> 
> That creates a feedback loop where a user can take corrective action
> with actual information in hand to resolve the issue.

Which is why I listed debugging as requirement #4, and solve
requirement #4 by using the existing INFO and printing the BDF list it
returns.

> > The kicker is we don't force the user to generate a de-duplicated list
> > of devices FDs, one per group, just because.
> 
> So on one hand you're asking for simplicity, but on the other you're
> criticizing a trivial simplification that we chose to allow the user to
> pass number of group fds equal to number of devices affected so that
> the user doesn't need to take that step to de-duplicate the list.  We
> can't win.

It is not a simplification because the kernel is wired to accept only
a list of exactly that group length, no more no less. It turns into a
pointless puzzle that userspace has to solve, and it can only solve it
by knowing about groups.

If we get rid of groups we have to do something about this so
userspace doesn't need to do the calculation. That is the point of
this change.

> > > You've said that hot-reset works if the iommufd_ctx has
> > > representation from each affected group, the INFO ioctl remains as
> > > it is, which suggests that it's reporting group ID and BDF, yet only
> > > sysfs tells the user the relation between a vfio cdev and a group
> > > and we're trying to enable a pass-by-fd model for cdev where the
> > > user has no reference to a sysfs node for the device.  Show me how
> > > these pieces fit together.  
> > 
> > I prefer the version where INFO2 returns the dev_id, but info can work
> > if we do the BDF cap like you suggested to Yi
> 
> As discussed ad nauseam, dev-id is useless if an affected device is not
> already within the iommufd ctx.  

The purpose of INFO2 is to satisfy requirement #3 - which is to report
the effected devices *that are already opened*. For this dev_id is
fine. There is nothing qemu can do with devices that are outside its
iommufdctx, so it is pointless to tell it about them. It will generate
the debug print of #4 using INFO. I don't think we don't need one API here.

> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> > of my execution?
> >  3) What groups of the devices I already have open does the reset
> > effect?
> >  4) For debugging, report to the user the full list of devices in the
> > reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> 
> That is super sketchy because you're also advocating for
> opportunistically supporting reset if the instantaneous conditions
> allow is (ex. unopened devices), and going back and forth whether "ask
> it to keep working" suggests that a user is able to extend their
> granted ownership themselves.  I think both needs to be based on some
> form of granted, not requested, ownership and not opportunism.

Ok, lets give up on ownership then

> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> Where dev-id is useful for... ?  I think there's a misuse of "groups"
> in 3) above, userspace needs to know specific devices affected, thus
> BDF.

I did not mean "group of devices" to mean iommu_group, I mean "the set of
devices affected by the reset"

> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> 
> We can't assume that all the affected devices are bound to vfio,
> therefore we cannot assume a vfio_device IDR exists.

So BDF is better for the debugging print.

> > #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> > for userspace to specify a minimal exact list of FDs means userspace
> > doesn't need the information to figure out what that list actually
> > is. Pass a 0 length list and use iommufdctx.
> 
> "...doesn't need the information to figure out what the list actually
> is."  That's false, userspace needs the information whether it uses it
> to make a list or not,

#3 is the need of affected devices, it is already covered.

I mean that #5 should not need this, #5 is only about triggering the

Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-12 Thread Jason Gunthorpe
On Wed, Apr 12, 2023 at 07:27:43AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, April 12, 2023 8:01 AM
> > 
> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> > of my execution?
> >  3) What groups of the devices I already have open does the reset
> > effect?
> >  4) For debugging, report to the user the full list of devices in the
> > reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> > 
> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > 
> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> 
> mdev doesn't have BDF. Of course it doesn't support hot_reset either.

It should support a reset.. Maybe idxd doesn't, but it should be part
of the SIOV model. Our SIOV devices would need it for instance.

> but it's presented to userspace as a pci device. Is it weird for a pci
> device which doesn't provide a BDF cap?

It is weird for a PCI device, but it is not weird for a VFIO
device. Leaking the physical labels out of the uAPI is not clean,
IMHO.

> from this point the vfio_device IDR# sounds more generic.

Yes, I was thinking about this for the SIOV model.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-11 Thread Jason Gunthorpe
On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:

> > Management tools already need to understand dev_set if they want to
> > offer reliable reset support to the VMs. Same as today.
> 
> I don't think that's true. Our primary hot-reset use case is GPUs and
> subordinate functions, where the isolation and reset scope are often
> sufficiently similar to make hot-reset possible, regardless whether
> all the functions are assigned to a VM.  I don't think you'll find any
> management tools that takes reset scope into account otherwise.

When I think of "reliable reset support" I think of the management
tool offering a checkbox that says "ensure PCI function reset
availability" and if checked it will not launch the VM without a
working reset.

If the user configures a set of VFIO devices and then hopes they get
working reset, that is fine, but doesn't require any reporting of
reset groups, or iommu groups to the management layer to work.

> > > As I understand the proposal, QEMU now gets to attempt to
> > > claim ownership of the dev_set, so it opportunistically extends its
> > > ownership and may block other users from the affected devices.  
> > 
> > We can decide the policy for the kernel to accept a claim. I suggested
> > below "same as today" - it must hold all the groups within the
> > iommufd_ctx.
> 
> It must hold all the groups [that the user doesn't know about because
> it's not a formal part of the cdev API] within the iommufd_ctx?

You keep going back to this, but I maintain userspace doesn't
care. qemu is given a list of VFIO devices to use, all it wants to
know is if it is allowed to use reset or not. Why should it need to
know groups and group_ids to get that binary signal out of the kernel?

> > The simplest option for no-iommu is to require it to pass in every
> > device fd to the reset ioctl.
> 
> Which ironically is exactly how it ends up working today, each no-iommu
> device has a fake IOMMU group, so every affected device (group) needs
> to be provided.

Sure, that is probably the way forward for no-iommu. Not that anyone
uses it..

The kicker is we don't force the user to generate a de-duplicated list
of devices FDs, one per group, just because.

> > I want to re-focus on the basics of what cdev is supposed to be doing,
> > because several of the idea you suggested seem against this direction:
> > 
> >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> >by compiling all the vfio_group infrastructure out. iommu_groups
> >continue to exist.
> >
> >So converting a cdev to a vfio_group is not an allowed operation.
> 
> My only statements in this respect were towards the notion that IOMMU
> groups continue to exist.  I'm well aware of the desire to deprecate
> and remove vfio groups.

Yes

> >  - no-iommu should not have iommu_groups. We enforce this by compiling
> >out all the no-iommu vfio_group infrastructure.
> 
> This is not logically inferred from the above if IOMMU groups continue
> to exist and continue to be a basis for describing DMA ownership as
> well as "reset groups"

It is not ment to flow out of the above, it is a seperate statement. I
want the iommu_group mechanism to stop being abused outside the iommu
core code. The only thing that should be creating groups is an
attached iommu driver operating under ops->device_group().

VFIO needed this to support mdev and no-iommu. We already have mdev
free of iommu_groups, I would like no-iommu to also be free of it too,
we are very close.

That would leave POWER as the only abuser of the
iommu_group_add_device() API, and it is only doing it because it
hasn't got a proper iommu driver implementation yet. It turns out
their abuse is mislocked and maybe racy to boot :(

> >  - cdev APIs should ideally not require the user to know the group_id,
> >we should try hard to design APIs to avoid this.
> 
> This is a nuance, group_id vs group, where it's been previously
> discussed that users will need to continue to know the boundaries of a
> group for the purpose of DMA isolation and potentially IOAS
> independence should cdev/iommufd choose to tackle those topics.

Yes, group_id is a value we have no specific use for and would require
userspace to keep seperate track of. I'd prefer to rely on dev_id as
much as possible instead.

> What is the actual proposal here?

I don't know anymore, you don't seem to like this direction either...

> You've said that hot-reset works if the iommufd_ctx has
> representation from each affected group, the INFO ioctl remains as
> it is, which suggests that it's reporting group ID and BDF, yet only
> sysfs tells the user the relation between a vfio cdev and a group
> and we're trying to enable a pass-by-fd model for cdev where the
> user has no reference to a sysfs node for the device.  Show me how
> these pieces fit together.

I prefer the version where INFO2 returns the dev_id, but info can work
if we do the BDF cap like you suggested to Yi

> OTOH, if 

Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-11 Thread Jason Gunthorpe
On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> [Appears the list got dropped, replying to my previous message to re-add]

Wowo this got mesed up alot, mutt drops the cc when replying for some
reason. I think it is fixed up now

> > Our cdev model says that opening a cdev locks out other cdevs from
> > independent use, eg because of the group sharing. Extending this to
> > include the reset group as well seems consistent.
> 
> The DMA ownership model based on the IOMMU group is consistent with
> legacy vfio, but now you're proposing a new ownership model that
> optionally allows a user to extend their ownership, opportunistically
> lock out other users, and wreaking havoc for management utilities that
> also have no insight into dev_sets or userspace driver behavior.

I suggested below that the owership require enough open devices - so
it doesn't "extend ownership opportunistically", and there is no
havoc.

Management tools already need to understand dev_set if they want to
offer reliable reset support to the VMs. Same as today.
 
> > There is some security concern here, but that goes both ways, a 3rd
> > party should not be able to break an application that needs to use
> > this RESET and had sufficient privileges to assert an ownership.
> 
> There are clearly scenarios we have now that could break.  For example,
> today if QEMU doesn't own all the IOMMU groups for a mult-function
> device, it can't do a reset, the remaining functions are available for
> other users. 

Sure, and we can keep that with this approach.

> As I understand the proposal, QEMU now gets to attempt to
> claim ownership of the dev_set, so it opportunistically extends its
> ownership and may block other users from the affected devices.

We can decide the policy for the kernel to accept a claim. I suggested
below "same as today" - it must hold all the groups within the
iommufd_ctx.

The main point is to make this claiming operation qemu needs to do
clearer and more explicit. I view this as better than trying to guess
if it successfully made the claim by inspecting the _INFO output.

> > I'd say anyone should be able to assert RESET ownership if, like
> > today, the iommufd_ctx has all the groups of the dev_set inside
> > it. Once asserted it becomes safe against all forms of hotplug, and
> > continues to be safe even if some of the devices are closed. eg hot
> > unplugging from the VM doesn't change the availability of RESET.
> > 
> > This comes from your ask that qemu know clearly if RESET works, and it
> > doesn't change while qemu is running. This seems stronger and clearer
> > than the current implicit scheme. It also doesn't require usespace to
> > do any calculations with groups or BDFs to figure out of RESET is
> > available, kernel confirms it directly.
> 
> As above, clarity and predictability seem lacking in this proposal.
> With the current scheme, the ownership of the affected devices is
> implied if they exist within an owned group, but the strength of that
> ownership is clear.  

Same logic holds here

Ownership is claimed same as today by having all groups representated
in the iommufd_ctx. This seems just as clear as today.

> > > seems this proposal essentially extends the ownership model to the
> > > greater of the dev_set or iommu group, apparently neither of which
> > > are explicitly exposed to the user in the cdev API.  
> > 
> > IIRC the group id can be learned from sysfs before opening the cdev
> > file. Something like /sys/class/vfio/XX/../../iommu_group
> 
> And in the passed cdev fd model... ?

IMHO we should try to avoid needing to expose group_id specifically to
userspace. We are missing a way to learn the "same ioas" restriction
in iommufd, and it should provide that directly based on dev_ids.

Otherwise if we really really need group_id then iommufd should
provide an ioctl to get it. Let's find a good reason first

> > We should also have an iommufd ioctl to report the "same ioas"
> > groupings of dev_ids to make it easy on userspace. I haven't checked
> > to see what the current qemu patches are doing with this..
> 
> Seems we're ignoring that no-iommu doesn't have a valid iommufd.

no-iommu doesn't and shouldn't have iommu_groups either. It also
doesn't have an IOAS so querying for same-IOAS is not necessary.

The simplest option for no-iommu is to require it to pass in every
device fd to the reset ioctl.

> > > How does a user determine when devices cannot be used independently
> > > in the cdev API?   
> > 
> > We have this problem right now. The only way to learn the reset group
> > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> > under /sys/class/vfio/XX/ if something needs it earlier.
> 
> For all the complaints about complexity, now we're asking management
> tools to not only take into account IOMMU groups, but also reset
> groups, and some inferred knowledge about the application and devices
> to speculate whether reset group ownership is taken by a 

Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-11 Thread Jason Gunthorpe
On Sun, Apr 09, 2023 at 07:29:51AM -0600, Alex Williamson wrote:

> > struct vfio_device_info_cap_pci_bdf {
> >  struct vfio_info_cap_header header;
> >  __u32   group_id;
> >  __u16   segment;
> >  __u8bus;
> >  __u8devfn; /* Use PCI_SLOT/PCI_FUNC */
> > };
> > 
> 
> Group-id and bdf should be separate capabilities, all device should
> report a group-id capability and only PCI devices a bdf capability.

Group should be reported by iommufd using a generic ioctl, and not be
part of VFIO.

This should report BDF only and only work for PCI.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-11 Thread Jason Gunthorpe
On Fri, Apr 07, 2023 at 03:07:21PM -0600, Alex Williamson wrote:

> I think we need to revisit the question of why allocating an IOMMU
> group for a no-iommu device is exclusive to the vfio group support.

One of the points of this effort is to remove the co-mingling of iommu
and VFIO so much. We should not create the fake iommu groups for
no-iommu.

The _INFO API reporting the group is not a good reason to wreck this
clean separation.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-11 Thread Jason Gunthorpe
On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:

> Where whether a device is opened is subject to change outside of the
> user's control.  This essentially allows the user to perform hot-resets
> of devices outside of their ownership so long as the device is not
> used elsewhere, versus the current requirement that the user own all the
> affected groups, which implies device ownership.  It's not been
> justified why this feature needs to exist, imo.

The cdev API doesn't have the notion that owning a group means you
"own" some collection of devices. It still happens as a side effect,
but it isn't obviously part of the API. I'm really loath to
re-introduce that group-based concept just for this. We are trying
reduce the group API surface.

How about a different direction.

We add a new uAPI for cdev mode that is "take ownership of the reset
group". Maybe it can be a flag in during bind.

When requested vfio will ensure that every device in the reset group
is only bound to this iommufd_ctx or left closed. Now and in the
future. Since no-iommu has no iommufd_ctx this means we can open only
one device in the reset group.

With this flag RESET is guaranteed to always work by definition.

We continue with the zero-length FD, but we can just replace the
security checks with a check if we are in reset group ownership mode.

_INFO is unchanged.

We decide if we add a new IOCTL to return the BDF so the existing
_INFO can get back to the dev_id or a new IOCTL that returns the
dev_id list of the reset group.

Userspace is required to figure out the extent of the reset, but we
don't require that userspace prove to the kernel it did this when
requesting the reset.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-05 Thread Jason Gunthorpe
On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:

> > > QEMU can make a policy decision today because the kernel provides a
> > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > hot-reset is all but guaranteed to work.
> > 
> > And we don't change that with cdev. If qemu wants to make the policy
> > decision it keeps using the exact same _INFO interface to make that
> > decision same it has always made.
> > 
> > We weaken the actual reset action to only consider the security side.
> > 
> > Applications that want this exclusive reset group policy simply must
> > check it on their own. It is a reasonable API design.
> 
> I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API.

reset should be strictly more permissive than INFO. If INFO predicts
reset is permitted then reset should succeed.

We don't change INFO so it cannot "becomes so weak"  ??

We don't care about the cases where INFO says it will not succeed but
reset does (temporarily) succeed.

I don't get what argument you are trying to make or what you think is
diminished..

Again, userspace calls INFO, if info says yes then reset *always
works*, exactly just like today.

Userspace will call reset with a 0 length FD list and it uses a
security only check that is strictly more permissive than what
get_info will return. So the new check is simple in the kernel and
always works in the cases we need it to work.

What is getting things into trouble is insisting that RESET have
additional restrictions beyond the minimum checks required for
security.

> > I don't view it as a loophole, it is flexability to use the API in a
> > way that is different from what qemu wants - eg an app like dpdk may
> > be willing to tolerate a reset group that becomes unavailable after
> > startup. Who knows, why should we force this in the kernel?
> 
> Because look at all the problems it's causing to try to introduce these
> loopholes without also introducing subtle bugs.

These problems are coming from tring to do this integrated version,
not from my approach!

AFAICT there was nothing wrong with my original plan of using the
empty fd list for reset. What Yi has here is some mashup of what you
and I both suggested.

> > Remember the reason we started doing this is because we don't
> > have easy access to the BDF anymore.
> 
> We don't need it, the info ioctl provides the groups, the group
> association can be learned from the DEVICE_GET_INFO ioctl, the
> hot-reset ioctl only requires a single representative fd per affected
> group.  dev-ids not required.

I'm not talking about triggering the ioctl.

I'm talking about whatever else qemu needs to do so that the VM is
aware of the reset groups device-by-device on it's side so nested VFIO
in the VM reflects the same data as the hypervisor. Maybe it doesn't
do this right now, but the kernel API should continue to provide the
data.

> > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > return the dev_ids.
> 
> I don't see any justification for this.  We could add another PCI
> specific DEVICE_GET_INFO capability to report the bdf if we really need
> it, but reporting the group seems sufficient for this use case.

What I imagine is a single new ioctl 'get reset group 2' or something.
It returns a list of dev_ids in the reset group. It has an output flag
if the reset is reliable. This is the only ioctl user space needs to
call.

The reliable test is done by simply calling the ioctl and throwing
away the dev ids. The mapping of the VM's reset groups is done by
processing the dev_ids to vRIDs and flowing that into the VM somehow.

We don't expose group_ids, and we don't expose BDF. It is much simpler
and cleaner to use.

A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
data too, it is just not as elegant and requires userspace to do a lot
more work to keep track of the 3 different identifiers.

> > This looks like a very complex uapi compared to the empty list option,
> > but it seems like it would work.
>
> It's the same API that we have now.  What's complex is trying to figure
> out all the subtle side-effects from the loopholes that are being
> proposed in this series.  Thanks,

I might agree with you if we weren't now going backwards - 
ideas didn't work out and Yi has to throw stuff away. :(

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-05 Thread Jason Gunthorpe
On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> Usability needs to be a consideration as well.  An interface where the
> result is effectively arbitrary from a user perspective because the
> kernel is solely focused on whether the operation is allowed,
> evaluating constraints that the user is unaware of and cannot control,
> is unusable.

Considering this API is only invoked by qemu we might be overdoing
this usability and 'no shoot in foot' view.

> > This is a good point that qemu needs to make a policy decision if it
> > is happy about the VFIO configuration - but that is a policy decision
> > that should not become entangled with the kernel's security checks.
> > 
> > Today qemu can make this policy choice the same way it does right now
> > - call _INFO and check the group_ids. It gets the exact same outcome
> > as today. We already discussed that we need to expose the group ID
> > through an ioctl someplace.
> 
> QEMU can make a policy decision today because the kernel provides a
> sufficiently reliable interface, ie. based on the set of owned groups, a
> hot-reset is all but guaranteed to work.  

And we don't change that with cdev. If qemu wants to make the policy
decision it keeps using the exact same _INFO interface to make that
decision same it has always made.

We weaken the actual reset action to only consider the security side.

Applications that want this exclusive reset group policy simply must
check it on their own. It is a reasonable API design.

> > If this is too awkward we could add a query to the kernel if the cdev
> > is "reset exclusive" - eg the iommufd covers all the groups that span
> > the reset set.
> 
> That's essentially what we have if there are valid dev-ids for each
> affected device in the info ioctl.

If you have dev-ids for everything, yes. If you don't, then you can't
make the same policy choice using a dev-id interface.

> I don't think it helps the user experience to create loopholes where
> the hot-reset ioctl can still work in spite of those missing
> devices.

I disagree. The easy straightforward design is that the reset ioctl
works if the process has security permissions. Mixing a policy check
into the kernel on this path is creating complexity we don't really
need.

I don't view it as a loophole, it is flexability to use the API in a
way that is different from what qemu wants - eg an app like dpdk may
be willing to tolerate a reset group that becomes unavailable after
startup. Who knows, why should we force this in the kernel?

> For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> capability chains, we could add a capability that reports the group ID
> for the device.  

I was going to put that in an iommufd ioctl so it works with VDPA too,
but sure, lets assume we can get the group ID from a cdev fd.

> The hot-reset info ioctl remains as it is today, reporting group-ids
> and bdfs.

Sure, but userspace still needs to know how to map the reset sets into
dev-ids. Remember the reason we started doing this is because we don't
have easy access to the BDF anymore.

I like leaving this ioctl alone, lets go back to a dedicated ioctl to
return the dev_ids.

> The hot-reset ioctl itself is modified to transparently
> support either group fds or device fds.  The user can now map cdevs
> to group-ids and therefore follow the same rules as groups,
> providing at least one representative device fd for each group.

This looks like a very complex uapi compared to the empty list option,
but it seems like it would work.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-05 Thread Jason Gunthorpe
On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:
> On Wed, 5 Apr 2023 13:37:05 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > 
> > > But that kind of brings to light the question of what does the user do
> > > when they encounter this situation.  
> > 
> > What does it do now when it encounters a group_id it doesn't
> > understand? Userspace already doesn't know if the foreign group is
> > open or not, right?
> 
> It's simple, there is currently no screwiness around opened devices.
> If the caller doesn't own all the groups mapping to the affected
> devices, hot-reset is not available.

That still has nasty edge cases. If the reset group spans beyond a
single iommu group you end up with qemu being unable to operate reset
at all, and it is unfixable from an API perspective as we can't pass
in groups that VFIO isn't going to use.

I think you are right, the fact we'd have to return -1 dev_ids to this
modified API is pretty damaging, it doesn't seem like a good
direction.

> This leads to scenarios where the info ioctl indicates a hot-reset is
> initially available, perhaps only because one of the affected devices
> was not opened at the time, and now it fails when QEMU actually tries
> to use it.

I would like it if the APIs toward the kernel were only about the
kernel's security apparatus. It is makes it easier to reason about the
kernel side and gives nice simple well defined APIs.

This is a good point that qemu needs to make a policy decision if it
is happy about the VFIO configuration - but that is a policy decision
that should not become entangled with the kernel's security checks.

Today qemu can make this policy choice the same way it does right now
- call _INFO and check the group_ids. It gets the exact same outcome
as today. We already discussed that we need to expose the group ID
through an ioctl someplace.

If this is too awkward we could add a query to the kernel if the cdev
is "reset exclusive" - eg the iommufd covers all the groups that span
the reset set.

Jason


Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET

2023-04-05 Thread Jason Gunthorpe
On Wed, Apr 05, 2023 at 09:36:46AM -0600, Alex Williamson wrote:

> > If we don't support singletion dev_set hot-reset, noiommu devices in cdev
> > path shall fail the hot-reset if empty-fd array is provided. But we may just
> > document that empty-fd array does not work for noiommu. User should
> > use the device fd array.
> 
> I don't see any replies to my comment on 08/12 where I again question
> why we need an empty array option.

I was pressing we'd do empty-fd only and not do the device fd array at
all since it is such an ugly fit for the use cases we have.

But it is such a minor detail if you don't want it then take it out.

> This singleton dev-set notion seems equally unjustified.  Do we just
> need to deal with hot-reset being unsupported for no-iommu devices
> with iommufd?

It was to support no-iommu, if you want to de-support it then it can
go away too. AFAIK dpdk doesn't use this feature and it is the only
user we know of that has support for no-iommu so it is probably safe.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-05 Thread Jason Gunthorpe
On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:

> But that kind of brings to light the question of what does the user do
> when they encounter this situation.

What does it do now when it encounters a group_id it doesn't
understand? Userspace already doesn't know if the foreign group is
open or not, right?

> reset can complete.  If the device is opened by a different user, the
> reset is blocked.  The only logical conclusion is that the user should
> try the reset regardless of the result of the info ioctl, which the

IMHO my suggested version is still the overall saner uAPI.

An info that basically returns success/fail if reset is security
authorized and information about the reset groupings.

Actual reset follows the returned groupings automatically.

Easy for qemu. Call the info at startup to confirm reset can be
emulated, use the returned information to propogate the reset groups
to the guest. Trigger the reset with no fuss when the guest asks for
it.

Less weird corner cases.

Jason


Re: [Intel-gfx] [PATCH v9 16/25] iommufd/device: Add iommufd_access_detach() API

2023-04-05 Thread Jason Gunthorpe
On Wed, Apr 05, 2023 at 02:10:19PM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, April 5, 2023 7:56 PM
> > 
> > On Tue, Apr 04, 2023 at 04:45:12PM -0600, Alex Williamson wrote:
> > > On Sat,  1 Apr 2023 08:18:24 -0700
> > > Yi Liu  wrote:
> > >
> > > > From: Nicolin Chen 
> > > >
> > > > Previously, the detach routine is only done by the destroy(). And it was
> > > > called by vfio_iommufd_emulated_unbind() when the device runs close(), 
> > > > so
> > > > all the mappings in iopt were cleaned in that setup, when the call trace
> > > > reaches this detach() routine.
> > > >
> > > > Now, there's a need of a detach uAPI, meaning that it does not only need
> > > > a new iommufd_access_detach() API, but also requires 
> > > > access->ops->unmap()
> > > > call as a cleanup. So add one.
> > > >
> > > > However, leaving that unprotected can introduce some potential of a race
> > > > condition during the pin_/unpin_pages() call, where access->ioas->iopt 
> > > > is
> > > > getting referenced. So, add an ioas_lock to protect the context of iopt
> > > > referencings.
> > > >
> > > > Also, to allow the iommufd_access_unpin_pages() callback to happen via
> > > > this unmap() call, add an ioas_unpin pointer, so the unpin routine won't
> > > > be affected by the "access->ioas = NULL" trick.
> > > >
> > > > Reviewed-by: Kevin Tian 
> > > > Tested-by: Terrence Xu 
> > > > Tested-by: Yanting Jiang 
> > > > Signed-off-by: Nicolin Chen 
> > > > Signed-off-by: Yi Liu 
> > > > ---
> > > >  drivers/iommu/iommufd/device.c  | 76 +++--
> > > >  drivers/iommu/iommufd/iommufd_private.h |  2 +
> > > >  include/linux/iommufd.h |  1 +
> > > >  3 files changed, 74 insertions(+), 5 deletions(-)
> > >
> > > Does this need to go in via iommufd first?  There seems to be quite a
> > > bit of churn in iommufd/device.c vs the vfio_mdev_ops branch (ie. it
> > > doesn't apply). Thanks,
> > 
> > I think it is best to stay with this series, Yi has to rebase it
> 
> The rebased version is here. Shall I resend a version which is rebased on
> top of vfio_mdev_ops?
> 
> https://github.com/yiliu1765/iommufd/commit/d3d8f65c82fe2ca2a7b1a635f4b40b2a0971daa9

When you post the v10 it should be based on top of the vfio_mdev_ops
and the hot reset series.

Jason


Re: [Intel-gfx] [PATCH v9 16/25] iommufd/device: Add iommufd_access_detach() API

2023-04-05 Thread Jason Gunthorpe
On Tue, Apr 04, 2023 at 04:45:12PM -0600, Alex Williamson wrote:
> On Sat,  1 Apr 2023 08:18:24 -0700
> Yi Liu  wrote:
> 
> > From: Nicolin Chen 
> > 
> > Previously, the detach routine is only done by the destroy(). And it was
> > called by vfio_iommufd_emulated_unbind() when the device runs close(), so
> > all the mappings in iopt were cleaned in that setup, when the call trace
> > reaches this detach() routine.
> > 
> > Now, there's a need of a detach uAPI, meaning that it does not only need
> > a new iommufd_access_detach() API, but also requires access->ops->unmap()
> > call as a cleanup. So add one.
> > 
> > However, leaving that unprotected can introduce some potential of a race
> > condition during the pin_/unpin_pages() call, where access->ioas->iopt is
> > getting referenced. So, add an ioas_lock to protect the context of iopt
> > referencings.
> > 
> > Also, to allow the iommufd_access_unpin_pages() callback to happen via
> > this unmap() call, add an ioas_unpin pointer, so the unpin routine won't
> > be affected by the "access->ioas = NULL" trick.
> > 
> > Reviewed-by: Kevin Tian 
> > Tested-by: Terrence Xu 
> > Tested-by: Yanting Jiang 
> > Signed-off-by: Nicolin Chen 
> > Signed-off-by: Yi Liu 
> > ---
> >  drivers/iommu/iommufd/device.c  | 76 +++--
> >  drivers/iommu/iommufd/iommufd_private.h |  2 +
> >  include/linux/iommufd.h |  1 +
> >  3 files changed, 74 insertions(+), 5 deletions(-)
> 
> Does this need to go in via iommufd first?  There seems to be quite a
> bit of churn in iommufd/device.c vs the vfio_mdev_ops branch (ie. it
> doesn't apply). Thanks,

I think it is best to stay with this series, Yi has to rebase it

Jason


Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset

2023-04-05 Thread Jason Gunthorpe
On Tue, Apr 04, 2023 at 05:59:01PM +0200, Eric Auger wrote:

> > but the hot reset shall fail as the group is not owned by the user.
> 
> sure it shall but I fail to understand if the reset fails or the device
> plug is somehow delayed until the reset completes.

It is just racy today - vfio_pci_dev_set_resettable() doesn't hold any
locks across the pci_walk_bus() check to prevent hot plug in while it is
working on the reset.

Jason


Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

2023-04-03 Thread Jason Gunthorpe
On Mon, Apr 03, 2023 at 09:32:18AM -0600, Alex Williamson wrote:
> > yes, this series is applied on [1]. I put the [1], this series and cdev 
> > series
> > in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.
> > 
> > Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
> > to apply this series and cdev series on top of it. Maybe I should have done
> > it to make life easier. 
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
> 
> Seems like it must be in the vfio_mdev_ops branch which has not been
> pushed aside from the merge back to for-next.  Jason?  Thanks,

Yeah, I didn't think we'd need it until we got to the cdev series, let
me do the steps..

Jason


Re: [Intel-gfx] [PATCH v3 2/6] iommufd: Create access in vfio_iommufd_emulated_bind()

2023-03-31 Thread Jason Gunthorpe
On Fri, Mar 31, 2023 at 08:16:16AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, March 30, 2023 4:00 AM
> > 
> > On Mon, Mar 27, 2023 at 02:33:47AM -0700, Yi Liu wrote:
> > > @@ -494,6 +479,30 @@ void iommufd_access_destroy(struct
> > iommufd_access *access)
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(iommufd_access_destroy, IOMMUFD);
> > >
> > > +int iommufd_access_attach(struct iommufd_access *access, u32 ioas_id)
> > > +{
> > > + struct iommufd_ioas *new_ioas;
> > > + int rc = 0;
> > > +
> > > + if (access->ioas != NULL && access->ioas->obj.id != ioas_id)
> > > + return -EINVAL;
> > 
> > This should just be
> > 
> >if (access->ioas)
> > return -EINVAL;
> 
> the physical path has the same check:
> 
>   if (idev->igroup->hwpt != NULL && idev->igroup->hwpt != hwpt) {
>   rc = -EINVAL;
>   goto err_unlock;
>   }
> 
> If we change here then that one should be changed too.

No, that one is checking if the another device attached to the same
group is a compatible hwpt so we succeed to attach the new device

Jason


Re: [Intel-gfx] [PATCH v8 11/24] vfio: Make vfio_device_first_open() to accept NULL iommufd for noiommu

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:40:34AM -0700, Yi Liu wrote:
> vfio_device_first_open() requires the caller to provide either a valid
> iommufd (the group path in iommufd compat mode) or a valid container
> (the group path in legacy container mode). As preparation for noiommu
> support in device cdev path it's extended to allow both being NULL. The
> caller is expected to verify noiommu permission before passing NULL
> to this function.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c |  8 
>  drivers/vfio/vfio.h  |  1 +
>  drivers/vfio/vfio_main.c | 12 
>  3 files changed, 17 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v8 10/24] vfio: Make vfio_device_open() single open for device cdev path

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:40:33AM -0700, Yi Liu wrote:
> VFIO group has historically allowed multi-open of the device FD. This
> was made secure because the "open" was executed via an ioctl to the
> group FD which is itself only single open.
> 
> However, no known use of multiple device FDs today. It is kind of a
> strange thing to do because new device FDs can naturally be created
> via dup().
> 
> When we implement the new device uAPI (only used in cdev path) there is
> no natural way to allow the device itself from being multi-opened in a
> secure manner. Without the group FD we cannot prove the security context
> of the opener.
> 
> Thus, when moving to the new uAPI we block the ability of opening
> a device multiple times. Given old group path still allows it we store
> a vfio_group pointer in struct vfio_device_file to differentiate.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c | 2 ++
>  drivers/vfio/vfio.h  | 2 ++
>  drivers/vfio/vfio_main.c | 7 +++
>  3 files changed, 11 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v8 03/24] vfio: Remove vfio_file_is_group()

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:40:26AM -0700, Yi Liu wrote:
> since no user of vfio_file_is_group() now.
> 
> Reviewed-by: Kevin Tian 
> Tested-by: Terrence Xu 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c | 10 --
>  include/linux/vfio.h |  1 -
>  2 files changed, 11 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v2 09/10] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:34:57AM -0700, Yi Liu wrote:
> Now user can also provide an array of device fds as a 3rd method to verify
> the reset ownership. It's not useful at this point when the device fds are
> acquired via group fds. But it's necessary when moving to device cdev which
> allows the user to directly acquire device fds by skipping group. In that
> case this method can be used as a last resort when the preferred iommufd
> verification doesn't work, e.g. in noiommu usages.
> 
> Clarify it in uAPI.
> 
> Reviewed-by: Kevin Tian 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 9 +
>  include/uapi/linux/vfio.h| 3 ++-
>  2 files changed, 7 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v2 07/10] vfio: Accpet device file from vfio PCI hot reset path

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:34:55AM -0700, Yi Liu wrote:
> This extends both vfio_file_is_valid() and vfio_file_has_dev() to accept
> device file from the vfio PCI hot reset.
> 
> Reviewed-by: Kevin Tian 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/vfio_main.c | 23 +++
>  1 file changed, 19 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v2 06/10] vfio: Refine vfio file kAPIs for vfio PCI hot reset

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:34:54AM -0700, Yi Liu wrote:
> This prepares vfio core to accept vfio device file from the vfio PCI
> hot reset path. vfio_file_is_group() is still kept for KVM usage.
> 
> Reviewed-by: Kevin Tian 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/group.c | 32 ++--
>  drivers/vfio/pci/vfio_pci_core.c |  4 ++--
>  drivers/vfio/vfio.h  |  2 ++
>  drivers/vfio/vfio_main.c | 29 +
>  include/linux/vfio.h |  1 +
>  5 files changed, 48 insertions(+), 20 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v2 05/10] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:34:53AM -0700, Yi Liu wrote:
> as an alternative method for ownership check when iommufd is used. In
> this case all opened devices in the affected dev_set are verified to
> be bound to a same valid iommufd value to allow reset. It's simpler
> and faster as user does not need to pass a set of fds and kernel no
> need to search the device within the given fds.
> 
> a device in noiommu mode doesn't have a valid iommufd, so this method
> should not be used in a dev_set which contains multiple devices and one
> of them is in noiommu. The only allowed noiommu scenario is that the
> calling device is noiommu and it's in a singleton dev_set.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 42 +++-
>  include/uapi/linux/vfio.h|  9 ++-
>  2 files changed, 44 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [Intel-gfx] [PATCH v2 04/10] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device

2023-03-30 Thread Jason Gunthorpe
On Mon, Mar 27, 2023 at 02:34:52AM -0700, Yi Liu wrote:
> This is needed by the vfio pci driver to report affected devices in the
> hot reset for a given device.
> 
> Signed-off-by: Yi Liu 
> ---
>  drivers/iommu/iommufd/device.c | 12 
>  drivers/vfio/iommufd.c | 16 
>  include/linux/iommufd.h|  3 +++
>  include/linux/vfio.h   | 13 +
>  4 files changed, 44 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


  1   2   3   4   5   >