Re: [PATCH v10 7/8] vduse: add vq group asid support

Jason Wang Sun, 28 Dec 2025 18:57:05 -0800

On Fri, Dec 26, 2025 at 7:39 PM Eugenio Perez Martin
<[email protected]> wrote:
>
> On Thu, Dec 25, 2025 at 3:23 AM Jason Wang <[email protected]> wrote:
> >
> > On Wed, Dec 24, 2025 at 3:39 PM Eugenio Perez Martin
> > <[email protected]> wrote:
> > >
> > > On Wed, Dec 24, 2025 at 1:20 AM Jason Wang <[email protected]> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 9:16 PM Eugenio Perez Martin
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <[email protected]> 
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <[email protected]> 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez 
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Add support for assigning Address Space Identifiers (ASIDs) 
> > > > > > > > > to each VQ
> > > > > > > > > group.  This enables mapping each group into a distinct 
> > > > > > > > > memory space.
> > > > > > > > >
> > > > > > > > > The vq group to ASID association is protected by a rwlock 
> > > > > > > > > now.  But the
> > > > > > > > > mutex domain_lock keeps protecting the domains of all ASIDs, 
> > > > > > > > > as some
> > > > > > > > > operations like the one related with the bounce buffer size 
> > > > > > > > > still
> > > > > > > > > requires to lock all the ASIDs.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eugenio Pérez <[email protected]>
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > Future improvements can include performance optimizations on 
> > > > > > > > > top could
> > > > > > > > > move to per-ASID locks, or hardening by tracking ASID or ASID 
> > > > > > > > > hashes on
> > > > > > > > > unused bits of the DMA address.
> > > > > > > > >
> > > > > > > > > Tested virtio_vdpa by adding manually two threads in 
> > > > > > > > > vduse_set_status:
> > > > > > > > > one of them modifies the vq group 0 ASID and the other one 
> > > > > > > > > map and unmap
> > > > > > > > > memory continuously.  After a while, the two threads stop and 
> > > > > > > > > the usual
> > > > > > > > > work continues.
> > > > > > > > >
> > > > > > > > > Tested with vhost_vdpa by migrating a VM while ping on 
> > > > > > > > > OVS+VDUSE.  A few
> > > > > > > > > workaround were needed in some parts:
> > > > > > > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does 
> > > > > > > > > not forward
> > > > > > > > >   the enable message to the userland device.  This will be 
> > > > > > > > > solved in the
> > > > > > > > >   future.
> > > > > > > > > * Share the suspended state between all vhost devices in QEMU:
> > > > > > > > >   
> > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > > > > > > * Implement a fake VDUSE suspend vdpa operation callback that 
> > > > > > > > > always
> > > > > > > > >   returns true in the kernel.  DPDK suspend the device at the 
> > > > > > > > > first
> > > > > > > > >   GET_VRING_BASE.
> > > > > > > > > * Remove the CVQ blocker in ASID.
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > v10:
> > > > > > > > > * Back to rwlock version so stronger locks are used.
> > > > > > > > > * Take out allocations from rwlock.
> > > > > > > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > > > > > > * Remove bad fetching again of domain variable in
> > > > > > > > >   vduse_dev_max_mapping_size (Yongji).
> > > > > > > > > * Remove unused vdev definition in vdpa map_ops callbacks 
> > > > > > > > > (kernel test
> > > > > > > > >   robot).
> > > > > > > > >
> > > > > > > > > v9:
> > > > > > > > > * Replace mutex with rwlock, as the vdpa map_ops can run from 
> > > > > > > > > atomic
> > > > > > > > >   context.
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > * Revert the mutex to rwlock change, it needs proper 
> > > > > > > > > profiling to
> > > > > > > > >   justify it.
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > * Take write lock in the error path (Jason).
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > > > > > * s/(dev->api_version < 1) ?/(dev->api_version < 
> > > > > > > > > VDUSE_API_VERSION_1) ?/
> > > > > > > > >   (MST).
> > > > > > > > > * Fix struct name not matching in the doc.
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > * Properly return errno if copy_to_user returns >0 in 
> > > > > > > > > VDUSE_IOTLB_GET_FD
> > > > > > > > >   ioctl (Jason).
> > > > > > > > > * Properly set domain bounce size to divide equally between 
> > > > > > > > > nas (Jason).
> > > > > > > > > * Exclude "padding" member from the only >V1 members in
> > > > > > > > >   vduse_dev_request.
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > * Divide each domain bounce size between the device bounce 
> > > > > > > > > size (Jason).
> > > > > > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; 
> > > > > > > > > if (z)
> > > > > > > > >   return; } (Jason)
> > > > > > > > > * Change a bad multiline comment, using @ caracter instead of 
> > > > > > > > > * (Jason).
> > > > > > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > * Get the vduse domain through the vduse_as in the map 
> > > > > > > > > functions
> > > > > > > > >   (Jason).
> > > > > > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic 
> > > > > > > > > number
> > > > > > > > >   (Jason)
> > > > > > > > >
> > > > > > > > > v2:
> > > > > > > > > * Convert the use of mutex to rwlock.
> > > > > > > > >
> > > > > > > > > RFC v3:
> > > > > > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set 
> > > > > > > > > to a lower
> > > > > > > > >   value to reduce memory consumption, but vqs are already 
> > > > > > > > > limited to
> > > > > > > > >   that value and userspace VDUSE is able to allocate that 
> > > > > > > > > many vqs.
> > > > > > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > > > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > > > > > * Move the umem mutex to asid struct so there is no 
> > > > > > > > > contention between
> > > > > > > > >   ASIDs.
> > > > > > > > >
> > > > > > > > > RFC v2:
> > > > > > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so 
> > > > > > > > > the first
> > > > > > > > >   part of the struct is the same.
> > > > > > > > > ---
> > > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 
> > > > > > > > > +++++++++++++++++++----------
> > > > > > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > > > > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > index 767abcb7e375..786ab2378825 100644
> > > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > @@ -41,6 +41,7 @@
> > > > > > > > >
> > > > > > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > > > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > > > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > > > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > > > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > > > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > > > > > > >         struct mm_struct *mm;
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > > +struct vduse_as {
> > > > > > > > > +       struct vduse_iova_domain *domain;
> > > > > > > > > +       struct vduse_umem *umem;
> > > > > > > > > +       struct mutex mem_lock;
> > > > > > > >
> > > > > > > > Not related to this patch. But If I was not wrong we have 1:1 
> > > > > > > > mapping
> > > > > > > > between domain and as. If this is true, can we use bounce_lock 
> > > > > > > > instead
> > > > > > > > of a new mem_lock? Since I see mem_lock is only used for 
> > > > > > > > synchronizing
> > > > > > > > umem reg/degreg which has been synchronized with domain rwlock.
> > > > > > > >
> > > > > > >
> > > > > > > I think you're right, but they work at different levels at the 
> > > > > > > moment.
> > > > > > > The mem_lock is at vduse_dev and also protect the umem pointer, 
> > > > > > > and
> > > > > > > bounce_lock is at iova_domain.c .
> > > > > > >
> > > > > > > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > > > > > > Xie, what do you think?
> > > > > > >
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > >  struct vduse_vq_group {
> > > > > > > > > +       rwlock_t as_lock;
> > > > > > > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > > > > > > >         struct vduse_dev *dev;
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > > > > > >         struct vduse_vdpa *vdev;
> > > > > > > > >         struct device *dev;
> > > > > > > > >         struct vduse_virtqueue **vqs;
> > > > > > > > > -       struct vduse_iova_domain *domain;
> > > > > > > > > +       struct vduse_as *as;
> > > > > > > > >         char *name;
> > > > > > > > >         struct mutex lock;
> > > > > > > > >         spinlock_t msg_lock;
> > > > > > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > > > > > >         u32 vq_num;
> > > > > > > > >         u32 vq_align;
> > > > > > > > >         u32 ngroups;
> > > > > > > > > -       struct vduse_umem *umem;
> > > > > > > > > +       u32 nas;
> > > > > > > > >         struct vduse_vq_group *groups;
> > > > > > > > > -       struct mutex mem_lock;
> > > > > > > > >         unsigned int bounce_size;
> > > > > > > > >         struct mutex domain_lock;
> > > > > > > > >  };
> > > > > > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct 
> > > > > > > > > vduse_dev *dev, u8 status)
> > > > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 
> > > > > > > > > asid,
> > > > > > > > >                                   u64 start, u64 last)
> > > > > > > > >  {
> > > > > > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct 
> > > > > > > > > vduse_dev *dev,
> > > > > > > > >                 return -EINVAL;
> > > > > > > > >
> > > > > > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > > > > > -       msg.req.iova.start = start;
> > > > > > > > > -       msg.req.iova.last = last;
> > > > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > > > > > +               msg.req.iova.start = start;
> > > > > > > > > +               msg.req.iova.last = last;
> > > > > > > > > +       } else {
> > > > > > > > > +               msg.req.iova_v2.start = start;
> > > > > > > > > +               msg.req.iova_v2.last = last;
> > > > > > > > > +               msg.req.iova_v2.asid = asid;
> > > > > > > > > +       }
> > > > > > > > >
> > > > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > > > >  }
> > > > > > > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct 
> > > > > > > > > file *file, poll_table *wait)
> > > > > > > > >         return mask;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +/* Force set the asid to a vq group without a message to the 
> > > > > > > > > VDUSE device */
> > > > > > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > > > > > +                                      unsigned int group, 
> > > > > > > > > unsigned int asid)
> > > > > > > > > +{
> > > > > > > > > +       write_lock(&dev->groups[group].as_lock);
> > > > > > > > > +       dev->groups[group].as = &dev->as[asid];
> > > > > > > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > > > > > >  {
> > > > > > > > >         int i;
> > > > > > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > > > > > >
> > > > > > > > >         /* The coherent mappings are handled in 
> > > > > > > > > vduse_dev_free_coherent() */
> > > > > > > > > -       if (domain && domain->bounce_map)
> > > > > > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > > > > > +               struct vduse_iova_domain *domain = 
> > > > > > > > > dev->as[i].domain;
> > > > > > > > > +
> > > > > > > > > +               if (domain && domain->bounce_map)
> > > > > > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > > > > >
> > > > > > > > Btw, I see this:
> > > > > > > >
> > > > > > > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain 
> > > > > > > > *domain)
> > > > > > > > {
> > > > > > > >         if (!domain->bounce_map)
> > > > > > > >                 return;
> > > > > > > >
> > > > > > > >         spin_lock(&domain->iotlb_lock);
> > > > > > > >         if (!domain->bounce_map)
> > > > > > > >                 goto unlock;
> > > > > > > >
> > > > > > > >
> > > > > > > > The counbe_map is checked twice, let's fix that.
> > > > > > > >
> > > > > > >
> > > > > > > Double checked locking to avoid taking the lock?
> > > > > >
> > > > > > I don't know, but I think we don't care too much about the 
> > > > > > performance
> > > > > > of vduse_domain_reset_bounce_map().
> > > > > >
> > > > > > > I don't think it is
> > > > > > > worth it to keep it as it is not in the hot path anyway. But that
> > > > > > > would also be another patch independent of this series, isn't it?
> > > > > >
> > > > > > Yes, it's another independent issue I just found when reviewing 
> > > > > > this patch.
> > > > > >
> > > > > > >
> > > > > > > > > +       }
> > > > > > > > > +
> > > > > > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > > > > > >
> > > > > > > > Note that this function still does:
> > > > > > > >
> > > > > > > >                 vq->vq_group = 0;
> > > > > > > >
> > > > > > > > Which is wrong.
> > > > > > > >
> > > > > > >
> > > > > > > Right, removing it for the next version. Thanks for the catch!
> > > > > > >
> > > > > > > > >
> > > > > > > > >         down_write(&dev->rwsem);
> > > > > > > > >
> > > > > > > > > @@ -623,6 +652,30 @@ static union virtio_map 
> > > > > > > > > vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > > > > > >         return ret;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, 
> > > > > > > > > unsigned int group,
> > > > > > > > > +                               unsigned int asid)
> > > > > > > > > +{
> > > > > > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > > > > > +       int r;
> > > > > > > > > +
> > > > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > > > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > > > > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > > > > > > +               return -EINVAL;
> > > > > > > >
> > > > > > > > If we forbid setting group asid for !DRIVER_OK, why do we still 
> > > > > > > > need a
> > > > > > > > rwlock?
> > > > > > >
> > > > > > > virtio_map_ops->alloc is still called before DRIVER_OK to 
> > > > > > > allocate the
> > > > > > > vrings in the bounce buffer, for example. If you're ok with that 
> > > > > > > I'm
> > > > > > > ok with removing the lock, as all the calls are issued by the 
> > > > > > > driver
> > > > > > > setup process anyway. Or just to keep it for alloc?
> > > > > >
> > > > > > I see, then I think we need to keep that. The reason is that there's
> > > > > > no guarantee that the alloc() must be called before DRIVER_OK.
> > > > > >
> > > > > > >
> > > > > > > Anyway, I think I misunderstood your comment from [1] then.
> > > > > > >
> > > > > > > > All we need to do is to synchronize set_group_asid() with
> > > > > > > > set_status()/reset()?
> > > > > > > >
> > > > > > >
> > > > > > > That's also a good one. There is no synchronization if one thread 
> > > > > > > call
> > > > > > > reset and then the device is set up from another thread. As all 
> > > > > > > this
> > > > > > > situation is still hypothetical because virtio_vdpa does not 
> > > > > > > support
> > > > > > > set_group_asid,
> > > > > >
> > > > > > Right.
> > > > > >
> > > > > > > and vhost one is already protected by vhost lock, do
> > > > > > > we need it?
> > > > > >
> > > > > > Let's add a TODO in the code.
> > > > > >
> > > > >
> > > > > Can you expand on this? I meant that there will be no synchronization
> > > > > if we remove the rwlock (or similar). If we keep the rwlock we don't
> > > > > need to add any TODO, or am I missing something?
> > > >
> > > > I meant even with rwlock we don't synchronize set_group_as_id() and
> > > > set_status(). Since you said vhost has been synchronized, I think we
> > > > should either
> > > >
> > > > 1) document the synchronization that needs to be done in the upper 
> > > > layer or
> > > > 2) add a todo to synchronize the set_status() and set_group_asid()
> > > >
> > >
> > > With the VDUSE messages they are synchronized by dev->msg_lock, the
> > > vduse module always has a coherent status from both sides: the VDUSE
> > > userland device and virtio_vdpa. If you want vduse not to trust vdpa
> > > we can go the extra mile and make dev->status atomic for both device
> > > features and group ASID, would that work?
> >
> > Yes, something like this, for example, msg_lock is only used to
> > synchronize the IPC with the userspace, it doesn't synchronize the
> > set_status() with set_group_asid().
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > Or if you want to synchronize map ops with set_status() that 
> > > > > > > > looks
> > > > > > > > like an independent thing (hardening).
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > > > > > +       msg.req.vq_group_asid.group = group;
> > > > > > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > > > > > +
> > > > > > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > > > > > +       if (r < 0)
> > > > > > > > > +               return r;
> > > > > > > > > +
> > > > > > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > > > > > >
> > > > > > > > I'm not sure this has been discussed before, but I think it 
> > > > > > > > would be
> > > > > > > > better to introduce a new ioctl to get group -> as mapping. 
> > > > > > > > This helps
> > > > > > > > to avoid vduse_dev_msg_sync() as much as possible. And it 
> > > > > > > > doesn't
> > > > > > > > require the userspace to poll vduse fd before DRIVER_OK.
> > > > > > > >
> > > > >
> > > > > The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
> > > > > actually. so it cannot avoid polling the vduse device. What are the 
> > > > > reasons to
> > > > > avoid vduse_dev_msg_sync?
> > > >
> > > > One less chance to avoid synchronization with userspace.
> > > >
> > >
> > > But the ioctl alternative is more synchronization from my POV, not
> > > less. Instead of receiving notifications that we could even batch, the
> > > VDUSE userland device needs to issue many synchronous ioctls.
> >
> > Synchronous ioctl seems to be better than the trick like letting
> > kernel to wait for the userspace to response (and timeout).
> >
>
> While I don't love the timeout part, I don't see it that way. To add a
> "waiting for userspace device ioctl" status to the vduse device
> complicates the code in many different places, and it is hard to
> return an error from that point. But it is interesting to have on the
> table for sure.


Ok, considering we've already used vduse_dev_msg_sync() for many
places. I'm fine if you stick to that.

>
> > >
> > > > >
> > > > > > >
> > > > > > > I'm fine with that, but how do we communicate that they have 
> > > > > > > changed?
> > > > > >
> > > > > > Since we forbid changing the group->as mapping after DRIVER_OK,
> > > > > > userspace just needs to use that ioctl once after DRIVER_OK.
> > > > > >
> > > > >
> > > > > But the userland VDUSE device needs to know when ASIDs are set by the
> > > > > driver and will not change anymore. In this series it is solved by the
> > > > > order of the messages, but now we would need a way to know that time.
> > > > > One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
> > > > > msg.
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > If we do it that way there is still a window where the hypothetical
> > > > > (malicious) virtio_vdpa driver can read and write vq groups from
> > > > > different threads.
> > > >
> > > > See above, we need to add synchronization in either vdpa or virtio_vdpa.
> > > >
> > > > > It could issue a set_group_asid after the VDUSE
> > > > > device returns from the ioctl but before dev->status has not been
> > > > > updated.
> > > >
> > > > In the case of VDUSE I think either side should not trust the other
> > > > side. So userspace needs to be prepared for this, so did the driver.
> > > > Since the memory accessing is initiated from the userspace, so it
> > > > should not perform any memory access before the ioctl to fetch the
> > > > group->asid mapping.
> > > >
> > >
> > > I don't follow this. In the case of virtio_vdpa the ASID groups and
> > > features are handled by the kernel entirely. And in the case of
> > > vhost_vdpa they are handled by the vhost_dev->mutex lock. So can we
> > > trust them?
> >
> > The problem is there's no way for VDUSE to know whether or not the
> > upper layer is vhost or not?
> >
>
> My point is not that VDUSE can trust in vhost_vdpa but not in
> virtio_vdpa or the reverse, like if we could set a conditional flag in
> VDUSE and then modify the behavior based on that.
>
> My point is that the code needs to be consistent in how the VDUSE
> kernel module trusts the vdpa driver. It does not make sense to trust
> that it will never issue a state change (like a reset) concurrently
> with a DMA operation [1] but then don't trust that it will change the
> feature flag.
>
> At this moment VDUSE trusts that the vDPA driver does not launch these
> operations concurrently,

I think at least we should document this somewhere since this is
judged from the code review.

> and each vDPA driver has its own way to
> synchronize them. In the case of vhost is the vhost_dev mutex, and
> virtio_vdpa has other ways. I'll add the hardening here, but this can
> of worms can be extended through all the chain. How is that
> virtio_vdpa trust virtio_net will never call vdpa_set_features after
> DRIVER_OK? and virtio_net with virtio_ring? Should we harden all of
> that?

That's why I think we can leave this for future investigation by leaving a TODO.

>
> virtio_net does protect against this, as is the one interacting with
> userland. Same with vhost_vdpa.
>
> > >
> > > > >
> > > > > I don't think we should protect that, but if we want to do it we
> > > > > should protect that part either acquiring the rwlock and trusting the
> > > > > vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
> > > > > / smp_load_acquire, read and write barriers...
> > > > >
> > > > > Note that I still think this is overthinking. We have the same
> > > > > problems with driver features, where changing in bits like packed vq
> > > > > changes the behavior of vDPA callbacks and could be able to
> > > > > desynchronize the vduse kernel and the userland device.
> > > >
> > > > Probably, but for driver features, it's too late to do the change.
> > > >
> > >
> > > Why is it late? We just need to add the same synchronization in both
> > > driver features and ASID groups. The change is not visible for either
> > > virtio_vdpa, vhost_vdpa, or VDUSE userland device, at all.
> >
> > Ok, I think I get you. I thought we were talking about reducing the
> > vduse_dev_msg_sync() when doing SET_FETURES. But you meant the
> > synchronization with set_status(). Yes, we can do that.
> >
> > >
> > > > > But since
> > > > > virtio_vdpa and vduse kernel modules run in the same kernel, so they 
> > > > > should
> > > > > be able to trust each other.
> > > >
> > > > Better not, usually the lower layer (vdpa/vduse) should not trust the
> > > > upper layer (virtio-vdpa) in this case.
> > > >
> > >
> > > Ok, adding the hardening in the next version!
> > >
> > > > But if you stick to the method with vduse_dev_msg_sync, I'm also fine 
> > > > with that.
> > > >
> > > > >
> > > > > > > Or how to communicate to the driver that the device does not 
> > > > > > > accept
> > > > > > > the assignment of the ASID to the group?
> > > > > >
> > > > > > See above.
> > > > > >
> > > > >
> > > > > I didn't find the answer :(.
> > > > >
> > > > > Let me put an example with QEMU and vhost_vdpa:
> > > > > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
> > > > > - vduse cannot send this information to the device as it must wait for
> > > > > the ioctl.
> > > >
> > > > Which ioctl did you mean in this case?
> > > >
> > >
> > > The new ioctl from the userland VDUSE device that you're proposing to
> > > accept or reject the set ASID group. Let's call it
> > > VDUSE_GET_GROUP_ASID, so let me rewrite the from from that moment:
> > >
> > > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group (in
> > > vdpa->ngroups range) and a valid asid (in vdpa->nas range)
> > > - The vduse kernel module cannot send the information to the VDUSE
> > > userland device, it must wait until the VDUSE userland device calls
> > > the new VDUSE_GET_GROUP_ASID. It will not happen until QEMU calls
> > > VHOST_VDPA_SET_STATUS with DRIVER_OK, but QEMU will not call
> > > VHOST_VDPA_SET_STATUS until the kernel returns from
> > > VHOST_VDPA_SET_GROUP_ASID. The vduse kernel module needs to return
> > > success, without knowing if the device will accept it in the future or
> > > not.
> > > - Now QEMU sends DIRVER_OK, so the vduse kernel module forwards it to
> > > the VDUSE userland device. The VDUSE userland device then calls
> > > VDUSE_GET_GROUP_ASID, and the flow continues.
> > > - The vduse userland device doesn't accept the vq group ASID map for
> > > whatever reason, so it returns an error through another ioctl
> >
> > If the asid is less than the total number of address spaces, any
> > reason for the usersapce to reject such configuration?
> >
>
> That's a good question I can only answer with "allowing userspace to
> communicate back an error here will save us having to expand the
> communication protocol when we find the reason".
>
> If we continue the logic that userspace (or vDPA devices) cannot fail
> as long as the AS is in range and status is !DRIVER_OK, should we just
> move all the checks about valid AS range and valid state to vdpa core
> and make the .set_group_asid operation return void? That would be a
> good change indeed.

I'd try to limit the changeset of this series. I think I'm fine with
keeping using the vduse_dev_msg_sync().

>
> What should VDUSE do if the device replies VDUSE_REQ_RESULT_FAILED?
> (or, in your case, if the device never knows the assigned ASID as it
> never calls ioctl).
>
> (Actually vhost_vdpa already checks for as < vdpa->nas so that part is
> duplicated. It will be simpler in the next version, thanks!).
>
> [1] 
> https://patchew.org/linux/[email protected]/[email protected]/#cacgkmes6-j8h7x8navzgc0wraan4wslo3nmfx++fe0uaexw...@mail.gmail.com
>

Thanks

Re: [PATCH v10 7/8] vduse: add vq group asid support

Reply via email to