from:"Si\-Wei Liu"

Re: [RFC] vdpa/mlx5: preserve CVQ vringh index

2023-10-26 Thread Si-Wei Liu




Steve, I think this is a loose end that I myself am not sure if worth 
fixing, copy Eugenio for his awareness. Reason is that when CVQ is in 
place it always has to cope with device state saving and restoration 
using shadowed virtqueue for a lot of cases not just migration, and 
that's the reason why SVQ is always enabled for CVQ in the latest QEMU. 
But I agree this is a nice to have, possibly there could be value to 
support vDPA VM instances without solely depending on SVQ for e.g. for 
use case like memory encrypted VM. Thanks for posting the fix and lets 
see what other people think about it.


-Siwei

On 10/26/2023 1:13 PM, Steven Sistare wrote:

On 10/26/2023 4:11 PM, Steve Sistare wrote:

mlx5_vdpa does not preserve userland's view of vring base for the control
queue in the following sequence:

ioctl VHOST_SET_VRING_BASE
ioctl VHOST_VDPA_SET_STATUS VIRTIO_CONFIG_S_DRIVER_OK
   mlx5_vdpa_set_status()
 setup_cvq_vring()
   vringh_init_iotlb()
 vringh_init_kern()
   vrh->last_avail_idx = 0;
ioctl VHOST_GET_VRING_BASE

To fix, restore the value of cvq->vring.last_avail_idx after calling
vringh_init_iotlb.

Signed-off-by: Steve Sistare 

This is a resend, I forgot to cc myself the first time.

I don't know if we expect vring_base to be preserved after reset, because the
uapi comments say nothing about it.  mlx5 *does* preserve base across reset
for the the data vq's, but perhaps that is an accident of the implementation.

I posted this patch to perhaps avoid future problems. The bug(?) bit me while
developing with an older version of qemu, and I can work around it in qemu
code.  Further, the latest version of qemu always enables svq for the cvq
and is not affected by this behavior AFAICT.

- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vhost-vdpa: fix use-after-free in _compat_vdpa_reset

2023-10-26 Thread Si-Wei Liu




On 10/25/2023 11:55 PM, Si-Wei Liu wrote:



On 10/25/2023 10:26 PM, Michael S. Tsirkin wrote:

On Wed, Oct 25, 2023 at 04:13:14PM -0700, Si-Wei Liu wrote:
When the vhost-vdpa device is being closed, vhost_vdpa_cleanup() 
doesn't

clean up the vqs pointer after free. This could lead to use-after-tree
when _compat_vdpa_reset() tries to access the vqs that are freed 
already.

Fix is to set vqs pointer to NULL at the end of vhost_vdpa_cleanup()
after getting freed, which is guarded by atomic opened state.

   BUG: unable to handle page fault for address: 0001005b4af4
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x) - not-present page
   PGD 16a80a067 P4D 0
   Oops:  [#1] PREEMPT SMP NOPTI
   CPU: 4 PID: 40387 Comm: qemu-kvm Not tainted 6.6.0-rc7+ #3
   Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 
09/14/2022

   RIP: 0010:_compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa]
   Code: 90 90 90 0f 1f 44 00 00 41 55 4c 8d ae 08 03 00 00 41 54 55 48
   89 f5 53 4c 8b a6 00 03 00 00 48 85 ff 74 49 48 8b 07 4c 89 ef 
<48> 8b

   80 88 45 00 00 48 c1 e8 08 48 83 f0 01 89 c3 e8 73 5e 9b dc
   RSP: 0018:ff73a85762073ba0 EFLAGS: 00010286
   RAX: 0001005b056c RBX: ff32b13ca6994c68 RCX: 0002
   RDX: 0001 RSI: ff32b13c07559000 RDI: ff32b13c07559308
   RBP: ff32b13c07559000 R08:  R09: ff32b12ca497c0f0
   R10: ff73a85762073c58 R11: 000c106f9de3 R12: ff32b12c95b1d050
   R13: ff32b13c07559308 R14: ff32b12d0ddc5100 R15: 8002
   FS:  7fec5b8cbf80() GS:ff32b13bbfc8()
   knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 0001005b4af4 CR3: 00015644a003 CR4: 00773ee0
   PKRU: 5554
   Call Trace:
    
    ? __die+0x20/0x70
    ? page_fault_oops+0x76/0x170
    ? exc_page_fault+0x65/0x150
    ? asm_exc_page_fault+0x22/0x30
    ? _compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa]
    vhost_vdpa_open+0x57/0x280 [vhost_vdpa]
    ? __pfx_chrdev_open+0x10/0x10
    chrdev_open+0xc6/0x260
    ? __pfx_chrdev_open+0x10/0x10
    do_dentry_open+0x16e/0x530
    do_open+0x21c/0x400
    path_openat+0x111/0x290
    do_filp_open+0xb2/0x160
    ? __check_object_size.part.0+0x5e/0x140
    do_sys_openat2+0x96/0xd0
    __x64_sys_openat+0x53/0xa0
    do_syscall_64+0x59/0x90
    ? syscall_exit_to_user_mode+0x22/0x40
    ? do_syscall_64+0x69/0x90
    ? syscall_exit_to_user_mode+0x22/0x40
    ? do_syscall_64+0x69/0x90
    ? do_syscall_64+0x69/0x90
    ? syscall_exit_to_user_mode+0x22/0x40
    ? do_syscall_64+0x69/0x90
    ? exc_page_fault+0x65/0x150
    entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for 
older userspace")
Fixes: ac7e98c73c05 ("vhost-vdpa: fix NULL pointer deref in 
_compat_vdpa_reset")

So these two are all in next correct?

I really do not like it how 10cbf8dfaf936e3ef1f5d7fdc6e9dada268ba6bb
introduced a regression and then apparently we keep fixing things up?

Sorry my bad. The latest one should be all of it.



Can I squash these 3 commits?
Sure. Or if you want me to send a v5 with all 3 commits squashed in, I 
can do for sure.
Saw you squashed it with the 2 fixups in place, thank you! Sent a v5 
anyway, just in case if you need a fresh series.


Thanks,
-Siwei



Thanks,
-Siwei




Reported-by: Lei Yang 
Closes: 
https://lore.kernel.org/all/CAPpAL=yhdqn1aztecn3mps8o4m+bl_hvy02fdpihn7dwd91...@mail.gmail.com/

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 9a2343c45df0..30df5c58db73 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1355,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct 
vhost_vdpa *v)

  vhost_vdpa_free_domain(v);
  vhost_dev_cleanup(>vdev);
  kfree(v->vdev.vqs);
+    v->vdev.vqs = NULL;
  }
    static int vhost_vdpa_open(struct inode *inode, struct file *filep)
--
2.39.3




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v5 7/7] vdpa_sim: implement .reset_map support

2023-10-26 Thread Si-Wei Liu

In order to reduce excessive memory mapping cost in live migration and
VM reboot, it is desirable to decouple the vhost-vdpa IOTLB abstraction
from the virtio device life cycle, i.e. mappings can be kept intact
across virtio device reset. Leverage the .reset_map callback, which is
meant to destroy the iotlb on the given ASID and recreate the 1:1
passthrough/identity mapping. To be consistent, the mapping on device
creation is initiailized to passthrough/identity with PA 1:1 mapped as
IOVA. With this the device .reset op doesn't have to maintain and clean
up memory mappings by itself.

Additionally, implement .compat_reset to cater for older userspace,
which may wish to see mapping to be cleared during reset.

Signed-off-by: Si-Wei Liu 
Tested-by: Stefano Garzarella 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 52 ++--
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 76d41058add9..be2925d0d283 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -139,7 +139,7 @@ static void vdpasim_vq_reset(struct vdpasim *vdpasim,
vq->vring.notify = NULL;
 }
 
-static void vdpasim_do_reset(struct vdpasim *vdpasim)
+static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 flags)
 {
int i;
 
@@ -151,11 +151,13 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim)
 >iommu_lock);
}
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
-   vhost_iotlb_reset(>iommu[i]);
-   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
- 0, VHOST_MAP_RW);
-   vdpasim->iommu_pt[i] = true;
+   if (flags & VDPA_RESET_F_CLEAN_MAP) {
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
+   vhost_iotlb_reset(>iommu[i]);
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
}
 
vdpasim->running = true;
@@ -259,8 +261,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr 
*dev_attr,
if (!vdpasim->iommu_pt)
goto err_iommu;
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++)
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0);
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0,
+ VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
 
for (i = 0; i < dev_attr->nvqs; i++)
vringh_set_iotlb(>vqs[i].vring, >iommu[0],
@@ -480,18 +486,23 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, 
u8 status)
mutex_unlock(>mutex);
 }
 
-static int vdpasim_reset(struct vdpa_device *vdpa)
+static int vdpasim_compat_reset(struct vdpa_device *vdpa, u32 flags)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 
mutex_lock(>mutex);
vdpasim->status = 0;
-   vdpasim_do_reset(vdpasim);
+   vdpasim_do_reset(vdpasim, flags);
mutex_unlock(>mutex);
 
return 0;
 }
 
+static int vdpasim_reset(struct vdpa_device *vdpa)
+{
+   return vdpasim_compat_reset(vdpa, 0);
+}
+
 static int vdpasim_suspend(struct vdpa_device *vdpa)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -637,6 +648,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, 
unsigned int asid,
return ret;
 }
 
+static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid)
+{
+   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+   if (asid >= vdpasim->dev_attr.nas)
+   return -EINVAL;
+
+   spin_lock(>iommu_lock);
+   if (vdpasim->iommu_pt[asid])
+   goto out;
+   vhost_iotlb_reset(>iommu[asid]);
+   vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[asid] = true;
+out:
+   spin_unlock(>iommu_lock);
+   return 0;
+}
+
 static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -749,6 +779,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = {
.get_status = vdpasim_get_status,
.set_status = vdpasim_set_status,
.reset  = vdpasim_reset,
+   .compat_reset   = vdpasim_compat_reset,
.suspend= vdpasim_suspend,
.resume = vdpasim_resume,
.get_config_size= vdpasim_get_config_size,
@@ -759,6 +790,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = {
.set_group_asid = vdpasim_set

[PATCH v5 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-26 Thread Si-Wei Liu

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack the
IOTLB_PERSIST backend feature. As this compatibility quirk is limited to
those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual driver
whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
Tested-by: Dragos Tatulea 
Tested-by: Lei Yang 
---
 drivers/vhost/vdpa.c | 20 
 drivers/virtio/virtio_vdpa.c |  2 +-
 include/linux/vdpa.h |  7 +--
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..30df5c58db73 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,24 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa 
*v, u16 qid)
irq_bypass_unregister_producer(>call_ctx.producer);
 }
 
-static int vhost_vdpa_reset(struct vhost_vdpa *v)
+static int _compat_vdpa_reset(struct vhost_vdpa *v)
 {
struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;
 
-   v->in_batch = 0;
+   if (v->vdev.vqs) {
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+   
VHOST_BACKEND_F_IOTLB_PERSIST) ?
+VDPA_RESET_F_CLEAN_MAP : 0;
+   }
+
+   return vdpa_reset(vdpa, flags);
+}
 
-   return vdpa_reset(vdpa);
+static int vhost_vdpa_reset(struct vhost_vdpa *v)
+{
+   v->in_batch = 0;
+   return _compat_vdpa_reset(v);
 }
 
 static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
@@ -312,7 +323,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
vhost_vdpa_unsetup_vq_irq(v, i);
 
if (status == 0) {
-   ret = vdpa_reset(vdpa);
+   ret = _compat_vdpa_reset(v);
if (ret)
return ret;
} else
@@ -1344,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v)
vhost_vdpa_free_domain(v);
vhost_dev_cleanup(>vdev);
kfree(v->vdev.vqs);
+   v->vdev.vqs = NULL;
 }
 
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 06ce6d8c2e00..8d63e5923d24 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev)
 {
struct vdpa_device *vdpa = vd_get_vdpa(vdev);
 
-   vdpa_reset(vdpa);
+   vdpa_reset(vdpa, 0);
 }
 
 static bool virtio_vdpa_notify(struct virtqueue *vq)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6b8cbf75712d..db15ac07f8a6 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
return vdev->dma_dev;
 }
 
-static inline int vdpa_reset(struct vdpa_device *vdev)
+static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
 {
const struct vdpa_config_ops *ops = vdev->config;
int ret;
 
down_write(>cf_lock);
vdev->features_valid = false;
-   ret = ops->reset(vdev);
+   if (ops->compat_reset && flags)
+   ret = ops->compat_reset(vdev, flags);
+   else
+   ret = ops->reset(vdev);
up_write(>cf_lock);
return ret;
 }
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v5 3/7] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-10-26 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in
the kernel can be trusted to persist IOTLB mapping across vDPA reset.
Without it, userspace has no way to tell apart if it's running on an
older kernel, which could silently drop all iotlb mapping across vDPA
reset, especially with broken parent driver implementation for the
.reset driver op. The broken driver may incorrectly drop all mappings of
its own as part of .reset, which inadvertently ends up with corrupted
mapping state between vhost-vdpa userspace and the kernel. As a
workaround, to make the mapping behaviour predictable across reset,
userspace has to pro-actively remove all mappings before vDPA reset, and
then restore all the mappings afterwards. This workaround is done
unconditionally on top of all parent drivers today, due to the parent
driver implementation issue and no means to differentiate.  This
workaround had been utilized in QEMU since day one when the
corresponding vhost-vdpa userspace backend came to the world.

There are 3 cases that backend may claim this feature bit on for:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation with
  persistent IOTLB mapping already that has to specifically
  declare this backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch back
iotlb to the initial state later on, especially for the on-chip IOMMU
case which starts with identity mapping at device creation. virtio-vdpa
requires on-chip IOMMU to perform 1:1 passthrough translation from PA to
IOVA as-is to begin with, and .reset_map is the only means to turn back
iotlb to the identity mapping mode after vhost-vdpa is gone.

The difference in behavior did not matter as QEMU unmaps all the memory
unregistering the memory listener at vhost_vdpa_dev_start( started =
false), but the backend acknowledging this feature flag allows QEMU to
make sure it is safe to skip this unmap & map in the case of vhost stop
& start cycle.

In that sense, this feature flag is actually a signal for userspace to
know that the driver bug has been solved. Not offering it indicates that
userspace cannot trust the kernel will retain the maps.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 15 +++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index c6bfe9bdde42..acc7c74ba7d6 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -439,6 +439,15 @@ static u64 vhost_vdpa_get_backend_features(const struct 
vhost_vdpa *v)
return ops->get_backend_features(vdpa);
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+  vhost_vdpa_get_backend_features(v) & 
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -726,6 +735,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -742,6 +752,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -797,6 +810,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, , sizeof(features

[PATCH v5 6/7] vdpa/mlx5: implement .reset_map driver op

2023-10-26 Thread Si-Wei Liu

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while the
first .set_map call is invoked, in which case callers like vhost-vdpa
will start to set up custom mappings. When the .reset callback is
invoked, the custom mappings will be cleared and the 1:1 DMA MR will be
re-created.

In order to reduce excessive memory mapping cost in live migration, it
is desirable to decouple the vhost-vdpa IOTLB abstraction from the
virtio device life cycle, i.e. mappings can be kept around intact across
virtio device reset. Leverage the .reset_map callback, which is meant to
destroy the regular MR (including cvq mapping) on the given ASID and
recreate the initial DMA mapping. That way, the device .reset op runs
free from having to maintain and clean up memory mappings by itself.

Additionally, implement .compat_reset to cater for older userspace,
which may wish to see mapping to be cleared during reset.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 27 ---
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ced5a5d..84547d998bcf 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28f327..2197c46e563a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+   if (asid >= MLX5_VDPA_NUM_AS)
+   return -EINVAL;
+
+   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+   if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (mlx5_vdpa_create_dma_mr(mvdev))
+   mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+   } else {
+   mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+   }
+
+   return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f4516a2d5bb0..12ac3397f39b 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2876,7 +2876,7 @@ static void init_group_to_asid_map(struct mlx5_vdpa_dev 
*mvdev)
mvdev->group2asid[i] = 0;
 }
 
-static int mlx5_vdpa_reset(struct vdpa_device *vdev)
+static int mlx5_vdpa_compat_reset(struct vdpa_device *vdev, u32 flags)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
@@ -2888,7 +2888,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
-   mlx5_vdpa_destroy_mr_resources(>mvdev);
+   if (flags & VDPA_RESET_F_CLEAN_MAP)
+   mlx5_vdpa_destroy_mr_resources(>mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2899,7 +2900,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;
 
-   if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if ((flags & VDPA_RESET_F_CLEAN_MAP) &&
+   MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
if (mlx5_vdpa_create_dma_mr(mvdev))
mlx5_vdpa_warn(mvdev, "create MR failed\n");
}
@@ -2908,6 +2910,11 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
return 0;
 }
 
+static int mlx5_vdpa_reset(struct vdpa_device *vdev)
+{
+   return mlx5_vdpa_compat_reset(vdev, 0);
+}
+
 static size_t mlx5_vdpa_get_config_size(struct vdpa_device *vdev)
 {
return sizeof(struct virtio_net_config);
@@ -2987,6 +2994,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, 
unsigned int asid,
return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsi

[PATCH v5 4/7] vdpa: introduce .compat_reset operation callback

2023-10-26 Thread Si-Wei Liu

Some device specific IOMMU parent drivers have long standing bogus
behaviour that mistakenly clean up the maps during .reset. By
definition, this is violation to the on-chip IOMMU ops (i.e. .set_map,
or .dma_map & .dma_unmap) in those offending drivers, as the removal of
internal maps is completely agnostic to the upper layer, causing
inconsistent view between the userspace and the kernel. Some userspace
app like QEMU gets around of this brokenness by proactively removing and
adding back all the maps around vdpa device reset, but such workaround
actually penaltize other well-behaved driver setup, where vdpa reset
always comes with the associated mapping cost, especially for kernel
vDPA devices (use_va=false) that have high cost on pinning. It's
imperative to rectify this behaviour and remove the problematic code
from all those non-compliant parent drivers.

However, we cannot unconditionally remove the bogus map-cleaning code
from the buggy .reset implementation, as there might exist userspace
apps that already rely on the behaviour on some setup. Introduce a
.compat_reset driver op to keep compatibility with older userspace. New
and well behaved parent driver should not bother to implement such op,
but only those drivers that are doing or used to do non-compliant
map-cleaning reset will have to.

Signed-off-by: Si-Wei Liu 
---
 include/linux/vdpa.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 26ae6ae1eac3..6b8cbf75712d 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -252,6 +252,17 @@ struct vdpa_map_file {
  * @reset: Reset device
  * @vdev: vdpa device
  * Returns integer: success (0) or error (< 0)
+ * @compat_reset:  Reset device with compatibility quirks to
+ * accommodate older userspace. Only needed by
+ * parent driver which used to have bogus reset
+ * behaviour, and has to maintain such behaviour
+ * for compatibility with older userspace.
+ * Historically compliant driver only has to
+ * implement .reset, Historically non-compliant
+ * driver should implement both.
+ * @vdev: vdpa device
+ * @flags: compatibility quirks for reset
+ * Returns integer: success (0) or error (< 0)
  * @suspend:   Suspend the device (optional)
  * @vdev: vdpa device
  * Returns integer: success (0) or error (< 0)
@@ -393,6 +404,8 @@ struct vdpa_config_ops {
u8 (*get_status)(struct vdpa_device *vdev);
void (*set_status)(struct vdpa_device *vdev, u8 status);
int (*reset)(struct vdpa_device *vdev);
+   int (*compat_reset)(struct vdpa_device *vdev, u32 flags);
+#define VDPA_RESET_F_CLEAN_MAP 1
int (*suspend)(struct vdpa_device *vdev);
int (*resume)(struct vdpa_device *vdev);
size_t (*get_config_size)(struct vdpa_device *vdev);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v5 1/7] vdpa: introduce .reset_map operation callback

2023-10-26 Thread Si-Wei Liu

Some device specific IOMMU parent drivers have long standing bogus
behavior that mistakenly clean up the maps during .reset. By definition,
this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map &
.dma_unmap) in those offending drivers, as the removal of internal maps
is completely agnostic to the upper layer, causing inconsistent view
between the userspace and the kernel. Some userspace app like QEMU gets
around of this brokenness by proactively removing and adding back all
the maps around vdpa device reset, but such workaround actually penalize
other well-behaved driver setup, where vdpa reset always comes with the
associated mapping cost, especially for kernel vDPA devices
(use_va=false) that have high cost on pinning. It's imperative to
rectify this behavior and remove the problematic code from all those
non-compliant parent drivers.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation detail to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB
mappings, and has been abstracted in a way similar to how a real IOMMU
device maps or unmaps pages for certain memory ranges. However, apart
from this there also exists other mapping needs, in which case 1:1
passthrough mapping has to be used by other users (read virtio-vdpa). To
ease parent/vendor driver implementation and to avoid abusing DMA ops in
an unexpacted way, these on-chip IOMMU devices can start with 1:1
passthrough mapping mode initially at the time of creation. Then the
.reset_map op can be used to switch iotlb back to this initial state
without having to expose a complex two-dimensional IOMMU device model.

The .reset_map is not a MUST for every parent that implements the
.dma_map or .set_map API, because device may work with DMA ops directly
by implement their own to manipulate system memory mappings, so don't
have to use .reset_map to achieve a simple IOMMU device model for 1:1
passthrough mapping.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309b99cf..26ae6ae1eac3 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v5 2/7] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-26 Thread Si-Wei Liu

Devices with on-chip IOMMU or vendor specific IOTLB implementation may
need to restore iotlb mapping to the initial or default state using the
.reset_map op, as it's desirable for some parent devices to not work
with DMA ops and maintain a simple IOMMU model with .reset_map. In
particular, device reset should not cause mapping to go away on such
IOTLB model, so persistent mapping is implied across reset. Before the
userspace process using vhost-vdpa is gone, give it a chance to reset
iotlb back to the initial state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f57b95..c6bfe9bdde42 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state, which cannot be
+* cleaned up in the all range unmap call above. Give them
+* a chance to clean up or reset the map to the desired
+* state.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v5 0/7] vdpa: decouple reset of iotlb mapping from device reset

2023-10-26 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v5:
- Squashed two fixups to the clean map patch

v4:
- Rework compatibility using new .compat_reset driver op

v3:
- add .reset_map support to vdpa_sim
- introduce module parameter to provide bug-for-bug compatibility with older
  userspace 

v2:
- improved commit message to clarify the intended csope of .reset_map API
- improved commit messages to clarify no breakage on older userspace

v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (7):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa: introduce .compat_reset operation callback
  vhost-vdpa: clean iotlb map during reset for older userspace
  vdpa/mlx5: implement .reset_map driver op
  vdpa_sim: implement .reset_map support

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 ++
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 27 ++--
 drivers/vdpa/vdpa_sim/vdpa_sim.c   | 52 --
 drivers/vhost/vdpa.c   | 52 +++---
 drivers/virtio/virtio_vdpa.c   |  2 +-
 include/linux/vdpa.h   | 30 +++--
 include/uapi/linux/vhost_types.h   |  2 ++
 8 files changed, 164 insertions(+), 19 deletions(-)

-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vhost-vdpa: fix use-after-free in _compat_vdpa_reset

2023-10-26 Thread Si-Wei Liu





On 10/25/2023 10:26 PM, Michael S. Tsirkin wrote:

On Wed, Oct 25, 2023 at 04:13:14PM -0700, Si-Wei Liu wrote:

When the vhost-vdpa device is being closed, vhost_vdpa_cleanup() doesn't
clean up the vqs pointer after free. This could lead to use-after-tree
when _compat_vdpa_reset() tries to access the vqs that are freed already.
Fix is to set vqs pointer to NULL at the end of vhost_vdpa_cleanup()
after getting freed, which is guarded by atomic opened state.

   BUG: unable to handle page fault for address: 0001005b4af4
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x) - not-present page
   PGD 16a80a067 P4D 0
   Oops:  [#1] PREEMPT SMP NOPTI
   CPU: 4 PID: 40387 Comm: qemu-kvm Not tainted 6.6.0-rc7+ #3
   Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022
   RIP: 0010:_compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa]
   Code: 90 90 90 0f 1f 44 00 00 41 55 4c 8d ae 08 03 00 00 41 54 55 48
   89 f5 53 4c 8b a6 00 03 00 00 48 85 ff 74 49 48 8b 07 4c 89 ef <48> 8b
   80 88 45 00 00 48 c1 e8 08 48 83 f0 01 89 c3 e8 73 5e 9b dc
   RSP: 0018:ff73a85762073ba0 EFLAGS: 00010286
   RAX: 0001005b056c RBX: ff32b13ca6994c68 RCX: 0002
   RDX: 0001 RSI: ff32b13c07559000 RDI: ff32b13c07559308
   RBP: ff32b13c07559000 R08:  R09: ff32b12ca497c0f0
   R10: ff73a85762073c58 R11: 000c106f9de3 R12: ff32b12c95b1d050
   R13: ff32b13c07559308 R14: ff32b12d0ddc5100 R15: 8002
   FS:  7fec5b8cbf80() GS:ff32b13bbfc8()
   knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 0001005b4af4 CR3: 00015644a003 CR4: 00773ee0
   PKRU: 5554
   Call Trace:

? __die+0x20/0x70
? page_fault_oops+0x76/0x170
? exc_page_fault+0x65/0x150
? asm_exc_page_fault+0x22/0x30
? _compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa]
vhost_vdpa_open+0x57/0x280 [vhost_vdpa]
? __pfx_chrdev_open+0x10/0x10
chrdev_open+0xc6/0x260
? __pfx_chrdev_open+0x10/0x10
do_dentry_open+0x16e/0x530
do_open+0x21c/0x400
path_openat+0x111/0x290
do_filp_open+0xb2/0x160
? __check_object_size.part.0+0x5e/0x140
do_sys_openat2+0x96/0xd0
__x64_sys_openat+0x53/0xa0
do_syscall_64+0x59/0x90
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
? do_syscall_64+0x69/0x90
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
? exc_page_fault+0x65/0x150
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older 
userspace")
Fixes: ac7e98c73c05 ("vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset")

So these two are all in next correct?

I really do not like it how 10cbf8dfaf936e3ef1f5d7fdc6e9dada268ba6bb
introduced a regression and then apparently we keep fixing things up?

Sorry my bad. The latest one should be all of it.



Can I squash these 3 commits?
Sure. Or if you want me to send a v5 with all 3 commits squashed in, I 
can do for sure.


Thanks,
-Siwei




Reported-by: Lei Yang 
Closes: 
https://lore.kernel.org/all/CAPpAL=yhdqn1aztecn3mps8o4m+bl_hvy02fdpihn7dwd91...@mail.gmail.com/
Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 9a2343c45df0..30df5c58db73 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1355,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v)
vhost_vdpa_free_domain(v);
vhost_dev_cleanup(>vdev);
kfree(v->vdev.vqs);
+   v->vdev.vqs = NULL;
  }
  
  static int vhost_vdpa_open(struct inode *inode, struct file *filep)

--
2.39.3


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset

2023-10-25 Thread Si-Wei Liu

Hi Yang Lei,

Thanks for testing my patches and reporting! As for the issue, could you
please try what I posted in:

https://lore.kernel.org/virtualization/1698275594-19204-1-git-send-email-si-wei@oracle.com/

and let me know how it goes? Thank you very much!

Thanks,
-Siwei

On 10/25/2023 2:41 AM, Lei Yang wrote:

On Wed, Oct 25, 2023 at 1:27 AM Si-Wei Liu wrote:
Hello Si-Wei

Thanks a lot for testing! Please be aware that there's a follow-up fix
for a potential oops in this v4 series:

The first, when I did not apply this patch [1], I will also hit this
patch mentioned problem. After I applied this patch, this problem will
no longer to hit again. But I hit another issues, about the error
messages please review the attached file.
[1]
https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/

My test steps:
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux/
b4 am 1697880319-4937-1-git-send-email-si-wei@oracle.com
b4 am 20231018171456.1624030-2-dtatu...@nvidia.com
b4 am 1698102863-21122-1-git-send-email-si-wei@oracle.com
git am ./v4_20231018_dtatulea_vdpa_add_support_for_vq_descriptor_mappings.mbx
git am
./v4_20231021_si_wei_liu_vdpa_decouple_reset_of_iotlb_mapping_from_device_reset.mbx
git am
./20231023_si_wei_liu_vhost_vdpa_fix_null_pointer_deref_in__compat_vdpa_reset.mbx
cp /boot/config-5.14.0-377.el9.x86_64 .config
make -j 32
make modules_install
make install

Thanks

Lei

https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/

Would be nice to have it applied for any tests.

Thanks,
-Siwei

On 10/23/2023 11:51 PM, Lei Yang wrote:

QE tested this series v4 with regression testing on real nic, there is
no new regression bug.

Tested-by: Lei Yang

On Tue, Oct 24, 2023 at 6:02 AM Si-Wei Liu wrote:

On 10/22/2023 8:51 PM, Jason Wang wrote:

Hi Si-Wei:

On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote:

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g.
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v4:
- Rework compatibility using new .compat_reset driver op

I still think having a set_backend_feature()

This will overload backend features with the role of carrying over
compatibility quirks, which I tried to avoid from. While I think the
.compat_reset from the v4 code just works with the backend features
acknowledgement (and maybe others as well) to determine, but not
directly tie it to backend features itself. These two have different
implications in terms of requirement, scope and maintaining/deprecation,
better to cope with compat quirks in explicit and driver visible way.

or reset_map(clean=true) might be better.

An explicit op might be marginally better in driver writer's point of
view. Compliant driver doesn't have to bother asserting clean_map never
be true so their code would never bother dealing with this case, as
explained in the commit log for patch 5 "vhost-vdpa: clean iotlb map
during reset for older userspace":

"
The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behavior before, so it
can apply the corresponding compatibility quirk to the individual
driver
whenever needed. Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.
"

As it tries hard to not introduce new stuff on the bus.

[PATCH] vhost-vdpa: fix use-after-free in _compat_vdpa_reset

2023-10-25 Thread Si-Wei Liu

When the vhost-vdpa device is being closed, vhost_vdpa_cleanup() doesn't
clean up the vqs pointer after free. This could lead to use-after-tree
when _compat_vdpa_reset() tries to access the vqs that are freed already.
Fix is to set vqs pointer to NULL at the end of vhost_vdpa_cleanup()
after getting freed, which is guarded by atomic opened state.

  BUG: unable to handle page fault for address: 0001005b4af4
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 16a80a067 P4D 0
  Oops:  [#1] PREEMPT SMP NOPTI
  CPU: 4 PID: 40387 Comm: qemu-kvm Not tainted 6.6.0-rc7+ #3
  Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022
  RIP: 0010:_compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa]
  Code: 90 90 90 0f 1f 44 00 00 41 55 4c 8d ae 08 03 00 00 41 54 55 48
  89 f5 53 4c 8b a6 00 03 00 00 48 85 ff 74 49 48 8b 07 4c 89 ef <48> 8b
  80 88 45 00 00 48 c1 e8 08 48 83 f0 01 89 c3 e8 73 5e 9b dc
  RSP: 0018:ff73a85762073ba0 EFLAGS: 00010286
  RAX: 0001005b056c RBX: ff32b13ca6994c68 RCX: 0002
  RDX: 0001 RSI: ff32b13c07559000 RDI: ff32b13c07559308
  RBP: ff32b13c07559000 R08:  R09: ff32b12ca497c0f0
  R10: ff73a85762073c58 R11: 000c106f9de3 R12: ff32b12c95b1d050
  R13: ff32b13c07559308 R14: ff32b12d0ddc5100 R15: 8002
  FS:  7fec5b8cbf80() GS:ff32b13bbfc8()
  knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0001005b4af4 CR3: 00015644a003 CR4: 00773ee0
  PKRU: 5554
  Call Trace:
   
   ? __die+0x20/0x70
   ? page_fault_oops+0x76/0x170
   ? exc_page_fault+0x65/0x150
   ? asm_exc_page_fault+0x22/0x30
   ? _compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa]
   vhost_vdpa_open+0x57/0x280 [vhost_vdpa]
   ? __pfx_chrdev_open+0x10/0x10
   chrdev_open+0xc6/0x260
   ? __pfx_chrdev_open+0x10/0x10
   do_dentry_open+0x16e/0x530
   do_open+0x21c/0x400
   path_openat+0x111/0x290
   do_filp_open+0xb2/0x160
   ? __check_object_size.part.0+0x5e/0x140
   do_sys_openat2+0x96/0xd0
   __x64_sys_openat+0x53/0xa0
   do_syscall_64+0x59/0x90
   ? syscall_exit_to_user_mode+0x22/0x40
   ? do_syscall_64+0x69/0x90
   ? syscall_exit_to_user_mode+0x22/0x40
   ? do_syscall_64+0x69/0x90
   ? do_syscall_64+0x69/0x90
   ? syscall_exit_to_user_mode+0x22/0x40
   ? do_syscall_64+0x69/0x90
   ? exc_page_fault+0x65/0x150
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older 
userspace")
Fixes: ac7e98c73c05 ("vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset")
Reported-by: Lei Yang 
Closes: 
https://lore.kernel.org/all/CAPpAL=yhdqn1aztecn3mps8o4m+bl_hvy02fdpihn7dwd91...@mail.gmail.com/
Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 9a2343c45df0..30df5c58db73 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1355,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v)
vhost_vdpa_free_domain(v);
vhost_dev_cleanup(>vdev);
kfree(v->vdev.vqs);
+   v->vdev.vqs = NULL;
 }
 
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset

2023-10-24 Thread Si-Wei Liu

Thanks a lot for testing! Please be aware that there's a follow-up fix
for a potential oops in this v4 series:

https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/

Would be nice to have it applied for any tests.

Thanks,
-Siwei

On 10/23/2023 11:51 PM, Lei Yang wrote:

QE tested this series v4 with regression testing on real nic, there is
no new regression bug.

Tested-by: Lei Yang

On Tue, Oct 24, 2023 at 6:02 AM Si-Wei Liu wrote:

On 10/22/2023 8:51 PM, Jason Wang wrote:

Hi Si-Wei:

On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote:

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v4:
- Rework compatibility using new .compat_reset driver op

I still think having a set_backend_feature()

or reset_map(clean=true) might be better.

As it tries hard to not introduce new stuff on the bus.

Honestly I don't see substantial difference between these other than the
color. There's no single best solution that stands out among the 3. And
I assume you already noticed it from all the above 3 approaches will
have to go with backend features negotiation, that the 1st vdpa reset
before backend feature negotiation will use the compliant version of
.reset that doesn't clean up the map. While I don't think this nuance
matters much to existing older userspace apps, as the maps should
already get cleaned by previous process in vhost_vdpa_cleanup(), but if
bug-for-bug behavioral compatibility is what you want, module parameter
will be the single best answer.

Regards,
-Siwei

But we can listen to others for sure.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-24 Thread Si-Wei Liu




On 10/24/2023 9:21 AM, Si-Wei Liu wrote:



On 10/23/2023 10:45 PM, Jason Wang wrote:
On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu  
wrote:

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack 
the
IOTLB_PERSIST backend feature. As this compatibility quirk is 
limited to

those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual 
driver

whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  drivers/virtio/virtio_vdpa.c |  2 +-
  include/linux/vdpa.h |  7 +--
  3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..9ce40003793b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct 
vhost_vdpa *v, u16 qid)

irq_bypass_unregister_producer(>call_ctx.producer);
  }

-static int vhost_vdpa_reset(struct vhost_vdpa *v)
+static int _compat_vdpa_reset(struct vhost_vdpa *v)
  {
 struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;

-   v->in_batch = 0;
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+ VHOST_BACKEND_F_IOTLB_PERSIST) ?
+    VDPA_RESET_F_CLEAN_MAP : 0;
+
+   return vdpa_reset(vdpa, flags);
+}

-   return vdpa_reset(vdpa);
+static int vhost_vdpa_reset(struct vhost_vdpa *v)
+{
+   v->in_batch = 0;
+   return _compat_vdpa_reset(v);
  }

  static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
@@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct 
vhost_vdpa *v, u8 __user *statusp)

 vhost_vdpa_unsetup_vq_irq(v, i);

 if (status == 0) {
-   ret = vdpa_reset(vdpa);
+   ret = _compat_vdpa_reset(v);
 if (ret)
 return ret;
 } else
diff --git a/drivers/virtio/virtio_vdpa.c 
b/drivers/virtio/virtio_vdpa.c

index 06ce6d8c2e00..8d63e5923d24 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct 
virtio_device *vdev)

  {
 struct vdpa_device *vdpa = vd_get_vdpa(vdev);

-   vdpa_reset(vdpa);
+   vdpa_reset(vdpa, 0);
  }

  static bool virtio_vdpa_notify(struct virtqueue *vq)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6b8cbf75712d..db15ac07f8a6 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -519,14 +519,17 @@ static inline struct device 
*vdpa_get_dma_dev(struct vdpa_device *vdev)

 return vdev->dma_dev;
  }

-static inline int vdpa_reset(struct vdpa_device *vdev)
+static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
  {
 const struct vdpa_config_ops *ops = vdev->config;
 int ret;

 down_write(>cf_lock);
 vdev->features_valid = false;
-   ret = ops->reset(vdev);
+   if (ops->compat_reset && flags)
+   ret = ops->compat_reset(vdev, flags);
+   else
+   ret = ops->reset(vdev);

Instead of inventing a new API that carries the flags. Tweak the
existing one seems to be simpler and better?
Well, as indicated in the commit message, this allows vhost-vdpa be 
able to know which driver had broken behavior before, so it
can apply the corresponding compatibility quirk to the individual 
driver when it's really necessary. If sending all flags 
unconditionally down to every driver, it's hard for driver writers to 
distinguish which are compatibility quirks that they can safely ignore 
and which are feature flags that are encouraged to implement. In that 
sense, gating features from being polluted by compatibility quirks 
with an implicit op 

s/implicit/explicit/

would be better.

Regards,
-Siwei


As compat_reset(vdev, 0) == reset(vdev)

Then you don't need the switch in the parent as well

+static int vdpasim_reset(struct vdpa_device *vdpa)
+{
+   return vdpasim_compat_reset(vdpa, 0);
+}

Thanks



up_write(>cf_lock);
 return ret;
  }
--
2.39.3





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-24 Thread Si-Wei Liu




On 10/23/2023 10:45 PM, Jason Wang wrote:

On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu  wrote:

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack the
IOTLB_PERSIST backend feature. As this compatibility quirk is limited to
those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual driver
whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  drivers/virtio/virtio_vdpa.c |  2 +-
  include/linux/vdpa.h |  7 +--
  3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..9ce40003793b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa 
*v, u16 qid)
 irq_bypass_unregister_producer(>call_ctx.producer);
  }

-static int vhost_vdpa_reset(struct vhost_vdpa *v)
+static int _compat_vdpa_reset(struct vhost_vdpa *v)
  {
 struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;

-   v->in_batch = 0;
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+   VHOST_BACKEND_F_IOTLB_PERSIST) ?
+VDPA_RESET_F_CLEAN_MAP : 0;
+
+   return vdpa_reset(vdpa, flags);
+}

-   return vdpa_reset(vdpa);
+static int vhost_vdpa_reset(struct vhost_vdpa *v)
+{
+   v->in_batch = 0;
+   return _compat_vdpa_reset(v);
  }

  static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
@@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
 vhost_vdpa_unsetup_vq_irq(v, i);

 if (status == 0) {
-   ret = vdpa_reset(vdpa);
+   ret = _compat_vdpa_reset(v);
 if (ret)
 return ret;
 } else
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 06ce6d8c2e00..8d63e5923d24 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev)
  {
 struct vdpa_device *vdpa = vd_get_vdpa(vdev);

-   vdpa_reset(vdpa);
+   vdpa_reset(vdpa, 0);
  }

  static bool virtio_vdpa_notify(struct virtqueue *vq)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6b8cbf75712d..db15ac07f8a6 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
 return vdev->dma_dev;
  }

-static inline int vdpa_reset(struct vdpa_device *vdev)
+static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
  {
 const struct vdpa_config_ops *ops = vdev->config;
 int ret;

 down_write(>cf_lock);
 vdev->features_valid = false;
-   ret = ops->reset(vdev);
+   if (ops->compat_reset && flags)
+   ret = ops->compat_reset(vdev, flags);
+   else
+   ret = ops->reset(vdev);

Instead of inventing a new API that carries the flags. Tweak the
existing one seems to be simpler and better?
Well, as indicated in the commit message, this allows vhost-vdpa be able 
to know which driver had broken behavior before, so it
can apply the corresponding compatibility quirk to the individual driver 
when it's really necessary. If sending all flags unconditionally down to 
every driver, it's hard for driver writers to distinguish which are 
compatibility quirks that they can safely ignore and which are feature 
flags that are encouraged to implement. In that sense, gating features 
from being polluted by compatibility quirks with an implicit op would be 
better.


Regards,
-Siwei


As compat_reset(vdev, 0) == reset(vdev)

Then you don't need the switch in the parent as well

+static int vdpasim_reset(struct vdpa_device *vdpa)
+{
+   return vdpasim_compat_reset(vdpa, 0);
+}

Thanks



 up_write(>cf_lock);
 return ret;
  }
--
2.39.3



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset

2023-10-23 Thread Si-Wei Liu

As subject. There's a vhost_vdpa_reset() done earlier before
vhost_dev is initialized via vhost_dev_init(), ending up with
NULL pointer dereference. Fix is to check if vqs is initialized
before checking backend features and resetting the device.

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 0 P4D 0
  Oops:  [#1] SMP
  CPU: 3 PID: 1727 Comm: qemu-system-x86 Not tainted 6.6.0-rc6+ #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-
  a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:_compat_vdpa_reset+0x47/0xc0 [vhost_vdpa]
  Code: c7 c7 fb 12 56 a0 4c 8d a5 b8 02 00 00 48 89 ea e8 7e b8 c4
  48 89 ee 48 c7 c7 19 13 56 a0 4c 8b ad b0 02 00 00 <48> 8b 00 49
  00 48 8b 80 88 45 00 00 48 c1 e8 08 48
  RSP: 0018:8881063c3c38 EFLAGS: 00010246
  RAX:  RBX: 8881074eb800 RCX: 
  RDX:  RSI: 888103ab4000 RDI: a0561319
  RBP: 888103ab4000 R08: dfff R09: 0001
  R10: 0003 R11: 7fecbac0 R12: 888103ab42b8
  R13: 888106dbe850 R14: 0003 R15: 8881074ebc18
  FS:  7f02fba6ef00() GS:5f8c()
  knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2:  CR3: 0001325e5003 CR4: 00372ea0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   
   ? __die+0x1f/0x60
   ? page_fault_oops+0x14c/0x3b0
   ? exc_page_fault+0x74/0x140
   ? asm_exc_page_fault+0x22/0x30
   ? _compat_vdpa_reset+0x47/0xc0 [vhost_vdpa]
   ? _compat_vdpa_reset+0x32/0xc0 [vhost_vdpa]
   vhost_vdpa_open+0x55/0x270 [vhost_vdpa]
   ? sb_init_dio_done_wq+0x50/0x50
   chrdev_open+0xc0/0x210
   ? __unregister_chrdev+0x50/0x50
   do_dentry_open+0x1fc/0x4f0
   path_openat+0xc2d/0xf20
   do_filp_open+0xb4/0x160
   ? kmem_cache_alloc+0x3c/0x490
   do_sys_openat2+0x8d/0xc0
   __x64_sys_openat+0x6a/0xa0
   do_syscall_64+0x3c/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older 
userspace")
Reported-by: Dragos Tatulea 
Closes: 
https://lore.kernel.org/all/b4913f84-8b52-4d28-af51-8573dc361...@oracle.com/
Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 9ce40003793b..9a2343c45df0 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -232,9 +232,11 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
struct vdpa_device *vdpa = v->vdpa;
u32 flags = 0;
 
-   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
-   VHOST_BACKEND_F_IOTLB_PERSIST) ?
-VDPA_RESET_F_CLEAN_MAP : 0;
+   if (v->vdev.vqs) {
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+   
VHOST_BACKEND_F_IOTLB_PERSIST) ?
+VDPA_RESET_F_CLEAN_MAP : 0;
+   }
 
return vdpa_reset(vdpa, flags);
 }
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-23 Thread Si-Wei Liu


(+ linux-next)

Hi Michael,

Dragos reported below oops for which I have a fix at hand (having it 
fully tested), ready to be posted to linux-next. Please let me know if 
you want me to respin the original patch series, or you would think it'd 
be fine to fix it on top.


On 10/23/2023 11:59 AM, Dragos Tatulea wrote:

On Sat, 2023-10-21 at 02:25 -0700, Si-Wei Liu wrote:

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack the
IOTLB_PERSIST backend feature. As this compatibility quirk is limited to
those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual driver
whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  drivers/virtio/virtio_vdpa.c |  2 +-
  include/linux/vdpa.h |  7 +--
  3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..9ce40003793b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa
*v, u16 qid)
 irq_bypass_unregister_producer(>call_ctx.producer);
  }
  
-static int vhost_vdpa_reset(struct vhost_vdpa *v)

+static int _compat_vdpa_reset(struct vhost_vdpa *v)
  {
 struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;
  
-   v->in_batch = 0;

+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+   VHOST_BACKEND_F_IOTLB_PERSIST) ?
+    VDPA_RESET_F_CLEAN_MAP : 0;

Hi Si-Wei,

I am getting a Oops due to the vqs not being initialized here. Here's how it it
looks like:

[   37.817075] BUG: kernel NULL pointer dereference, address: 
[   37.817674] #PF: supervisor read access in kernel mode
[   37.818150] #PF: error_code(0x) - not-present page
[   37.818615] PGD 0 P4D 0
[   37.818893] Oops:  [#1] SMP
[   37.819223] CPU: 3 PID: 1727 Comm: qemu-system-x86 Not tainted 6.6.0-rc6+ #2
[   37.819829] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-
1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   37.820791] RIP: 0010:_compat_vdpa_reset+0x47/0xc0 [vhost_vdpa]
[   37.821316] Code: c7 c7 fb 12 56 a0 4c 8d a5 b8 02 00 00 48 89 ea e8 7e b8 c4
e0 48 8b 43 28 48 89 ee 48 c7 c7 19 13 56 a0 4c 8b ad b0 02 00 00 <48> 8b 00 49
8b 95 d8 00 00 00 48 8b 80 88 45 00 00 48 c1 e8 08 48
[   37.822811] RSP: 0018:8881063c3c38 EFLAGS: 00010246
[   37.823285] RAX:  RBX: 8881074eb800 RCX: 
[   37.823893] RDX:  RSI: 888103ab4000 RDI: a0561319
[   37.824506] RBP: 888103ab4000 R08: dfff R09: 0001
[   37.825116] R10: 0003 R11: 7fecbac0 R12: 888103ab42b8
[   37.825721] R13: 888106dbe850 R14: 0003 R15: 8881074ebc18
[   37.826326] FS:  7f02fba6ef00() GS:5f8c()
knlGS:
[   37.827035] CS:  0010 DS:  ES:  CR0: 80050033
[   37.827552] CR2:  CR3: 0001325e5003 CR4: 00372ea0
[   37.828162] DR0:  DR1:  DR2: 
[   37.828772] DR3:  DR6: fffe0ff0 DR7: 0400
[   37.829381] Call Trace:
[   37.829660]  
[   37.829911]  ? __die+0x1f/0x60
[   37.830234]  ? page_fault_oops+0x14c/0x3b0
[   37.830623]  ? exc_page_fault+0x74/0x140
[   37.830999]  ? asm_exc_page_fault+0x22/0x30
[   37.831402]  ? _compat_vdpa_reset+0x47/0xc0 [vhost_vdpa]
[   37.831888]  ? _compat_vdpa_reset+0x32/0xc0 [vhost_vdpa]
[   37.832366]  vhost_vdpa_open+0x55/0x270 [vhost_vdpa]
[   37.832821]  ? sb_init_dio_done_wq+0x50/0x50
[   37.833225]  chrdev_open+0xc0/0x210
[   37.833582]  ? __unregister_chrdev+0x50/0x50
[   37.833990]  do_dentry_open+0x1fc/0x4f0
[   37.834363]  path_openat+0xc2d/0xf20
[   37.834721]  do_filp_open+0xb4/0x160
[   37.835082]  ? kmem_cache_alloc+0x3c/0x490
[   37.835474]  do_sys_openat2+0x8d/0xc0
[   37.835834]  __x64_sys_openat+0x6a/0xa0
[   37.836208]  do_syscall_64+0x3c/0x80
[   37.836564]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   37.837021] RIP: 0033:0x7f02fcc2c085
[   37.837378] Code: 8b 55 d0 48 89 45 b0 75 a0 44 89 55 9c e8 63 7d f8 ff 44 8b
55 9c 89 da 4c 89 e6 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0
ff ff 77 33 44 89 c7 89 45 9c e8 b8 7d f8 ff 8b 45 9c
[   37.838891] RSP: 002b:7ffdea3c8cc0 EFLAGS: 0293 ORIG_RAX:
0101
[   37.839571] R

Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset

2023-10-23 Thread Si-Wei Liu




On 10/22/2023 8:51 PM, Jason Wang wrote:

Hi Si-Wei:

On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu  wrote:

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g.
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v4:
- Rework compatibility using new .compat_reset driver op

I still think having a set_backend_feature()
This will overload backend features with the role of carrying over 
compatibility quirks, which I tried to avoid from. While I think the 
.compat_reset from the v4 code just works with the backend features 
acknowledgement (and maybe others as well) to determine, but not 
directly tie it to backend features itself. These two have different 
implications in terms of requirement, scope and maintaining/deprecation, 
better to cope with compat quirks in explicit and driver visible way.



  or reset_map(clean=true) might be better.
An explicit op might be marginally better in driver writer's point of 
view. Compliant driver doesn't have to bother asserting clean_map never 
be true so their code would never bother dealing with this case, as 
explained in the commit log for patch 5 "vhost-vdpa: clean iotlb map 
during reset for older userspace":


"
    The separation of .compat_reset from the regular .reset allows
    vhost-vdpa able to know which driver had broken behavior before, so it
    can apply the corresponding compatibility quirk to the individual 
driver

    whenever needed.  Compared to overloading the existing .reset with
    flags, .compat_reset won't cause any extra burden to the implementation
    of every compliant driver.
"


  As it tries hard to not introduce new stuff on the bus.
Honestly I don't see substantial difference between these other than the 
color. There's no single best solution that stands out among the 3. And 
I assume you already noticed it from all the above 3 approaches will 
have to go with backend features negotiation, that the 1st vdpa reset 
before backend feature negotiation will use the compliant version of 
.reset that doesn't clean up the map. While I don't think this nuance 
matters much to existing older userspace apps, as the maps should 
already get cleaned by previous process in vhost_vdpa_cleanup(), but if 
bug-for-bug behavioral compatibility is what you want, module parameter 
will be the single best answer.


Regards,
-Siwei


But we can listen to others for sure.

Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-21 Thread Si-Wei Liu

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack the
IOTLB_PERSIST backend feature. As this compatibility quirk is limited to
those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual driver
whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 17 +
 drivers/virtio/virtio_vdpa.c |  2 +-
 include/linux/vdpa.h |  7 +--
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..9ce40003793b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa 
*v, u16 qid)
irq_bypass_unregister_producer(>call_ctx.producer);
 }
 
-static int vhost_vdpa_reset(struct vhost_vdpa *v)
+static int _compat_vdpa_reset(struct vhost_vdpa *v)
 {
struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;
 
-   v->in_batch = 0;
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+   VHOST_BACKEND_F_IOTLB_PERSIST) ?
+VDPA_RESET_F_CLEAN_MAP : 0;
+
+   return vdpa_reset(vdpa, flags);
+}
 
-   return vdpa_reset(vdpa);
+static int vhost_vdpa_reset(struct vhost_vdpa *v)
+{
+   v->in_batch = 0;
+   return _compat_vdpa_reset(v);
 }
 
 static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
@@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
vhost_vdpa_unsetup_vq_irq(v, i);
 
if (status == 0) {
-   ret = vdpa_reset(vdpa);
+   ret = _compat_vdpa_reset(v);
if (ret)
return ret;
} else
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 06ce6d8c2e00..8d63e5923d24 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev)
 {
struct vdpa_device *vdpa = vd_get_vdpa(vdev);
 
-   vdpa_reset(vdpa);
+   vdpa_reset(vdpa, 0);
 }
 
 static bool virtio_vdpa_notify(struct virtqueue *vq)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6b8cbf75712d..db15ac07f8a6 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
return vdev->dma_dev;
 }
 
-static inline int vdpa_reset(struct vdpa_device *vdev)
+static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
 {
const struct vdpa_config_ops *ops = vdev->config;
int ret;
 
down_write(>cf_lock);
vdev->features_valid = false;
-   ret = ops->reset(vdev);
+   if (ops->compat_reset && flags)
+   ret = ops->compat_reset(vdev, flags);
+   else
+   ret = ops->reset(vdev);
up_write(>cf_lock);
return ret;
 }
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 6/7] vdpa/mlx5: implement .reset_map driver op

2023-10-21 Thread Si-Wei Liu

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while the
first .set_map call is invoked, in which case callers like vhost-vdpa
will start to set up custom mappings. When the .reset callback is
invoked, the custom mappings will be cleared and the 1:1 DMA MR will be
re-created.

In order to reduce excessive memory mapping cost in live migration, it
is desirable to decouple the vhost-vdpa IOTLB abstraction from the
virtio device life cycle, i.e. mappings can be kept around intact across
virtio device reset. Leverage the .reset_map callback, which is meant to
destroy the regular MR (including cvq mapping) on the given ASID and
recreate the initial DMA mapping. That way, the device .reset op runs
free from having to maintain and clean up memory mappings by itself.

Additionally, implement .compat_reset to cater for older userspace,
which may wish to see mapping to be cleared during reset.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 27 ---
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ced5a5d..84547d998bcf 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28f327..2197c46e563a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+   if (asid >= MLX5_VDPA_NUM_AS)
+   return -EINVAL;
+
+   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+   if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (mlx5_vdpa_create_dma_mr(mvdev))
+   mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+   } else {
+   mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+   }
+
+   return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f4516a2d5bb0..12ac3397f39b 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2876,7 +2876,7 @@ static void init_group_to_asid_map(struct mlx5_vdpa_dev 
*mvdev)
mvdev->group2asid[i] = 0;
 }
 
-static int mlx5_vdpa_reset(struct vdpa_device *vdev)
+static int mlx5_vdpa_compat_reset(struct vdpa_device *vdev, u32 flags)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
@@ -2888,7 +2888,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
-   mlx5_vdpa_destroy_mr_resources(>mvdev);
+   if (flags & VDPA_RESET_F_CLEAN_MAP)
+   mlx5_vdpa_destroy_mr_resources(>mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2899,7 +2900,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;
 
-   if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if ((flags & VDPA_RESET_F_CLEAN_MAP) &&
+   MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
if (mlx5_vdpa_create_dma_mr(mvdev))
mlx5_vdpa_warn(mvdev, "create MR failed\n");
}
@@ -2908,6 +2910,11 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
return 0;
 }
 
+static int mlx5_vdpa_reset(struct vdpa_device *vdev)
+{
+   return mlx5_vdpa_compat_reset(vdev, 0);
+}
+
 static size_t mlx5_vdpa_get_config_size(struct vdpa_device *vdev)
 {
return sizeof(struct virtio_net_config);
@@ -2987,6 +2994,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, 
unsigned int asid,
return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsi

[PATCH v4 7/7] vdpa_sim: implement .reset_map support

2023-10-21 Thread Si-Wei Liu

In order to reduce excessive memory mapping cost in live migration and
VM reboot, it is desirable to decouple the vhost-vdpa IOTLB abstraction
from the virtio device life cycle, i.e. mappings can be kept intact
across virtio device reset. Leverage the .reset_map callback, which is
meant to destroy the iotlb on the given ASID and recreate the 1:1
passthrough/identity mapping. To be consistent, the mapping on device
creation is initiailized to passthrough/identity with PA 1:1 mapped as
IOVA. With this the device .reset op doesn't have to maintain and clean
up memory mappings by itself.

Additionally, implement .compat_reset to cater for older userspace,
which may wish to see mapping to be cleared during reset.

Signed-off-by: Si-Wei Liu 
Tested-by: Stefano Garzarella 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 52 ++--
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 76d41058add9..be2925d0d283 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -139,7 +139,7 @@ static void vdpasim_vq_reset(struct vdpasim *vdpasim,
vq->vring.notify = NULL;
 }
 
-static void vdpasim_do_reset(struct vdpasim *vdpasim)
+static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 flags)
 {
int i;
 
@@ -151,11 +151,13 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim)
 >iommu_lock);
}
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
-   vhost_iotlb_reset(>iommu[i]);
-   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
- 0, VHOST_MAP_RW);
-   vdpasim->iommu_pt[i] = true;
+   if (flags & VDPA_RESET_F_CLEAN_MAP) {
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
+   vhost_iotlb_reset(>iommu[i]);
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
}
 
vdpasim->running = true;
@@ -259,8 +261,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr 
*dev_attr,
if (!vdpasim->iommu_pt)
goto err_iommu;
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++)
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0);
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0,
+ VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
 
for (i = 0; i < dev_attr->nvqs; i++)
vringh_set_iotlb(>vqs[i].vring, >iommu[0],
@@ -480,18 +486,23 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, 
u8 status)
mutex_unlock(>mutex);
 }
 
-static int vdpasim_reset(struct vdpa_device *vdpa)
+static int vdpasim_compat_reset(struct vdpa_device *vdpa, u32 flags)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 
mutex_lock(>mutex);
vdpasim->status = 0;
-   vdpasim_do_reset(vdpasim);
+   vdpasim_do_reset(vdpasim, flags);
mutex_unlock(>mutex);
 
return 0;
 }
 
+static int vdpasim_reset(struct vdpa_device *vdpa)
+{
+   return vdpasim_compat_reset(vdpa, 0);
+}
+
 static int vdpasim_suspend(struct vdpa_device *vdpa)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -637,6 +648,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, 
unsigned int asid,
return ret;
 }
 
+static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid)
+{
+   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+   if (asid >= vdpasim->dev_attr.nas)
+   return -EINVAL;
+
+   spin_lock(>iommu_lock);
+   if (vdpasim->iommu_pt[asid])
+   goto out;
+   vhost_iotlb_reset(>iommu[asid]);
+   vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[asid] = true;
+out:
+   spin_unlock(>iommu_lock);
+   return 0;
+}
+
 static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -749,6 +779,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = {
.get_status = vdpasim_get_status,
.set_status = vdpasim_set_status,
.reset  = vdpasim_reset,
+   .compat_reset   = vdpasim_compat_reset,
.suspend= vdpasim_suspend,
.resume = vdpasim_resume,
.get_config_size= vdpasim_get_config_size,
@@ -759,6 +790,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = {
.set_group_asid = vdpasim_set

[PATCH v4 3/7] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-10-21 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in
the kernel can be trusted to persist IOTLB mapping across vDPA reset.
Without it, userspace has no way to tell apart if it's running on an
older kernel, which could silently drop all iotlb mapping across vDPA
reset, especially with broken parent driver implementation for the
.reset driver op. The broken driver may incorrectly drop all mappings of
its own as part of .reset, which inadvertently ends up with corrupted
mapping state between vhost-vdpa userspace and the kernel. As a
workaround, to make the mapping behaviour predictable across reset,
userspace has to pro-actively remove all mappings before vDPA reset, and
then restore all the mappings afterwards. This workaround is done
unconditionally on top of all parent drivers today, due to the parent
driver implementation issue and no means to differentiate.  This
workaround had been utilized in QEMU since day one when the
corresponding vhost-vdpa userspace backend came to the world.

There are 3 cases that backend may claim this feature bit on for:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation with
  persistent IOTLB mapping already that has to specifically
  declare this backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch back
iotlb to the initial state later on, especially for the on-chip IOMMU
case which starts with identity mapping at device creation. virtio-vdpa
requires on-chip IOMMU to perform 1:1 passthrough translation from PA to
IOVA as-is to begin with, and .reset_map is the only means to turn back
iotlb to the identity mapping mode after vhost-vdpa is gone.

The difference in behavior did not matter as QEMU unmaps all the memory
unregistering the memory listener at vhost_vdpa_dev_start( started =
false), but the backend acknowledging this feature flag allows QEMU to
make sure it is safe to skip this unmap & map in the case of vhost stop
& start cycle.

In that sense, this feature flag is actually a signal for userspace to
know that the driver bug has been solved. Not offering it indicates that
userspace cannot trust the kernel will retain the maps.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 15 +++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index c6bfe9bdde42..acc7c74ba7d6 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -439,6 +439,15 @@ static u64 vhost_vdpa_get_backend_features(const struct 
vhost_vdpa *v)
return ops->get_backend_features(vdpa);
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+  vhost_vdpa_get_backend_features(v) & 
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -726,6 +735,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -742,6 +752,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -797,6 +810,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, , sizeof(features

[PATCH v4 2/7] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-21 Thread Si-Wei Liu

Devices with on-chip IOMMU or vendor specific IOTLB implementation may
need to restore iotlb mapping to the initial or default state using the
.reset_map op, as it's desirable for some parent devices to not work
with DMA ops and maintain a simple IOMMU model with .reset_map. In
particular, device reset should not cause mapping to go away on such
IOTLB model, so persistent mapping is implied across reset. Before the
userspace process using vhost-vdpa is gone, give it a chance to reset
iotlb back to the initial state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f57b95..c6bfe9bdde42 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state, which cannot be
+* cleaned up in the all range unmap call above. Give them
+* a chance to clean up or reset the map to the desired
+* state.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 1/7] vdpa: introduce .reset_map operation callback

2023-10-21 Thread Si-Wei Liu

Some device specific IOMMU parent drivers have long standing bogus
behavior that mistakenly clean up the maps during .reset. By definition,
this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map &
.dma_unmap) in those offending drivers, as the removal of internal maps
is completely agnostic to the upper layer, causing inconsistent view
between the userspace and the kernel. Some userspace app like QEMU gets
around of this brokenness by proactively removing and adding back all
the maps around vdpa device reset, but such workaround actually penalize
other well-behaved driver setup, where vdpa reset always comes with the
associated mapping cost, especially for kernel vDPA devices
(use_va=false) that have high cost on pinning. It's imperative to
rectify this behavior and remove the problematic code from all those
non-compliant parent drivers.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation detail to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB
mappings, and has been abstracted in a way similar to how a real IOMMU
device maps or unmaps pages for certain memory ranges. However, apart
from this there also exists other mapping needs, in which case 1:1
passthrough mapping has to be used by other users (read virtio-vdpa). To
ease parent/vendor driver implementation and to avoid abusing DMA ops in
an unexpacted way, these on-chip IOMMU devices can start with 1:1
passthrough mapping mode initially at the time of creation. Then the
.reset_map op can be used to switch iotlb back to this initial state
without having to expose a complex two-dimensional IOMMU device model.

The .reset_map is not a MUST for every parent that implements the
.dma_map or .set_map API, because device may work with DMA ops directly
by implement their own to manipulate system memory mappings, so don't
have to use .reset_map to achieve a simple IOMMU device model for 1:1
passthrough mapping.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309b99cf..26ae6ae1eac3 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 4/7] vdpa: introduce .compat_reset operation callback

2023-10-21 Thread Si-Wei Liu

Some device specific IOMMU parent drivers have long standing bogus
behaviour that mistakenly clean up the maps during .reset. By
definition, this is violation to the on-chip IOMMU ops (i.e. .set_map,
or .dma_map & .dma_unmap) in those offending drivers, as the removal of
internal maps is completely agnostic to the upper layer, causing
inconsistent view between the userspace and the kernel. Some userspace
app like QEMU gets around of this brokenness by proactively removing and
adding back all the maps around vdpa device reset, but such workaround
actually penaltize other well-behaved driver setup, where vdpa reset
always comes with the associated mapping cost, especially for kernel
vDPA devices (use_va=false) that have high cost on pinning. It's
imperative to rectify this behaviour and remove the problematic code
from all those non-compliant parent drivers.

However, we cannot unconditionally remove the bogus map-cleaning code
from the buggy .reset implementation, as there might exist userspace
apps that already rely on the behaviour on some setup. Introduce a
.compat_reset driver op to keep compatibility with older userspace. New
and well behaved parent driver should not bother to implement such op,
but only those drivers that are doing or used to do non-compliant
map-cleaning reset will have to.

Signed-off-by: Si-Wei Liu 
---
 include/linux/vdpa.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 26ae6ae1eac3..6b8cbf75712d 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -252,6 +252,17 @@ struct vdpa_map_file {
  * @reset: Reset device
  * @vdev: vdpa device
  * Returns integer: success (0) or error (< 0)
+ * @compat_reset:  Reset device with compatibility quirks to
+ * accommodate older userspace. Only needed by
+ * parent driver which used to have bogus reset
+ * behaviour, and has to maintain such behaviour
+ * for compatibility with older userspace.
+ * Historically compliant driver only has to
+ * implement .reset, Historically non-compliant
+ * driver should implement both.
+ * @vdev: vdpa device
+ * @flags: compatibility quirks for reset
+ * Returns integer: success (0) or error (< 0)
  * @suspend:   Suspend the device (optional)
  * @vdev: vdpa device
  * Returns integer: success (0) or error (< 0)
@@ -393,6 +404,8 @@ struct vdpa_config_ops {
u8 (*get_status)(struct vdpa_device *vdev);
void (*set_status)(struct vdpa_device *vdev, u8 status);
int (*reset)(struct vdpa_device *vdev);
+   int (*compat_reset)(struct vdpa_device *vdev, u32 flags);
+#define VDPA_RESET_F_CLEAN_MAP 1
int (*suspend)(struct vdpa_device *vdev);
int (*resume)(struct vdpa_device *vdev);
size_t (*get_config_size)(struct vdpa_device *vdev);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset

2023-10-21 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v4:
- Rework compatibility using new .compat_reset driver op

v3:
- add .reset_map support to vdpa_sim
- introduce module parameter to provide bug-for-bug compatibility with older
  userspace 

v2:
- improved commit message to clarify the intended csope of .reset_map API
- improved commit messages to clarify no breakage on older userspace

v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (7):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa: introduce .compat_reset operation callback
  vhost-vdpa: clean iotlb map during reset for older userspace
  vdpa/mlx5: implement .reset_map driver op
  vdpa_sim: implement .reset_map support

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 ++
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 27 ++--
 drivers/vdpa/vdpa_sim/vdpa_sim.c   | 52 --
 drivers/vhost/vdpa.c   | 49 +---
 drivers/virtio/virtio_vdpa.c   |  2 +-
 include/linux/vdpa.h   | 30 +++--
 include/uapi/linux/vhost_types.h   |  2 ++
 8 files changed, 161 insertions(+), 19 deletions(-)

-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-19 Thread Si-Wei Liu




On 10/19/2023 9:11 PM, Jason Wang wrote:

On Fri, Oct 20, 2023 at 6:28 AM Si-Wei Liu  wrote:



On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote:

On Thu, Oct 19, 2023 at 10:27 AM Jason Wang  wrote:

On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu  wrote:


On 10/18/2023 7:53 PM, Jason Wang wrote:

On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu  wrote:

On 10/18/2023 12:00 AM, Jason Wang wrote:

Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
don't have a better choice. Or we can fail the probe if userspace
doesn't ack this feature.

Antoher idea we can just do the following in vhost_vdpa reset?

config->reset()
if (IOTLB_PERSIST is not set) {
config->reset_map()
}

Then we don't have the burden to maintain them in the parent?

Thanks

Please see my earlier response in the other email, thanks.

%<%<

First, the ideal fix would be to leave this reset_vendor_mappings()
emulation code on the individual driver itself, which already has the
broken behavior.

So the point is, not about whether the existing behavior is "broken"
or not.

Hold on, I thought earlier we all agreed upon that the existing behavior
of vendor driver self-clearing maps during .reset violates the vhost
iotlb abstraction and also breaks the .set_map/.dma_map API. This is
100% buggy driver implementation itself that we should discourage or
eliminate as much as possible (that's part of the goal for this series),

I'm not saying it's not an issue, what I'm saying is, if the fix
breaks another userspace, it's a new bug in the kernel. See what Linus
said in [1]

"If a change results in user programs breaking, it's a bug in the kernel."


but here you seem to go existentialism and suggests the very opposite
that every .set_map/.dma_map driver implementation, regardless being the
current or the new/upcoming, should unconditionally try to emulate the
broken reset behavior for the sake of not breaking older userspace.

Such "emulation" is not done at the parent level. New parents just
need to implement reset_map() or not. everything could be done inside
vhost-vDPA as pseudo code that is shown above.


Set
aside the criteria and definition for how userspace can be broken, can
we step back to the original question why we think it's broken, and what
we can do to promote good driver implementation instead of discuss the
implementation details?

I'm not sure I get the point of this question. I'm not saying we don't
need to fix, what I am saying is that such a fix must be done in a
negotiable way. And it's better if parents won't get any burden. It
can just decide to implement reset_map() or not.


Reading the below response I found my major
points are not heard even if written for quite a few times.

I try my best to not ignore any important things, but I can't promise
I will not miss any. I hope the above clarifies my points.


It's not
that I don't understand the importance of not breaking old userspace, I
appreciate your questions and extra patience, however I do feel the
"broken" part is very relevant to our discussion here.
If it's broken (in the sense of vhost IOTLB API) that you agree, I think
we should at least allow good driver implementations; and when you think
about the possibility of those valid good driver cases
(.set_map/.dma_map implementations that do not clear maps in .reset),
you might be able to see why it's coded the way as it is now.


It's about whether we could stick to the old behaviour without
too much cost. And I believe we could.

And just to clarify here, reset_vendor_mappings() = config->reset_map()


But today there's no backend feature negotiation
between vhost-vdpa and the parent driver. Do we want to send down the
acked_backend_features to parent drivers?

There's no need to do that with the above code, or anything I missed here?

config->reset()
if (IOTLB_PERSIST is not set) {
 config->reset_map()
}

Implementation issue: this implies reset_map() has to be there for every
.set_map implementations, but vendor driver implementation for custom
IOMMU could well implement DMA ops by itself instead of .reset_map. This
won't work for every set_map driver (think about the vduse case).

Well let me do it once again, reset_map() is not mandated:

config->reset()
if (IOTLB_PERSIST is not set) {
  if (config->reset_map)
config->reset_map()

To avoid new parent drivers

I am afraid it's not just new parent drivers, but any well behaved
driver today may well break userspace if go with this forced emulation
code, if they have to implement reset_map for some reason (e.g. restored
to 1:1 passthrough mapping or other default state in mapping). For new
userspace and user driver we can guard against it using the
IOTLB_PERSIST flag, but the above code would get a big chance to break
setup with good driver and older userspace in practice.

And .reset_map implementatio

Re: [RFC v2 PATCH] vdpa_sim: implement .reset_map support

2023-10-19 Thread Si-Wei Liu

On 10/19/2023 2:29 AM, Stefano Garzarella wrote:

On Wed, Oct 18, 2023 at 04:47:48PM -0700, Si-Wei Liu wrote:

On 10/18/2023 1:05 AM, Stefano Garzarella wrote:

On Tue, Oct 17, 2023 at 10:11:33PM -0700, Si-Wei Liu wrote:

RFC only. Not tested on vdpa-sim-blk with user virtual address.
Works fine with vdpa-sim-net which uses physical address to map.

This patch is based on top of [1].

[1]
https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/

Signed-off-by: Si-Wei Liu

---
RFC v2:
- initialize iotlb to passthrough mode in device add

I tested this version and I didn't see any issue ;-)

Great, thank you so much for your help on testing my patch, Stefano!

You're welcome :-)

Just for my own interest/curiosity, currently there's no vhost-vdpa
backend client implemented for vdpa-sim-blk

Yep, we developed libblkio [1]. libblkio exposes common API to access
block devices in userspace. It supports several drivers.
The one useful for this use case is `virtio-blk-vhost-vdpa`. Here [2]
some examples on how to use the libblkio test suite with the
vdpa-sim-blk.

Since QEMU 7.2, it supports libblkio drivers, so you can use the
following options to attach a vdpa-blk device to a VM:

-blockdev
node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on
\

-device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1 \

For now only what we called slow-path [3][4] is supported, since the
VQs are not directly exposed to the guest, but QEMU allocates other
VQs (similar to shadow VQs for net) to support live-migration and QEMU
storage features. Fast-path is on the agenda, but on pause for now.

or any vdpa block device in userspace as yet, correct?

Do you mean with VDUSE?
In this case, yes, qemu-storage-daemon supports it, and can implement
a virtio-blk in user space, exposing a disk image thorough VDUSE.

There is an example in libblkio as well [5] on how to start it.

So there was no test specific to vhost-vdpa that needs to be
exercised, right?

I hope I answered above :-)
Definitely! This is exactly what I needed, it's really useful! Much
appreciated for the detailed information!

I hadn't been aware of the latest status on libblkio drivers and qemu
support since I last checked it (it was at some point right after KVM
2022, sorry my knowledge too outdated). I followed your links below and
checked a few things, looks my change shouldn't affect anything. Good to
see all the desired pieces landed to QEMU and libblkio already as
planned, great job done!

Cheers,
-Siwei

This reminded me that I need to write a blog post with all this
information, I hope to do that soon!

Stefano

[1] https://gitlab.com/libblkio/libblkio
[2]
https://gitlab.com/libblkio/libblkio/-/blob/main/tests/meson.build?ref_type=heads#L42
[3]
https://kvmforum2022.sched.com/event/15jK5/qemu-storage-daemon-and-libblkio-exploring-new-shores-for-the-qemu-block-layer-kevin-wolf-stefano-garzarella-red-hat
[4]
https://kvmforum2021.sched.com/event/ke3a/vdpa-blk-unified-hardware-and-software-offload-for-virtio-blk-stefano-garzarella-red-hat
[5]
https://gitlab.com/libblkio/libblkio/-/blob/main/tests/meson.build?ref_type=heads#L58

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vhost v4 00/16] vdpa: Add support for vq descriptor mappings

2023-10-19 Thread Si-Wei Liu

For patches 05-16:

Reviewed-by: Si-Wei Liu
Tested-by: Si-Wei Liu

Thanks for the fixes!

On 10/18/2023 10:14 AM, Dragos Tatulea wrote:

This patch series adds support for vq descriptor table mappings which
are used to improve vdpa live migration downtime. The improvement comes
from using smaller mappings which take less time to create and destroy
in hw.

The first part adds the vdpa core changes from Si-Wei [0].

The second part adds support in mlx5_vdpa:
- Refactor the mr code to be able to cleanly add descriptor mappings.
- Add hardware descriptor mr support.
- Properly update iotlb for cvq during ASID switch.

Changes in v4:

- Improved the handling of empty iotlbs. See mlx5_vdpa_change_map
section in patch "12/16 vdpa/mlx5: Improve mr upate flow".
- Fixed a invalid usage of desc_group_mkey hw vq field when the
capability is not there. See patch
"15/16 vdpa/mlx5: Enable hw support for vq descriptor map".

Changes in v3:

- dup_iotlb now checks for src == dst case and returns an error.
- Renamed iotlb parameter in dup_iotlb to dst.
- Removed a redundant check of the asid value.
- Fixed a commit message.
- mx5_ifc.h patch has been applied to mlx5-vhost tree. When applying
this series please pull from that tree first.

Changes in v2:

- The "vdpa/mlx5: Enable hw support for vq descriptor mapping" change
was split off into two patches to avoid merge conflicts into the tree
of Linus.

The first patch contains only changes for mlx5_ifc.h. This must be
applied into the mlx5-vdpa tree [1] first. Once this patch is applied
on mlx5-vdpa, the change has to be pulled fom mlx5-vdpa into the vhost
tree and only then the remaining patches can be applied.

[0]
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-vhost

Dragos Tatulea (13):
vdpa/mlx5: Expose descriptor group mkey hw capability
vdpa/mlx5: Create helper function for dma mappings
vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code
vdpa/mlx5: Take cvq iotlb lock during refresh
vdpa/mlx5: Collapse "dvq" mr add/delete functions
vdpa/mlx5: Rename mr destroy functions
vdpa/mlx5: Allow creation/deletion of any given mr struct
vdpa/mlx5: Move mr mutex out of mr struct
vdpa/mlx5: Improve mr update flow
vdpa/mlx5: Introduce mr for vq descriptor
vdpa/mlx5: Enable hw support for vq descriptor mapping
vdpa/mlx5: Make iotlb helper functions more generic
vdpa/mlx5: Update cvq iotlb mapping on ASID change

Si-Wei Liu (3):
vdpa: introduce dedicated descriptor group for virtqueue
vhost-vdpa: introduce descriptor group backend feature
vhost-vdpa: uAPI to get dedicated descriptor group id

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-19 Thread Si-Wei Liu





On 10/17/2023 10:27 PM, Jason Wang wrote:

   If we do
this without a negotiation, IOTLB will not be clear but the Qemu will
try to re-program the IOTLB after reset. Which will break?

1) stick the exact old behaviour with just one line of check

It's not just one line of check here, the old behavior emulation has to
be done as Eugenio illustrated in the other email.

For vhost-vDPA it's just

if (IOTLB_PERSIST is acked by userspace)
 reset_map()
... and this reset_map in vhost_vdpa_cleanup can't be negotiable 
depending on IOTLB_PERSIST. Consider the case where user switches to 
virtio-vdpa after an older userspace using vhost-vdpa finished running. 
Even with buggy_virtio_reset_map in place it's unwarranted the vendor 
IOMMU can get back to the default state, e.g. ending with 1:1 
passthrough mapping. If not doing this unconditionally it will get a big 
chance to break userspace.


-Siwei



For parent, it's somehow similar:

during .reset()

if (IOTLB_PERSIST is not acked by userspace)
 reset_vendor_mappings()

Anything I missed here?


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-19 Thread Si-Wei Liu




On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote:

On Thu, Oct 19, 2023 at 10:27 AM Jason Wang  wrote:

On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu  wrote:



On 10/18/2023 7:53 PM, Jason Wang wrote:

On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu  wrote:


On 10/18/2023 12:00 AM, Jason Wang wrote:

Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
don't have a better choice. Or we can fail the probe if userspace
doesn't ack this feature.

Antoher idea we can just do the following in vhost_vdpa reset?

config->reset()
if (IOTLB_PERSIST is not set) {
   config->reset_map()
}

Then we don't have the burden to maintain them in the parent?

Thanks

Please see my earlier response in the other email, thanks.

%<%<

First, the ideal fix would be to leave this reset_vendor_mappings()
emulation code on the individual driver itself, which already has the
broken behavior.

So the point is, not about whether the existing behavior is "broken"
or not.

Hold on, I thought earlier we all agreed upon that the existing behavior
of vendor driver self-clearing maps during .reset violates the vhost
iotlb abstraction and also breaks the .set_map/.dma_map API. This is
100% buggy driver implementation itself that we should discourage or
eliminate as much as possible (that's part of the goal for this series),

I'm not saying it's not an issue, what I'm saying is, if the fix
breaks another userspace, it's a new bug in the kernel. See what Linus
said in [1]

"If a change results in user programs breaking, it's a bug in the kernel."


but here you seem to go existentialism and suggests the very opposite
that every .set_map/.dma_map driver implementation, regardless being the
current or the new/upcoming, should unconditionally try to emulate the
broken reset behavior for the sake of not breaking older userspace.

Such "emulation" is not done at the parent level. New parents just
need to implement reset_map() or not. everything could be done inside
vhost-vDPA as pseudo code that is shown above.


Set
aside the criteria and definition for how userspace can be broken, can
we step back to the original question why we think it's broken, and what
we can do to promote good driver implementation instead of discuss the
implementation details?

I'm not sure I get the point of this question. I'm not saying we don't
need to fix, what I am saying is that such a fix must be done in a
negotiable way. And it's better if parents won't get any burden. It
can just decide to implement reset_map() or not.


Reading the below response I found my major
points are not heard even if written for quite a few times.

I try my best to not ignore any important things, but I can't promise
I will not miss any. I hope the above clarifies my points.


It's not
that I don't understand the importance of not breaking old userspace, I
appreciate your questions and extra patience, however I do feel the
"broken" part is very relevant to our discussion here.
If it's broken (in the sense of vhost IOTLB API) that you agree, I think
we should at least allow good driver implementations; and when you think
about the possibility of those valid good driver cases
(.set_map/.dma_map implementations that do not clear maps in .reset),
you might be able to see why it's coded the way as it is now.


   It's about whether we could stick to the old behaviour without
too much cost. And I believe we could.

And just to clarify here, reset_vendor_mappings() = config->reset_map()


But today there's no backend feature negotiation
between vhost-vdpa and the parent driver. Do we want to send down the
acked_backend_features to parent drivers?

There's no need to do that with the above code, or anything I missed here?

config->reset()
if (IOTLB_PERSIST is not set) {
config->reset_map()
}

Implementation issue: this implies reset_map() has to be there for every
.set_map implementations, but vendor driver implementation for custom
IOMMU could well implement DMA ops by itself instead of .reset_map. This
won't work for every set_map driver (think about the vduse case).

Well let me do it once again, reset_map() is not mandated:

config->reset()
if (IOTLB_PERSIST is not set) {
 if (config->reset_map)
   config->reset_map()

To avoid new parent drivers
I am afraid it's not just new parent drivers, but any well behaved 
driver today may well break userspace if go with this forced emulation 
code, if they have to implement reset_map for some reason (e.g. restored 
to 1:1 passthrough mapping or other default state in mapping). For new 
userspace and user driver we can guard against it using the 
IOTLB_PERSIST flag, but the above code would get a big chance to break 
setup with good driver and older userspace in practice.


And .reset_map implementation doesn't necessarily need to clear maps. 
For e.g. IOMMU API compliant driver that only needs

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-19 Thread Si-Wei Liu




On 10/18/2023 7:53 PM, Jason Wang wrote:

On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu  wrote:



On 10/18/2023 12:00 AM, Jason Wang wrote:

Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
don't have a better choice. Or we can fail the probe if userspace
doesn't ack this feature.

Antoher idea we can just do the following in vhost_vdpa reset?

config->reset()
if (IOTLB_PERSIST is not set) {
  config->reset_map()
}

Then we don't have the burden to maintain them in the parent?

Thanks

Please see my earlier response in the other email, thanks.

%<%<

First, the ideal fix would be to leave this reset_vendor_mappings()
emulation code on the individual driver itself, which already has the
broken behavior.

So the point is, not about whether the existing behavior is "broken"
or not.
Hold on, I thought earlier we all agreed upon that the existing behavior 
of vendor driver self-clearing maps during .reset violates the vhost 
iotlb abstraction and also breaks the .set_map/.dma_map API. This is 
100% buggy driver implementation itself that we should discourage or 
eliminate as much as possible (that's part of the goal for this series), 
but here you seem to go existentialism and suggests the very opposite 
that every .set_map/.dma_map driver implementation, regardless being the 
current or the new/upcoming, should unconditionally try to emulate the 
broken reset behavior for the sake of not breaking older userspace. Set 
aside the criteria and definition for how userspace can be broken, can 
we step back to the original question why we think it's broken, and what 
we can do to promote good driver implementation instead of discuss the 
implementation details? Reading the below response I found my major 
points are not heard even if written for quite a few times. It's not 
that I don't understand the importance of not breaking old userspace, I 
appreciate your questions and extra patience, however I do feel the 
"broken" part is very relevant to our discussion here.


If it's broken (in the sense of vhost IOTLB API) that you agree, I think 
we should at least allow good driver implementations; and when you think 
about the possibility of those valid good driver cases 
(.set_map/.dma_map implementations that do not clear maps in .reset),  
you might be able to see why it's coded the way as it is now.



  It's about whether we could stick to the old behaviour without
too much cost. And I believe we could.

And just to clarify here, reset_vendor_mappings() = config->reset_map()


But today there's no backend feature negotiation
between vhost-vdpa and the parent driver. Do we want to send down the
acked_backend_features to parent drivers?

There's no need to do that with the above code, or anything I missed here?

config->reset()
if (IOTLB_PERSIST is not set) {
   config->reset_map()
}
Implementation issue: this implies reset_map() has to be there for every 
.set_map implementations, but vendor driver implementation for custom 
IOMMU could well implement DMA ops by itself instead of .reset_map. This 
won't work for every set_map driver (think about the vduse case).


But this is not the the point I was making. I think if you agree this is 
purely buggy driver implementation of its own, we should try to isolate 
this buggy behavior to individual driver rather than overload vhost-vdpa 
or vdpa core's role to help implement the emulation of broken driver 
behavior. I don't get why .reset is special here, the abuse of .reset to 
manipulate mapping could also happen in other IOMMU unrelated driver 
entries like in .suspend, or in queue_reset. If someday userspace is 
found coded around similar buggy driver implementation in other driver 
ops, do we want to follow and duplicate the same emulation in vdpa core 
as the precedent is already set here around .reset?
The buggy driver can fail in a lot of other ways indefinitely during 
reset, if there's a buggy driver that's already broken the way as how it 
is and happens to survive with all userspace apps, we just don't care 
and let it be. There's no way we can enumerate all those buggy behaviors 
in .reset_map itself, it's overloading that driver API too much.

Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
backend feature negotiation in parent driver, if vhost-vdpa has to
provide the old-behaviour emulation for compatibility on driver's
behalf, it needs to be done per-driver basis. There could be good
on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
.reset, and vendor specific IOMMU doesn't have to provide .reset_map,

Then we just don't offer IOTLB_PRESIST, isn't this by design?
Think about the vduse case, it can work with DMA ops directly so doesn't 
have to implement .reset_map, unless for some specific good reason. 
Because it's a conforming and valid/good driver implementation, we may 
still allow it to ad

[PATCH v3 0/4] vdpa: decouple reset of iotlb mapping from device reset

2023-10-18 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v3:
- add .reset_map support to vdpa_sim
- introduce module parameter to provide bug-for-bug compatiblity with older
  userspace 

v2:
- improved commit message to clarify the intended csope of .reset_map API
- improved commit messages to clarify no breakage on older userspace

v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (5):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa/mlx5: implement .reset_map driver op
  vdpa_sim: implement .reset_map support

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 26 --
 drivers/vdpa/vdpa_sim/vdpa_sim.c   | 58 --
 drivers/vhost/vdpa.c   | 31 
 include/linux/vdpa.h   | 10 ++
 include/uapi/linux/vhost_types.h   |  2 ++
 7 files changed, 132 insertions(+), 13 deletions(-)

-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 5/5] vdpa_sim: implement .reset_map support

2023-10-18 Thread Si-Wei Liu

In order to reduce excessive memory mapping cost in live migration and
VM reboot, it is desirable to decouple the vhost-vdpa IOTLB abstraction
from the virtio device life cycle, i.e. mappings can be kept intact
across virtio device reset. Leverage the .reset_map callback, which is
meant to destroy the iotlb on the given ASID and recreate the 1:1
passthrough/identity mapping. To be consistent, the mapping on device
creation is initiailized to passthrough/identity with PA 1:1 mapped
as IOVA. With this the device .reset op doesn't have to maintain and
clean up memory mappings by itself.

Add a module paramemter, iotlb_persist, to cater for older userspace
which may wish to see mapping to be cleared during reset.

Signed-off-by: Si-Wei Liu 
Tested-by: Stefano Garzarella 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 58 ++--
 1 file changed, 47 insertions(+), 11 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 76d41058add9..74506636375f 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -40,6 +40,10 @@ static bool use_va = true;
 module_param(use_va, bool, 0444);
 MODULE_PARM_DESC(use_va, "Enable/disable the device's ability to use VA");
 
+static bool iotlb_persist = true;
+module_param(iotlb_persist, bool, 0444);
+MODULE_PARM_DESC(iotlb_persist, "Enable/disable persistent iotlb across reset: 
1 to keep maps, 0 to clear");
+
 #define VDPASIM_QUEUE_ALIGN PAGE_SIZE
 #define VDPASIM_QUEUE_MAX 256
 #define VDPASIM_VENDOR_ID 0
@@ -151,11 +155,13 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim)
 >iommu_lock);
}
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
-   vhost_iotlb_reset(>iommu[i]);
-   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
- 0, VHOST_MAP_RW);
-   vdpasim->iommu_pt[i] = true;
+   if (unlikely(!iotlb_persist)) {
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
+   vhost_iotlb_reset(>iommu[i]);
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
}
 
vdpasim->running = true;
@@ -166,8 +172,8 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim)
++vdpasim->generation;
 }
 
-static const struct vdpa_config_ops vdpasim_config_ops;
-static const struct vdpa_config_ops vdpasim_batch_config_ops;
+static struct vdpa_config_ops vdpasim_config_ops;
+static struct vdpa_config_ops vdpasim_batch_config_ops;
 
 static void vdpasim_work_fn(struct kthread_work *work)
 {
@@ -191,7 +197,7 @@ static void vdpasim_work_fn(struct kthread_work *work)
 struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr,
   const struct vdpa_dev_set_config *config)
 {
-   const struct vdpa_config_ops *ops;
+   struct vdpa_config_ops *ops;
struct vdpa_device *vdpa;
struct vdpasim *vdpasim;
struct device *dev;
@@ -213,6 +219,9 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr 
*dev_attr,
else
ops = _config_ops;
 
+   if (unlikely(!iotlb_persist))
+   ops->reset_map = NULL;
+
vdpa = __vdpa_alloc_device(NULL, ops,
   dev_attr->ngroups, dev_attr->nas,
   dev_attr->alloc_size,
@@ -259,8 +268,14 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr 
*dev_attr,
if (!vdpasim->iommu_pt)
goto err_iommu;
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++)
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0);
+   if (likely(iotlb_persist)) {
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 
0,
+ VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
+   }
 
for (i = 0; i < dev_attr->nvqs; i++)
vringh_set_iotlb(>vqs[i].vring, >iommu[0],
@@ -637,6 +652,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, 
unsigned int asid,
return ret;
 }
 
+static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid)
+{
+   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+   if (asid >= vdpasim->dev_attr.nas)
+   return -EINVAL;
+
+   spin_lock(>iommu_lock);
+   if (vdpasim->iommu_pt[asid])
+   goto out;
+   vhost_iotlb_reset(>iommu[asid]);
+   vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[asid] = true;
+out:
+

[PATCH v3 1/5] vdpa: introduce .reset_map operation callback

2023-10-18 Thread Si-Wei Liu

Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

The .reset_map is not a MUST for every parent that implements the
.dma_map or .set_map API, because there could be software vDPA devices
(which has use_va=true) that don't have to pin kernel memory so they
don't care much about high mapping cost during device reset. And those
software devices may have also implemented their own DMA ops, so don't
have to use .reset_map to achieve a simple IOMMU device model for 1:1
passthrough mapping, either.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309b99cf..26ae6ae1eac3 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 3/5] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-10-18 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel can be trusted to persist IOTLB mapping
across vDPA reset. Without it, userspace has no way to tell apart
if it's running on an older kernel, which could silently drop
all iotlb mapping across vDPA reset, especially with broken
parent driver implementation for the .reset driver op. The broken
driver may incorrectly drop all mappings of its own as part of
.reset, which inadvertently ends up with corrupted mapping state
between vhost-vdpa userspace and the kernel. As a workaround, to
make the mapping behaviour predictable across reset, userspace
has to pro-actively remove all mappings before vDPA reset, and
then restore all the mappings afterwards. This workaround is done
unconditionally on top of all parent drivers today, due to the
parent driver implementation issue and no means to differentiate.
This workaround had been utilized in QEMU since day one when the
corresponding vhost-vdpa userspace backend came to the world.

There are 3 cases that backend may claim this feature bit on for:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation with
  persistent IOTLB mapping already that has to specifically
  declare this backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch
back iotlb to the initial state later on, especially for the
on-chip IOMMU case which starts with identity mapping at device
creation. virtio-vdpa requires on-chip IOMMU to perform 1:1
passthrough translation from PA to IOVA as-is to begin with, and
.reset_map is the only means to turn back iotlb to the identity
mapping mode after vhost-vdpa is gone.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(
started = false), but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace
to know that the driver bug has been solved. Not offering it
indicates that userspace cannot trust the kernel will retain the
maps.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 15 +++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a3f8160c9807..9202986a7d81 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct 
vhost_vdpa *v)
return ops->get_backend_features(vdpa);
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+  vhost_vdpa_get_backend_features(v) & 
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, , sizeof(features

[PATCH v3 2/5] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-18 Thread Si-Wei Liu

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f57b95..a3f8160c9807 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 4/5] vdpa/mlx5: implement .reset_map driver op

2023-10-18 Thread Si-Wei Liu

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR (including cvq mapping) on the
given ASID and recreate the initial DMA mapping. That way, the device
.reset op runs free from having to maintain and clean up memory
mappings by itself.

Add a module paramemter, persist_mapping, to cater for older userspace
which may wish to see mapping to be cleared during reset.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 26 --
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ced5a5d..84547d998bcf 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28f327..2197c46e563a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+   if (asid >= MLX5_VDPA_NUM_AS)
+   return -EINVAL;
+
+   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+   if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (mlx5_vdpa_create_dma_mr(mvdev))
+   mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+   } else {
+   mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+   }
+
+   return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f4516a2d5bb0..e809ccec6048 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -25,6 +25,11 @@ MODULE_AUTHOR("Eli Cohen ");
 MODULE_DESCRIPTION("Mellanox VDPA driver");
 MODULE_LICENSE("Dual BSD/GPL");
 
+static bool persist_mapping = true;
+module_param(persist_mapping, bool, 0444);
+MODULE_PARM_DESC(persist_mapping,
+"Enable/disable persistent mapping across reset: 1 to keep, 0 
to clear");
+
 #define VALID_FEATURES_MASK
\
(BIT_ULL(VIRTIO_NET_F_CSUM) | BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) |
   \
 BIT_ULL(VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) | BIT_ULL(VIRTIO_NET_F_MTU) 
| BIT_ULL(VIRTIO_NET_F_MAC) |   \
@@ -2888,7 +2893,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
-   mlx5_vdpa_destroy_mr_resources(>mvdev);
+   if (unlikely(!persist_mapping))
+   mlx5_vdpa_destroy_mr_resources(>mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2899,7 +2905,7 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;
 
-   if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (unlikely(!persist_mapping) && MLX5_CAP_GEN(mvdev->mdev, 
umem_uid_0)) {
if (mlx5_vdpa_create_dma_mr(mvdev))
mlx5_vdpa_warn(mvdev, "create MR failed\n");
}
@@ -2987,6 +2993,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, 
unsigned int asid,
return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)

Re: [RFC v2 PATCH] vdpa_sim: implement .reset_map support

2023-10-18 Thread Si-Wei Liu




On 10/18/2023 1:05 AM, Stefano Garzarella wrote:

On Tue, Oct 17, 2023 at 10:11:33PM -0700, Si-Wei Liu wrote:

RFC only. Not tested on vdpa-sim-blk with user virtual address.
Works fine with vdpa-sim-net which uses physical address to map.

This patch is based on top of [1].

[1] 
https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/


Signed-off-by: Si-Wei Liu 

---
RFC v2:
 - initialize iotlb to passthrough mode in device add


I tested this version and I didn't see any issue ;-)

Great, thank you so much for your help on testing my patch, Stefano!
Just for my own interest/curiosity, currently there's no vhost-vdpa 
backend client implemented for vdpa-sim-blk or any vdpa block device in 
userspace as yet, correct? So there was no test specific to vhost-vdpa 
that needs to be exercised, right?


Thanks,
-Siwei





Tested-by: Stefano Garzarella 


---
drivers/vdpa/vdpa_sim/vdpa_sim.c | 34 
1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
b/drivers/vdpa/vdpa_sim/vdpa_sim.c

index 76d41058add9..2a0a6042d61d 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -151,13 +151,6 @@ static void vdpasim_do_reset(struct vdpasim 
*vdpasim)

 >iommu_lock);
}

-    for (i = 0; i < vdpasim->dev_attr.nas; i++) {
-    vhost_iotlb_reset(>iommu[i]);
-    vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
-  0, VHOST_MAP_RW);
-    vdpasim->iommu_pt[i] = true;
-    }
-
vdpasim->running = true;
spin_unlock(>iommu_lock);

@@ -259,8 +252,12 @@ struct vdpasim *vdpasim_create(struct 
vdpasim_dev_attr *dev_attr,

if (!vdpasim->iommu_pt)
    goto err_iommu;

-    for (i = 0; i < vdpasim->dev_attr.nas; i++)
+    for (i = 0; i < vdpasim->dev_attr.nas; i++) {
    vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0);
+    vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0,
+  VHOST_MAP_RW);
+    vdpasim->iommu_pt[i] = true;
+    }

for (i = 0; i < dev_attr->nvqs; i++)
    vringh_set_iotlb(>vqs[i].vring, >iommu[0],
@@ -637,6 +634,25 @@ static int vdpasim_set_map(struct vdpa_device 
*vdpa, unsigned int asid,

return ret;
}

+static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int 
asid)

+{
+    struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+    if (asid >= vdpasim->dev_attr.nas)
+    return -EINVAL;
+
+    spin_lock(>iommu_lock);
+    if (vdpasim->iommu_pt[asid])
+    goto out;
+    vhost_iotlb_reset(>iommu[asid]);
+    vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX,
+  0, VHOST_MAP_RW);
+    vdpasim->iommu_pt[asid] = true;
+out:
+    spin_unlock(>iommu_lock);
+    return 0;
+}
+
static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct 
*mm)

{
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -759,6 +775,7 @@ static const struct vdpa_config_ops 
vdpasim_config_ops = {

.set_group_asid = vdpasim_set_group_asid,
.dma_map    = vdpasim_dma_map,
.dma_unmap  = vdpasim_dma_unmap,
+    .reset_map  = vdpasim_reset_map,
.bind_mm    = vdpasim_bind_mm,
.unbind_mm    = vdpasim_unbind_mm,
.free   = vdpasim_free,
@@ -796,6 +813,7 @@ static const struct vdpa_config_ops 
vdpasim_batch_config_ops = {

.get_iova_range = vdpasim_get_iova_range,
.set_group_asid = vdpasim_set_group_asid,
.set_map    = vdpasim_set_map,
+    .reset_map  = vdpasim_reset_map,
.bind_mm    = vdpasim_bind_mm,
.unbind_mm    = vdpasim_unbind_mm,
.free   = vdpasim_free,
--
2.39.3





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-18 Thread Si-Wei Liu




On 10/18/2023 4:14 AM, Eugenio Perez Martin wrote:

On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu  wrote:



On 10/17/2023 10:27 PM, Jason Wang wrote:

On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu  wrote:


On 10/16/2023 7:35 PM, Jason Wang wrote:

On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu  wrote:

On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:

On Mon, Oct 16, 2023 at 8:33 AM Jason Wang  wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:

On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
 return vhost_vdpa_alloc_as(v, asid);
  }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
  static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
  {
 struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

 hlist_del(>hash_link);
 vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.


As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU.

I'm confused, how to define tolerating here?

Tolerating defined as QEMU has to proactively unmap before reset just to
workaround the driver bug (on-chip maps out of sync), unconditionally
for platform or on-chip. While we all know it doesn't have to do so for
platform IOMMU, though userspace has no means to distinguish. That said,
userspace is sacrificing reset time performance on platform IOMMU setup
just for working around buggy implementation in the other setup.

Ok, so what you actually mean is that userspace can tolerate the "bug"
with the performance penalty.

Right.



For example, if it has tolerance, why bother?

I'm not sure I get the question. But I think userspace is compromising
because of buggy implementation in a few drivers doesn't mean we should
uniformly enforce such behavior for all set_map/dma_map implementations.

This is not my point. I meant, we can fix we need a negotiation in
order to let some "buggy" old user space to survive from the changes.

Userspace is no buggy today, how to define "buggy"? Userspace with
tolerance could survive just fine no matter if this negotiation or buggy
driver behavior emulation is around or not. If any userspace doesn't
tolerate, it can work still fine on good on-chip IOMMU or platform
IOMMU, no matter if the negotiation is around or not.

This code of

not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

I can easily imagine a case:

The old Qemu that works only with a setup like mlx5_vdpa.

Noted, seems to me there's no such case of a userspace implementation
that only works with mlx5_vdpa or its friends, but doesn't work with the
others e.g. platform IOMMU, or well behaving on-chip IOMMU
implementations.

It's not hard t

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-18 Thread Si-Wei Liu





On 10/18/2023 12:00 AM, Jason Wang wrote:

Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
don't have a better choice. Or we can fail the probe if userspace
doesn't ack this feature.

Antoher idea we can just do the following in vhost_vdpa reset?

config->reset()
if (IOTLB_PERSIST is not set) {
 config->reset_map()
}

Then we don't have the burden to maintain them in the parent?

Thanks

Please see my earlier response in the other email, thanks.

%<%<

First, the ideal fix would be to leave this reset_vendor_mappings() 
emulation code on the individual driver itself, which already has the 
broken behavior. But today there's no backend feature negotiation 
between vhost-vdpa and the parent driver. Do we want to send down the 
acked_backend_features to parent drivers?


Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of 
backend feature negotiation in parent driver, if vhost-vdpa has to 
provide the old-behaviour emulation for compatibility on driver's 
behalf, it needs to be done per-driver basis. There could be good 
on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in 
.reset, and vendor specific IOMMU doesn't have to provide .reset_map, we 
should allow these good driver implementations rather than 
unconditionally stick to some specific problematic behavior for every 
other good driver. Then we need a set of device flags (backend_features 
bit again?) to indicate the specific driver needs upper layer's help on 
old-behaviour emulation.


Last but not least, I'm not sure how to properly emulate 
reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no 
.reset_map op implemented, or if .reset_map has a slightly different 
implementation than what it used to reset the iotlb in the .reset op, 
then this either becomes effectively dead code if no one ends up using, 
or the vhost-vdpa emulation is helpless and limited in scope, unable to 
cover all the cases.


%<%<
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-18 Thread Si-Wei Liu




On 10/17/2023 10:27 PM, Jason Wang wrote:

On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu  wrote:



On 10/16/2023 7:35 PM, Jason Wang wrote:

On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu  wrote:


On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:

On Mon, Oct 16, 2023 at 8:33 AM Jason Wang  wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:

On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.


As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU.

I'm confused, how to define tolerating here?

Tolerating defined as QEMU has to proactively unmap before reset just to
workaround the driver bug (on-chip maps out of sync), unconditionally
for platform or on-chip. While we all know it doesn't have to do so for
platform IOMMU, though userspace has no means to distinguish. That said,
userspace is sacrificing reset time performance on platform IOMMU setup
just for working around buggy implementation in the other setup.

Ok, so what you actually mean is that userspace can tolerate the "bug"
with the performance penalty.

Right.




For example, if it has tolerance, why bother?

I'm not sure I get the question. But I think userspace is compromising
because of buggy implementation in a few drivers doesn't mean we should
uniformly enforce such behavior for all set_map/dma_map implementations.

This is not my point. I meant, we can fix we need a negotiation in
order to let some "buggy" old user space to survive from the changes.
Userspace is no buggy today, how to define "buggy"? Userspace with 
tolerance could survive just fine no matter if this negotiation or buggy 
driver behavior emulation is around or not. If any userspace doesn't 
tolerate, it can work still fine on good on-chip IOMMU or platform 
IOMMU, no matter if the negotiation is around or not.



This code of

not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

I can easily imagine a case:

The old Qemu that works only with a setup like mlx5_vdpa.

Noted, seems to me there's no such case of a userspace implementation
that only works with mlx5_vdpa or its friends, but doesn't work with the
others e.g. platform IOMMU, or well behaving on-chip IOMMU
implementations.

It's not hard to think of a case where:

1) the environment has mlx5_vdpa only
2) kernel doc can't have endless details, so when

Re: [RFC PATCH] vdpa_sim: implement .reset_map support

2023-10-17 Thread Si-Wei Liu

Hi Stefano,

On 10/17/2023 6:44 AM, Stefano Garzarella wrote:

On Fri, Oct 13, 2023 at 10:29:26AM -0700, Si-Wei Liu wrote:

Hi Stefano,

On 10/13/2023 2:22 AM, Stefano Garzarella wrote:

Hi Si-Wei,

On Fri, Oct 13, 2023 at 01:23:40AM -0700, Si-Wei Liu wrote:

RFC only. Not tested on vdpa-sim-blk with user virtual address.

I can test it, but what I should stress?
Great, thank you! As you see, my patch moved vhost_iotlb_reset out of
vdpasim_reset for the sake of decoupling mapping from vdpa device
reset. For hardware devices this decoupling makes sense as platform
IOMMU already did it. But I'm not sure if there's something in the
software device (esp. with vdpa-blk and the userspace library stack)
that may have to rely on the current .reset behavior that clears the
vhost_iotlb. So perhaps you can try to exercise every possible case
involving blk device reset, and see if anything (related to mapping)
breaks?

I just tried these steps without using a VM and the host kernel hangs
after adding the device:

[root@f38-vm-build ~]# modprobe virtio-vdpa
[root@f38-vm-build ~]# modprobe vdpa-sim-blk
[root@f38-vm-build ~]# vdpa dev add mgmtdev vdpasim_blk name blk0
[ 35.284575][ T563] virtio_blk virtio6: 1/0/0 default/read/poll queues
[ 35.286372][ T563] virtio_blk virtio6: [vdb] 262144 512-byte
logical blocks (134 MB/128 MiB)

[ 35.295271][ T564] vringh:

Reverting this patch (so building "vdpa/mlx5: implement .reset_map
driver op") worked here.
I'm sorry, the previous RFC patch was incomplete - please see the v2 I
just posted. Tested both use_va and !use_va on vdpa-sim-blk, and raw
disk copy to the vdpa block simulator using dd seems fine. Just let me
know how it goes on your side this time.

Thanks,
-Siwei

Works fine with vdpa-sim-net which uses physical address to map.

Can you share your tests? so I'll try to do the same with blk.
Basically everything involving virtio device reset in the guest,
e.g. reboot the VM, remove/unbind then reprobe/bind the virtio-net
module/driver, then see if device I/O (which needs mapping properly)
is still flowing as expected. And then everything else that could
trigger QEMU's vhost_dev_start/stop paths ending up as passive
vhos-vdpa backend reset, for e.g. link status change,
suspend/hibernate, SVQ switch and live migration. I am not sure if
vdpa-blk supports live migration through SVQ or not, if not you don't
need to worry about.

This patch is based on top of [1].

[1]
https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/

The series does not apply well on master or vhost tree.
Where should I apply it?

Sent the link through another email offline.

Received thanks!

Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC v2 PATCH] vdpa_sim: implement .reset_map support

2023-10-17 Thread Si-Wei Liu

RFC only. Not tested on vdpa-sim-blk with user virtual address.
Works fine with vdpa-sim-net which uses physical address to map.

This patch is based on top of [1].

[1] 
https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/

Signed-off-by: Si-Wei Liu 

---
RFC v2:
  - initialize iotlb to passthrough mode in device add
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 34 
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 76d41058add9..2a0a6042d61d 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -151,13 +151,6 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim)
 >iommu_lock);
}
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
-   vhost_iotlb_reset(>iommu[i]);
-   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
- 0, VHOST_MAP_RW);
-   vdpasim->iommu_pt[i] = true;
-   }
-
vdpasim->running = true;
spin_unlock(>iommu_lock);
 
@@ -259,8 +252,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr 
*dev_attr,
if (!vdpasim->iommu_pt)
goto err_iommu;
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++)
+   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0);
+   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0,
+ VHOST_MAP_RW);
+   vdpasim->iommu_pt[i] = true;
+   }
 
for (i = 0; i < dev_attr->nvqs; i++)
vringh_set_iotlb(>vqs[i].vring, >iommu[0],
@@ -637,6 +634,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, 
unsigned int asid,
return ret;
 }
 
+static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid)
+{
+   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+   if (asid >= vdpasim->dev_attr.nas)
+   return -EINVAL;
+
+   spin_lock(>iommu_lock);
+   if (vdpasim->iommu_pt[asid])
+   goto out;
+   vhost_iotlb_reset(>iommu[asid]);
+   vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[asid] = true;
+out:
+   spin_unlock(>iommu_lock);
+   return 0;
+}
+
 static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -759,6 +775,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = {
.set_group_asid = vdpasim_set_group_asid,
.dma_map= vdpasim_dma_map,
.dma_unmap  = vdpasim_dma_unmap,
+   .reset_map  = vdpasim_reset_map,
.bind_mm= vdpasim_bind_mm,
.unbind_mm  = vdpasim_unbind_mm,
.free   = vdpasim_free,
@@ -796,6 +813,7 @@ static const struct vdpa_config_ops 
vdpasim_batch_config_ops = {
.get_iova_range = vdpasim_get_iova_range,
.set_group_asid = vdpasim_set_group_asid,
.set_map= vdpasim_set_map,
+   .reset_map  = vdpasim_reset_map,
.bind_mm= vdpasim_bind_mm,
.unbind_mm  = vdpasim_unbind_mm,
.free   = vdpasim_free,
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-17 Thread Si-Wei Liu




On 10/16/2023 7:35 PM, Jason Wang wrote:

On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu  wrote:



On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:

On Mon, Oct 16, 2023 at 8:33 AM Jason Wang  wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:


On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
drivers/vhost/vdpa.c | 16 
1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
   return vhost_vdpa_alloc_as(v, asid);
}

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
{
   struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

   hlist_del(>hash_link);
   vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.


As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU.

I'm confused, how to define tolerating here?


Tolerating defined as QEMU has to proactively unmap before reset just to 
workaround the driver bug (on-chip maps out of sync), unconditionally 
for platform or on-chip. While we all know it doesn't have to do so for 
platform IOMMU, though userspace has no means to distinguish. That said, 
userspace is sacrificing reset time performance on platform IOMMU setup 
just for working around buggy implementation in the other setup.



For example, if it has tolerance, why bother?
I'm not sure I get the question. But I think userspace is compromising 
because of buggy implementation in a few drivers doesn't mean we should 
uniformly enforce such behavior for all set_map/dma_map implementations.





This code of

not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

I can easily imagine a case:

The old Qemu that works only with a setup like mlx5_vdpa.
Noted, seems to me there's no such case of a userspace implementation 
that only works with mlx5_vdpa or its friends, but doesn't work with the 
others e.g. platform IOMMU, or well behaving on-chip IOMMU 
implementations. The Unmap+remap trick around vdpa reset works totally 
fine for platform IOMMU, except with sub-optimal performance. Other than 
this trick, I cannot easily think of other means or iotlb message 
sequence for userspace to recover the bogus state and make iotlb back to 
work again after reset. Are we talking about hypnosis that has no real 
basis to exist in the real world?



  If we do
this without a negotiation, IOTLB will not be clear but the Qemu will
try to re-program the IOTLB after reset. Which will break?

1) stick the exact old behaviour with just one line of check
It's not just one line of check here, the old behavior emulation has to 
be done as Eugenio illustrated in the other email. In addition, the 
emulation has to limit to those buggy drivers as I don't feel this 
emulation should apply uniformly to all futu

[PATCH v2 4/4] vdpa/mlx5: implement .reset_map driver op

2023-10-16 Thread Si-Wei Liu

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR on the given ASID and recreate the
initial DMA mapping. That way, the device .reset op can run free from
having to maintain and clean up memory mappings by itself.

The cvq mapping also needs to be cleared if is in the given ASID.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +-
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ced5a5d..84547d998bcf 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28f327..2197c46e563a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+   if (asid >= MLX5_VDPA_NUM_AS)
+   return -EINVAL;
+
+   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+   if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (mlx5_vdpa_create_dma_mr(mvdev))
+   mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+   } else {
+   mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+   }
+
+   return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 6abe02310f2b..928e71bc5571 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
-   mlx5_vdpa_destroy_mr_resources(>mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;
 
-   if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
-   if (mlx5_vdpa_create_dma_mr(mvdev))
-   mlx5_vdpa_warn(mvdev, "create MR failed\n");
-   }
up_write(>reslock);
 
return 0;
@@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, 
unsigned int asid,
return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
+{
+   struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+   struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+   int err;
+
+   down_write(>reslock);
+   err = mlx5_vdpa_reset_mr(mvdev, asid);
+   up_write(>reslock);
+   return err;
+}
+
 static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
@@ -3199,6 +3206,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = {
.set_config = mlx5_vdpa_set_config,
.get_generation = mlx5_vdpa_get_generation,
.set_map = mlx5_vdpa_set_map,
+   .reset_map = mlx5_vdpa_reset_map,
.set_group_asid = mlx5_set_group_asid,
.get_vq_dma_dev = mlx5_get_vq_dma_dev,
.free = mlx5_vdpa_free,
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 0/4] vdpa: decouple reset of iotlb mapping from device reset

2023-10-16 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is based off of the descriptor group v3 series
from Dragos. [2]

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html
[2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings
https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatu...@nvidia.com/

---
v2:
- improved commit message to clarify the intended csope of .reset_map API
- improved commit messages to clarify no breakage on older userspace

v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (4):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa/mlx5: implement .reset_map driver op

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 -
 drivers/vhost/vdpa.c   | 31 ++
 include/linux/vdpa.h   | 10 ++
 include/uapi/linux/vhost_types.h   |  2 ++
 6 files changed, 74 insertions(+), 5 deletions(-)

-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 1/4] vdpa: introduce .reset_map operation callback

2023-10-16 Thread Si-Wei Liu

Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

The .reset_map is not a MUST for every parent that implements the
.dma_map or .set_map API, because there could be software vDPA devices
(which has use_va=true) that don't have to pin kernel memory so they
don't care much about high mapping cost during device reset. And those
software devices may have also implemented their own DMA ops, so don't
have to use .reset_map to achieve a simple IOMMU device model for 1:1
passthrough mapping, either.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309b99cf..26ae6ae1eac3 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Si-Wei Liu

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f57b95..a3f8160c9807 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-10-16 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel can be trusted to persist IOTLB mapping
across vDPA reset. Without it, userspace has no way to tell apart
if it's running on an older kernel, which could silently drop
all iotlb mapping across vDPA reset, especially with broken
parent driver implementation for the .reset driver op. The broken
driver may incorrectly drop all mappings of its own as part of
.reset, which inadvertently ends up with corrupted mapping state
between vhost-vdpa userspace and the kernel. As a workaround, to
make the mapping behaviour predictable across reset, userspace
has to pro-actively remove all mappings before vDPA reset, and
then restore all the mappings afterwards. This workaround is done
unconditionally on top of all parent drivers today, due to the
parent driver implementation issue and no means to differentiate.
This workaround had been utilized in QEMU since day one when the
corresponding vhost-vdpa userspace backend came to the world.

There are 3 cases that backend may claim this feature bit on for:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation with
  persistent IOTLB mapping already that has to specifically
  declare this backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch
back iotlb to the initial state later on, especially for the
on-chip IOMMU case which starts with identity mapping at device
creation. virtio-vdpa requires on-chip IOMMU to perform 1:1
passthrough translation from PA to IOVA as-is to begin with, and
.reset_map is the only means to turn back iotlb to the identity
mapping mode after vhost-vdpa is gone.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(
started = false), but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace
to know that the driver bug has been solved. Not offering it
indicates that userspace cannot trust the kernel will retain the
maps.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 15 +++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a3f8160c9807..9202986a7d81 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct 
vhost_vdpa *v)
return ops->get_backend_features(vdpa);
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+  vhost_vdpa_get_backend_features(v) & 
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, , sizeof(features

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Si-Wei Liu




On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:

On Mon, Oct 16, 2023 at 8:33 AM Jason Wang  wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:



On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
   drivers/vhost/vdpa.c | 16 
   1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
  return vhost_vdpa_alloc_as(v, asid);
   }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
   {
  struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

  hlist_del(>hash_link);
  vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.


As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU. This code of
not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

For two reasons:

1) backend features need acked by userspace this is by design
2) keep the odd behaviour seems to be more safe as we can't audit
every userspace program


The old behavior (without flag ack) cannot be trusted already, as:
* Devices using platform IOMMU (in other words, not implementing
neither .set_map nor .dma_map) does not unmap memory at virtio reset.
* Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
now all backends work the same as far as I know., which was (and is)
the way devices using the platform IOMMU works.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(...,
started = false),
Exactly. It's not just QEMU, but any (older) userspace manipulates 
mappings through the vhost-vdpa iotlb interface has to unmap all 
mappings to workaround the vdpa parent driver bug. If they don't do 
explicit unmap, it would cause state inconsistency between vhost-vdpa 
and parent driver, then old mappings can't be restored, and new mapping 
can be added to iotlb after vDPA reset. There's no point to preserve 
this broken and inconsistent behavior between vhost-vdpa and parent 
driver, as userspace doesn't care at all!



but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace to
know that the bug has been solved.
Right, I couldn't say it better than you do, thanks! The feature flag is 
more of an unusual means to indicating kernel bug having been fixed, 
rather than introduce a new feature or new kernel behavior ending up in 
change of userspace's expectation.



Not offering it indicates that
userspace cannot trust the kernel will retain the maps.

Si-Wei or

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Si-Wei Liu




On 10/15/2023 11:32 PM, Jason Wang wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:



On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
   drivers/vhost/vdpa.c | 16 
   1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
  return vhost_vdpa_alloc_as(v, asid);
   }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
   {
  struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

  hlist_del(>hash_link);
  vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.
Nope, it was the opposite. Maybe it was not clear enough, let me try 
once more - userspace CANNOT decouple IOTLB reset from vDPA reset today. 
This is because of bug/discrepancy in mlx5_vdap and vdpa_sim already 
breaking userspace's expectation, rendering the brokenness/inconsistency 
on vhost-vdpa mapping interface from behaving what it promised and 
should have done. Only with the IOTLB_PERSIST flag seen userspace can 
trust vhost-vdpa kernel interface *reliably* to decouple IOTLB reset 
from vDPA reset. Without seeing this flag, no matter how the code in 
QEMU was written, today's older userspace was never like to assume the 
mappings will *definitely* be cleared by vDPA reset. If any userspace 
implementation wants to get consistent behavior for all vDPA parent 
devices, it still has to *explicitly* clear all existing mappings by its 
own by sending bunch of unmap (iotlb invalidate) requests to vhost-vdpa 
kernel before resetting the vDPA backend.


In brief, userspace is already broken by kernel implementation today, 
and new userspace needs some device flag to know for sure if kernel bug 
has already been fixed; older userspace doesn't care about preserving 
the broken kernel behavior at all, regardless whether or not it wants to 
decouple IOTLB from vDPA reset.





As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU. This code of
not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

For two reasons:

1) backend features need acked by userspace this is by design
There's no breakage on this part. Backend feature IOTLB_PERSIST won't be 
set if userspace doesn't ack.

2) keep the odd behaviour seems to be more safe as we can't audit
every userspace program
Definitely don't have to audit every userspace program, but I cannot 
think of a case where a sane userspace program can be broken. Can you 
elaborate one or two potential userspace usage that may break because of 
this? As said, platform IOMMU already did it this way.


Regards,
-Siwei


Thanks


I think
the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
certainty of persistent iotlb mapping not getting lost across vdpa reset.

Thanks,
-Siwei

[1]
https://lore.kernel.

Re: [RFC PATCH] vdpa_sim: implement .reset_map support

2023-10-13 Thread Si-Wei Liu

Hi Stefano,

On 10/13/2023 2:22 AM, Stefano Garzarella wrote:

Hi Si-Wei,

On Fri, Oct 13, 2023 at 01:23:40AM -0700, Si-Wei Liu wrote:

RFC only. Not tested on vdpa-sim-blk with user virtual address.

I can test it, but what I should stress?
Great, thank you! As you see, my patch moved vhost_iotlb_reset out of
vdpasim_reset for the sake of decoupling mapping from vdpa device reset.
For hardware devices this decoupling makes sense as platform IOMMU
already did it. But I'm not sure if there's something in the software
device (esp. with vdpa-blk and the userspace library stack) that may
have to rely on the current .reset behavior that clears the vhost_iotlb.
So perhaps you can try to exercise every possible case involving blk
device reset, and see if anything (related to mapping) breaks?

Works fine with vdpa-sim-net which uses physical address to map.

Can you share your tests? so I'll try to do the same with blk.
Basically everything involving virtio device reset in the guest, e.g.
reboot the VM, remove/unbind then reprobe/bind the virtio-net
module/driver, then see if device I/O (which needs mapping properly) is
still flowing as expected. And then everything else that could trigger
QEMU's vhost_dev_start/stop paths ending up as passive vhos-vdpa backend
reset, for e.g. link status change, suspend/hibernate, SVQ switch and
live migration. I am not sure if vdpa-blk supports live migration
through SVQ or not, if not you don't need to worry about.

This patch is based on top of [1].

[1]
https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/

The series does not apply well on master or vhost tree.
Where should I apply it?

Sent the link through another email offline.

Thanks,
-Siwei

If you have a tree with all of them applied, will be easy for me ;-)

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH] vdpa_sim: implement .reset_map support

2023-10-13 Thread Si-Wei Liu

RFC only. Not tested on vdpa-sim-blk with user virtual address.
Works fine with vdpa-sim-net which uses physical address to map.

This patch is based on top of [1].

[1] 
https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/

Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 28 +---
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 76d4105..a7455f2 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -151,13 +151,6 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim)
 >iommu_lock);
}
 
-   for (i = 0; i < vdpasim->dev_attr.nas; i++) {
-   vhost_iotlb_reset(>iommu[i]);
-   vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX,
- 0, VHOST_MAP_RW);
-   vdpasim->iommu_pt[i] = true;
-   }
-
vdpasim->running = true;
spin_unlock(>iommu_lock);
 
@@ -637,6 +630,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, 
unsigned int asid,
return ret;
 }
 
+static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid)
+{
+   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+   if (asid >= vdpasim->dev_attr.nas)
+   return -EINVAL;
+
+   spin_lock(>iommu_lock);
+   if (vdpasim->iommu_pt[asid])
+   goto out;
+   vhost_iotlb_reset(>iommu[asid]);
+   vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX,
+ 0, VHOST_MAP_RW);
+   vdpasim->iommu_pt[asid] = true;
+out:
+   spin_unlock(>iommu_lock);
+   return 0;
+}
+
 static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm)
 {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
@@ -759,6 +771,7 @@ static void vdpasim_free(struct vdpa_device *vdpa)
.set_group_asid = vdpasim_set_group_asid,
.dma_map= vdpasim_dma_map,
.dma_unmap  = vdpasim_dma_unmap,
+   .reset_map  = vdpasim_reset_map,
.bind_mm= vdpasim_bind_mm,
.unbind_mm  = vdpasim_unbind_mm,
.free   = vdpasim_free,
@@ -796,6 +809,7 @@ static void vdpasim_free(struct vdpa_device *vdpa)
.get_iova_range = vdpasim_get_iova_range,
.set_group_asid = vdpasim_set_group_asid,
.set_map= vdpasim_set_map,
+   .reset_map  = vdpasim_reset_map,
.bind_mm= vdpasim_bind_mm,
.unbind_mm  = vdpasim_unbind_mm,
.free   = vdpasim_free,
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op

2023-10-13 Thread Si-Wei Liu




On 10/12/2023 8:04 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR on the given ASID and recreate the
initial DMA mapping. That way, the device .reset op can run free from
having to maintain and clean up memory mappings by itself.

The cvq mapping also needs to be cleared if is in the given ASID.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 

I wonder if the simulator suffers from the exact same issue.
For vdpa-sim !use_va (map using PA and with pinning) case, yes. But I'm 
not sure the situation of the vdpa-sim(-blk) use_va case, e.g. I haven't 
checked if there's dependency on today's reset behavior (coupled), and 
if QEMU vhost-vdpa backend driver is the only userspace consumer. Maybe 
Stefano knows?


I can give it a try on simulator fix but don't count me on the 
vdpa-sim(-blk) use_va part.


Regards,
-Siwei




  If yes,
let's fix the simulator as well?

Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/4] vdpa: introduce .reset_map operation callback

2023-10-13 Thread Si-Wei Liu




On 10/12/2023 7:49 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

Signed-off-by: Si-Wei Liu 
---
  include/linux/vdpa.h | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309..26ae6ae 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
   * @iova: iova to be unmapped
   * @size: size of the area
   * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)

I think we need to mention that this is a must for parents that use set_map()?
It's not a must IMO, some .set_map() user for e.g. VDUSE or vdpa-sim-blk 
can deliberately choose to implement .reset_map() depending on its own 
need. Those user_va software devices mostly don't care about mapping 
cost during reset, as they don't have to pin kernel memory in general. 
It's just whether or not they care about mapping being decoupled from 
device reset at all.


And the exact implementation requirement of this interface has been 
documented right below.


-Siwei


Other than this:

Acked-by: Jason Wang 

Thanks


+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
   * @get_vq_dma_dev:Get the dma device for a specific
   * virtqueue (optional)
   * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
 int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
  u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
 int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
   unsigned int asid);
 struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-13 Thread Si-Wei Liu




On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
 return vhost_vdpa_alloc_as(v, asid);
  }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
  static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
  {
 struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

 hlist_del(>hash_link);
 vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?
Well, in theory this seems like so but it's unnecessary code change 
actually, as that is the way how vDPA parent behind platform IOMMU works 
today, and userspace doesn't break as of today. :)


As explained in previous threads [1][2], when IOTLB_PERSIST is not set 
it doesn't necessarily mean the iotlb will definitely be destroyed 
across reset (think about the platform IOMMU case), so userspace today 
is already tolerating enough with either good or bad IOMMU. This code of 
not checking IOTLB_PERSIST being set is intentional, there's no point to 
emulate bad IOMMU behavior even for older userspace (with improper 
emulation to be done it would result in even worse performance). I think 
the purpose of the IOTLB_PERSIST flag is just to give userspace 100% 
certainty of persistent iotlb mapping not getting lost across vdpa reset.


Thanks,
-Siwei

[1] 
https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e4...@oracle.com/
[2] 
https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1...@oracle.com/

  Otherwise
we may break old userspace.

Thanks


+   vhost_vdpa_reset_map(v, asid);
 kfree(as);

 return 0;
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-12 Thread Si-Wei Liu




On 10/11/2023 4:21 AM, Eugenio Perez Martin wrote:

On Tue, Oct 10, 2023 at 11:05 AM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
 return vhost_vdpa_alloc_as(v, asid);
  }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
  static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
  {
 struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

 hlist_del(>hash_link);
 vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);

Now I'm wondering, does this call to vhost_vdpa_iotlb_unmap sets a
different map (via .set_map) per element of the vhost_iotlb_itree?
Yes and no, effectively this vhost_vdpa_iotlb_unmap call will pass an 
empty iotlb with zero map entry down to the driver via .set_map, so for 
.set_map interface it's always a different map no matter what. As for 
this special case, the internal implementation of mlx5_vdpa .set_map may 
choose to either destroy MR and recreate a new one, or remove all 
mappings on the existing MR (currently it uses destroy+recreate for 
simplicity without have to special case). But .reset_map is different - 
the 1:1 DMA MR has to be recreated explicitly after destroying the 
regular MR, so you see this is driver/device implementation specifics.



  Not
a big deal since we're in the cleanup path, but it could be a nice
optimization on top as we're going to reset the map of the asid
anyway.
You mean wrap up what's done in vhost_vdpa_iotlb_unmap and 
vhost_vdpa_reset_map to a new call, say vhost_vdpa_iotlb_reset? Yes this 
is possible, but be noted that the vhost_vdpa_iotlb_unmap also takes 
charge of pinning accounting other than mapping, and it has to also 
maintain it's own vhost_iotlb copy in sync. There's no such much code 
that can be consolidated or generalized at this point, as 
vhost_vdpa_reset_map() is very specific to some device implementation, 
and I don't see common need to optimize this further up in the map/unmap 
hot path rather than this cleanup slow path, just as you alluded to.


Regards,
-Siwei



+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/
+   vhost_vdpa_reset_map(v, asid);
 kfree(as);

 return 0;
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 4/4] vdpa/mlx5: implement .reset_map driver op

2023-10-10 Thread Si-Wei Liu

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR on the given ASID and recreate the
initial DMA mapping. That way, the device .reset op can run free from
having to maintain and clean up memory mappings by itself.

The cvq mapping also needs to be cleared if is in the given ASID.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +-
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ce..84547d9 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28..2197c46 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+   if (asid >= MLX5_VDPA_NUM_AS)
+   return -EINVAL;
+
+   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+   if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (mlx5_vdpa_create_dma_mr(mvdev))
+   mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+   } else {
+   mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+   }
+
+   return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 6abe023..928e71b 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
-   mlx5_vdpa_destroy_mr_resources(>mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;
 
-   if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
-   if (mlx5_vdpa_create_dma_mr(mvdev))
-   mlx5_vdpa_warn(mvdev, "create MR failed\n");
-   }
up_write(>reslock);
 
return 0;
@@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, 
unsigned int asid,
return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
+{
+   struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+   struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+   int err;
+
+   down_write(>reslock);
+   err = mlx5_vdpa_reset_mr(mvdev, asid);
+   up_write(>reslock);
+   return err;
+}
+
 static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
@@ -3199,6 +3206,7 @@ static int mlx5_set_group_asid(struct vdpa_device *vdev, 
u32 group,
.set_config = mlx5_vdpa_set_config,
.get_generation = mlx5_vdpa_get_generation,
.set_map = mlx5_vdpa_set_map,
+   .reset_map = mlx5_vdpa_reset_map,
.set_group_asid = mlx5_set_group_asid,
.get_vq_dma_dev = mlx5_get_vq_dma_dev,
.free = mlx5_vdpa_free,
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-10-10 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel supports persistent IOTLB mapping across
device reset. Without it, userspace has no way to tell apart
if it's running on an older kernel, which could silently drop
all iotlb mapping across vDPA reset.

There are 3 cases that backend may claim this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation
  that explicitly declares the specific backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch
back iotlb to the initial state later on, especially for the
on-chip IOMMU case which starts with identity mapping at device
creation. virtio-vdpa requires on-chip IOMMU to perform 1:1
passthrough translation from PA to IOVA as-is to begin with, and
.reset_map is the only means to turn back iotlb to the identity
mapping mode after vhost-vdpa is gone.

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 15 +++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a3f8160..c92794f 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -413,6 +413,15 @@ static bool vhost_vdpa_has_desc_group(const struct 
vhost_vdpa *v)
return ops->get_vq_desc_group;
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+  vhost_vdpa_get_backend_features(v) & 
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, , sizeof(features)))
r = -EFAULT;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 18ad6ae..d765690 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -190,5 +190,7 @@ struct vhost_vdpa_iova_range {
  * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
  */
 #define VHOST_BACKEND_F_DESC_ASID0x7
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
 
 #endif
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset

2023-10-10 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is based off of the descriptor group v3 series
from Dragos. [2]

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html
[2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings
https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatu...@nvidia.com/

---
v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (4):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa/mlx5: implement .reset_map driver op

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +-
 drivers/vhost/vdpa.c   | 31 +++
 include/linux/vdpa.h   | 10 ++
 include/uapi/linux/vhost_types.h   |  2 ++
 6 files changed, 74 insertions(+), 5 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/4] vdpa: introduce .reset_map operation callback

2023-10-10 Thread Si-Wei Liu

Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

Signed-off-by: Si-Wei Liu 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309..26ae6ae 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-10 Thread Si-Wei Liu

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 00/16] vdpa: Add support for vq descriptor mappings

2023-09-26 Thread Si-Wei Liu




On 9/25/2023 12:59 AM, Dragos Tatulea wrote:

On Tue, 2023-09-12 at 16:01 +0300, Dragos Tatulea wrote:

This patch series adds support for vq descriptor table mappings which
are used to improve vdpa live migration downtime. The improvement comes
from using smaller mappings which take less time to create and destroy
in hw.


Gentle ping.

Note that I will have to send a v2. The changes in mlx5_ifc.h will need to be
merged first separately into the mlx5-next branch [0] and then pulled from there
when the series is applied.
This separation is unnecessary, as historically the virtio emulation 
portion of the update to mlx5_ifc.h often had to go through the vhost 
tree. See commits 1892a3d425bf and e13cd45d352d. Especially the 
additions from this series (mainly desc group mkey) have nothing to do 
with any networking or NIC driver feature.


-Siwei



[0]
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next

Thanks,
Dragos


The first part adds the vdpa core changes from Si-Wei [0].

The second part adds support in mlx5_vdpa:
- Refactor the mr code to be able to cleanly add descriptor mappings.
- Add hardware descriptor mr support.
- Properly update iotlb for cvq during ASID switch.

[0]
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com

Dragos Tatulea (13):
   vdpa/mlx5: Create helper function for dma mappings
   vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code
   vdpa/mlx5: Take cvq iotlb lock during refresh
   vdpa/mlx5: Collapse "dvq" mr add/delete functions
   vdpa/mlx5: Rename mr destroy functions
   vdpa/mlx5: Allow creation/deletion of any given mr struct
   vdpa/mlx5: Move mr mutex out of mr struct
   vdpa/mlx5: Improve mr update flow
   vdpa/mlx5: Introduce mr for vq descriptor
   vdpa/mlx5: Enable hw support for vq descriptor mapping
   vdpa/mlx5: Make iotlb helper functions more generic
   vdpa/mlx5: Update cvq iotlb mapping on ASID change
   Cover letter: vdpa/mlx5: Add support for vq descriptor mappings

Si-Wei Liu (3):
   vdpa: introduce dedicated descriptor group for virtqueue
   vhost-vdpa: introduce descriptor group backend feature
   vhost-vdpa: uAPI to get dedicated descriptor group id

  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  31 +++--
  drivers/vdpa/mlx5/core/mr.c    | 191 -
  drivers/vdpa/mlx5/core/resources.c |   6 +-
  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 100 ++-
  drivers/vhost/vdpa.c   |  27 
  include/linux/mlx5/mlx5_ifc.h  |   8 +-
  include/linux/mlx5/mlx5_ifc_vdpa.h |   7 +-
  include/linux/vdpa.h   |  11 ++
  include/uapi/linux/vhost.h |   8 ++
  include/uapi/linux/vhost_types.h   |   5 +
  10 files changed, 264 insertions(+), 130 deletions(-)



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 00/16] vdpa: Add support for vq descriptor mappings

2023-09-15 Thread Si-Wei Liu




On 9/13/2023 9:08 AM, Eugenio Perez Martin wrote:

On Wed, Sep 13, 2023 at 3:03 AM Lei Yang  wrote:

Hi Dragos, Eugenio and Si-Wei

My name is Lei Yang, a software Quality Engineer from Red Hat.  And
always paying attention to improving the live migration downtime
issues because there are others QE asked about this problem when I
share live migration status  recently. Therefore I would like to test
it in my environment. Before the testing I want to know if there is an
expectation of downtime range based on this series of patches? In
addition, QE also can help do a regression test based on this series
of patches if there is a requirement.


Hi Lei,

Thanks for offering the testing bandwidth!

I think we can only do regression tests here, as the userland part is
still not sent to qemu.
Right. Regression for now, even QEMU has it, to exercise the relevant 
feature it would need a supporting firmware that is not yet available 
for now. Just stay tuned.


thanks for your patience,
-Siwei



Regards and thanks
Lei


On Tue, Sep 12, 2023 at 9:04 PM Dragos Tatulea  wrote:

This patch series adds support for vq descriptor table mappings which
are used to improve vdpa live migration downtime. The improvement comes
from using smaller mappings which take less time to create and destroy
in hw.

The first part adds the vdpa core changes from Si-Wei [0].

The second part adds support in mlx5_vdpa:
- Refactor the mr code to be able to cleanly add descriptor mappings.
- Add hardware descriptor mr support.
- Properly update iotlb for cvq during ASID switch.

[0] 
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com

Dragos Tatulea (13):
   vdpa/mlx5: Create helper function for dma mappings
   vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code
   vdpa/mlx5: Take cvq iotlb lock during refresh
   vdpa/mlx5: Collapse "dvq" mr add/delete functions
   vdpa/mlx5: Rename mr destroy functions
   vdpa/mlx5: Allow creation/deletion of any given mr struct
   vdpa/mlx5: Move mr mutex out of mr struct
   vdpa/mlx5: Improve mr update flow
   vdpa/mlx5: Introduce mr for vq descriptor
   vdpa/mlx5: Enable hw support for vq descriptor mapping
   vdpa/mlx5: Make iotlb helper functions more generic
   vdpa/mlx5: Update cvq iotlb mapping on ASID change
   Cover letter: vdpa/mlx5: Add support for vq descriptor mappings

Si-Wei Liu (3):
   vdpa: introduce dedicated descriptor group for virtqueue
   vhost-vdpa: introduce descriptor group backend feature
   vhost-vdpa: uAPI to get dedicated descriptor group id

  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  31 +++--
  drivers/vdpa/mlx5/core/mr.c| 191 -
  drivers/vdpa/mlx5/core/resources.c |   6 +-
  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 100 ++-
  drivers/vhost/vdpa.c   |  27 
  include/linux/mlx5/mlx5_ifc.h  |   8 +-
  include/linux/mlx5/mlx5_ifc_vdpa.h |   7 +-
  include/linux/vdpa.h   |  11 ++
  include/uapi/linux/vhost.h |   8 ++
  include/uapi/linux/vhost_types.h   |   5 +
  10 files changed, 264 insertions(+), 130 deletions(-)

--
2.41.0



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v3 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-09-14 Thread Si-Wei Liu




On 9/12/2023 12:01 AM, Jason Wang wrote:

On Tue, Sep 12, 2023 at 8:28 AM Si-Wei Liu  wrote:



On 9/10/2023 8:52 PM, Jason Wang wrote:

On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu  wrote:

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel supports persistent IOTLB mapping across
device reset.

As discussed, the IOTLB persists for devices with platform IOMMU at
least. You've mentioned that this behaviour is covered by Qemu since
it reset IOTLB on reset. I wonder what happens if we simply decouple
the IOTLB reset from vDPA reset in Qemu. Could we meet bugs there?

Not sure I understand. Simple decouple meaning to remove memory_listener
registration/unregistration calls *unconditionally* from the vDPA dev
start/stop path in QEMU, or make it conditional around the existence of
PERSIST_IOTLB?

If my memory is correct, currently the listeners were
registered/unregistered during start/stop. I mean what if we
register/unregister during init/free?

Yes, the move to init/cleanup was assumed in below response.




If unconditional, it breaks older host kernel, as the
older kernel still silently drops all mapping across vDPA reset (VM
reboot),

Ok, right.


ending up with network loss afterwards; if make the QEMU change
conditional on PERSIST_IOTLB without the .reset_map API, it can't cover
the virtio-vdpa 1:1 identity mapping case.

Ok, I see. Let's add the above and explain why it can't cover the 1:1
mapping somewhere (probably the commit log) in the next version.

OK. Will do.



So I think we can probably introduce reset_map() but not say it's for
on-chip IOMMU. We can probably say, it's for resetting the vendor
specific mapping into initialization state?
For sure. That's exactly the intent, though I didn't specifically tie 
on-chip to two-dimension or entity mapping. Yes I can reword to "vendor 
specific" in the next rev to avoid confusions and ambiguity.


Thanks,
-Siwei



Btw, is there a Qemu patch for reference for this new feature?

There's a WIP version, not ready yet for review:

https://github.com/siwliu-kernel/qemu
branch: vdpa-svq-asid-poc

Will need to clean up code and split to smaller patches before I can
post it, if the kernel part can be settled.

Ok.

Thanks


Thanks,
-Siwei


Thanks


There are two cases that backend may claim
this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
.reset_map support in driver

Signed-off-by: Si-Wei Liu 
---
RFC v2 -> v3:
- fix missing return due to merge error

---
   drivers/vhost/vdpa.c | 16 +++-
   include/uapi/linux/vhost_types.h |  2 ++
   2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 71fbd559..b404504 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct 
vhost_vdpa *v)
  return ops->get_vq_desc_group;
   }

+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;
+}
+
   static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user 
*featurep)
   {
  struct vdpa_device *vdpa = v->vdpa;
@@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
  if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
   BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
   BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
-BIT_ULL(VHOST_BACKEND_F_RESUME)))
+BIT_ULL(VHOST_BACKEND_F_RESUME) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)))
  return -EOPNOTSUPP;
  if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
   !vhost_vdpa_can_suspend(v))
@@ -730,6 +739,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
  if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
   !vhost_vdpa_has_desc_group(v))
  return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
  vhost_set_backend_features(>vdev, features);
  return 0;
  }
@@ -785,6 +797,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
  features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
  if (vhost_vdpa_has_desc_group(v))
  features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persist

Re: [PATCH RFC v3 2/4] vdpa/mlx5: implement .reset_map driver op

2023-09-14 Thread Si-Wei Liu




On 9/11/2023 11:53 PM, Jason Wang wrote:

On Tue, Sep 12, 2023 at 8:02 AM Si-Wei Liu  wrote:



On 9/10/2023 8:48 PM, Jason Wang wrote:

On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu  wrote:

Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at
device creation time, while this 1:1 mapping will be implicitly
destroyed when the first .set_map call is invoked. Everytime
when the .reset callback is invoked, any mapping left behind will
be dropped then reset back to the initial 1:1 DMA mapping.

In order to reduce excessive memory mapping cost during live
migration, it is desirable to decouple the vhost-vdpa iotlb
abstraction from the virtio device life cycle, i.e. mappings
should be left intact across virtio device reset. Leverage the
.reset_map callback to reset memory mapping, then the device
.reset routine can run free from having to clean up memory
mappings.

It's not clear the direct relationship between the persistent mapping
and reset_map.

Consider .reset_map as a simplified abstraction for on-chip IOMMU model,
decoupling memory mapping mode switching from the current vdpa_reset
hack. Slightly different than platform IOMMU iommu_domain_alloc/free,
but works the best with existing .dma_map/.set_map APIs.

Note that iommu_domain_alloc/free doesn't imply any mappings (even the
identity mapping).
Forget about this part, it just exposes the multi-dimension aspect of 
iommu domain unnecessarily, and I think we both don't like to. Although 
this was intended to make virtio-vdpa work seamlessly when it is used 
over mlx5-vdpa, similar to the DMA device deviation introduced to the 
vDPA driver API.


Thanks,
-Siwei



As said in the
other email, the distinction cannot be hidden, as there are bus drivers
with varied mapping needs. On the other hand, I can live with the
iommu_domain_alloc/free flavor strictly following the platform IOMMU
model, but not sure if worth the complexity.

I'm not sure I get this, maybe you can post some RFC or pseudo code?


Could we do it step by step? For example, remove the
mlx5_vdpa_destroy_mr() in mlx5_vdpa_reset() when PERSIST_IOTLB exists?

I think today there's no way for the parent driver to negotiate backend
features with userspace, for e.g. parent won't be able to perform
mlx5_vdpa_destroy_mr for the virtio-vdpa case when PERSIST_IOTLB doesn't
exist. And this backend features stuff is a vhost specific thing, not
specifically tied to vdpa itself. How do we get it extended and
propagated up to the vdpa bus layer?

Just to make sure we are on the same page, I just want to know what
happens if we simply remove mlx5_vdpa_destroy_mr() in
mlx5_vdpa_reset()?


And then we can deal with the resetting and others on top,

For this proposed fix, dealing with vdpa_reset from vhost-vdpa is not
specifically an issue, but how to get the mapping reverted back to 1:1
identity/passthrough when users want to switch from vhost-vdpa to
virtio-vdpa is.


   or it needs
some explanation for why reset_map() must be done first.

Yep, I can add more to the commit log.

Thanks


Thanks,
-Siwei


Thanks


Signed-off-by: Si-Wei Liu 

---
RFC v1 -> v2:
- fix error path when both CVQ and DVQ fall in same asid
---
   drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
   drivers/vdpa/mlx5/core/mr.c| 70 
+++---
   drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
   3 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index b53420e..5c9a25a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct 
vhost_iotlb *iotlb,
  unsigned int asid);
   void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
   void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int 
asid);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);

   #define mlx5_vdpa_warn(__dev, format, ...)   
  \
  dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 5a1971fc..ec2c7b4e1 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_mr *mr
  }
   }

-static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev)
   {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return;
-
  prune_iotlb(mvdev);
   }

-static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev)
   {
  struct mlx5_vdpa_mr *mr = &

Re: [PATCH RFC v2 1/4] vdpa: introduce .reset_map operation callback

2023-09-14 Thread Si-Wei Liu




On 9/11/2023 11:23 PM, Jason Wang wrote:

On Tue, Sep 12, 2023 at 7:31 AM Si-Wei Liu  wrote:

Hi Jason,

On 9/10/2023 8:42 PM, Jason Wang wrote:

Hi Si-Wei:

On Sat, Sep 9, 2023 at 9:34 PM Si-Wei Liu  wrote:

On-chip IOMMU parent driver could use it to restore memory mapping
to the initial state.

As discussed before. On-chip IOMMU is the hardware details that need
to be hidden by the vDPA bus.

I guess today this is exposed to the bus driver layer already, for e.g.
vhost_vdpa_map() can call into the  .dma_map, or .set_map, or
iommu_map() flavors depending on the specific hardware IOMMU
implementation underneath? Specifically, "struct iommu_domain *domain"
is now part of "struct vhost_vdpa" at an individual bus driver
(vhost-vdpa), rather than being wrapped around under the vdpa core
"struct vdpa_device" as vdpa device level object. Do we know for what
reason the hardware details could be exposed to bus callers like
vhost_vdpa_map and vhost_vdpa_general_unmap, while it's prohibited for
other similar cases on the other hand? Or is there a boundary in between
I was not aware of?

Let me try to explain:

set_map(), dma_map(), dma_unmap() is used for parent specific
mappings. It means the parents want to do vendor specific setup for
the mapping. The abstraction of translation is still one dimension
(thought the actual implementation in the parent could be two
dimensions).  So it's not necessarily the on-chip stuff (see the
example of the VDUSE).

That means we never expose two dimension mappings like (on-chip)
beyond the bus. So it's not one dimension vs two dimensions but the
platform specific mappings vs vendor specific mappings.
OK, I think I saw on-chip was used interchangeably for vendor specific 
means of mapping even for VDUSE. While I think we both agreed it's too 
complex to expose the details of two-dimensions and we should try to 
avoid that (I thought on-chip doesn't imply two-dimension but just the 
vendor specific part). That's the reason why I hide this special detail 
under a simple .reset_map interface such that we could easily decouple 
mapping from virtio life cycle (device reset).





I think a more fundamental question I don't quite understand, is adding
an extra API to on-chip IOMMU itself an issue, or just that you don't
like the way how the IOMMU model gets exposed via this specific API of
.reset_map?

extra API to on-chip IOMMU, since the on-chip logics should be hidden
by the bus unless we want to introduce the two dimensions abstraction
(which seems to be an overkill).
Thanks for clarifications of your concern. I will rephrase on-chip to 
"vendor specific" and try to avoid mentioning the two-dimension aspect 
of the API.



For the platform IOMMU case, internally there exists
distinction between the 1:1 identify (passthrough) mode and DMA page
mapping mode, and this distinction is somehow getting exposed and
propagated through the IOMMU API - for e.g. iommu_domain_alloc() and
iommu_attach_device() are being called explicitly from
vhost_vdpa_alloc_domain() by vhost-vdpa (and the opposite from within
vhost_vdpa_free_domain), while for virtio-vdpa it doesn't call any IOMMU
API at all on the other hand

It's the way the kernel manages DMA mappings. For a userspace driver
via vhost-vDPA, it needs to call IOMMU APIs. And for a kernel driver
via virtio-vDPA, DMA API is used (via the dma_dev exposed through
virtio_vdpa). DMA API may decide to call IOMMU API if IOMMU is enabled
but not in passthrough mode.
Right, I think what I meant is, distinction of mapping requirement 
exists between two bus drivers, vhost-vdpa and virtio-vdpa. It's 
impossible to hide every details (identity, swiotlb, dmar) under the 
cover of DMA API simply using the IOMMU API abstraction. Same applies to 
how one dimension oriented vendor specific API ( .dma_map/.set_map I 
mean) can't cover all cases of potentially multi-dimensional mapping 
requirements from virtio-vdpa (which is using a feature rich DMA API 
instead of simple and lower level page mapping based IOMMU API). I now 
get that you may want to understand why .reset_map is required and which 
part of the userspace functionality won't work without it, on the other 
hand.



- which is to inherit what default IOMMU
domain has.

Yes, but it's not a 1:1 (identify) mapping, it really depends on the
configuration. (And there could even be a swiotlb layer in the
middle).
Yes, so I said inherit the configuration of the default domain, which 
could vary versus one-dimension.



Ideally for on-chip IOMMU we can and should do pretty much
the same, but I don't think there's a clean way without introducing any
driver API to make vhost-vdpa case distinguish from the virtio-vdpa
case. I'm afraid to say that it was just a hack to hide the necessary
distinction needed by vdpa bus users for e.g. in the deep of
vdpa_reset(), if not introducing any new driver API is the goal here...

So rest_map() is fine if it is not defined just f

Re: [PATCH] vdpa: consume device_features parameter

2023-09-11 Thread Si-Wei Liu

Thanks David, for clarifications. Now I see the patch just got posted by 
Shannon (thanks!) with the correct iproute2 label in the subject line. 
We may expect to see this land on iproute2 repo soon?


Thanks!
-Siwei

On 9/9/2023 1:36 PM, David Ahern wrote:

On 9/8/23 12:37 PM, Si-Wei Liu wrote:

Just out of my own curiosity, the patch is not applicable simply because
the iproute2 was missing from the subject, or the code base somehow got

most likely missing the iproute2 label in the Subject line


changed that isn't aligned with the patch any more?

Thanks,
-Siwei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v3 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-09-11 Thread Si-Wei Liu




On 9/10/2023 8:52 PM, Jason Wang wrote:

On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu  wrote:

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel supports persistent IOTLB mapping across
device reset.

As discussed, the IOTLB persists for devices with platform IOMMU at
least. You've mentioned that this behaviour is covered by Qemu since
it reset IOTLB on reset. I wonder what happens if we simply decouple
the IOTLB reset from vDPA reset in Qemu. Could we meet bugs there?
Not sure I understand. Simple decouple meaning to remove memory_listener 
registration/unregistration calls *unconditionally* from the vDPA dev 
start/stop path in QEMU, or make it conditional around the existence of 
PERSIST_IOTLB? If unconditional, it breaks older host kernel, as the 
older kernel still silently drops all mapping across vDPA reset (VM 
reboot), ending up with network loss afterwards; if make the QEMU change 
conditional on PERSIST_IOTLB without the .reset_map API, it can't cover 
the virtio-vdpa 1:1 identity mapping case.



Btw, is there a Qemu patch for reference for this new feature?

There's a WIP version, not ready yet for review:

https://github.com/siwliu-kernel/qemu
branch: vdpa-svq-asid-poc

Will need to clean up code and split to smaller patches before I can 
post it, if the kernel part can be settled.


Thanks,
-Siwei



Thanks


There are two cases that backend may claim
this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
   .reset_map support in driver

Signed-off-by: Si-Wei Liu 
---
RFC v2 -> v3:
   - fix missing return due to merge error

---
  drivers/vhost/vdpa.c | 16 +++-
  include/uapi/linux/vhost_types.h |  2 ++
  2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 71fbd559..b404504 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct 
vhost_vdpa *v)
 return ops->get_vq_desc_group;
  }

+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;
+}
+
  static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user 
*featurep)
  {
 struct vdpa_device *vdpa = v->vdpa;
@@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
  BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
  BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
-BIT_ULL(VHOST_BACKEND_F_RESUME)))
+BIT_ULL(VHOST_BACKEND_F_RESUME) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)))
 return -EOPNOTSUPP;
 if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
  !vhost_vdpa_can_suspend(v))
@@ -730,6 +739,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
  !vhost_vdpa_has_desc_group(v))
 return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
 vhost_set_backend_features(>vdev, features);
 return 0;
 }
@@ -785,6 +797,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
 if (vhost_vdpa_has_desc_group(v))
 features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
 if (copy_to_user(featurep, , sizeof(features)))
 r = -EFAULT;
 break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 6acc604..0fdb6f0 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -186,5 +186,7 @@ struct vhost_vdpa_iova_range {
   * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
   */
  #define VHOST_BACKEND_F_DESC_ASID0x6
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x7

  #endif
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v3 2/4] vdpa/mlx5: implement .reset_map driver op

2023-09-11 Thread Si-Wei Liu




On 9/10/2023 8:48 PM, Jason Wang wrote:

On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu  wrote:

Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at
device creation time, while this 1:1 mapping will be implicitly
destroyed when the first .set_map call is invoked. Everytime
when the .reset callback is invoked, any mapping left behind will
be dropped then reset back to the initial 1:1 DMA mapping.

In order to reduce excessive memory mapping cost during live
migration, it is desirable to decouple the vhost-vdpa iotlb
abstraction from the virtio device life cycle, i.e. mappings
should be left intact across virtio device reset. Leverage the
.reset_map callback to reset memory mapping, then the device
.reset routine can run free from having to clean up memory
mappings.

It's not clear the direct relationship between the persistent mapping
and reset_map.
Consider .reset_map as a simplified abstraction for on-chip IOMMU model, 
decoupling memory mapping mode switching from the current vdpa_reset 
hack. Slightly different than platform IOMMU iommu_domain_alloc/free, 
but works the best with existing .dma_map/.set_map APIs. As said in the 
other email, the distinction cannot be hidden, as there are bus drivers 
with varied mapping needs. On the other hand, I can live with the 
iommu_domain_alloc/free flavor strictly following the platform IOMMU 
model, but not sure if worth the complexity.



Could we do it step by step? For example, remove the
mlx5_vdpa_destroy_mr() in mlx5_vdpa_reset() when PERSIST_IOTLB exists?
I think today there's no way for the parent driver to negotiate backend 
features with userspace, for e.g. parent won't be able to perform 
mlx5_vdpa_destroy_mr for the virtio-vdpa case when PERSIST_IOTLB doesn't 
exist. And this backend features stuff is a vhost specific thing, not 
specifically tied to vdpa itself. How do we get it extended and 
propagated up to the vdpa bus layer?



And then we can deal with the resetting and others on top,
For this proposed fix, dealing with vdpa_reset from vhost-vdpa is not 
specifically an issue, but how to get the mapping reverted back to 1:1 
identity/passthrough when users want to switch from vhost-vdpa to 
virtio-vdpa is.



  or it needs
some explanation for why reset_map() must be done first.

Yep, I can add more to the commit log.

Thanks,
-Siwei



Thanks


Signed-off-by: Si-Wei Liu 

---
RFC v1 -> v2:
   - fix error path when both CVQ and DVQ fall in same asid
---
  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
  drivers/vdpa/mlx5/core/mr.c| 70 +++---
  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
  3 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index b53420e..5c9a25a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct 
vhost_iotlb *iotlb,
 unsigned int asid);
  void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
  void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int 
asid);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);

  #define mlx5_vdpa_warn(__dev, format, ...)
 \
 dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 5a1971fc..ec2c7b4e1 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_mr *mr
 }
  }

-static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev)
  {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return;
-
 prune_iotlb(mvdev);
  }

-static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev)
  {
 struct mlx5_vdpa_mr *mr = >mr;

-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)
-   return;
-
 if (!mr->initialized)
 return;

@@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev 
*mvdev, unsigned int asid)

 mutex_lock(>mkey_mtx);

-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
-   _mlx5_vdpa_destroy_cvq_mr(mvdev, asid);
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_dvq_mr(mvdev);
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_cvq_mr(mvdev);

 mutex_unlock(>mkey_mtx);
  }
@@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)

Re: [PATCH RFC v2 1/4] vdpa: introduce .reset_map operation callback

2023-09-11 Thread Si-Wei Liu


Hi Jason,

On 9/10/2023 8:42 PM, Jason Wang wrote:

Hi Si-Wei:

On Sat, Sep 9, 2023 at 9:34 PM Si-Wei Liu  wrote:

On-chip IOMMU parent driver could use it to restore memory mapping
to the initial state.

As discussed before. On-chip IOMMU is the hardware details that need
to be hidden by the vDPA bus.
I guess today this is exposed to the bus driver layer already, for e.g. 
vhost_vdpa_map() can call into the  .dma_map, or .set_map, or 
iommu_map() flavors depending on the specific hardware IOMMU 
implementation underneath? Specifically, "struct iommu_domain *domain" 
is now part of "struct vhost_vdpa" at an individual bus driver 
(vhost-vdpa), rather than being wrapped around under the vdpa core 
"struct vdpa_device" as vdpa device level object. Do we know for what 
reason the hardware details could be exposed to bus callers like 
vhost_vdpa_map and vhost_vdpa_general_unmap, while it's prohibited for 
other similar cases on the other hand? Or is there a boundary in between 
I was not aware of?


I think a more fundamental question I don't quite understand, is adding 
an extra API to on-chip IOMMU itself an issue, or just that you don't 
like the way how the IOMMU model gets exposed via this specific API of 
.reset_map? For the platform IOMMU case, internally there exists 
distinction between the 1:1 identify (passthrough) mode and DMA page 
mapping mode, and this distinction is somehow getting exposed and 
propagated through the IOMMU API - for e.g. iommu_domain_alloc() and 
iommu_attach_device() are being called explicitly from 
vhost_vdpa_alloc_domain() by vhost-vdpa (and the opposite from within 
vhost_vdpa_free_domain), while for virtio-vdpa it doesn't call any IOMMU 
API at all on the other hand - which is to inherit what default IOMMU 
domain has. Ideally for on-chip IOMMU we can and should do pretty much 
the same, but I don't think there's a clean way without introducing any 
driver API to make vhost-vdpa case distinguish from the virtio-vdpa 
case. I'm afraid to say that it was just a hack to hide the necessary 
distinction needed by vdpa bus users for e.g. in the deep of 
vdpa_reset(), if not introducing any new driver API is the goal here...



Exposing this will complicate the implementation of bus drivers.
As said above, this distinction is needed by bus drivers, and it's 
already done by platform IOMMU via IOMMU API. I can drop the .reset_map 
API while add another set of similar driver API to mimic 
iommu_domain_alloc/iommu_domain_free, but doing this will complicate the 
parent driver's implementation on the other hand. While .reset_map is 
what I can think of to be the simplest for parent, I can do the other 
way if you're fine with it. Let me know how it sounds.


Thanks,
-Siwei



Thanks


Signed-off-by: Si-Wei Liu 
---
  include/linux/vdpa.h | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 17a4efa..daecf55 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -324,6 +324,12 @@ struct vdpa_map_file {
   * @iova: iova to be unmapped
   * @size: size of the area
   * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping (optional)
+ * Needed for device that using device
+ * specific DMA translation (on-chip IOMMU)
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
   * @get_vq_dma_dev:Get the dma device for a specific
   * virtqueue (optional)
   * @vdev: vdpa device
@@ -401,6 +407,7 @@ struct vdpa_config_ops {
u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
 int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
  u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
 int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
   unsigned int asid);
 struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v3 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-09-09 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel supports persistent IOTLB mapping across
device reset. There are two cases that backend may claim
this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver

Signed-off-by: Si-Wei Liu 
---
RFC v2 -> v3:
  - fix missing return due to merge error

---
 drivers/vhost/vdpa.c | 16 +++-
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 71fbd559..b404504 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct 
vhost_vdpa *v)
return ops->get_vq_desc_group;
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;
+}
+
 static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
-BIT_ULL(VHOST_BACKEND_F_RESUME)))
+BIT_ULL(VHOST_BACKEND_F_RESUME) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)))
return -EOPNOTSUPP;
if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
 !vhost_vdpa_can_suspend(v))
@@ -730,6 +739,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -785,6 +797,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
if (copy_to_user(featurep, , sizeof(features)))
r = -EFAULT;
break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 6acc604..0fdb6f0 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -186,5 +186,7 @@ struct vhost_vdpa_iova_range {
  * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
  */
 #define VHOST_BACKEND_F_DESC_ASID0x6
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x7
 
 #endif
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v3 2/4] vdpa/mlx5: implement .reset_map driver op

2023-09-09 Thread Si-Wei Liu

Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at
device creation time, while this 1:1 mapping will be implicitly
destroyed when the first .set_map call is invoked. Everytime
when the .reset callback is invoked, any mapping left behind will
be dropped then reset back to the initial 1:1 DMA mapping.

In order to reduce excessive memory mapping cost during live
migration, it is desirable to decouple the vhost-vdpa iotlb
abstraction from the virtio device life cycle, i.e. mappings
should be left intact across virtio device reset. Leverage the
.reset_map callback to reset memory mapping, then the device
.reset routine can run free from having to clean up memory
mappings.

Signed-off-by: Si-Wei Liu 

---
RFC v1 -> v2:
  - fix error path when both CVQ and DVQ fall in same asid
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 70 +++---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
 3 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index b53420e..5c9a25a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct 
vhost_iotlb *iotlb,
unsigned int asid);
 void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
 void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 5a1971fc..ec2c7b4e1 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_mr *mr
}
 }
 
-static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev)
 {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return;
-
prune_iotlb(mvdev);
 }
 
-static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev)
 {
struct mlx5_vdpa_mr *mr = >mr;
 
-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)
-   return;
-
if (!mr->initialized)
return;
 
@@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev 
*mvdev, unsigned int asid)
 
mutex_lock(>mkey_mtx);
 
-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
-   _mlx5_vdpa_destroy_cvq_mr(mvdev, asid);
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_dvq_mr(mvdev);
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_cvq_mr(mvdev);
 
mutex_unlock(>mkey_mtx);
 }
@@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
 }
 
 static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev,
-   struct vhost_iotlb *iotlb,
-   unsigned int asid)
+   struct vhost_iotlb *iotlb)
 {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return 0;
-
return dup_iotlb(mvdev, iotlb);
 }
 
 static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev,
-   struct vhost_iotlb *iotlb,
-   unsigned int asid)
+   struct vhost_iotlb *iotlb)
 {
struct mlx5_vdpa_mr *mr = >mr;
int err;
 
-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)
-   return 0;
-
if (mr->initialized)
return 0;
 
@@ -574,18 +562,22 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev 
*mvdev,
 {
int err;
 
-   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);
-   if (err)
-   return err;
-
-   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid);
-   if (err)
-   goto out_err;
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) {
+   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb);
+   if (err)
+   return err;
+   }
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) {
+   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb);
+   if (err)
+   goto out_err;
+   }
 
return 0;
 
 out_err:
-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);

[PATCH RFC v3 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver

2023-09-09 Thread Si-Wei Liu

Devices with on-chip IOMMU may need to restore iotlb to 1:1 identity
mapping from IOVA to PA. Before vhost-vdpa is going away, give them
a chance to clean up and reset iotlb back to 1:1 identify mapping
mode. This is done so that any vdpa bus driver may start with 1:1
identity mapping by default.

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index eabac06..71fbd559 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with on-chip IOMMU need to restore iotlb
+* to 1:1 identity mapping before vhost-vdpa is going
+* to be removed and detached from the device. Give
+* them a chance to do so, as this cannot be done
+* efficiently via the whole-range unmap call above.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v3 1/4] vdpa: introduce .reset_map operation callback

2023-09-09 Thread Si-Wei Liu

On-chip IOMMU parent driver could use it to restore memory mapping
to the initial state.

Signed-off-by: Si-Wei Liu 
---
 include/linux/vdpa.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 17a4efa..daecf55 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -324,6 +324,12 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping (optional)
+ * Needed for device that using device
+ * specific DMA translation (on-chip IOMMU)
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -401,6 +407,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v3 0/4] vdpa: decouple reset of iotlb mapping from device reset

2023-09-09 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device should implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, which
is mainly used to reset virtio specific device state.
This new .reset_map() callback will be invoked only when
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device add,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (4):
  vdpa: introduce .reset_map operation callback
  vdpa/mlx5: implement .reset_map driver op
  vhost-vdpa: should restore 1:1 dma mapping before detaching driver
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 70 +++---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
 drivers/vhost/vdpa.c   | 32 -
 include/linux/vdpa.h   |  7 
 include/uapi/linux/vhost_types.h   |  2 ++
 6 files changed, 96 insertions(+), 34 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-09-09 Thread Si-Wei Liu

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel supports persistent IOTLB mapping across
device reset. There are two cases that backend may claim
this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 15 ++-
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 71fbd559..bbb1092 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct 
vhost_vdpa *v)
return ops->get_vq_desc_group;
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;
+}
+
 static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
-BIT_ULL(VHOST_BACKEND_F_RESUME)))
+BIT_ULL(VHOST_BACKEND_F_RESUME) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)))
return -EOPNOTSUPP;
if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
 !vhost_vdpa_can_suspend(v))
@@ -729,6 +738,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EINVAL;
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
@@ -785,6 +796,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
if (copy_to_user(featurep, , sizeof(features)))
r = -EFAULT;
break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 6acc604..0fdb6f0 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -186,5 +186,7 @@ struct vhost_vdpa_iova_range {
  * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
  */
 #define VHOST_BACKEND_F_DESC_ASID0x6
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x7
 
 #endif
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 1/4] vdpa: introduce .reset_map operation callback

2023-09-09 Thread Si-Wei Liu

On-chip IOMMU parent driver could use it to restore memory mapping
to the initial state.

Signed-off-by: Si-Wei Liu 
---
 include/linux/vdpa.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 17a4efa..daecf55 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -324,6 +324,12 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping (optional)
+ * Needed for device that using device
+ * specific DMA translation (on-chip IOMMU)
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -401,6 +407,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 0/4] vdpa: decouple reset of iotlb mapping from device reset

2023-09-09 Thread Si-Wei Liu

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device should implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, which
is mainly used to reset virtio specific device state.
This new .reset_map() callback will be invoked only when
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device add,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (4):
  vdpa: introduce .reset_map operation callback
  vdpa/mlx5: implement .reset_map driver op
  vhost-vdpa: should restore 1:1 dma mapping before detaching driver
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 70 +++---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
 drivers/vhost/vdpa.c   | 32 -
 include/linux/vdpa.h   |  7 
 include/uapi/linux/vhost_types.h   |  2 ++
 6 files changed, 96 insertions(+), 34 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 2/4] vdpa/mlx5: implement .reset_map driver op

2023-09-09 Thread Si-Wei Liu

Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at
device creation time, while this 1:1 mapping will be implicitly
destroyed when the first .set_map call is invoked. Everytime
when the .reset callback is invoked, any mapping left behind will
be dropped then reset back to the initial 1:1 DMA mapping.

In order to reduce excessive memory mapping cost during live
migration, it is desirable to decouple the vhost-vdpa iotlb
abstraction from the virtio device life cycle, i.e. mappings
should be left intact across virtio device reset. Leverage the
.reset_map callback to reset memory mapping, then the device
.reset routine can run free from having to clean up memory
mappings.

Signed-off-by: Si-Wei Liu 

---
RFC v1 -> v2:
  - fix error path when both CVQ and DVQ fall in same asid
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 70 +++---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
 3 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index b53420e..5c9a25a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct 
vhost_iotlb *iotlb,
unsigned int asid);
 void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
 void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 5a1971fc..ec2c7b4e1 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_mr *mr
}
 }
 
-static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev)
 {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return;
-
prune_iotlb(mvdev);
 }
 
-static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
int asid)
+static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev)
 {
struct mlx5_vdpa_mr *mr = >mr;
 
-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)
-   return;
-
if (!mr->initialized)
return;
 
@@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev 
*mvdev, unsigned int asid)
 
mutex_lock(>mkey_mtx);
 
-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
-   _mlx5_vdpa_destroy_cvq_mr(mvdev, asid);
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_dvq_mr(mvdev);
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_cvq_mr(mvdev);
 
mutex_unlock(>mkey_mtx);
 }
@@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
 }
 
 static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev,
-   struct vhost_iotlb *iotlb,
-   unsigned int asid)
+   struct vhost_iotlb *iotlb)
 {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return 0;
-
return dup_iotlb(mvdev, iotlb);
 }
 
 static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev,
-   struct vhost_iotlb *iotlb,
-   unsigned int asid)
+   struct vhost_iotlb *iotlb)
 {
struct mlx5_vdpa_mr *mr = >mr;
int err;
 
-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)
-   return 0;
-
if (mr->initialized)
return 0;
 
@@ -574,18 +562,22 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev 
*mvdev,
 {
int err;
 
-   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);
-   if (err)
-   return err;
-
-   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid);
-   if (err)
-   goto out_err;
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) {
+   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb);
+   if (err)
+   return err;
+   }
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) {
+   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb);
+   if (err)
+   goto out_err;
+   }
 
return 0;
 
 out_err:
-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);

[PATCH RFC v2 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver

2023-09-09 Thread Si-Wei Liu

Devices with on-chip IOMMU may need to restore iotlb to 1:1 identity
mapping from IOVA to PA. Before vhost-vdpa is going away, give them
a chance to clean up and reset iotlb back to 1:1 identify mapping
mode. This is done so that any vdpa bus driver may start with 1:1
identity mapping by default.

Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index eabac06..71fbd559 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with on-chip IOMMU need to restore iotlb
+* to 1:1 identity mapping before vhost-vdpa is going
+* to be removed and detached from the device. Give
+* them a chance to do so, as this cannot be done
+* efficiently via the whole-range unmap call above.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 3/3] vhost-vdpa: uAPI to get dedicated descriptor group id

2023-09-09 Thread Si-Wei Liu

With _F_DESC_ASID backend feature, the device can now support the
VHOST_VDPA_GET_VRING_DESC_GROUP ioctl, and it may expose the descriptor
table (including avail and used ring) in a different group than the
buffers it contains. This new uAPI will fetch the group ID of the
descriptor table.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c   | 10 ++
 include/uapi/linux/vhost.h |  8 
 2 files changed, 18 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index f2e5dce..eabac06 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -602,6 +602,16 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
else if (copy_to_user(argp, , sizeof(s)))
return -EFAULT;
return 0;
+   case VHOST_VDPA_GET_VRING_DESC_GROUP:
+   if (!vhost_vdpa_has_desc_group(v))
+   return -EOPNOTSUPP;
+   s.index = idx;
+   s.num = ops->get_vq_desc_group(vdpa, idx);
+   if (s.num >= vdpa->ngroups)
+   return -EIO;
+   else if (copy_to_user(argp, , sizeof(s)))
+   return -EFAULT;
+   return 0;
case VHOST_VDPA_SET_GROUP_ASID:
if (copy_from_user(, argp, sizeof(s)))
return -EFAULT;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index f5c48b6..649560c 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -219,4 +219,12 @@
  */
 #define VHOST_VDPA_RESUME  _IO(VHOST_VIRTIO, 0x7E)
 
+/* Get the group for the descriptor table including driver & device areas
+ * of a virtqueue: read index, write group in num.
+ * The virtqueue index is stored in the index field of vhost_vring_state.
+ * The group ID of the descriptor table for this specific virtqueue
+ * is returned via num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F,   
\
+ struct vhost_vring_state)
 #endif
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group

2023-09-09 Thread Si-Wei Liu

Following patchset introduces dedicated group for descriptor table to
reduce live migration downtime when passthrough VQ is being switched
to shadow VQ. This RFC v2 is sent to incorporate the early feedback
from reviewers on the uAPI and driver API part of changes, the
associated driver patch set consuming ths API will come around
soon along with formal submission of this series.

Some initial performance data will be gathered using the real
hardware device with mlx5_vdpa. The target goal of this series is to
reduce the SVQ switching overhead to less than 300ms on a ~100GB
guest with 2 non-mq vhost-vdpa devices. The reduction in the downtime
is thanks to avoiding the full remap in the switching.

The plan of the intended driver implementation is to use a dedicated
group (specifically, 2 in below table) to host the descriptor tables
for data vqs, different from where buffer addresses are contained (in
group 0 as below). cvq does not have to allocate dedicated group for
descriptor table, so its buffers and descriptor table would always
belong to the same group (1 in table below).


  |  data vq | ctrl vq
==+==+===
vq_group  |0 |1
vq_desc_group |2 |1


---

Si-Wei Liu (3):
  vdpa: introduce dedicated descriptor group for virtqueue
  vhost-vdpa: introduce descriptor group backend feature
  vhost-vdpa: uAPI to get dedicated descriptor group id

 drivers/vhost/vdpa.c | 27 +++
 include/linux/vdpa.h | 11 +++
 include/uapi/linux/vhost.h   |  8 
 include/uapi/linux/vhost_types.h |  5 +
 4 files changed, 51 insertions(+)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 2/3] vhost-vdpa: introduce descriptor group backend feature

2023-09-09 Thread Si-Wei Liu

Userspace knows if the device has dedicated descriptor group or not
by checking this feature bit.

It's only exposed if the vdpa driver backend implements the
.get_vq_desc_group() operation callback. Userspace trying to negotiate
this feature when it or the dependent _F_IOTLB_ASID feature hasn't
been exposed will result in an error.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 

---
RFC v1 -> v2:
  - add clarifications for what areas F_DESC_ASID should cover
---
 drivers/vhost/vdpa.c | 17 +
 include/uapi/linux/vhost_types.h |  5 +
 2 files changed, 22 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b43e868..f2e5dce 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -389,6 +389,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa 
*v)
return ops->resume;
 }
 
+static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return ops->get_vq_desc_group;
+}
+
 static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -679,6 +687,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if (copy_from_user(, featurep, sizeof(features)))
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
+BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME)))
return -EOPNOTSUPP;
@@ -688,6 +697,12 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_RESUME)) &&
 !vhost_vdpa_can_resume(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
+   !(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)))
+   return -EINVAL;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
+!vhost_vdpa_has_desc_group(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -741,6 +756,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_SUSPEND);
if (vhost_vdpa_can_resume(v))
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
+   if (vhost_vdpa_has_desc_group(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
if (copy_to_user(featurep, , sizeof(features)))
r = -EFAULT;
break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d3aad12a..6acc604 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -181,5 +181,10 @@ struct vhost_vdpa_iova_range {
 #define VHOST_BACKEND_F_SUSPEND  0x4
 /* Device can be resumed */
 #define VHOST_BACKEND_F_RESUME  0x5
+/* Device may expose the virtqueue's descriptor area, driver area and
+ * device area to a different group for ASID binding than where its
+ * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
+ */
+#define VHOST_BACKEND_F_DESC_ASID0x6
 
 #endif
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH RFC v2 1/3] vdpa: introduce dedicated descriptor group for virtqueue

2023-09-09 Thread Si-Wei Liu

In some cases, the access to the virtqueue's descriptor area, device
and driver areas (precluding indirect descriptor table in guest memory)
may have to be confined to a different address space than where its
buffers reside. Without loss of simplicity and generality with already
established terminology, let's fold up these 3 areas and call them
as a whole as descriptor table group, or descriptor group for short.
Specifically, in case of split virtqueues, descriptor group consists of
regions for Descriptor Table, Available Ring and Used Ring; for packed
virtqueues layout, descriptor group contains Descriptor Ring, Driver
and Device Event Suppression structures.

The group ID for a dedicated descriptor group can be obtained through a
new .get_vq_desc_group() op. If driver implements this op, it means that
the descriptor, device and driver areas of the virtqueue may reside
in a dedicated group than where its buffers reside, a.k.a the default
virtqueue group through the .get_vq_group() op.

In principle, the descriptor group may or may not have same group ID
as the default group. Even if the descriptor group has a different ID,
meaning the vq's descriptor group areas can optionally move to a
separate address space than where guest memory resides, the descriptor
group may still start from a default address space, same as where its
buffers reside. To move the descriptor group to a different address
space, .set_group_asid() has to be called to change the ASID binding
for the group, which is no different than what needs to be done on any
other virtqueue group. On the other hand, the .reset() semantics also
applies on descriptor table group, meaning the device reset will clear
all ASID bindings and move all virtqueue groups including descriptor
group back to the default address space, i.e. in ASID 0.

QEMU's shadow virtqueue is going to utilize dedicated descriptor group
to speed up map and unmap operations, yielding tremendous downtime
reduction by avoiding the full and slow remap cycle in SVQ switching.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 

---
RFC v1 -> v2:
  - expand commit log to mention downtime reduction in switching
  - add clarifications for what "descriptor group" covers and whatnot
---
 include/linux/vdpa.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index db1b0ea..17a4efa 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -204,6 +204,16 @@ struct vdpa_map_file {
  * @vdev: vdpa device
  * @idx: virtqueue index
  * Returns u32: group id for this virtqueue
+ * @get_vq_desc_group: Get the group id for the descriptor table of
+ * a specific virtqueue (optional)
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * Returns u32: group id for the descriptor table
+ * portion of this virtqueue. Could be different
+ * than the one from @get_vq_group, in which case
+ * the access to the descriptor table can be
+ * confined to a separate asid, isolating from
+ * the virtqueue's buffer address access.
  * @get_device_features:   Get virtio features supported by the device
  * @vdev: vdpa device
  * Returns the virtio features support by the
@@ -357,6 +367,7 @@ struct vdpa_config_ops {
/* Device ops */
u32 (*get_vq_align)(struct vdpa_device *vdev);
u32 (*get_vq_group)(struct vdpa_device *vdev, u16 idx);
+   u32 (*get_vq_desc_group)(struct vdpa_device *vdev, u16 idx);
u64 (*get_device_features)(struct vdpa_device *vdev);
int (*set_driver_features)(struct vdpa_device *vdev, u64 features);
u64 (*get_driver_features)(struct vdpa_device *vdev);
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vdpa: consume device_features parameter

2023-09-08 Thread Si-Wei Liu





On 9/7/2023 5:07 PM, David Ahern wrote:

On 9/7/23 2:41 PM, Si-Wei Liu wrote:

Hi David,

Why this patch doesn't get picked in the last 4 months? Maybe the
subject is not clear, but this is an iproute2 patch. Would it be
possible to merge at your earliest convenience?

PS, adding my R-b to the patch.


It got marked "Not applicable":
https://patchwork.kernel.org/project/netdevbpf/patch/29db10bca7e5ef6b1137282292660fc337a4323a.1683907102.git.allen.hu...@amd.com/

Resend the patch with any reviewed by tags and be sure to cc me.

Just out of my own curiosity, the patch is not applicable simply because 
the iproute2 was missing from the subject, or the code base somehow got 
changed that isn't aligned with the patch any more?


Thanks,
-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vdpa: consume device_features parameter

2023-09-07 Thread Si-Wei Liu


Hi David,

Why this patch doesn't get picked in the last 4 months? Maybe the 
subject is not clear, but this is an iproute2 patch. Would it be 
possible to merge at your earliest convenience?


PS, adding my R-b to the patch.

Thanks,
-Siwei


On Sat, May 13, 2023 at 12:42 AM Shannon Nelson  
wrote:

>
> From: Allen Hubbe 
>
> Consume the parameter to device_features when parsing command line
> options.  Otherwise the parameter may be used again as an option name.
>
>  # vdpa dev add ... device_features 0xdeadbeef mac 00:11:22:33:44:55
>  Unknown option "0xdeadbeef"
>
> Fixes: a4442ce58ebb ("vdpa: allow provisioning device features")
> Signed-off-by: Allen Hubbe 
> Reviewed-by: Shannon Nelson 

Reviewed-by: Si-Wei Liu 

> ---
>  vdpa/vdpa.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/vdpa/vdpa.c b/vdpa/vdpa.c
> index 27647d73d498..8a2fca8647b6 100644
> --- a/vdpa/vdpa.c
> +++ b/vdpa/vdpa.c
> @@ -353,6 +353,8 @@ static int vdpa_argv_parse(struct vdpa *vdpa, int 
argc, char **argv,

> >device_features);
> if (err)
> return err;
> +
> +   NEXT_ARG_FWD();
> o_found |= VDPA_OPT_VDEV_FEATURES;
> } else {
> fprintf(stderr, "Unknown option \"%s\"\n", 
*argv);

> --
> 2.17.1
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-08-28 Thread Si-Wei Liu




On 8/22/2023 1:54 AM, Jason Wang wrote:

On Thu, Aug 17, 2023 at 7:44 AM Si-Wei Liu  wrote:



On 8/15/2023 6:48 PM, Jason Wang wrote:

On Wed, Aug 16, 2023 at 6:31 AM Si-Wei Liu  wrote:


On 8/14/2023 7:25 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
drivers/vhost/vdpa.c | 16 +++-
include/uapi/linux/vhost_types.h |  2 ++
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 62b0a01..75092a7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa 
*v)
   return ops->resume;
}

+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;

So this means the IOTLB/IOMMU mappings have already been decoupled
from the vdpa reset.

Not in the sense of API, it' been coupled since day one from the
implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa
and vdpa_sim. Because of that, later on the (improper) support for
virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for
running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA
ops") misused the .reset() op to realize 1:1 mapping, rendering strong
coupling between device reset and reset of iotlb mappings. This series
try to rectify that implementation deficiency, while keep userspace
continuing to work with older kernel behavior.


So it should have been noticed by the userspace.

Yes, userspace had noticed this no-chip IOMMU discrepancy since day one
I suppose. Unfortunately there's already code in userspace with this
assumption in mind that proactively tears down and sets up iotlb mapping
around vdpa device reset...

I guess we can just fix the simulator and mlx5 then we are fine?

Only IF we don't care about running new QEMU on older kernels with
flawed on-chip iommu behavior around reset. But that's a big IF...

So what I meant is:

Userspace doesn't know whether the vendor specific mappings (set_map)
are required or not. And in the implementation of vhost_vdpa, if
platform IOMMU is used, the mappings are decoupled from the reset. So
if the Qemu works with parents with platform IOMMU it means Qemu can
work if we just decouple vendor specific mappings from the parents
that uses set_map.

I was aware of this, and if you may notice I don't even offer a way
backward to retain/emulate the flawed vhost-iotlb reset behavior for
older userspace - I consider it more of a bug in .set_map driver
implementation of its own rather than what the vhost-vdpa iotlb
abstraction wishes to expose to userspace in the first place.

That's my understanding as well.


If you ever look into QEMU's vhost_vdpa_reset_status() function, you may
see memory_listener_unregister() will be called to evict all of the
existing iotlb mappings right after vhost_vdpa_reset_device() across
device reset, and later on at vhost_vdpa_dev_start(),
memory_listener_register() will set up all iotlb mappings again. In an
ideal world without this on-chip iommu deficiency QEMU should not have
to behave this way - this is what I mentioned earlier that userspace had
already noticed the discrepancy and it has to "proactively tear down and
set up iotlb mapping around vdpa device reset". Apparently from
functionality perspective this trick works completely fine with platform
IOMMU, however, it's sub-optimal in the performance perspective.

Right.


We can't simply fix QEMU by moving this memory_listener_unregister()
call out of the reset path unconditionally, as we don't want to break
the already-functioning older kernel even though it's suboptimal in
performance.

I'm not sure how things can be broken in this case?
Things won't be broken if we don't care about performance, for example 
reboot a large memory VM (translated to device reset internally) will 
freeze the guest and introduce extra reboot delay unnecessarily. If we 
want to fix the performance by remove memory_listener_unregister() 
unconditionally and we don't have such a flag to distinguish, we will 
break network connectivity entirely after reset - as all mappings are 
purged during reset on older parent driver.



  Or why it is specific to parent with set_map.
As if without the .reset_map op and corresponding driver implementation 
(in correct way), there's no appropriate means for on-chip iommu parent 
driver to persist iotlb mappings across reset, isn't it? If the driver 
deliberately removes it from .reset, they don't support 1:1 DMA mapping 
for virtio-vdpa on the other hand, for instance.





Instead, to keep new QEMU continuing to work on top of the
existing or older kernels, QEMU has to check this IOTLB_PERSIST

Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback

2023-08-21 Thread Si-Wei Liu




On 8/17/2023 8:28 AM, Eugenio Perez Martin wrote:

On Thu, Aug 17, 2023 at 2:05 AM Si-Wei Liu  wrote:



On 8/15/2023 6:55 PM, Jason Wang wrote:

On Wed, Aug 16, 2023 at 3:49 AM Si-Wei Liu  wrote:


On 8/14/2023 7:21 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
include/linux/vdpa.h | 7 +++
1 file changed, 7 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index db1b0ea..3a3878d 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -314,6 +314,12 @@ struct vdpa_map_file {
 * @iova: iova to be unmapped
 * @size: size of the area
 * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping (optional)
+ * Needed for device that using device
+ * specific DMA translation (on-chip IOMMU)

This exposes the device internal to the upper layer which is not optimal.

Not sure what does it mean by "device internal", but this op callback
just follows existing convention to describe what vdpa parent this API
targets.

I meant the bus tries to hide the differences among vendors. So it
needs to hide on-chip IOMMU stuff to the upper layer.

We can expose two dimensional IO mappings models but it looks like
over engineering for this issue. More below.


* @set_map:Set device memory mapping (optional)
*  Needed for device that using device
*  specific DMA translation (on-chip IOMMU)
:
:
* @dma_map:Map an area of PA to IOVA (optional)
*  Needed for device that using device
*  specific DMA translation (on-chip IOMMU)
*  and preferring incremental map.
:
:
* @dma_unmap:  Unmap an area of IOVA (optional but
*  must be implemented with dma_map)
*  Needed for device that using device
*  specific DMA translation (on-chip IOMMU)
*  and preferring incremental unmap.



Btw, what's the difference between this and a simple

set_map(NULL)?

I don't think parent drivers support this today - they can accept
non-NULL iotlb containing empty map entry, but not a NULL iotlb. The
behavior is undefined or it even causes panic when a NULL iotlb is
passed in.

We can do this simple change if it can work.

If we go with setting up 1:1 DMA mapping at virtio-vdpa .probe() and
tearing it down at .release(), perhaps set_map(NULL) is not sufficient.

   Further this doesn't work with .dma_map parent drivers.

Probably, but I'd remove dma_map as it doesn't have any real users
except for the simulator.

OK, at a point there was suggestion to get this incremental API extended
to support batching to be in par with or even replace .set_map, not sure
if it's too soon to conclude. But I'm okay with the removal if need be.

Yes, I think the right move in the long run is to delegate the
batching to the parent driver. This allows drivers like mlx to add
memory (like hotplugged memory) without the need of tearing down all
the old maps.

Nods.



Having said that, maybe we can work on top if we need to remove
.dma_map for now.
I guess for that sake I would keep .dma_map unless there's strong 
objection against it.


Thanks,
-Siwei




The reason why a new op is needed or better is because it allows
userspace to tell apart different reset behavior from the older kernel
(via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior
could vary between parent drivers.

I'm ok with a new feature flag, but we need to first seek a way to
reuse the existing API.

A feature flag is needed anyway. I'm fine with reusing but guess I'd
want to converge on the direction first.

Thanks,
-Siwei

Thanks


Regards,
-Siwei


Thanks


+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
 * @get_vq_dma_dev:Get the dma device for a specific
 * virtqueue (optional)
 * @vdev: vdpa device
@@ -390,6 +396,7 @@ struct vdpa_config_ops {
  u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
   int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
   int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
 unsigned int asid);
   struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16

Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback

2023-08-16 Thread Si-Wei Liu




On 8/15/2023 6:55 PM, Jason Wang wrote:

On Wed, Aug 16, 2023 at 3:49 AM Si-Wei Liu  wrote:



On 8/14/2023 7:21 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
   include/linux/vdpa.h | 7 +++
   1 file changed, 7 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index db1b0ea..3a3878d 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -314,6 +314,12 @@ struct vdpa_map_file {
* @iova: iova to be unmapped
* @size: size of the area
* Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping (optional)
+ * Needed for device that using device
+ * specific DMA translation (on-chip IOMMU)

This exposes the device internal to the upper layer which is not optimal.

Not sure what does it mean by "device internal", but this op callback
just follows existing convention to describe what vdpa parent this API
targets.

I meant the bus tries to hide the differences among vendors. So it
needs to hide on-chip IOMMU stuff to the upper layer.

We can expose two dimensional IO mappings models but it looks like
over engineering for this issue. More below.


   * @set_map:Set device memory mapping (optional)
   *  Needed for device that using device
   *  specific DMA translation (on-chip IOMMU)
:
:
   * @dma_map:Map an area of PA to IOVA (optional)
   *  Needed for device that using device
   *  specific DMA translation (on-chip IOMMU)
   *  and preferring incremental map.
:
:
   * @dma_unmap:  Unmap an area of IOVA (optional but
   *  must be implemented with dma_map)
   *  Needed for device that using device
   *  specific DMA translation (on-chip IOMMU)
   *  and preferring incremental unmap.



Btw, what's the difference between this and a simple

set_map(NULL)?

I don't think parent drivers support this today - they can accept
non-NULL iotlb containing empty map entry, but not a NULL iotlb. The
behavior is undefined or it even causes panic when a NULL iotlb is
passed in.

We can do this simple change if it can work.
If we go with setting up 1:1 DMA mapping at virtio-vdpa .probe() and 
tearing it down at .release(), perhaps set_map(NULL) is not sufficient.



  Further this doesn't work with .dma_map parent drivers.

Probably, but I'd remove dma_map as it doesn't have any real users
except for the simulator.
OK, at a point there was suggestion to get this incremental API extended 
to support batching to be in par with or even replace .set_map, not sure 
if it's too soon to conclude. But I'm okay with the removal if need be.



The reason why a new op is needed or better is because it allows
userspace to tell apart different reset behavior from the older kernel
(via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior
could vary between parent drivers.

I'm ok with a new feature flag, but we need to first seek a way to
reuse the existing API.
A feature flag is needed anyway. I'm fine with reusing but guess I'd 
want to converge on the direction first.


Thanks,
-Siwei


Thanks


Regards,
-Siwei


Thanks


+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
* @get_vq_dma_dev:Get the dma device for a specific
* virtqueue (optional)
* @vdev: vdpa device
@@ -390,6 +396,7 @@ struct vdpa_config_ops {
 u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
  int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
   u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
  int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
unsigned int asid);
  struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-08-16 Thread Si-Wei Liu




On 8/15/2023 6:48 PM, Jason Wang wrote:

On Wed, Aug 16, 2023 at 6:31 AM Si-Wei Liu  wrote:



On 8/14/2023 7:25 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
   drivers/vhost/vdpa.c | 16 +++-
   include/uapi/linux/vhost_types.h |  2 ++
   2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 62b0a01..75092a7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa 
*v)
  return ops->resume;
   }

+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;

So this means the IOTLB/IOMMU mappings have already been decoupled
from the vdpa reset.

Not in the sense of API, it' been coupled since day one from the
implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa
and vdpa_sim. Because of that, later on the (improper) support for
virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for
running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA
ops") misused the .reset() op to realize 1:1 mapping, rendering strong
coupling between device reset and reset of iotlb mappings. This series
try to rectify that implementation deficiency, while keep userspace
continuing to work with older kernel behavior.


   So it should have been noticed by the userspace.

Yes, userspace had noticed this no-chip IOMMU discrepancy since day one
I suppose. Unfortunately there's already code in userspace with this
assumption in mind that proactively tears down and sets up iotlb mapping
around vdpa device reset...

I guess we can just fix the simulator and mlx5 then we are fine?

Only IF we don't care about running new QEMU on older kernels with
flawed on-chip iommu behavior around reset. But that's a big IF...

So what I meant is:

Userspace doesn't know whether the vendor specific mappings (set_map)
are required or not. And in the implementation of vhost_vdpa, if
platform IOMMU is used, the mappings are decoupled from the reset. So
if the Qemu works with parents with platform IOMMU it means Qemu can
work if we just decouple vendor specific mappings from the parents
that uses set_map.
I was aware of this, and if you may notice I don't even offer a way 
backward to retain/emulate the flawed vhost-iotlb reset behavior for 
older userspace - I consider it more of a bug in .set_map driver 
implementation of its own rather than what the vhost-vdpa iotlb 
abstraction wishes to expose to userspace in the first place.


If you ever look into QEMU's vhost_vdpa_reset_status() function, you may 
see memory_listener_unregister() will be called to evict all of the 
existing iotlb mappings right after vhost_vdpa_reset_device() across 
device reset, and later on at vhost_vdpa_dev_start(), 
memory_listener_register() will set up all iotlb mappings again. In an 
ideal world without this on-chip iommu deficiency QEMU should not have 
to behave this way - this is what I mentioned earlier that userspace had 
already noticed the discrepancy and it has to "proactively tear down and 
set up iotlb mapping around vdpa device reset". Apparently from 
functionality perspective this trick works completely fine with platform 
IOMMU, however, it's sub-optimal in the performance perspective.


We can't simply fix QEMU by moving this memory_listener_unregister() 
call out of the reset path unconditionally, as we don't want to break 
the already-functioning older kernel even though it's suboptimal in 
performance. Instead, to keep new QEMU continuing to work on top of the 
existing or older kernels, QEMU has to check this IOTLB_PERSIST feature 
flag to decide whether it is safe not to bother flushing and setting up 
iotlb across reset. For the platform IOMMU case, vdpa parent driver 
won't implement either the .set_map or .dma_map op, so it should be 
covered in the vhost_vdpa_has_persistent_map() check I suppose.



Thanks,
-Siwei

Thanks


Regards,
-Siwei

Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group

2023-08-15 Thread Si-Wei Liu




On 8/9/2023 8:50 PM, Jason Wang wrote:

On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu  wrote:

Following patchset introduces dedicated group for descriptor table to
reduce live migration downtime when passthrough VQ is being switched
to shadow VQ. As this RFC set is to seek early feedback on the uAPI
and driver API part, for now there's no associated driver patch consuming
the API. As soon as the support is in place on both hardware device and
driver, performance data will be show using real hardware device. The
target goal of this series is to reduce the SVQ switching overhead
to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices.

The plan of the intended driver implementation is to use a dedicated
group (specifically, 2 in below table) to host descriptor table for
all data vqs, different from where buffer addresses are contained (in
group 0 as below). cvq does not have to allocate dedicated group for
descriptor table, so its buffers and descriptor table would always
belong to a same group (1).

I'm fine with this, but I think we need an implementation in the
driver (e.g the simulator).
Yes. FWIW for the sake of time saving and get this series accepted 
promptly in the upcoming v6.6 merge window, the driver we're going to 
support along with this series will be mlx5_vdpa in the formal 
submission, and simulator support may come up later after if I got spare 
cycle. Do you foresee any issue without simulator change? We will have 
mlx5_vdpa driver consuming the API for sure, that's the target of this 
work and it has to be proved working on real device at first.


Thanks,
-Siwei



Thanks



   |  data vq | ctrl vq
==+==+===
vq_group  |0 |1
vq_desc_group |2 |1


---

Si-Wei Liu (3):
   vdpa: introduce dedicated descriptor group for virtqueue
   vhost-vdpa: introduce descriptor group backend feature
   vhost-vdpa: uAPI to get dedicated descriptor group id

  drivers/vhost/vdpa.c | 27 +++
  include/linux/vdpa.h | 11 +++
  include/uapi/linux/vhost.h   |  8 
  include/uapi/linux/vhost_types.h |  5 +
  4 files changed, 51 insertions(+)

--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 2/3] vhost-vdpa: introduce descriptor group backend feature

2023-08-15 Thread Si-Wei Liu




On 8/9/2023 8:49 PM, Jason Wang wrote:

On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu  wrote:

Userspace knows if the device has dedicated descriptor group or not
by checking this feature bit.

It's only exposed if the vdpa driver backend implements the
.get_vq_desc_group() operation callback. Userspace trying to negotiate
this feature when it or the dependent _F_IOTLB_ASID feature hasn't
been exposed will result in an error.

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  include/uapi/linux/vhost_types.h |  5 +
  2 files changed, 22 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b43e868..f2e5dce 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -389,6 +389,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa 
*v)
 return ops->resume;
  }

+static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return ops->get_vq_desc_group;
+}
+
  static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user 
*featurep)
  {
 struct vdpa_device *vdpa = v->vdpa;
@@ -679,6 +687,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 if (copy_from_user(, featurep, sizeof(features)))
 return -EFAULT;
 if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
+BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
  BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
  BIT_ULL(VHOST_BACKEND_F_RESUME)))
 return -EOPNOTSUPP;
@@ -688,6 +697,12 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 if ((features & BIT_ULL(VHOST_BACKEND_F_RESUME)) &&
  !vhost_vdpa_can_resume(v))
 return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
+   !(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)))
+   return -EINVAL;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
+!vhost_vdpa_has_desc_group(v))
+   return -EOPNOTSUPP;
 vhost_set_backend_features(>vdev, features);
 return 0;
 }
@@ -741,6 +756,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 features |= BIT_ULL(VHOST_BACKEND_F_SUSPEND);
 if (vhost_vdpa_can_resume(v))
 features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
+   if (vhost_vdpa_has_desc_group(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
 if (copy_to_user(featurep, , sizeof(features)))
 r = -EFAULT;
 break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d3aad12a..0856f84 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -181,5 +181,10 @@ struct vhost_vdpa_iova_range {
  #define VHOST_BACKEND_F_SUSPEND  0x4
  /* Device can be resumed */
  #define VHOST_BACKEND_F_RESUME  0x5
+/* Device may expose the descriptor table, avail and used ring in a
+ * different group for ASID binding than the buffers it contains.

Nit:

s/a different group/different groups/?

Yep, I will try to rephrase. Would below work?

"Device may expose virtqueue's descriptor table, avail and used ring in a
different group for ASID binding than where buffers it contains reside."


Btw, not a native speaker but I think "descriptor" might be confusing
since as you explained above, it contains more than just a descriptor
table.
Yep. I chose "descriptor" because packed virtqueue doesn't have 
"physical" avail and used ring other than descriptor table, but I think 
I am open to a better name. I once thought of "descriptor ring" but that 
might be too specific to packed virtqueue. Any suggestion?


Thanks,
-Siwei



Thanks


+ * Requires VHOST_BACKEND_F_IOTLB_ASID.
+ */
+#define VHOST_BACKEND_F_DESC_ASID0x6

  #endif
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group

2023-08-15 Thread Si-Wei Liu




On 8/9/2023 7:49 AM, Eugenio Perez Martin wrote:

On Wed, Aug 9, 2023 at 2:56 PM Si-Wei Liu  wrote:

Following patchset introduces dedicated group for descriptor table to
reduce live migration downtime when passthrough VQ is being switched
to shadow VQ. As this RFC set is to seek early feedback on the uAPI
and driver API part, for now there's no associated driver patch consuming
the API. As soon as the support is in place on both hardware device and
driver, performance data will be show using real hardware device. The
target goal of this series is to reduce the SVQ switching overhead
to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices.


I would expand the cover letter with something in the line of:
The reduction in the downtime is thanks to avoiding the full remap in
the switching.

Sure, will add in the next.




The plan of the intended driver implementation is to use a dedicated
group (specifically, 2 in below table) to host descriptor table for
all data vqs, different from where buffer addresses are contained (in
group 0 as below). cvq does not have to allocate dedicated group for
descriptor table, so its buffers and descriptor table would always
belong to a same group (1).


   |  data vq | ctrl vq
==+==+===
vq_group  |0 |1
vq_desc_group |2 |1



Acked-by: Eugenio Pérez 

Thanks!

-Siwei





---

Si-Wei Liu (3):
   vdpa: introduce dedicated descriptor group for virtqueue
   vhost-vdpa: introduce descriptor group backend feature
   vhost-vdpa: uAPI to get dedicated descriptor group id

  drivers/vhost/vdpa.c | 27 +++
  include/linux/vdpa.h | 11 +++
  include/uapi/linux/vhost.h   |  8 
  include/uapi/linux/vhost_types.h |  5 +
  4 files changed, 51 insertions(+)

--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 2/4] vdpa/mlx5: implement .reset_map driver op

2023-08-15 Thread Si-Wei Liu




On 8/15/2023 1:26 AM, Dragos Tatulea wrote:

On Mon, 2023-08-14 at 18:43 -0700, Si-Wei Liu wrote:

This patch is based on top of the "vdpa/mlx5: Fixes
for ASID handling" series [1].

[1] vdpa/mlx5: Fixes for ASID handling
https://lore.kernel.org/virtualization/20230802171231.11001-1-dtatu...@nvidia.com/

Signed-off-by: Si-Wei Liu 
---
  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
  drivers/vdpa/mlx5/core/mr.c    | 72 +
-
  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++---
  3 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index b53420e..5c9a25a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
 unsigned int asid);
  void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
  void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int
asid);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
  
  #define mlx5_vdpa_warn(__dev, format,

...) \
 dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format,
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 5a1971fc..c8d64fc 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev,
struct mlx5_vdpa_mr *mr
 }
  }
  
-static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned

int asid)
+static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev)
  {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return;
-
 prune_iotlb(mvdev);
  }
  
-static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned

int asid)
+static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev)
  {
 struct mlx5_vdpa_mr *mr = >mr;
  
-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)

-   return;
-
 if (!mr->initialized)
 return;
  
@@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev

*mvdev, unsigned int asid)
  
 mutex_lock(>mkey_mtx);
  
-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);

-   _mlx5_vdpa_destroy_cvq_mr(mvdev, asid);
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_dvq_mr(mvdev);
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid)
+   _mlx5_vdpa_destroy_cvq_mr(mvdev);
  
 mutex_unlock(>mkey_mtx);

  }
@@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
  }
  
  static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev,

-   struct vhost_iotlb *iotlb,
-   unsigned int asid)
+   struct vhost_iotlb *iotlb)
  {
-   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
-   return 0;
-
 return dup_iotlb(mvdev, iotlb);
  }
  
  static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev,

-   struct vhost_iotlb *iotlb,
-   unsigned int asid)
+   struct vhost_iotlb *iotlb)
  {
 struct mlx5_vdpa_mr *mr = >mr;
 int err;
  
-   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid)

-   return 0;
-
 if (mr->initialized)
 return 0;
  
@@ -574,20 +562,18 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev

*mvdev,
  {
 int err;
  
-   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);

-   if (err)
-   return err;
-
-   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid);
-   if (err)
-   goto out_err;
+   if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) {
+   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);
+   if (err)
+   return err;
+   }
+   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) {
+   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb);
+   if (err)
+   return err;

I think you still need the goto here, when CVQ and DVQ fall in same asid and
there's a CVQ mr creation error, you are left stuck with the DVQ mr.

Yes, you are right, I will fix this in v2. Thank you for spotting this!

-Siwei




+   }
  
 return 0;

-
-out_err:
-   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
-
-   return err;
  }
  
  int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb

*iotlb,
@@ -601,6 +587,28 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev

Re: [PATCH RFC 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver

2023-08-15 Thread Si-Wei Liu




On 8/14/2023 7:32 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b43e868..62b0a01 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
 return vhost_vdpa_alloc_as(v, asid);
  }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
  static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
  {
 struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

 hlist_del(>hash_link);
 vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with on-chip IOMMU need to restore iotlb
+* to 1:1 identity mapping before vhost-vdpa is going
+* to be removed and detached from the device. Give
+* them a chance to do so, as this cannot be done
+* efficiently via the whole-range unmap call above.
+*/

Same question as before, if 1:1 is restored and the userspace doesn't
do any IOTLB updating. It looks like a security issue? (Assuming IOVA
is PA)
This is already flawed independent of this series. It was introduced 
from the two commits I referenced earlier in the other thread. Today 
userspace is already able to do so with device reset and don't do any 
IOTLB update. This series don't get it worse nor make it better.


FWIW as said earlier, to address this security issue properly we 
probably should set up 1:1 DMA mapping in virtio_vdpa_probe() on demand, 
and tears it down at virtio_vdpa_release_dev(). Question is, was 
virtio-vdpa the only vdpa bus user that needs 1:1 DMA mapping, or it's 
the other way around that vhost-vdpa is the only exception among all 
vdpa bus drivers that don't want to start with 1:1 by default. This 
would help parent vdpa implementation for what kind of mapping it should 
start with upon creation.


Regards,
-Siwei





Thanks


+   vhost_vdpa_reset_map(v, asid);
 kfree(as);

 return 0;
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-08-15 Thread Si-Wei Liu




On 8/14/2023 7:25 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 16 +++-
  include/uapi/linux/vhost_types.h |  2 ++
  2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 62b0a01..75092a7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa 
*v)
 return ops->resume;
  }

+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map;

So this means the IOTLB/IOMMU mappings have already been decoupled
from the vdpa reset.
Not in the sense of API, it' been coupled since day one from the 
implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa 
and vdpa_sim. Because of that, later on the (improper) support for 
virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for 
running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA 
ops") misused the .reset() op to realize 1:1 mapping, rendering strong 
coupling between device reset and reset of iotlb mappings. This series 
try to rectify that implementation deficiency, while keep userspace 
continuing to work with older kernel behavior.



  So it should have been noticed by the userspace.
Yes, userspace had noticed this no-chip IOMMU discrepancy since day one 
I suppose. Unfortunately there's already code in userspace with this 
assumption in mind that proactively tears down and sets up iotlb mapping 
around vdpa device reset...

I guess we can just fix the simulator and mlx5 then we are fine?
Only IF we don't care about running new QEMU on older kernels with 
flawed on-chip iommu behavior around reset. But that's a big IF...


Regards,
-Siwei


Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback

2023-08-15 Thread Si-Wei Liu




On 8/14/2023 7:21 PM, Jason Wang wrote:

On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu  wrote:

Signed-off-by: Si-Wei Liu 
---
  include/linux/vdpa.h | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index db1b0ea..3a3878d 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -314,6 +314,12 @@ struct vdpa_map_file {
   * @iova: iova to be unmapped
   * @size: size of the area
   * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping (optional)
+ * Needed for device that using device
+ * specific DMA translation (on-chip IOMMU)

This exposes the device internal to the upper layer which is not optimal.
Not sure what does it mean by "device internal", but this op callback 
just follows existing convention to describe what vdpa parent this API 
targets.


 * @set_map:    Set device memory mapping (optional)
 *  Needed for device that using device
 *  specific DMA translation (on-chip IOMMU)
:
:
 * @dma_map:    Map an area of PA to IOVA (optional)
 *  Needed for device that using device
 *  specific DMA translation (on-chip IOMMU)
 *  and preferring incremental map.
:
:
 * @dma_unmap:  Unmap an area of IOVA (optional but
 *  must be implemented with dma_map)
 *  Needed for device that using device
 *  specific DMA translation (on-chip IOMMU)
 *  and preferring incremental unmap.



Btw, what's the difference between this and a simple

set_map(NULL)?
I don't think parent drivers support this today - they can accept 
non-NULL iotlb containing empty map entry, but not a NULL iotlb. The 
behavior is undefined or it even causes panic when a NULL iotlb is 
passed in. Further this doesn't work with .dma_map parent drivers.


The reason why a new op is needed or better is because it allows 
userspace to tell apart different reset behavior from the older kernel 
(via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior 
could vary between parent drivers.


Regards,
-Siwei



Thanks


+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
   * @get_vq_dma_dev:Get the dma device for a specific
   * virtqueue (optional)
   * @vdev: vdpa device
@@ -390,6 +396,7 @@ struct vdpa_config_ops {
u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
 int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
  u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
 int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
   unsigned int asid);
 struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
--
1.8.3.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

1 2 3 4 5 >

1 - 100 of 483 matches

Mail list logo