Re: [RFC] vdpa/mlx5: preserve CVQ vringh index
Steve, I think this is a loose end that I myself am not sure if worth fixing, copy Eugenio for his awareness. Reason is that when CVQ is in place it always has to cope with device state saving and restoration using shadowed virtqueue for a lot of cases not just migration, and that's the reason why SVQ is always enabled for CVQ in the latest QEMU. But I agree this is a nice to have, possibly there could be value to support vDPA VM instances without solely depending on SVQ for e.g. for use case like memory encrypted VM. Thanks for posting the fix and lets see what other people think about it. -Siwei On 10/26/2023 1:13 PM, Steven Sistare wrote: On 10/26/2023 4:11 PM, Steve Sistare wrote: mlx5_vdpa does not preserve userland's view of vring base for the control queue in the following sequence: ioctl VHOST_SET_VRING_BASE ioctl VHOST_VDPA_SET_STATUS VIRTIO_CONFIG_S_DRIVER_OK mlx5_vdpa_set_status() setup_cvq_vring() vringh_init_iotlb() vringh_init_kern() vrh->last_avail_idx = 0; ioctl VHOST_GET_VRING_BASE To fix, restore the value of cvq->vring.last_avail_idx after calling vringh_init_iotlb. Signed-off-by: Steve Sistare This is a resend, I forgot to cc myself the first time. I don't know if we expect vring_base to be preserved after reset, because the uapi comments say nothing about it. mlx5 *does* preserve base across reset for the the data vq's, but perhaps that is an accident of the implementation. I posted this patch to perhaps avoid future problems. The bug(?) bit me while developing with an older version of qemu, and I can work around it in qemu code. Further, the latest version of qemu always enables svq for the cvq and is not affected by this behavior AFAICT. - Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vhost-vdpa: fix use-after-free in _compat_vdpa_reset
On 10/25/2023 11:55 PM, Si-Wei Liu wrote: On 10/25/2023 10:26 PM, Michael S. Tsirkin wrote: On Wed, Oct 25, 2023 at 04:13:14PM -0700, Si-Wei Liu wrote: When the vhost-vdpa device is being closed, vhost_vdpa_cleanup() doesn't clean up the vqs pointer after free. This could lead to use-after-tree when _compat_vdpa_reset() tries to access the vqs that are freed already. Fix is to set vqs pointer to NULL at the end of vhost_vdpa_cleanup() after getting freed, which is guarded by atomic opened state. BUG: unable to handle page fault for address: 0001005b4af4 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 16a80a067 P4D 0 Oops: [#1] PREEMPT SMP NOPTI CPU: 4 PID: 40387 Comm: qemu-kvm Not tainted 6.6.0-rc7+ #3 Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022 RIP: 0010:_compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa] Code: 90 90 90 0f 1f 44 00 00 41 55 4c 8d ae 08 03 00 00 41 54 55 48 89 f5 53 4c 8b a6 00 03 00 00 48 85 ff 74 49 48 8b 07 4c 89 ef <48> 8b 80 88 45 00 00 48 c1 e8 08 48 83 f0 01 89 c3 e8 73 5e 9b dc RSP: 0018:ff73a85762073ba0 EFLAGS: 00010286 RAX: 0001005b056c RBX: ff32b13ca6994c68 RCX: 0002 RDX: 0001 RSI: ff32b13c07559000 RDI: ff32b13c07559308 RBP: ff32b13c07559000 R08: R09: ff32b12ca497c0f0 R10: ff73a85762073c58 R11: 000c106f9de3 R12: ff32b12c95b1d050 R13: ff32b13c07559308 R14: ff32b12d0ddc5100 R15: 8002 FS: 7fec5b8cbf80() GS:ff32b13bbfc8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0001005b4af4 CR3: 00015644a003 CR4: 00773ee0 PKRU: 5554 Call Trace: ? __die+0x20/0x70 ? page_fault_oops+0x76/0x170 ? exc_page_fault+0x65/0x150 ? asm_exc_page_fault+0x22/0x30 ? _compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa] vhost_vdpa_open+0x57/0x280 [vhost_vdpa] ? __pfx_chrdev_open+0x10/0x10 chrdev_open+0xc6/0x260 ? __pfx_chrdev_open+0x10/0x10 do_dentry_open+0x16e/0x530 do_open+0x21c/0x400 path_openat+0x111/0x290 do_filp_open+0xb2/0x160 ? __check_object_size.part.0+0x5e/0x140 do_sys_openat2+0x96/0xd0 __x64_sys_openat+0x53/0xa0 do_syscall_64+0x59/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? do_syscall_64+0x69/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? exc_page_fault+0x65/0x150 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older userspace") Fixes: ac7e98c73c05 ("vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset") So these two are all in next correct? I really do not like it how 10cbf8dfaf936e3ef1f5d7fdc6e9dada268ba6bb introduced a regression and then apparently we keep fixing things up? Sorry my bad. The latest one should be all of it. Can I squash these 3 commits? Sure. Or if you want me to send a v5 with all 3 commits squashed in, I can do for sure. Saw you squashed it with the 2 fixups in place, thank you! Sent a v5 anyway, just in case if you need a fresh series. Thanks, -Siwei Thanks, -Siwei Reported-by: Lei Yang Closes: https://lore.kernel.org/all/CAPpAL=yhdqn1aztecn3mps8o4m+bl_hvy02fdpihn7dwd91...@mail.gmail.com/ Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 9a2343c45df0..30df5c58db73 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1355,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v) vhost_vdpa_free_domain(v); vhost_dev_cleanup(>vdev); kfree(v->vdev.vqs); + v->vdev.vqs = NULL; } static int vhost_vdpa_open(struct inode *inode, struct file *filep) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 7/7] vdpa_sim: implement .reset_map support
In order to reduce excessive memory mapping cost in live migration and VM reboot, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the iotlb on the given ASID and recreate the 1:1 passthrough/identity mapping. To be consistent, the mapping on device creation is initiailized to passthrough/identity with PA 1:1 mapped as IOVA. With this the device .reset op doesn't have to maintain and clean up memory mappings by itself. Additionally, implement .compat_reset to cater for older userspace, which may wish to see mapping to be cleared during reset. Signed-off-by: Si-Wei Liu Tested-by: Stefano Garzarella --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 52 ++-- 1 file changed, 43 insertions(+), 9 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 76d41058add9..be2925d0d283 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -139,7 +139,7 @@ static void vdpasim_vq_reset(struct vdpasim *vdpasim, vq->vring.notify = NULL; } -static void vdpasim_do_reset(struct vdpasim *vdpasim) +static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 flags) { int i; @@ -151,11 +151,13 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) >iommu_lock); } - for (i = 0; i < vdpasim->dev_attr.nas; i++) { - vhost_iotlb_reset(>iommu[i]); - vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, - 0, VHOST_MAP_RW); - vdpasim->iommu_pt[i] = true; + if (flags & VDPA_RESET_F_CLEAN_MAP) { + for (i = 0; i < vdpasim->dev_attr.nas; i++) { + vhost_iotlb_reset(>iommu[i]); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } } vdpasim->running = true; @@ -259,8 +261,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, if (!vdpasim->iommu_pt) goto err_iommu; - for (i = 0; i < vdpasim->dev_attr.nas; i++) + for (i = 0; i < vdpasim->dev_attr.nas; i++) { vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0, + VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } for (i = 0; i < dev_attr->nvqs; i++) vringh_set_iotlb(>vqs[i].vring, >iommu[0], @@ -480,18 +486,23 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, u8 status) mutex_unlock(>mutex); } -static int vdpasim_reset(struct vdpa_device *vdpa) +static int vdpasim_compat_reset(struct vdpa_device *vdpa, u32 flags) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); mutex_lock(>mutex); vdpasim->status = 0; - vdpasim_do_reset(vdpasim); + vdpasim_do_reset(vdpasim, flags); mutex_unlock(>mutex); return 0; } +static int vdpasim_reset(struct vdpa_device *vdpa) +{ + return vdpasim_compat_reset(vdpa, 0); +} + static int vdpasim_suspend(struct vdpa_device *vdpa) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -637,6 +648,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, unsigned int asid, return ret; } +static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid) +{ + struct vdpasim *vdpasim = vdpa_to_sim(vdpa); + + if (asid >= vdpasim->dev_attr.nas) + return -EINVAL; + + spin_lock(>iommu_lock); + if (vdpasim->iommu_pt[asid]) + goto out; + vhost_iotlb_reset(>iommu[asid]); + vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[asid] = true; +out: + spin_unlock(>iommu_lock); + return 0; +} + static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -749,6 +779,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = { .get_status = vdpasim_get_status, .set_status = vdpasim_set_status, .reset = vdpasim_reset, + .compat_reset = vdpasim_compat_reset, .suspend= vdpasim_suspend, .resume = vdpasim_resume, .get_config_size= vdpasim_get_config_size, @@ -759,6 +790,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = { .set_group_asid = vdpasim_set
[PATCH v5 5/7] vhost-vdpa: clean iotlb map during reset for older userspace
Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. Signed-off-by: Si-Wei Liu Tested-by: Dragos Tatulea Tested-by: Lei Yang --- drivers/vhost/vdpa.c | 20 drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 7 +-- 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index acc7c74ba7d6..30df5c58db73 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -227,13 +227,24 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) irq_bypass_unregister_producer(>call_ctx.producer); } -static int vhost_vdpa_reset(struct vhost_vdpa *v) +static int _compat_vdpa_reset(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; + u32 flags = 0; - v->in_batch = 0; + if (v->vdev.vqs) { + flags |= !vhost_backend_has_feature(v->vdev.vqs[0], + VHOST_BACKEND_F_IOTLB_PERSIST) ? +VDPA_RESET_F_CLEAN_MAP : 0; + } + + return vdpa_reset(vdpa, flags); +} - return vdpa_reset(vdpa); +static int vhost_vdpa_reset(struct vhost_vdpa *v) +{ + v->in_batch = 0; + return _compat_vdpa_reset(v); } static long vhost_vdpa_bind_mm(struct vhost_vdpa *v) @@ -312,7 +323,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) vhost_vdpa_unsetup_vq_irq(v, i); if (status == 0) { - ret = vdpa_reset(vdpa); + ret = _compat_vdpa_reset(v); if (ret) return ret; } else @@ -1344,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v) vhost_vdpa_free_domain(v); vhost_dev_cleanup(>vdev); kfree(v->vdev.vqs); + v->vdev.vqs = NULL; } static int vhost_vdpa_open(struct inode *inode, struct file *filep) diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 06ce6d8c2e00..8d63e5923d24 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev) { struct vdpa_device *vdpa = vd_get_vdpa(vdev); - vdpa_reset(vdpa); + vdpa_reset(vdpa, 0); } static bool virtio_vdpa_notify(struct virtqueue *vq) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 6b8cbf75712d..db15ac07f8a6 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline int vdpa_reset(struct vdpa_device *vdev) +static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags) { const struct vdpa_config_ops *ops = vdev->config; int ret; down_write(>cf_lock); vdev->features_valid = false; - ret = ops->reset(vdev); + if (ops->compat_reset && flags) + ret = ops->compat_reset(vdev, flags); + else + ret = ops->reset(vdev); up_write(>cf_lock); return ret; } -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 3/7] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel can be trusted to persist IOTLB mapping across vDPA reset. Without it, userspace has no way to tell apart if it's running on an older kernel, which could silently drop all iotlb mapping across vDPA reset, especially with broken parent driver implementation for the .reset driver op. The broken driver may incorrectly drop all mappings of its own as part of .reset, which inadvertently ends up with corrupted mapping state between vhost-vdpa userspace and the kernel. As a workaround, to make the mapping behaviour predictable across reset, userspace has to pro-actively remove all mappings before vDPA reset, and then restore all the mappings afterwards. This workaround is done unconditionally on top of all parent drivers today, due to the parent driver implementation issue and no means to differentiate. This workaround had been utilized in QEMU since day one when the corresponding vhost-vdpa userspace backend came to the world. There are 3 cases that backend may claim this feature bit on for: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver - parent device with vendor specific IOMMU implementation with persistent IOTLB mapping already that has to specifically declare this backend feature The reason why .reset_map is being one of the pre-condition for persistent iotlb is because without it, vhost-vdpa can't switch back iotlb to the initial state later on, especially for the on-chip IOMMU case which starts with identity mapping at device creation. virtio-vdpa requires on-chip IOMMU to perform 1:1 passthrough translation from PA to IOVA as-is to begin with, and .reset_map is the only means to turn back iotlb to the identity mapping mode after vhost-vdpa is gone. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start( started = false), but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the driver bug has been solved. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 15 +++ include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index c6bfe9bdde42..acc7c74ba7d6 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -439,6 +439,15 @@ static u64 vhost_vdpa_get_backend_features(const struct vhost_vdpa *v) return ops->get_backend_features(vdpa); } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map || + vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); +} + static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -726,6 +735,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME) | BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK))) @@ -742,6 +752,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -797,6 +810,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); features |= vhost_vdpa_get_backend_features(v); if (copy_to_user(featurep, , sizeof(features
[PATCH v5 6/7] vdpa/mlx5: implement .reset_map driver op
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR (including cvq mapping) on the given ASID and recreate the initial DMA mapping. That way, the device .reset op runs free from having to maintain and clean up memory mappings by itself. Additionally, implement .compat_reset to cater for older userspace, which may wish to see mapping to be cleared during reset. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 27 --- 3 files changed, 42 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index db988ced5a5d..84547d998bcf 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 66530e28f327..2197c46e563a 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev) return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0); } + +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +{ + if (asid >= MLX5_VDPA_NUM_AS) + return -EINVAL; + + mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]); + + if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (mlx5_vdpa_create_dma_mr(mvdev)) + mlx5_vdpa_warn(mvdev, "create DMA MR failed\n"); + } else { + mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid); + } + + return 0; +} diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index f4516a2d5bb0..12ac3397f39b 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -2876,7 +2876,7 @@ static void init_group_to_asid_map(struct mlx5_vdpa_dev *mvdev) mvdev->group2asid[i] = 0; } -static int mlx5_vdpa_reset(struct vdpa_device *vdev) +static int mlx5_vdpa_compat_reset(struct vdpa_device *vdev, u32 flags) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); @@ -2888,7 +2888,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) unregister_link_notifier(ndev); teardown_driver(ndev); clear_vqs_ready(ndev); - mlx5_vdpa_destroy_mr_resources(>mvdev); + if (flags & VDPA_RESET_F_CLEAN_MAP) + mlx5_vdpa_destroy_mr_resources(>mvdev); ndev->mvdev.status = 0; ndev->mvdev.suspended = false; ndev->cur_num_vqs = 0; @@ -2899,7 +2900,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) init_group_to_asid_map(mvdev); ++mvdev->generation; - if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if ((flags & VDPA_RESET_F_CLEAN_MAP) && + MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { if (mlx5_vdpa_create_dma_mr(mvdev)) mlx5_vdpa_warn(mvdev, "create MR failed\n"); } @@ -2908,6 +2910,11 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) return 0; } +static int mlx5_vdpa_reset(struct vdpa_device *vdev) +{ + return mlx5_vdpa_compat_reset(vdev, 0); +} + static size_t mlx5_vdpa_get_config_size(struct vdpa_device *vdev) { return sizeof(struct virtio_net_config); @@ -2987,6 +2994,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, return err; } +static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsi
[PATCH v5 4/7] vdpa: introduce .compat_reset operation callback
Some device specific IOMMU parent drivers have long standing bogus behaviour that mistakenly clean up the maps during .reset. By definition, this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map & .dma_unmap) in those offending drivers, as the removal of internal maps is completely agnostic to the upper layer, causing inconsistent view between the userspace and the kernel. Some userspace app like QEMU gets around of this brokenness by proactively removing and adding back all the maps around vdpa device reset, but such workaround actually penaltize other well-behaved driver setup, where vdpa reset always comes with the associated mapping cost, especially for kernel vDPA devices (use_va=false) that have high cost on pinning. It's imperative to rectify this behaviour and remove the problematic code from all those non-compliant parent drivers. However, we cannot unconditionally remove the bogus map-cleaning code from the buggy .reset implementation, as there might exist userspace apps that already rely on the behaviour on some setup. Introduce a .compat_reset driver op to keep compatibility with older userspace. New and well behaved parent driver should not bother to implement such op, but only those drivers that are doing or used to do non-compliant map-cleaning reset will have to. Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 13 + 1 file changed, 13 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 26ae6ae1eac3..6b8cbf75712d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -252,6 +252,17 @@ struct vdpa_map_file { * @reset: Reset device * @vdev: vdpa device * Returns integer: success (0) or error (< 0) + * @compat_reset: Reset device with compatibility quirks to + * accommodate older userspace. Only needed by + * parent driver which used to have bogus reset + * behaviour, and has to maintain such behaviour + * for compatibility with older userspace. + * Historically compliant driver only has to + * implement .reset, Historically non-compliant + * driver should implement both. + * @vdev: vdpa device + * @flags: compatibility quirks for reset + * Returns integer: success (0) or error (< 0) * @suspend: Suspend the device (optional) * @vdev: vdpa device * Returns integer: success (0) or error (< 0) @@ -393,6 +404,8 @@ struct vdpa_config_ops { u8 (*get_status)(struct vdpa_device *vdev); void (*set_status)(struct vdpa_device *vdev, u8 status); int (*reset)(struct vdpa_device *vdev); + int (*compat_reset)(struct vdpa_device *vdev, u32 flags); +#define VDPA_RESET_F_CLEAN_MAP 1 int (*suspend)(struct vdpa_device *vdev); int (*resume)(struct vdpa_device *vdev); size_t (*get_config_size)(struct vdpa_device *vdev); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 1/7] vdpa: introduce .reset_map operation callback
Some device specific IOMMU parent drivers have long standing bogus behavior that mistakenly clean up the maps during .reset. By definition, this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map & .dma_unmap) in those offending drivers, as the removal of internal maps is completely agnostic to the upper layer, causing inconsistent view between the userspace and the kernel. Some userspace app like QEMU gets around of this brokenness by proactively removing and adding back all the maps around vdpa device reset, but such workaround actually penalize other well-behaved driver setup, where vdpa reset always comes with the associated mapping cost, especially for kernel vDPA devices (use_va=false) that have high cost on pinning. It's imperative to rectify this behavior and remove the problematic code from all those non-compliant parent drivers. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation detail to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. The .reset_map is not a MUST for every parent that implements the .dma_map or .set_map API, because device may work with DMA ops directly by implement their own to manipulate system memory mappings, so don't have to use .reset_map to achieve a simple IOMMU device model for 1:1 passthrough mapping. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309b99cf..26ae6ae1eac3 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 2/7] vhost-vdpa: reset vendor specific mapping to initial state in .release
Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to not work with DMA ops and maintain a simple IOMMU model with .reset_map. In particular, device reset should not cause mapping to go away on such IOTLB model, so persistent mapping is implied across reset. Before the userspace process using vhost-vdpa is gone, give it a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f57b95..c6bfe9bdde42 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state, which cannot be +* cleaned up in the all range unmap call above. Give them +* a chance to clean up or reset the map to the desired +* state. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 0/7] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is rebased on top of the latest vhost tree. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- v5: - Squashed two fixups to the clean map patch v4: - Rework compatibility using new .compat_reset driver op v3: - add .reset_map support to vdpa_sim - introduce module parameter to provide bug-for-bug compatibility with older userspace v2: - improved commit message to clarify the intended csope of .reset_map API - improved commit messages to clarify no breakage on older userspace v1: - rewrote commit messages to include more detailed description and background - reword to vendor specific IOMMU implementation from on-chip IOMMU - include parent device backend feautres to persistent iotlb precondition - reimplement mlx5_vdpa patch on top of descriptor group series RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (7): vdpa: introduce .reset_map operation callback vhost-vdpa: reset vendor specific mapping to initial state in .release vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vdpa: introduce .compat_reset operation callback vhost-vdpa: clean iotlb map during reset for older userspace vdpa/mlx5: implement .reset_map driver op vdpa_sim: implement .reset_map support drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 ++ drivers/vdpa/mlx5/net/mlx5_vnet.c | 27 ++-- drivers/vdpa/vdpa_sim/vdpa_sim.c | 52 -- drivers/vhost/vdpa.c | 52 +++--- drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 30 +++-- include/uapi/linux/vhost_types.h | 2 ++ 8 files changed, 164 insertions(+), 19 deletions(-) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vhost-vdpa: fix use-after-free in _compat_vdpa_reset
On 10/25/2023 10:26 PM, Michael S. Tsirkin wrote: On Wed, Oct 25, 2023 at 04:13:14PM -0700, Si-Wei Liu wrote: When the vhost-vdpa device is being closed, vhost_vdpa_cleanup() doesn't clean up the vqs pointer after free. This could lead to use-after-tree when _compat_vdpa_reset() tries to access the vqs that are freed already. Fix is to set vqs pointer to NULL at the end of vhost_vdpa_cleanup() after getting freed, which is guarded by atomic opened state. BUG: unable to handle page fault for address: 0001005b4af4 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 16a80a067 P4D 0 Oops: [#1] PREEMPT SMP NOPTI CPU: 4 PID: 40387 Comm: qemu-kvm Not tainted 6.6.0-rc7+ #3 Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022 RIP: 0010:_compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa] Code: 90 90 90 0f 1f 44 00 00 41 55 4c 8d ae 08 03 00 00 41 54 55 48 89 f5 53 4c 8b a6 00 03 00 00 48 85 ff 74 49 48 8b 07 4c 89 ef <48> 8b 80 88 45 00 00 48 c1 e8 08 48 83 f0 01 89 c3 e8 73 5e 9b dc RSP: 0018:ff73a85762073ba0 EFLAGS: 00010286 RAX: 0001005b056c RBX: ff32b13ca6994c68 RCX: 0002 RDX: 0001 RSI: ff32b13c07559000 RDI: ff32b13c07559308 RBP: ff32b13c07559000 R08: R09: ff32b12ca497c0f0 R10: ff73a85762073c58 R11: 000c106f9de3 R12: ff32b12c95b1d050 R13: ff32b13c07559308 R14: ff32b12d0ddc5100 R15: 8002 FS: 7fec5b8cbf80() GS:ff32b13bbfc8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0001005b4af4 CR3: 00015644a003 CR4: 00773ee0 PKRU: 5554 Call Trace: ? __die+0x20/0x70 ? page_fault_oops+0x76/0x170 ? exc_page_fault+0x65/0x150 ? asm_exc_page_fault+0x22/0x30 ? _compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa] vhost_vdpa_open+0x57/0x280 [vhost_vdpa] ? __pfx_chrdev_open+0x10/0x10 chrdev_open+0xc6/0x260 ? __pfx_chrdev_open+0x10/0x10 do_dentry_open+0x16e/0x530 do_open+0x21c/0x400 path_openat+0x111/0x290 do_filp_open+0xb2/0x160 ? __check_object_size.part.0+0x5e/0x140 do_sys_openat2+0x96/0xd0 __x64_sys_openat+0x53/0xa0 do_syscall_64+0x59/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? do_syscall_64+0x69/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? exc_page_fault+0x65/0x150 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older userspace") Fixes: ac7e98c73c05 ("vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset") So these two are all in next correct? I really do not like it how 10cbf8dfaf936e3ef1f5d7fdc6e9dada268ba6bb introduced a regression and then apparently we keep fixing things up? Sorry my bad. The latest one should be all of it. Can I squash these 3 commits? Sure. Or if you want me to send a v5 with all 3 commits squashed in, I can do for sure. Thanks, -Siwei Reported-by: Lei Yang Closes: https://lore.kernel.org/all/CAPpAL=yhdqn1aztecn3mps8o4m+bl_hvy02fdpihn7dwd91...@mail.gmail.com/ Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 9a2343c45df0..30df5c58db73 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1355,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v) vhost_vdpa_free_domain(v); vhost_dev_cleanup(>vdev); kfree(v->vdev.vqs); + v->vdev.vqs = NULL; } static int vhost_vdpa_open(struct inode *inode, struct file *filep) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset
Hi Yang Lei, Thanks for testing my patches and reporting! As for the issue, could you please try what I posted in: https://lore.kernel.org/virtualization/1698275594-19204-1-git-send-email-si-wei@oracle.com/ and let me know how it goes? Thank you very much! Thanks, -Siwei On 10/25/2023 2:41 AM, Lei Yang wrote: On Wed, Oct 25, 2023 at 1:27 AM Si-Wei Liu wrote: Hello Si-Wei Thanks a lot for testing! Please be aware that there's a follow-up fix for a potential oops in this v4 series: The first, when I did not apply this patch [1], I will also hit this patch mentioned problem. After I applied this patch, this problem will no longer to hit again. But I hit another issues, about the error messages please review the attached file. [1] https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/ My test steps: git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cd linux/ b4 am 1697880319-4937-1-git-send-email-si-wei@oracle.com b4 am 20231018171456.1624030-2-dtatu...@nvidia.com b4 am 1698102863-21122-1-git-send-email-si-wei@oracle.com git am ./v4_20231018_dtatulea_vdpa_add_support_for_vq_descriptor_mappings.mbx git am ./v4_20231021_si_wei_liu_vdpa_decouple_reset_of_iotlb_mapping_from_device_reset.mbx git am ./20231023_si_wei_liu_vhost_vdpa_fix_null_pointer_deref_in__compat_vdpa_reset.mbx cp /boot/config-5.14.0-377.el9.x86_64 .config make -j 32 make modules_install make install Thanks Lei https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/ Would be nice to have it applied for any tests. Thanks, -Siwei On 10/23/2023 11:51 PM, Lei Yang wrote: QE tested this series v4 with regression testing on real nic, there is no new regression bug. Tested-by: Lei Yang On Tue, Oct 24, 2023 at 6:02 AM Si-Wei Liu wrote: On 10/22/2023 8:51 PM, Jason Wang wrote: Hi Si-Wei: On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote: In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is rebased on top of the latest vhost tree. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- v4: - Rework compatibility using new .compat_reset driver op I still think having a set_backend_feature() This will overload backend features with the role of carrying over compatibility quirks, which I tried to avoid from. While I think the .compat_reset from the v4 code just works with the backend features acknowledgement (and maybe others as well) to determine, but not directly tie it to backend features itself. These two have different implications in terms of requirement, scope and maintaining/deprecation, better to cope with compat quirks in explicit and driver visible way. or reset_map(clean=true) might be better. An explicit op might be marginally better in driver writer's point of view. Compliant driver doesn't have to bother asserting clean_map never be true so their code would never bother dealing with this case, as explained in the commit log for patch 5 "vhost-vdpa: clean iotlb map during reset for older userspace": " The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behavior before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. " As it tries hard to not introduce new stuff on the bus. Honestly I don't see substantial difference between these other than the color. There's no single best solution that stands out among the 3. And I assume you already noticed it from all the above 3 approaches will have to go with backend features negotiation, that the 1st vdpa reset before backend feature negotiation will use the compliant version of .reset that doesn't clean up the map. While I don
[PATCH] vhost-vdpa: fix use-after-free in _compat_vdpa_reset
When the vhost-vdpa device is being closed, vhost_vdpa_cleanup() doesn't clean up the vqs pointer after free. This could lead to use-after-tree when _compat_vdpa_reset() tries to access the vqs that are freed already. Fix is to set vqs pointer to NULL at the end of vhost_vdpa_cleanup() after getting freed, which is guarded by atomic opened state. BUG: unable to handle page fault for address: 0001005b4af4 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 16a80a067 P4D 0 Oops: [#1] PREEMPT SMP NOPTI CPU: 4 PID: 40387 Comm: qemu-kvm Not tainted 6.6.0-rc7+ #3 Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022 RIP: 0010:_compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa] Code: 90 90 90 0f 1f 44 00 00 41 55 4c 8d ae 08 03 00 00 41 54 55 48 89 f5 53 4c 8b a6 00 03 00 00 48 85 ff 74 49 48 8b 07 4c 89 ef <48> 8b 80 88 45 00 00 48 c1 e8 08 48 83 f0 01 89 c3 e8 73 5e 9b dc RSP: 0018:ff73a85762073ba0 EFLAGS: 00010286 RAX: 0001005b056c RBX: ff32b13ca6994c68 RCX: 0002 RDX: 0001 RSI: ff32b13c07559000 RDI: ff32b13c07559308 RBP: ff32b13c07559000 R08: R09: ff32b12ca497c0f0 R10: ff73a85762073c58 R11: 000c106f9de3 R12: ff32b12c95b1d050 R13: ff32b13c07559308 R14: ff32b12d0ddc5100 R15: 8002 FS: 7fec5b8cbf80() GS:ff32b13bbfc8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0001005b4af4 CR3: 00015644a003 CR4: 00773ee0 PKRU: 5554 Call Trace: ? __die+0x20/0x70 ? page_fault_oops+0x76/0x170 ? exc_page_fault+0x65/0x150 ? asm_exc_page_fault+0x22/0x30 ? _compat_vdpa_reset.isra.0+0x27/0xb0 [vhost_vdpa] vhost_vdpa_open+0x57/0x280 [vhost_vdpa] ? __pfx_chrdev_open+0x10/0x10 chrdev_open+0xc6/0x260 ? __pfx_chrdev_open+0x10/0x10 do_dentry_open+0x16e/0x530 do_open+0x21c/0x400 path_openat+0x111/0x290 do_filp_open+0xb2/0x160 ? __check_object_size.part.0+0x5e/0x140 do_sys_openat2+0x96/0xd0 __x64_sys_openat+0x53/0xa0 do_syscall_64+0x59/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? do_syscall_64+0x69/0x90 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? exc_page_fault+0x65/0x150 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older userspace") Fixes: ac7e98c73c05 ("vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset") Reported-by: Lei Yang Closes: https://lore.kernel.org/all/CAPpAL=yhdqn1aztecn3mps8o4m+bl_hvy02fdpihn7dwd91...@mail.gmail.com/ Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 9a2343c45df0..30df5c58db73 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1355,6 +1355,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v) vhost_vdpa_free_domain(v); vhost_dev_cleanup(>vdev); kfree(v->vdev.vqs); + v->vdev.vqs = NULL; } static int vhost_vdpa_open(struct inode *inode, struct file *filep) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset
Thanks a lot for testing! Please be aware that there's a follow-up fix for a potential oops in this v4 series: https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/ Would be nice to have it applied for any tests. Thanks, -Siwei On 10/23/2023 11:51 PM, Lei Yang wrote: QE tested this series v4 with regression testing on real nic, there is no new regression bug. Tested-by: Lei Yang On Tue, Oct 24, 2023 at 6:02 AM Si-Wei Liu wrote: On 10/22/2023 8:51 PM, Jason Wang wrote: Hi Si-Wei: On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote: In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is rebased on top of the latest vhost tree. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- v4: - Rework compatibility using new .compat_reset driver op I still think having a set_backend_feature() This will overload backend features with the role of carrying over compatibility quirks, which I tried to avoid from. While I think the .compat_reset from the v4 code just works with the backend features acknowledgement (and maybe others as well) to determine, but not directly tie it to backend features itself. These two have different implications in terms of requirement, scope and maintaining/deprecation, better to cope with compat quirks in explicit and driver visible way. or reset_map(clean=true) might be better. An explicit op might be marginally better in driver writer's point of view. Compliant driver doesn't have to bother asserting clean_map never be true so their code would never bother dealing with this case, as explained in the commit log for patch 5 "vhost-vdpa: clean iotlb map during reset for older userspace": " The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behavior before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. " As it tries hard to not introduce new stuff on the bus. Honestly I don't see substantial difference between these other than the color. There's no single best solution that stands out among the 3. And I assume you already noticed it from all the above 3 approaches will have to go with backend features negotiation, that the 1st vdpa reset before backend feature negotiation will use the compliant version of .reset that doesn't clean up the map. While I don't think this nuance matters much to existing older userspace apps, as the maps should already get cleaned by previous process in vhost_vdpa_cleanup(), but if bug-for-bug behavioral compatibility is what you want, module parameter will be the single best answer. Regards, -Siwei But we can listen to others for sure. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace
On 10/24/2023 9:21 AM, Si-Wei Liu wrote: On 10/23/2023 10:45 PM, Jason Wang wrote: On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote: Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 7 +-- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index acc7c74ba7d6..9ce40003793b 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) irq_bypass_unregister_producer(>call_ctx.producer); } -static int vhost_vdpa_reset(struct vhost_vdpa *v) +static int _compat_vdpa_reset(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; + u32 flags = 0; - v->in_batch = 0; + flags |= !vhost_backend_has_feature(v->vdev.vqs[0], + VHOST_BACKEND_F_IOTLB_PERSIST) ? + VDPA_RESET_F_CLEAN_MAP : 0; + + return vdpa_reset(vdpa, flags); +} - return vdpa_reset(vdpa); +static int vhost_vdpa_reset(struct vhost_vdpa *v) +{ + v->in_batch = 0; + return _compat_vdpa_reset(v); } static long vhost_vdpa_bind_mm(struct vhost_vdpa *v) @@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) vhost_vdpa_unsetup_vq_irq(v, i); if (status == 0) { - ret = vdpa_reset(vdpa); + ret = _compat_vdpa_reset(v); if (ret) return ret; } else diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 06ce6d8c2e00..8d63e5923d24 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev) { struct vdpa_device *vdpa = vd_get_vdpa(vdev); - vdpa_reset(vdpa); + vdpa_reset(vdpa, 0); } static bool virtio_vdpa_notify(struct virtqueue *vq) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 6b8cbf75712d..db15ac07f8a6 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline int vdpa_reset(struct vdpa_device *vdev) +static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags) { const struct vdpa_config_ops *ops = vdev->config; int ret; down_write(>cf_lock); vdev->features_valid = false; - ret = ops->reset(vdev); + if (ops->compat_reset && flags) + ret = ops->compat_reset(vdev, flags); + else + ret = ops->reset(vdev); Instead of inventing a new API that carries the flags. Tweak the existing one seems to be simpler and better? Well, as indicated in the commit message, this allows vhost-vdpa be able to know which driver had broken behavior before, so it can apply the corresponding compatibility quirk to the individual driver when it's really necessary. If sending all flags unconditionally down to every driver, it's hard for driver writers to distinguish which are compatibility quirks that they can safely ignore and which are feature flags that are encouraged to implement. In that sense, gating features from being polluted by compatibility quirks with an implicit op s/implicit/explicit/ would be better. Regards, -Siwei As compat_reset(vdev, 0) == reset(vdev) Then you don't need the switch in the parent as well +static int vdpasim_reset(struct vdpa_device *vdpa) +{ + return vdpasim_compat_reset(vdpa, 0); +} Thanks up_write(>cf_lock); return ret; } -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace
On 10/23/2023 10:45 PM, Jason Wang wrote: On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote: Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 7 +-- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index acc7c74ba7d6..9ce40003793b 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) irq_bypass_unregister_producer(>call_ctx.producer); } -static int vhost_vdpa_reset(struct vhost_vdpa *v) +static int _compat_vdpa_reset(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; + u32 flags = 0; - v->in_batch = 0; + flags |= !vhost_backend_has_feature(v->vdev.vqs[0], + VHOST_BACKEND_F_IOTLB_PERSIST) ? +VDPA_RESET_F_CLEAN_MAP : 0; + + return vdpa_reset(vdpa, flags); +} - return vdpa_reset(vdpa); +static int vhost_vdpa_reset(struct vhost_vdpa *v) +{ + v->in_batch = 0; + return _compat_vdpa_reset(v); } static long vhost_vdpa_bind_mm(struct vhost_vdpa *v) @@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) vhost_vdpa_unsetup_vq_irq(v, i); if (status == 0) { - ret = vdpa_reset(vdpa); + ret = _compat_vdpa_reset(v); if (ret) return ret; } else diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 06ce6d8c2e00..8d63e5923d24 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev) { struct vdpa_device *vdpa = vd_get_vdpa(vdev); - vdpa_reset(vdpa); + vdpa_reset(vdpa, 0); } static bool virtio_vdpa_notify(struct virtqueue *vq) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 6b8cbf75712d..db15ac07f8a6 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline int vdpa_reset(struct vdpa_device *vdev) +static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags) { const struct vdpa_config_ops *ops = vdev->config; int ret; down_write(>cf_lock); vdev->features_valid = false; - ret = ops->reset(vdev); + if (ops->compat_reset && flags) + ret = ops->compat_reset(vdev, flags); + else + ret = ops->reset(vdev); Instead of inventing a new API that carries the flags. Tweak the existing one seems to be simpler and better? Well, as indicated in the commit message, this allows vhost-vdpa be able to know which driver had broken behavior before, so it can apply the corresponding compatibility quirk to the individual driver when it's really necessary. If sending all flags unconditionally down to every driver, it's hard for driver writers to distinguish which are compatibility quirks that they can safely ignore and which are feature flags that are encouraged to implement. In that sense, gating features from being polluted by compatibility quirks with an implicit op would be better. Regards, -Siwei As compat_reset(vdev, 0) == reset(vdev) Then you don't need the switch in the parent as well +static int vdpasim_reset(struct vdpa_device *vdpa) +{ + return vdpasim_compat_reset(vdpa, 0); +} Thanks up_write(>cf_lock); return ret; } -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset
As subject. There's a vhost_vdpa_reset() done earlier before vhost_dev is initialized via vhost_dev_init(), ending up with NULL pointer dereference. Fix is to check if vqs is initialized before checking backend features and resetting the device. BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] SMP CPU: 3 PID: 1727 Comm: qemu-system-x86 Not tainted 6.6.0-rc6+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel- a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:_compat_vdpa_reset+0x47/0xc0 [vhost_vdpa] Code: c7 c7 fb 12 56 a0 4c 8d a5 b8 02 00 00 48 89 ea e8 7e b8 c4 48 89 ee 48 c7 c7 19 13 56 a0 4c 8b ad b0 02 00 00 <48> 8b 00 49 00 48 8b 80 88 45 00 00 48 c1 e8 08 48 RSP: 0018:8881063c3c38 EFLAGS: 00010246 RAX: RBX: 8881074eb800 RCX: RDX: RSI: 888103ab4000 RDI: a0561319 RBP: 888103ab4000 R08: dfff R09: 0001 R10: 0003 R11: 7fecbac0 R12: 888103ab42b8 R13: 888106dbe850 R14: 0003 R15: 8881074ebc18 FS: 7f02fba6ef00() GS:5f8c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0001325e5003 CR4: 00372ea0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ? __die+0x1f/0x60 ? page_fault_oops+0x14c/0x3b0 ? exc_page_fault+0x74/0x140 ? asm_exc_page_fault+0x22/0x30 ? _compat_vdpa_reset+0x47/0xc0 [vhost_vdpa] ? _compat_vdpa_reset+0x32/0xc0 [vhost_vdpa] vhost_vdpa_open+0x55/0x270 [vhost_vdpa] ? sb_init_dio_done_wq+0x50/0x50 chrdev_open+0xc0/0x210 ? __unregister_chrdev+0x50/0x50 do_dentry_open+0x1fc/0x4f0 path_openat+0xc2d/0xf20 do_filp_open+0xb4/0x160 ? kmem_cache_alloc+0x3c/0x490 do_sys_openat2+0x8d/0xc0 __x64_sys_openat+0x6a/0xa0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older userspace") Reported-by: Dragos Tatulea Closes: https://lore.kernel.org/all/b4913f84-8b52-4d28-af51-8573dc361...@oracle.com/ Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 9ce40003793b..9a2343c45df0 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -232,9 +232,11 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v) struct vdpa_device *vdpa = v->vdpa; u32 flags = 0; - flags |= !vhost_backend_has_feature(v->vdev.vqs[0], - VHOST_BACKEND_F_IOTLB_PERSIST) ? -VDPA_RESET_F_CLEAN_MAP : 0; + if (v->vdev.vqs) { + flags |= !vhost_backend_has_feature(v->vdev.vqs[0], + VHOST_BACKEND_F_IOTLB_PERSIST) ? +VDPA_RESET_F_CLEAN_MAP : 0; + } return vdpa_reset(vdpa, flags); } -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace
(+ linux-next) Hi Michael, Dragos reported below oops for which I have a fix at hand (having it fully tested), ready to be posted to linux-next. Please let me know if you want me to respin the original patch series, or you would think it'd be fine to fix it on top. On 10/23/2023 11:59 AM, Dragos Tatulea wrote: On Sat, 2023-10-21 at 02:25 -0700, Si-Wei Liu wrote: Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 7 +-- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index acc7c74ba7d6..9ce40003793b 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) irq_bypass_unregister_producer(>call_ctx.producer); } -static int vhost_vdpa_reset(struct vhost_vdpa *v) +static int _compat_vdpa_reset(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; + u32 flags = 0; - v->in_batch = 0; + flags |= !vhost_backend_has_feature(v->vdev.vqs[0], + VHOST_BACKEND_F_IOTLB_PERSIST) ? + VDPA_RESET_F_CLEAN_MAP : 0; Hi Si-Wei, I am getting a Oops due to the vqs not being initialized here. Here's how it it looks like: [ 37.817075] BUG: kernel NULL pointer dereference, address: [ 37.817674] #PF: supervisor read access in kernel mode [ 37.818150] #PF: error_code(0x) - not-present page [ 37.818615] PGD 0 P4D 0 [ 37.818893] Oops: [#1] SMP [ 37.819223] CPU: 3 PID: 1727 Comm: qemu-system-x86 Not tainted 6.6.0-rc6+ #2 [ 37.819829] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel- 1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 37.820791] RIP: 0010:_compat_vdpa_reset+0x47/0xc0 [vhost_vdpa] [ 37.821316] Code: c7 c7 fb 12 56 a0 4c 8d a5 b8 02 00 00 48 89 ea e8 7e b8 c4 e0 48 8b 43 28 48 89 ee 48 c7 c7 19 13 56 a0 4c 8b ad b0 02 00 00 <48> 8b 00 49 8b 95 d8 00 00 00 48 8b 80 88 45 00 00 48 c1 e8 08 48 [ 37.822811] RSP: 0018:8881063c3c38 EFLAGS: 00010246 [ 37.823285] RAX: RBX: 8881074eb800 RCX: [ 37.823893] RDX: RSI: 888103ab4000 RDI: a0561319 [ 37.824506] RBP: 888103ab4000 R08: dfff R09: 0001 [ 37.825116] R10: 0003 R11: 7fecbac0 R12: 888103ab42b8 [ 37.825721] R13: 888106dbe850 R14: 0003 R15: 8881074ebc18 [ 37.826326] FS: 7f02fba6ef00() GS:5f8c() knlGS: [ 37.827035] CS: 0010 DS: ES: CR0: 80050033 [ 37.827552] CR2: CR3: 0001325e5003 CR4: 00372ea0 [ 37.828162] DR0: DR1: DR2: [ 37.828772] DR3: DR6: fffe0ff0 DR7: 0400 [ 37.829381] Call Trace: [ 37.829660] [ 37.829911] ? __die+0x1f/0x60 [ 37.830234] ? page_fault_oops+0x14c/0x3b0 [ 37.830623] ? exc_page_fault+0x74/0x140 [ 37.830999] ? asm_exc_page_fault+0x22/0x30 [ 37.831402] ? _compat_vdpa_reset+0x47/0xc0 [vhost_vdpa] [ 37.831888] ? _compat_vdpa_reset+0x32/0xc0 [vhost_vdpa] [ 37.832366] vhost_vdpa_open+0x55/0x270 [vhost_vdpa] [ 37.832821] ? sb_init_dio_done_wq+0x50/0x50 [ 37.833225] chrdev_open+0xc0/0x210 [ 37.833582] ? __unregister_chrdev+0x50/0x50 [ 37.833990] do_dentry_open+0x1fc/0x4f0 [ 37.834363] path_openat+0xc2d/0xf20 [ 37.834721] do_filp_open+0xb4/0x160 [ 37.835082] ? kmem_cache_alloc+0x3c/0x490 [ 37.835474] do_sys_openat2+0x8d/0xc0 [ 37.835834] __x64_sys_openat+0x6a/0xa0 [ 37.836208] do_syscall_64+0x3c/0x80 [ 37.836564] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 37.837021] RIP: 0033:0x7f02fcc2c085 [ 37.837378] Code: 8b 55 d0 48 89 45 b0 75 a0 44 89 55 9c e8 63 7d f8 ff 44 8b 55 9c 89 da 4c 89 e6 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 89 45 9c e8 b8 7d f8 ff 8b 45 9c [ 37.838891] RSP: 002b:7ffdea3c8cc0 EFLAGS: 0293 ORIG_RAX: 0101 [ 37.839571] R
Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset
On 10/22/2023 8:51 PM, Jason Wang wrote: Hi Si-Wei: On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu wrote: In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is rebased on top of the latest vhost tree. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- v4: - Rework compatibility using new .compat_reset driver op I still think having a set_backend_feature() This will overload backend features with the role of carrying over compatibility quirks, which I tried to avoid from. While I think the .compat_reset from the v4 code just works with the backend features acknowledgement (and maybe others as well) to determine, but not directly tie it to backend features itself. These two have different implications in terms of requirement, scope and maintaining/deprecation, better to cope with compat quirks in explicit and driver visible way. or reset_map(clean=true) might be better. An explicit op might be marginally better in driver writer's point of view. Compliant driver doesn't have to bother asserting clean_map never be true so their code would never bother dealing with this case, as explained in the commit log for patch 5 "vhost-vdpa: clean iotlb map during reset for older userspace": " The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behavior before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. " As it tries hard to not introduce new stuff on the bus. Honestly I don't see substantial difference between these other than the color. There's no single best solution that stands out among the 3. And I assume you already noticed it from all the above 3 approaches will have to go with backend features negotiation, that the 1st vdpa reset before backend feature negotiation will use the compliant version of .reset that doesn't clean up the map. While I don't think this nuance matters much to existing older userspace apps, as the maps should already get cleaned by previous process in vhost_vdpa_cleanup(), but if bug-for-bug behavioral compatibility is what you want, module parameter will be the single best answer. Regards, -Siwei But we can listen to others for sure. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace
Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 7 +-- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index acc7c74ba7d6..9ce40003793b 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) irq_bypass_unregister_producer(>call_ctx.producer); } -static int vhost_vdpa_reset(struct vhost_vdpa *v) +static int _compat_vdpa_reset(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; + u32 flags = 0; - v->in_batch = 0; + flags |= !vhost_backend_has_feature(v->vdev.vqs[0], + VHOST_BACKEND_F_IOTLB_PERSIST) ? +VDPA_RESET_F_CLEAN_MAP : 0; + + return vdpa_reset(vdpa, flags); +} - return vdpa_reset(vdpa); +static int vhost_vdpa_reset(struct vhost_vdpa *v) +{ + v->in_batch = 0; + return _compat_vdpa_reset(v); } static long vhost_vdpa_bind_mm(struct vhost_vdpa *v) @@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) vhost_vdpa_unsetup_vq_irq(v, i); if (status == 0) { - ret = vdpa_reset(vdpa); + ret = _compat_vdpa_reset(v); if (ret) return ret; } else diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 06ce6d8c2e00..8d63e5923d24 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev) { struct vdpa_device *vdpa = vd_get_vdpa(vdev); - vdpa_reset(vdpa); + vdpa_reset(vdpa, 0); } static bool virtio_vdpa_notify(struct virtqueue *vq) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 6b8cbf75712d..db15ac07f8a6 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline int vdpa_reset(struct vdpa_device *vdev) +static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags) { const struct vdpa_config_ops *ops = vdev->config; int ret; down_write(>cf_lock); vdev->features_valid = false; - ret = ops->reset(vdev); + if (ops->compat_reset && flags) + ret = ops->compat_reset(vdev, flags); + else + ret = ops->reset(vdev); up_write(>cf_lock); return ret; } -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 6/7] vdpa/mlx5: implement .reset_map driver op
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR (including cvq mapping) on the given ASID and recreate the initial DMA mapping. That way, the device .reset op runs free from having to maintain and clean up memory mappings by itself. Additionally, implement .compat_reset to cater for older userspace, which may wish to see mapping to be cleared during reset. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 27 --- 3 files changed, 42 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index db988ced5a5d..84547d998bcf 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 66530e28f327..2197c46e563a 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev) return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0); } + +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +{ + if (asid >= MLX5_VDPA_NUM_AS) + return -EINVAL; + + mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]); + + if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (mlx5_vdpa_create_dma_mr(mvdev)) + mlx5_vdpa_warn(mvdev, "create DMA MR failed\n"); + } else { + mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid); + } + + return 0; +} diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index f4516a2d5bb0..12ac3397f39b 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -2876,7 +2876,7 @@ static void init_group_to_asid_map(struct mlx5_vdpa_dev *mvdev) mvdev->group2asid[i] = 0; } -static int mlx5_vdpa_reset(struct vdpa_device *vdev) +static int mlx5_vdpa_compat_reset(struct vdpa_device *vdev, u32 flags) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); @@ -2888,7 +2888,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) unregister_link_notifier(ndev); teardown_driver(ndev); clear_vqs_ready(ndev); - mlx5_vdpa_destroy_mr_resources(>mvdev); + if (flags & VDPA_RESET_F_CLEAN_MAP) + mlx5_vdpa_destroy_mr_resources(>mvdev); ndev->mvdev.status = 0; ndev->mvdev.suspended = false; ndev->cur_num_vqs = 0; @@ -2899,7 +2900,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) init_group_to_asid_map(mvdev); ++mvdev->generation; - if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if ((flags & VDPA_RESET_F_CLEAN_MAP) && + MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { if (mlx5_vdpa_create_dma_mr(mvdev)) mlx5_vdpa_warn(mvdev, "create MR failed\n"); } @@ -2908,6 +2910,11 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) return 0; } +static int mlx5_vdpa_reset(struct vdpa_device *vdev) +{ + return mlx5_vdpa_compat_reset(vdev, 0); +} + static size_t mlx5_vdpa_get_config_size(struct vdpa_device *vdev) { return sizeof(struct virtio_net_config); @@ -2987,6 +2994,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, return err; } +static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsi
[PATCH v4 7/7] vdpa_sim: implement .reset_map support
In order to reduce excessive memory mapping cost in live migration and VM reboot, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the iotlb on the given ASID and recreate the 1:1 passthrough/identity mapping. To be consistent, the mapping on device creation is initiailized to passthrough/identity with PA 1:1 mapped as IOVA. With this the device .reset op doesn't have to maintain and clean up memory mappings by itself. Additionally, implement .compat_reset to cater for older userspace, which may wish to see mapping to be cleared during reset. Signed-off-by: Si-Wei Liu Tested-by: Stefano Garzarella --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 52 ++-- 1 file changed, 43 insertions(+), 9 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 76d41058add9..be2925d0d283 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -139,7 +139,7 @@ static void vdpasim_vq_reset(struct vdpasim *vdpasim, vq->vring.notify = NULL; } -static void vdpasim_do_reset(struct vdpasim *vdpasim) +static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 flags) { int i; @@ -151,11 +151,13 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) >iommu_lock); } - for (i = 0; i < vdpasim->dev_attr.nas; i++) { - vhost_iotlb_reset(>iommu[i]); - vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, - 0, VHOST_MAP_RW); - vdpasim->iommu_pt[i] = true; + if (flags & VDPA_RESET_F_CLEAN_MAP) { + for (i = 0; i < vdpasim->dev_attr.nas; i++) { + vhost_iotlb_reset(>iommu[i]); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } } vdpasim->running = true; @@ -259,8 +261,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, if (!vdpasim->iommu_pt) goto err_iommu; - for (i = 0; i < vdpasim->dev_attr.nas; i++) + for (i = 0; i < vdpasim->dev_attr.nas; i++) { vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0, + VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } for (i = 0; i < dev_attr->nvqs; i++) vringh_set_iotlb(>vqs[i].vring, >iommu[0], @@ -480,18 +486,23 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, u8 status) mutex_unlock(>mutex); } -static int vdpasim_reset(struct vdpa_device *vdpa) +static int vdpasim_compat_reset(struct vdpa_device *vdpa, u32 flags) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); mutex_lock(>mutex); vdpasim->status = 0; - vdpasim_do_reset(vdpasim); + vdpasim_do_reset(vdpasim, flags); mutex_unlock(>mutex); return 0; } +static int vdpasim_reset(struct vdpa_device *vdpa) +{ + return vdpasim_compat_reset(vdpa, 0); +} + static int vdpasim_suspend(struct vdpa_device *vdpa) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -637,6 +648,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, unsigned int asid, return ret; } +static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid) +{ + struct vdpasim *vdpasim = vdpa_to_sim(vdpa); + + if (asid >= vdpasim->dev_attr.nas) + return -EINVAL; + + spin_lock(>iommu_lock); + if (vdpasim->iommu_pt[asid]) + goto out; + vhost_iotlb_reset(>iommu[asid]); + vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[asid] = true; +out: + spin_unlock(>iommu_lock); + return 0; +} + static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -749,6 +779,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = { .get_status = vdpasim_get_status, .set_status = vdpasim_set_status, .reset = vdpasim_reset, + .compat_reset = vdpasim_compat_reset, .suspend= vdpasim_suspend, .resume = vdpasim_resume, .get_config_size= vdpasim_get_config_size, @@ -759,6 +790,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = { .set_group_asid = vdpasim_set
[PATCH v4 3/7] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel can be trusted to persist IOTLB mapping across vDPA reset. Without it, userspace has no way to tell apart if it's running on an older kernel, which could silently drop all iotlb mapping across vDPA reset, especially with broken parent driver implementation for the .reset driver op. The broken driver may incorrectly drop all mappings of its own as part of .reset, which inadvertently ends up with corrupted mapping state between vhost-vdpa userspace and the kernel. As a workaround, to make the mapping behaviour predictable across reset, userspace has to pro-actively remove all mappings before vDPA reset, and then restore all the mappings afterwards. This workaround is done unconditionally on top of all parent drivers today, due to the parent driver implementation issue and no means to differentiate. This workaround had been utilized in QEMU since day one when the corresponding vhost-vdpa userspace backend came to the world. There are 3 cases that backend may claim this feature bit on for: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver - parent device with vendor specific IOMMU implementation with persistent IOTLB mapping already that has to specifically declare this backend feature The reason why .reset_map is being one of the pre-condition for persistent iotlb is because without it, vhost-vdpa can't switch back iotlb to the initial state later on, especially for the on-chip IOMMU case which starts with identity mapping at device creation. virtio-vdpa requires on-chip IOMMU to perform 1:1 passthrough translation from PA to IOVA as-is to begin with, and .reset_map is the only means to turn back iotlb to the identity mapping mode after vhost-vdpa is gone. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start( started = false), but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the driver bug has been solved. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 15 +++ include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index c6bfe9bdde42..acc7c74ba7d6 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -439,6 +439,15 @@ static u64 vhost_vdpa_get_backend_features(const struct vhost_vdpa *v) return ops->get_backend_features(vdpa); } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map || + vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); +} + static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -726,6 +735,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME) | BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK))) @@ -742,6 +752,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -797,6 +810,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); features |= vhost_vdpa_get_backend_features(v); if (copy_to_user(featurep, , sizeof(features
[PATCH v4 2/7] vhost-vdpa: reset vendor specific mapping to initial state in .release
Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to not work with DMA ops and maintain a simple IOMMU model with .reset_map. In particular, device reset should not cause mapping to go away on such IOTLB model, so persistent mapping is implied across reset. Before the userspace process using vhost-vdpa is gone, give it a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f57b95..c6bfe9bdde42 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state, which cannot be +* cleaned up in the all range unmap call above. Give them +* a chance to clean up or reset the map to the desired +* state. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 1/7] vdpa: introduce .reset_map operation callback
Some device specific IOMMU parent drivers have long standing bogus behavior that mistakenly clean up the maps during .reset. By definition, this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map & .dma_unmap) in those offending drivers, as the removal of internal maps is completely agnostic to the upper layer, causing inconsistent view between the userspace and the kernel. Some userspace app like QEMU gets around of this brokenness by proactively removing and adding back all the maps around vdpa device reset, but such workaround actually penalize other well-behaved driver setup, where vdpa reset always comes with the associated mapping cost, especially for kernel vDPA devices (use_va=false) that have high cost on pinning. It's imperative to rectify this behavior and remove the problematic code from all those non-compliant parent drivers. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation detail to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. The .reset_map is not a MUST for every parent that implements the .dma_map or .set_map API, because device may work with DMA ops directly by implement their own to manipulate system memory mappings, so don't have to use .reset_map to achieve a simple IOMMU device model for 1:1 passthrough mapping. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309b99cf..26ae6ae1eac3 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 4/7] vdpa: introduce .compat_reset operation callback
Some device specific IOMMU parent drivers have long standing bogus behaviour that mistakenly clean up the maps during .reset. By definition, this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map & .dma_unmap) in those offending drivers, as the removal of internal maps is completely agnostic to the upper layer, causing inconsistent view between the userspace and the kernel. Some userspace app like QEMU gets around of this brokenness by proactively removing and adding back all the maps around vdpa device reset, but such workaround actually penaltize other well-behaved driver setup, where vdpa reset always comes with the associated mapping cost, especially for kernel vDPA devices (use_va=false) that have high cost on pinning. It's imperative to rectify this behaviour and remove the problematic code from all those non-compliant parent drivers. However, we cannot unconditionally remove the bogus map-cleaning code from the buggy .reset implementation, as there might exist userspace apps that already rely on the behaviour on some setup. Introduce a .compat_reset driver op to keep compatibility with older userspace. New and well behaved parent driver should not bother to implement such op, but only those drivers that are doing or used to do non-compliant map-cleaning reset will have to. Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 13 + 1 file changed, 13 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 26ae6ae1eac3..6b8cbf75712d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -252,6 +252,17 @@ struct vdpa_map_file { * @reset: Reset device * @vdev: vdpa device * Returns integer: success (0) or error (< 0) + * @compat_reset: Reset device with compatibility quirks to + * accommodate older userspace. Only needed by + * parent driver which used to have bogus reset + * behaviour, and has to maintain such behaviour + * for compatibility with older userspace. + * Historically compliant driver only has to + * implement .reset, Historically non-compliant + * driver should implement both. + * @vdev: vdpa device + * @flags: compatibility quirks for reset + * Returns integer: success (0) or error (< 0) * @suspend: Suspend the device (optional) * @vdev: vdpa device * Returns integer: success (0) or error (< 0) @@ -393,6 +404,8 @@ struct vdpa_config_ops { u8 (*get_status)(struct vdpa_device *vdev); void (*set_status)(struct vdpa_device *vdev, u8 status); int (*reset)(struct vdpa_device *vdev); + int (*compat_reset)(struct vdpa_device *vdev, u32 flags); +#define VDPA_RESET_F_CLEAN_MAP 1 int (*suspend)(struct vdpa_device *vdev); int (*resume)(struct vdpa_device *vdev); size_t (*get_config_size)(struct vdpa_device *vdev); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is rebased on top of the latest vhost tree. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- v4: - Rework compatibility using new .compat_reset driver op v3: - add .reset_map support to vdpa_sim - introduce module parameter to provide bug-for-bug compatibility with older userspace v2: - improved commit message to clarify the intended csope of .reset_map API - improved commit messages to clarify no breakage on older userspace v1: - rewrote commit messages to include more detailed description and background - reword to vendor specific IOMMU implementation from on-chip IOMMU - include parent device backend feautres to persistent iotlb precondition - reimplement mlx5_vdpa patch on top of descriptor group series RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (7): vdpa: introduce .reset_map operation callback vhost-vdpa: reset vendor specific mapping to initial state in .release vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vdpa: introduce .compat_reset operation callback vhost-vdpa: clean iotlb map during reset for older userspace vdpa/mlx5: implement .reset_map driver op vdpa_sim: implement .reset_map support drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 ++ drivers/vdpa/mlx5/net/mlx5_vnet.c | 27 ++-- drivers/vdpa/vdpa_sim/vdpa_sim.c | 52 -- drivers/vhost/vdpa.c | 49 +--- drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 30 +++-- include/uapi/linux/vhost_types.h | 2 ++ 8 files changed, 161 insertions(+), 19 deletions(-) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/19/2023 9:11 PM, Jason Wang wrote: On Fri, Oct 20, 2023 at 6:28 AM Si-Wei Liu wrote: On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote: On Thu, Oct 19, 2023 at 10:27 AM Jason Wang wrote: On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu wrote: On 10/18/2023 7:53 PM, Jason Wang wrote: On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu wrote: On 10/18/2023 12:00 AM, Jason Wang wrote: Unfortunately, it's a must to stick to ABI. I agree it's a mess but we don't have a better choice. Or we can fail the probe if userspace doesn't ack this feature. Antoher idea we can just do the following in vhost_vdpa reset? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Then we don't have the burden to maintain them in the parent? Thanks Please see my earlier response in the other email, thanks. %<%< First, the ideal fix would be to leave this reset_vendor_mappings() emulation code on the individual driver itself, which already has the broken behavior. So the point is, not about whether the existing behavior is "broken" or not. Hold on, I thought earlier we all agreed upon that the existing behavior of vendor driver self-clearing maps during .reset violates the vhost iotlb abstraction and also breaks the .set_map/.dma_map API. This is 100% buggy driver implementation itself that we should discourage or eliminate as much as possible (that's part of the goal for this series), I'm not saying it's not an issue, what I'm saying is, if the fix breaks another userspace, it's a new bug in the kernel. See what Linus said in [1] "If a change results in user programs breaking, it's a bug in the kernel." but here you seem to go existentialism and suggests the very opposite that every .set_map/.dma_map driver implementation, regardless being the current or the new/upcoming, should unconditionally try to emulate the broken reset behavior for the sake of not breaking older userspace. Such "emulation" is not done at the parent level. New parents just need to implement reset_map() or not. everything could be done inside vhost-vDPA as pseudo code that is shown above. Set aside the criteria and definition for how userspace can be broken, can we step back to the original question why we think it's broken, and what we can do to promote good driver implementation instead of discuss the implementation details? I'm not sure I get the point of this question. I'm not saying we don't need to fix, what I am saying is that such a fix must be done in a negotiable way. And it's better if parents won't get any burden. It can just decide to implement reset_map() or not. Reading the below response I found my major points are not heard even if written for quite a few times. I try my best to not ignore any important things, but I can't promise I will not miss any. I hope the above clarifies my points. It's not that I don't understand the importance of not breaking old userspace, I appreciate your questions and extra patience, however I do feel the "broken" part is very relevant to our discussion here. If it's broken (in the sense of vhost IOTLB API) that you agree, I think we should at least allow good driver implementations; and when you think about the possibility of those valid good driver cases (.set_map/.dma_map implementations that do not clear maps in .reset), you might be able to see why it's coded the way as it is now. It's about whether we could stick to the old behaviour without too much cost. And I believe we could. And just to clarify here, reset_vendor_mappings() = config->reset_map() But today there's no backend feature negotiation between vhost-vdpa and the parent driver. Do we want to send down the acked_backend_features to parent drivers? There's no need to do that with the above code, or anything I missed here? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Implementation issue: this implies reset_map() has to be there for every .set_map implementations, but vendor driver implementation for custom IOMMU could well implement DMA ops by itself instead of .reset_map. This won't work for every set_map driver (think about the vduse case). Well let me do it once again, reset_map() is not mandated: config->reset() if (IOTLB_PERSIST is not set) { if (config->reset_map) config->reset_map() To avoid new parent drivers I am afraid it's not just new parent drivers, but any well behaved driver today may well break userspace if go with this forced emulation code, if they have to implement reset_map for some reason (e.g. restored to 1:1 passthrough mapping or other default state in mapping). For new userspace and user driver we can guard against it using the IOTLB_PERSIST flag, but the above code would get a big chance to break setup with good driver and older userspace in practice. And .reset_map implementatio
Re: [RFC v2 PATCH] vdpa_sim: implement .reset_map support
On 10/19/2023 2:29 AM, Stefano Garzarella wrote: On Wed, Oct 18, 2023 at 04:47:48PM -0700, Si-Wei Liu wrote: On 10/18/2023 1:05 AM, Stefano Garzarella wrote: On Tue, Oct 17, 2023 at 10:11:33PM -0700, Si-Wei Liu wrote: RFC only. Not tested on vdpa-sim-blk with user virtual address. Works fine with vdpa-sim-net which uses physical address to map. This patch is based on top of [1]. [1] https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/ Signed-off-by: Si-Wei Liu --- RFC v2: - initialize iotlb to passthrough mode in device add I tested this version and I didn't see any issue ;-) Great, thank you so much for your help on testing my patch, Stefano! You're welcome :-) Just for my own interest/curiosity, currently there's no vhost-vdpa backend client implemented for vdpa-sim-blk Yep, we developed libblkio [1]. libblkio exposes common API to access block devices in userspace. It supports several drivers. The one useful for this use case is `virtio-blk-vhost-vdpa`. Here [2] some examples on how to use the libblkio test suite with the vdpa-sim-blk. Since QEMU 7.2, it supports libblkio drivers, so you can use the following options to attach a vdpa-blk device to a VM: -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \ -device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1 \ For now only what we called slow-path [3][4] is supported, since the VQs are not directly exposed to the guest, but QEMU allocates other VQs (similar to shadow VQs for net) to support live-migration and QEMU storage features. Fast-path is on the agenda, but on pause for now. or any vdpa block device in userspace as yet, correct? Do you mean with VDUSE? In this case, yes, qemu-storage-daemon supports it, and can implement a virtio-blk in user space, exposing a disk image thorough VDUSE. There is an example in libblkio as well [5] on how to start it. So there was no test specific to vhost-vdpa that needs to be exercised, right? I hope I answered above :-) Definitely! This is exactly what I needed, it's really useful! Much appreciated for the detailed information! I hadn't been aware of the latest status on libblkio drivers and qemu support since I last checked it (it was at some point right after KVM 2022, sorry my knowledge too outdated). I followed your links below and checked a few things, looks my change shouldn't affect anything. Good to see all the desired pieces landed to QEMU and libblkio already as planned, great job done! Cheers, -Siwei This reminded me that I need to write a blog post with all this information, I hope to do that soon! Stefano [1] https://gitlab.com/libblkio/libblkio [2] https://gitlab.com/libblkio/libblkio/-/blob/main/tests/meson.build?ref_type=heads#L42 [3] https://kvmforum2022.sched.com/event/15jK5/qemu-storage-daemon-and-libblkio-exploring-new-shores-for-the-qemu-block-layer-kevin-wolf-stefano-garzarella-red-hat [4] https://kvmforum2021.sched.com/event/ke3a/vdpa-blk-unified-hardware-and-software-offload-for-virtio-blk-stefano-garzarella-red-hat [5] https://gitlab.com/libblkio/libblkio/-/blob/main/tests/meson.build?ref_type=heads#L58 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v4 00/16] vdpa: Add support for vq descriptor mappings
For patches 05-16: Reviewed-by: Si-Wei Liu Tested-by: Si-Wei Liu Thanks for the fixes! On 10/18/2023 10:14 AM, Dragos Tatulea wrote: This patch series adds support for vq descriptor table mappings which are used to improve vdpa live migration downtime. The improvement comes from using smaller mappings which take less time to create and destroy in hw. The first part adds the vdpa core changes from Si-Wei [0]. The second part adds support in mlx5_vdpa: - Refactor the mr code to be able to cleanly add descriptor mappings. - Add hardware descriptor mr support. - Properly update iotlb for cvq during ASID switch. Changes in v4: - Improved the handling of empty iotlbs. See mlx5_vdpa_change_map section in patch "12/16 vdpa/mlx5: Improve mr upate flow". - Fixed a invalid usage of desc_group_mkey hw vq field when the capability is not there. See patch "15/16 vdpa/mlx5: Enable hw support for vq descriptor map". Changes in v3: - dup_iotlb now checks for src == dst case and returns an error. - Renamed iotlb parameter in dup_iotlb to dst. - Removed a redundant check of the asid value. - Fixed a commit message. - mx5_ifc.h patch has been applied to mlx5-vhost tree. When applying this series please pull from that tree first. Changes in v2: - The "vdpa/mlx5: Enable hw support for vq descriptor mapping" change was split off into two patches to avoid merge conflicts into the tree of Linus. The first patch contains only changes for mlx5_ifc.h. This must be applied into the mlx5-vdpa tree [1] first. Once this patch is applied on mlx5-vdpa, the change has to be pulled fom mlx5-vdpa into the vhost tree and only then the remaining patches can be applied. [0] https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com [1] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-vhost Dragos Tatulea (13): vdpa/mlx5: Expose descriptor group mkey hw capability vdpa/mlx5: Create helper function for dma mappings vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code vdpa/mlx5: Take cvq iotlb lock during refresh vdpa/mlx5: Collapse "dvq" mr add/delete functions vdpa/mlx5: Rename mr destroy functions vdpa/mlx5: Allow creation/deletion of any given mr struct vdpa/mlx5: Move mr mutex out of mr struct vdpa/mlx5: Improve mr update flow vdpa/mlx5: Introduce mr for vq descriptor vdpa/mlx5: Enable hw support for vq descriptor mapping vdpa/mlx5: Make iotlb helper functions more generic vdpa/mlx5: Update cvq iotlb mapping on ASID change Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vdpa/mlx5/core/mlx5_vdpa.h | 31 +++-- drivers/vdpa/mlx5/core/mr.c| 194 - drivers/vdpa/mlx5/core/resources.c | 6 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 105 +++- drivers/vhost/vdpa.c | 27 include/linux/mlx5/mlx5_ifc.h | 8 +- include/linux/mlx5/mlx5_ifc_vdpa.h | 7 +- include/linux/vdpa.h | 11 ++ include/uapi/linux/vhost.h | 8 ++ include/uapi/linux/vhost_types.h | 5 + 10 files changed, 272 insertions(+), 130 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/17/2023 10:27 PM, Jason Wang wrote: If we do this without a negotiation, IOTLB will not be clear but the Qemu will try to re-program the IOTLB after reset. Which will break? 1) stick the exact old behaviour with just one line of check It's not just one line of check here, the old behavior emulation has to be done as Eugenio illustrated in the other email. For vhost-vDPA it's just if (IOTLB_PERSIST is acked by userspace) reset_map() ... and this reset_map in vhost_vdpa_cleanup can't be negotiable depending on IOTLB_PERSIST. Consider the case where user switches to virtio-vdpa after an older userspace using vhost-vdpa finished running. Even with buggy_virtio_reset_map in place it's unwarranted the vendor IOMMU can get back to the default state, e.g. ending with 1:1 passthrough mapping. If not doing this unconditionally it will get a big chance to break userspace. -Siwei For parent, it's somehow similar: during .reset() if (IOTLB_PERSIST is not acked by userspace) reset_vendor_mappings() Anything I missed here? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote: On Thu, Oct 19, 2023 at 10:27 AM Jason Wang wrote: On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu wrote: On 10/18/2023 7:53 PM, Jason Wang wrote: On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu wrote: On 10/18/2023 12:00 AM, Jason Wang wrote: Unfortunately, it's a must to stick to ABI. I agree it's a mess but we don't have a better choice. Or we can fail the probe if userspace doesn't ack this feature. Antoher idea we can just do the following in vhost_vdpa reset? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Then we don't have the burden to maintain them in the parent? Thanks Please see my earlier response in the other email, thanks. %<%< First, the ideal fix would be to leave this reset_vendor_mappings() emulation code on the individual driver itself, which already has the broken behavior. So the point is, not about whether the existing behavior is "broken" or not. Hold on, I thought earlier we all agreed upon that the existing behavior of vendor driver self-clearing maps during .reset violates the vhost iotlb abstraction and also breaks the .set_map/.dma_map API. This is 100% buggy driver implementation itself that we should discourage or eliminate as much as possible (that's part of the goal for this series), I'm not saying it's not an issue, what I'm saying is, if the fix breaks another userspace, it's a new bug in the kernel. See what Linus said in [1] "If a change results in user programs breaking, it's a bug in the kernel." but here you seem to go existentialism and suggests the very opposite that every .set_map/.dma_map driver implementation, regardless being the current or the new/upcoming, should unconditionally try to emulate the broken reset behavior for the sake of not breaking older userspace. Such "emulation" is not done at the parent level. New parents just need to implement reset_map() or not. everything could be done inside vhost-vDPA as pseudo code that is shown above. Set aside the criteria and definition for how userspace can be broken, can we step back to the original question why we think it's broken, and what we can do to promote good driver implementation instead of discuss the implementation details? I'm not sure I get the point of this question. I'm not saying we don't need to fix, what I am saying is that such a fix must be done in a negotiable way. And it's better if parents won't get any burden. It can just decide to implement reset_map() or not. Reading the below response I found my major points are not heard even if written for quite a few times. I try my best to not ignore any important things, but I can't promise I will not miss any. I hope the above clarifies my points. It's not that I don't understand the importance of not breaking old userspace, I appreciate your questions and extra patience, however I do feel the "broken" part is very relevant to our discussion here. If it's broken (in the sense of vhost IOTLB API) that you agree, I think we should at least allow good driver implementations; and when you think about the possibility of those valid good driver cases (.set_map/.dma_map implementations that do not clear maps in .reset), you might be able to see why it's coded the way as it is now. It's about whether we could stick to the old behaviour without too much cost. And I believe we could. And just to clarify here, reset_vendor_mappings() = config->reset_map() But today there's no backend feature negotiation between vhost-vdpa and the parent driver. Do we want to send down the acked_backend_features to parent drivers? There's no need to do that with the above code, or anything I missed here? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Implementation issue: this implies reset_map() has to be there for every .set_map implementations, but vendor driver implementation for custom IOMMU could well implement DMA ops by itself instead of .reset_map. This won't work for every set_map driver (think about the vduse case). Well let me do it once again, reset_map() is not mandated: config->reset() if (IOTLB_PERSIST is not set) { if (config->reset_map) config->reset_map() To avoid new parent drivers I am afraid it's not just new parent drivers, but any well behaved driver today may well break userspace if go with this forced emulation code, if they have to implement reset_map for some reason (e.g. restored to 1:1 passthrough mapping or other default state in mapping). For new userspace and user driver we can guard against it using the IOTLB_PERSIST flag, but the above code would get a big chance to break setup with good driver and older userspace in practice. And .reset_map implementation doesn't necessarily need to clear maps. For e.g. IOMMU API compliant driver that only needs
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/18/2023 7:53 PM, Jason Wang wrote: On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu wrote: On 10/18/2023 12:00 AM, Jason Wang wrote: Unfortunately, it's a must to stick to ABI. I agree it's a mess but we don't have a better choice. Or we can fail the probe if userspace doesn't ack this feature. Antoher idea we can just do the following in vhost_vdpa reset? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Then we don't have the burden to maintain them in the parent? Thanks Please see my earlier response in the other email, thanks. %<%< First, the ideal fix would be to leave this reset_vendor_mappings() emulation code on the individual driver itself, which already has the broken behavior. So the point is, not about whether the existing behavior is "broken" or not. Hold on, I thought earlier we all agreed upon that the existing behavior of vendor driver self-clearing maps during .reset violates the vhost iotlb abstraction and also breaks the .set_map/.dma_map API. This is 100% buggy driver implementation itself that we should discourage or eliminate as much as possible (that's part of the goal for this series), but here you seem to go existentialism and suggests the very opposite that every .set_map/.dma_map driver implementation, regardless being the current or the new/upcoming, should unconditionally try to emulate the broken reset behavior for the sake of not breaking older userspace. Set aside the criteria and definition for how userspace can be broken, can we step back to the original question why we think it's broken, and what we can do to promote good driver implementation instead of discuss the implementation details? Reading the below response I found my major points are not heard even if written for quite a few times. It's not that I don't understand the importance of not breaking old userspace, I appreciate your questions and extra patience, however I do feel the "broken" part is very relevant to our discussion here. If it's broken (in the sense of vhost IOTLB API) that you agree, I think we should at least allow good driver implementations; and when you think about the possibility of those valid good driver cases (.set_map/.dma_map implementations that do not clear maps in .reset), you might be able to see why it's coded the way as it is now. It's about whether we could stick to the old behaviour without too much cost. And I believe we could. And just to clarify here, reset_vendor_mappings() = config->reset_map() But today there's no backend feature negotiation between vhost-vdpa and the parent driver. Do we want to send down the acked_backend_features to parent drivers? There's no need to do that with the above code, or anything I missed here? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Implementation issue: this implies reset_map() has to be there for every .set_map implementations, but vendor driver implementation for custom IOMMU could well implement DMA ops by itself instead of .reset_map. This won't work for every set_map driver (think about the vduse case). But this is not the the point I was making. I think if you agree this is purely buggy driver implementation of its own, we should try to isolate this buggy behavior to individual driver rather than overload vhost-vdpa or vdpa core's role to help implement the emulation of broken driver behavior. I don't get why .reset is special here, the abuse of .reset to manipulate mapping could also happen in other IOMMU unrelated driver entries like in .suspend, or in queue_reset. If someday userspace is found coded around similar buggy driver implementation in other driver ops, do we want to follow and duplicate the same emulation in vdpa core as the precedent is already set here around .reset? The buggy driver can fail in a lot of other ways indefinitely during reset, if there's a buggy driver that's already broken the way as how it is and happens to survive with all userspace apps, we just don't care and let it be. There's no way we can enumerate all those buggy behaviors in .reset_map itself, it's overloading that driver API too much. Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of backend feature negotiation in parent driver, if vhost-vdpa has to provide the old-behaviour emulation for compatibility on driver's behalf, it needs to be done per-driver basis. There could be good on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in .reset, and vendor specific IOMMU doesn't have to provide .reset_map, Then we just don't offer IOTLB_PRESIST, isn't this by design? Think about the vduse case, it can work with DMA ops directly so doesn't have to implement .reset_map, unless for some specific good reason. Because it's a conforming and valid/good driver implementation, we may still allow it to ad
[PATCH v3 0/4] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is rebased on top of the latest vhost tree. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- v3: - add .reset_map support to vdpa_sim - introduce module parameter to provide bug-for-bug compatiblity with older userspace v2: - improved commit message to clarify the intended csope of .reset_map API - improved commit messages to clarify no breakage on older userspace v1: - rewrote commit messages to include more detailed description and background - reword to vendor specific IOMMU implementation from on-chip IOMMU - include parent device backend feautres to persistent iotlb precondition - reimplement mlx5_vdpa patch on top of descriptor group series RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (5): vdpa: introduce .reset_map operation callback vhost-vdpa: reset vendor specific mapping to initial state in .release vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vdpa/mlx5: implement .reset_map driver op vdpa_sim: implement .reset_map support drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 26 -- drivers/vdpa/vdpa_sim/vdpa_sim.c | 58 -- drivers/vhost/vdpa.c | 31 include/linux/vdpa.h | 10 ++ include/uapi/linux/vhost_types.h | 2 ++ 7 files changed, 132 insertions(+), 13 deletions(-) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 5/5] vdpa_sim: implement .reset_map support
In order to reduce excessive memory mapping cost in live migration and VM reboot, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the iotlb on the given ASID and recreate the 1:1 passthrough/identity mapping. To be consistent, the mapping on device creation is initiailized to passthrough/identity with PA 1:1 mapped as IOVA. With this the device .reset op doesn't have to maintain and clean up memory mappings by itself. Add a module paramemter, iotlb_persist, to cater for older userspace which may wish to see mapping to be cleared during reset. Signed-off-by: Si-Wei Liu Tested-by: Stefano Garzarella --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 58 ++-- 1 file changed, 47 insertions(+), 11 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 76d41058add9..74506636375f 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -40,6 +40,10 @@ static bool use_va = true; module_param(use_va, bool, 0444); MODULE_PARM_DESC(use_va, "Enable/disable the device's ability to use VA"); +static bool iotlb_persist = true; +module_param(iotlb_persist, bool, 0444); +MODULE_PARM_DESC(iotlb_persist, "Enable/disable persistent iotlb across reset: 1 to keep maps, 0 to clear"); + #define VDPASIM_QUEUE_ALIGN PAGE_SIZE #define VDPASIM_QUEUE_MAX 256 #define VDPASIM_VENDOR_ID 0 @@ -151,11 +155,13 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) >iommu_lock); } - for (i = 0; i < vdpasim->dev_attr.nas; i++) { - vhost_iotlb_reset(>iommu[i]); - vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, - 0, VHOST_MAP_RW); - vdpasim->iommu_pt[i] = true; + if (unlikely(!iotlb_persist)) { + for (i = 0; i < vdpasim->dev_attr.nas; i++) { + vhost_iotlb_reset(>iommu[i]); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } } vdpasim->running = true; @@ -166,8 +172,8 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) ++vdpasim->generation; } -static const struct vdpa_config_ops vdpasim_config_ops; -static const struct vdpa_config_ops vdpasim_batch_config_ops; +static struct vdpa_config_ops vdpasim_config_ops; +static struct vdpa_config_ops vdpasim_batch_config_ops; static void vdpasim_work_fn(struct kthread_work *work) { @@ -191,7 +197,7 @@ static void vdpasim_work_fn(struct kthread_work *work) struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, const struct vdpa_dev_set_config *config) { - const struct vdpa_config_ops *ops; + struct vdpa_config_ops *ops; struct vdpa_device *vdpa; struct vdpasim *vdpasim; struct device *dev; @@ -213,6 +219,9 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, else ops = _config_ops; + if (unlikely(!iotlb_persist)) + ops->reset_map = NULL; + vdpa = __vdpa_alloc_device(NULL, ops, dev_attr->ngroups, dev_attr->nas, dev_attr->alloc_size, @@ -259,8 +268,14 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, if (!vdpasim->iommu_pt) goto err_iommu; - for (i = 0; i < vdpasim->dev_attr.nas; i++) + for (i = 0; i < vdpasim->dev_attr.nas; i++) { vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0); + if (likely(iotlb_persist)) { + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0, + VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } + } for (i = 0; i < dev_attr->nvqs; i++) vringh_set_iotlb(>vqs[i].vring, >iommu[0], @@ -637,6 +652,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, unsigned int asid, return ret; } +static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid) +{ + struct vdpasim *vdpasim = vdpa_to_sim(vdpa); + + if (asid >= vdpasim->dev_attr.nas) + return -EINVAL; + + spin_lock(>iommu_lock); + if (vdpasim->iommu_pt[asid]) + goto out; + vhost_iotlb_reset(>iommu[asid]); + vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[asid] = true; +out: +
[PATCH v3 1/5] vdpa: introduce .reset_map operation callback
Device specific IOMMU parent driver who wishes to see mapping to be decoupled from virtio or vdpa device life cycle (device reset) can use it to restore memory mapping in the device IOMMU to the initial or default state. The reset of mapping is done per address space basis. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation details to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the he time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. The .reset_map is not a MUST for every parent that implements the .dma_map or .set_map API, because there could be software vDPA devices (which has use_va=true) that don't have to pin kernel memory so they don't care much about high mapping cost during device reset. And those software devices may have also implemented their own DMA ops, so don't have to use .reset_map to achieve a simple IOMMU device model for 1:1 passthrough mapping, either. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309b99cf..26ae6ae1eac3 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 3/5] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel can be trusted to persist IOTLB mapping across vDPA reset. Without it, userspace has no way to tell apart if it's running on an older kernel, which could silently drop all iotlb mapping across vDPA reset, especially with broken parent driver implementation for the .reset driver op. The broken driver may incorrectly drop all mappings of its own as part of .reset, which inadvertently ends up with corrupted mapping state between vhost-vdpa userspace and the kernel. As a workaround, to make the mapping behaviour predictable across reset, userspace has to pro-actively remove all mappings before vDPA reset, and then restore all the mappings afterwards. This workaround is done unconditionally on top of all parent drivers today, due to the parent driver implementation issue and no means to differentiate. This workaround had been utilized in QEMU since day one when the corresponding vhost-vdpa userspace backend came to the world. There are 3 cases that backend may claim this feature bit on for: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver - parent device with vendor specific IOMMU implementation with persistent IOTLB mapping already that has to specifically declare this backend feature The reason why .reset_map is being one of the pre-condition for persistent iotlb is because without it, vhost-vdpa can't switch back iotlb to the initial state later on, especially for the on-chip IOMMU case which starts with identity mapping at device creation. virtio-vdpa requires on-chip IOMMU to perform 1:1 passthrough translation from PA to IOVA as-is to begin with, and .reset_map is the only means to turn back iotlb to the identity mapping mode after vhost-vdpa is gone. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start( started = false), but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the driver bug has been solved. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 15 +++ include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index a3f8160c9807..9202986a7d81 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct vhost_vdpa *v) return ops->get_backend_features(vdpa); } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map || + vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); +} + static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME) | BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK))) @@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); features |= vhost_vdpa_get_backend_features(v); if (copy_to_user(featurep, , sizeof(features
[PATCH v3 2/5] vhost-vdpa: reset vendor specific mapping to initial state in .release
Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f57b95..a3f8160c9807 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 4/5] vdpa/mlx5: implement .reset_map driver op
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR (including cvq mapping) on the given ASID and recreate the initial DMA mapping. That way, the device .reset op runs free from having to maintain and clean up memory mappings by itself. Add a module paramemter, persist_mapping, to cater for older userspace which may wish to see mapping to be cleared during reset. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 26 -- 3 files changed, 42 insertions(+), 2 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index db988ced5a5d..84547d998bcf 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 66530e28f327..2197c46e563a 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev) return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0); } + +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +{ + if (asid >= MLX5_VDPA_NUM_AS) + return -EINVAL; + + mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]); + + if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (mlx5_vdpa_create_dma_mr(mvdev)) + mlx5_vdpa_warn(mvdev, "create DMA MR failed\n"); + } else { + mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid); + } + + return 0; +} diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index f4516a2d5bb0..e809ccec6048 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -25,6 +25,11 @@ MODULE_AUTHOR("Eli Cohen "); MODULE_DESCRIPTION("Mellanox VDPA driver"); MODULE_LICENSE("Dual BSD/GPL"); +static bool persist_mapping = true; +module_param(persist_mapping, bool, 0444); +MODULE_PARM_DESC(persist_mapping, +"Enable/disable persistent mapping across reset: 1 to keep, 0 to clear"); + #define VALID_FEATURES_MASK \ (BIT_ULL(VIRTIO_NET_F_CSUM) | BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) | \ BIT_ULL(VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) | BIT_ULL(VIRTIO_NET_F_MTU) | BIT_ULL(VIRTIO_NET_F_MAC) | \ @@ -2888,7 +2893,8 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) unregister_link_notifier(ndev); teardown_driver(ndev); clear_vqs_ready(ndev); - mlx5_vdpa_destroy_mr_resources(>mvdev); + if (unlikely(!persist_mapping)) + mlx5_vdpa_destroy_mr_resources(>mvdev); ndev->mvdev.status = 0; ndev->mvdev.suspended = false; ndev->cur_num_vqs = 0; @@ -2899,7 +2905,7 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) init_group_to_asid_map(mvdev); ++mvdev->generation; - if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (unlikely(!persist_mapping) && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { if (mlx5_vdpa_create_dma_mr(mvdev)) mlx5_vdpa_warn(mvdev, "create MR failed\n"); } @@ -2987,6 +2993,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, return err; } +static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
Re: [RFC v2 PATCH] vdpa_sim: implement .reset_map support
On 10/18/2023 1:05 AM, Stefano Garzarella wrote: On Tue, Oct 17, 2023 at 10:11:33PM -0700, Si-Wei Liu wrote: RFC only. Not tested on vdpa-sim-blk with user virtual address. Works fine with vdpa-sim-net which uses physical address to map. This patch is based on top of [1]. [1] https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/ Signed-off-by: Si-Wei Liu --- RFC v2: - initialize iotlb to passthrough mode in device add I tested this version and I didn't see any issue ;-) Great, thank you so much for your help on testing my patch, Stefano! Just for my own interest/curiosity, currently there's no vhost-vdpa backend client implemented for vdpa-sim-blk or any vdpa block device in userspace as yet, correct? So there was no test specific to vhost-vdpa that needs to be exercised, right? Thanks, -Siwei Tested-by: Stefano Garzarella --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 34 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 76d41058add9..2a0a6042d61d 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -151,13 +151,6 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) >iommu_lock); } - for (i = 0; i < vdpasim->dev_attr.nas; i++) { - vhost_iotlb_reset(>iommu[i]); - vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, - 0, VHOST_MAP_RW); - vdpasim->iommu_pt[i] = true; - } - vdpasim->running = true; spin_unlock(>iommu_lock); @@ -259,8 +252,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, if (!vdpasim->iommu_pt) goto err_iommu; - for (i = 0; i < vdpasim->dev_attr.nas; i++) + for (i = 0; i < vdpasim->dev_attr.nas; i++) { vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0, + VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } for (i = 0; i < dev_attr->nvqs; i++) vringh_set_iotlb(>vqs[i].vring, >iommu[0], @@ -637,6 +634,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, unsigned int asid, return ret; } +static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid) +{ + struct vdpasim *vdpasim = vdpa_to_sim(vdpa); + + if (asid >= vdpasim->dev_attr.nas) + return -EINVAL; + + spin_lock(>iommu_lock); + if (vdpasim->iommu_pt[asid]) + goto out; + vhost_iotlb_reset(>iommu[asid]); + vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[asid] = true; +out: + spin_unlock(>iommu_lock); + return 0; +} + static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -759,6 +775,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = { .set_group_asid = vdpasim_set_group_asid, .dma_map = vdpasim_dma_map, .dma_unmap = vdpasim_dma_unmap, + .reset_map = vdpasim_reset_map, .bind_mm = vdpasim_bind_mm, .unbind_mm = vdpasim_unbind_mm, .free = vdpasim_free, @@ -796,6 +813,7 @@ static const struct vdpa_config_ops vdpasim_batch_config_ops = { .get_iova_range = vdpasim_get_iova_range, .set_group_asid = vdpasim_set_group_asid, .set_map = vdpasim_set_map, + .reset_map = vdpasim_reset_map, .bind_mm = vdpasim_bind_mm, .unbind_mm = vdpasim_unbind_mm, .free = vdpasim_free, -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/18/2023 4:14 AM, Eugenio Perez Martin wrote: On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu wrote: On 10/17/2023 10:27 PM, Jason Wang wrote: On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu wrote: On 10/16/2023 7:35 PM, Jason Wang wrote: On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu wrote: On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: On Mon, Oct 16, 2023 at 8:33 AM Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. I'm confused, how to define tolerating here? Tolerating defined as QEMU has to proactively unmap before reset just to workaround the driver bug (on-chip maps out of sync), unconditionally for platform or on-chip. While we all know it doesn't have to do so for platform IOMMU, though userspace has no means to distinguish. That said, userspace is sacrificing reset time performance on platform IOMMU setup just for working around buggy implementation in the other setup. Ok, so what you actually mean is that userspace can tolerate the "bug" with the performance penalty. Right. For example, if it has tolerance, why bother? I'm not sure I get the question. But I think userspace is compromising because of buggy implementation in a few drivers doesn't mean we should uniformly enforce such behavior for all set_map/dma_map implementations. This is not my point. I meant, we can fix we need a negotiation in order to let some "buggy" old user space to survive from the changes. Userspace is no buggy today, how to define "buggy"? Userspace with tolerance could survive just fine no matter if this negotiation or buggy driver behavior emulation is around or not. If any userspace doesn't tolerate, it can work still fine on good on-chip IOMMU or platform IOMMU, no matter if the negotiation is around or not. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). I can easily imagine a case: The old Qemu that works only with a setup like mlx5_vdpa. Noted, seems to me there's no such case of a userspace implementation that only works with mlx5_vdpa or its friends, but doesn't work with the others e.g. platform IOMMU, or well behaving on-chip IOMMU implementations. It's not hard t
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/18/2023 12:00 AM, Jason Wang wrote: Unfortunately, it's a must to stick to ABI. I agree it's a mess but we don't have a better choice. Or we can fail the probe if userspace doesn't ack this feature. Antoher idea we can just do the following in vhost_vdpa reset? config->reset() if (IOTLB_PERSIST is not set) { config->reset_map() } Then we don't have the burden to maintain them in the parent? Thanks Please see my earlier response in the other email, thanks. %<%< First, the ideal fix would be to leave this reset_vendor_mappings() emulation code on the individual driver itself, which already has the broken behavior. But today there's no backend feature negotiation between vhost-vdpa and the parent driver. Do we want to send down the acked_backend_features to parent drivers? Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of backend feature negotiation in parent driver, if vhost-vdpa has to provide the old-behaviour emulation for compatibility on driver's behalf, it needs to be done per-driver basis. There could be good on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in .reset, and vendor specific IOMMU doesn't have to provide .reset_map, we should allow these good driver implementations rather than unconditionally stick to some specific problematic behavior for every other good driver. Then we need a set of device flags (backend_features bit again?) to indicate the specific driver needs upper layer's help on old-behaviour emulation. Last but not least, I'm not sure how to properly emulate reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no .reset_map op implemented, or if .reset_map has a slightly different implementation than what it used to reset the iotlb in the .reset op, then this either becomes effectively dead code if no one ends up using, or the vhost-vdpa emulation is helpless and limited in scope, unable to cover all the cases. %<%< ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/17/2023 10:27 PM, Jason Wang wrote: On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu wrote: On 10/16/2023 7:35 PM, Jason Wang wrote: On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu wrote: On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: On Mon, Oct 16, 2023 at 8:33 AM Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. I'm confused, how to define tolerating here? Tolerating defined as QEMU has to proactively unmap before reset just to workaround the driver bug (on-chip maps out of sync), unconditionally for platform or on-chip. While we all know it doesn't have to do so for platform IOMMU, though userspace has no means to distinguish. That said, userspace is sacrificing reset time performance on platform IOMMU setup just for working around buggy implementation in the other setup. Ok, so what you actually mean is that userspace can tolerate the "bug" with the performance penalty. Right. For example, if it has tolerance, why bother? I'm not sure I get the question. But I think userspace is compromising because of buggy implementation in a few drivers doesn't mean we should uniformly enforce such behavior for all set_map/dma_map implementations. This is not my point. I meant, we can fix we need a negotiation in order to let some "buggy" old user space to survive from the changes. Userspace is no buggy today, how to define "buggy"? Userspace with tolerance could survive just fine no matter if this negotiation or buggy driver behavior emulation is around or not. If any userspace doesn't tolerate, it can work still fine on good on-chip IOMMU or platform IOMMU, no matter if the negotiation is around or not. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). I can easily imagine a case: The old Qemu that works only with a setup like mlx5_vdpa. Noted, seems to me there's no such case of a userspace implementation that only works with mlx5_vdpa or its friends, but doesn't work with the others e.g. platform IOMMU, or well behaving on-chip IOMMU implementations. It's not hard to think of a case where: 1) the environment has mlx5_vdpa only 2) kernel doc can't have endless details, so when
Re: [RFC PATCH] vdpa_sim: implement .reset_map support
Hi Stefano, On 10/17/2023 6:44 AM, Stefano Garzarella wrote: On Fri, Oct 13, 2023 at 10:29:26AM -0700, Si-Wei Liu wrote: Hi Stefano, On 10/13/2023 2:22 AM, Stefano Garzarella wrote: Hi Si-Wei, On Fri, Oct 13, 2023 at 01:23:40AM -0700, Si-Wei Liu wrote: RFC only. Not tested on vdpa-sim-blk with user virtual address. I can test it, but what I should stress? Great, thank you! As you see, my patch moved vhost_iotlb_reset out of vdpasim_reset for the sake of decoupling mapping from vdpa device reset. For hardware devices this decoupling makes sense as platform IOMMU already did it. But I'm not sure if there's something in the software device (esp. with vdpa-blk and the userspace library stack) that may have to rely on the current .reset behavior that clears the vhost_iotlb. So perhaps you can try to exercise every possible case involving blk device reset, and see if anything (related to mapping) breaks? I just tried these steps without using a VM and the host kernel hangs after adding the device: [root@f38-vm-build ~]# modprobe virtio-vdpa [root@f38-vm-build ~]# modprobe vdpa-sim-blk [root@f38-vm-build ~]# vdpa dev add mgmtdev vdpasim_blk name blk0 [ 35.284575][ T563] virtio_blk virtio6: 1/0/0 default/read/poll queues [ 35.286372][ T563] virtio_blk virtio6: [vdb] 262144 512-byte logical blocks (134 MB/128 MiB) [ 35.295271][ T564] vringh: Reverting this patch (so building "vdpa/mlx5: implement .reset_map driver op") worked here. I'm sorry, the previous RFC patch was incomplete - please see the v2 I just posted. Tested both use_va and !use_va on vdpa-sim-blk, and raw disk copy to the vdpa block simulator using dd seems fine. Just let me know how it goes on your side this time. Thanks, -Siwei Works fine with vdpa-sim-net which uses physical address to map. Can you share your tests? so I'll try to do the same with blk. Basically everything involving virtio device reset in the guest, e.g. reboot the VM, remove/unbind then reprobe/bind the virtio-net module/driver, then see if device I/O (which needs mapping properly) is still flowing as expected. And then everything else that could trigger QEMU's vhost_dev_start/stop paths ending up as passive vhos-vdpa backend reset, for e.g. link status change, suspend/hibernate, SVQ switch and live migration. I am not sure if vdpa-blk supports live migration through SVQ or not, if not you don't need to worry about. This patch is based on top of [1]. [1] https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/ The series does not apply well on master or vhost tree. Where should I apply it? Sent the link through another email offline. Received thanks! Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC v2 PATCH] vdpa_sim: implement .reset_map support
RFC only. Not tested on vdpa-sim-blk with user virtual address. Works fine with vdpa-sim-net which uses physical address to map. This patch is based on top of [1]. [1] https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/ Signed-off-by: Si-Wei Liu --- RFC v2: - initialize iotlb to passthrough mode in device add --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 34 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 76d41058add9..2a0a6042d61d 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -151,13 +151,6 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) >iommu_lock); } - for (i = 0; i < vdpasim->dev_attr.nas; i++) { - vhost_iotlb_reset(>iommu[i]); - vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, - 0, VHOST_MAP_RW); - vdpasim->iommu_pt[i] = true; - } - vdpasim->running = true; spin_unlock(>iommu_lock); @@ -259,8 +252,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, if (!vdpasim->iommu_pt) goto err_iommu; - for (i = 0; i < vdpasim->dev_attr.nas; i++) + for (i = 0; i < vdpasim->dev_attr.nas; i++) { vhost_iotlb_init(>iommu[i], max_iotlb_entries, 0); + vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, 0, + VHOST_MAP_RW); + vdpasim->iommu_pt[i] = true; + } for (i = 0; i < dev_attr->nvqs; i++) vringh_set_iotlb(>vqs[i].vring, >iommu[0], @@ -637,6 +634,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, unsigned int asid, return ret; } +static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid) +{ + struct vdpasim *vdpasim = vdpa_to_sim(vdpa); + + if (asid >= vdpasim->dev_attr.nas) + return -EINVAL; + + spin_lock(>iommu_lock); + if (vdpasim->iommu_pt[asid]) + goto out; + vhost_iotlb_reset(>iommu[asid]); + vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[asid] = true; +out: + spin_unlock(>iommu_lock); + return 0; +} + static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -759,6 +775,7 @@ static const struct vdpa_config_ops vdpasim_config_ops = { .set_group_asid = vdpasim_set_group_asid, .dma_map= vdpasim_dma_map, .dma_unmap = vdpasim_dma_unmap, + .reset_map = vdpasim_reset_map, .bind_mm= vdpasim_bind_mm, .unbind_mm = vdpasim_unbind_mm, .free = vdpasim_free, @@ -796,6 +813,7 @@ static const struct vdpa_config_ops vdpasim_batch_config_ops = { .get_iova_range = vdpasim_get_iova_range, .set_group_asid = vdpasim_set_group_asid, .set_map= vdpasim_set_map, + .reset_map = vdpasim_reset_map, .bind_mm= vdpasim_bind_mm, .unbind_mm = vdpasim_unbind_mm, .free = vdpasim_free, -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/16/2023 7:35 PM, Jason Wang wrote: On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu wrote: On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: On Mon, Oct 16, 2023 at 8:33 AM Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. I'm confused, how to define tolerating here? Tolerating defined as QEMU has to proactively unmap before reset just to workaround the driver bug (on-chip maps out of sync), unconditionally for platform or on-chip. While we all know it doesn't have to do so for platform IOMMU, though userspace has no means to distinguish. That said, userspace is sacrificing reset time performance on platform IOMMU setup just for working around buggy implementation in the other setup. For example, if it has tolerance, why bother? I'm not sure I get the question. But I think userspace is compromising because of buggy implementation in a few drivers doesn't mean we should uniformly enforce such behavior for all set_map/dma_map implementations. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). I can easily imagine a case: The old Qemu that works only with a setup like mlx5_vdpa. Noted, seems to me there's no such case of a userspace implementation that only works with mlx5_vdpa or its friends, but doesn't work with the others e.g. platform IOMMU, or well behaving on-chip IOMMU implementations. The Unmap+remap trick around vdpa reset works totally fine for platform IOMMU, except with sub-optimal performance. Other than this trick, I cannot easily think of other means or iotlb message sequence for userspace to recover the bogus state and make iotlb back to work again after reset. Are we talking about hypnosis that has no real basis to exist in the real world? If we do this without a negotiation, IOTLB will not be clear but the Qemu will try to re-program the IOTLB after reset. Which will break? 1) stick the exact old behaviour with just one line of check It's not just one line of check here, the old behavior emulation has to be done as Eugenio illustrated in the other email. In addition, the emulation has to limit to those buggy drivers as I don't feel this emulation should apply uniformly to all futu
[PATCH v2 4/4] vdpa/mlx5: implement .reset_map driver op
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR on the given ASID and recreate the initial DMA mapping. That way, the device .reset op can run free from having to maintain and clean up memory mappings by itself. The cvq mapping also needs to be cleared if is in the given ASID. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +- 3 files changed, 31 insertions(+), 5 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index db988ced5a5d..84547d998bcf 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 66530e28f327..2197c46e563a 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev) return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0); } + +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +{ + if (asid >= MLX5_VDPA_NUM_AS) + return -EINVAL; + + mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]); + + if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (mlx5_vdpa_create_dma_mr(mvdev)) + mlx5_vdpa_warn(mvdev, "create DMA MR failed\n"); + } else { + mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid); + } + + return 0; +} diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 6abe02310f2b..928e71bc5571 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) unregister_link_notifier(ndev); teardown_driver(ndev); clear_vqs_ready(ndev); - mlx5_vdpa_destroy_mr_resources(>mvdev); ndev->mvdev.status = 0; ndev->mvdev.suspended = false; ndev->cur_num_vqs = 0; @@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) init_group_to_asid_map(mvdev); ++mvdev->generation; - if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { - if (mlx5_vdpa_create_dma_mr(mvdev)) - mlx5_vdpa_warn(mvdev, "create MR failed\n"); - } up_write(>reslock); return 0; @@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, return err; } +static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid) +{ + struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); + struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); + int err; + + down_write(>reslock); + err = mlx5_vdpa_reset_mr(mvdev, asid); + up_write(>reslock); + return err; +} + static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); @@ -3199,6 +3206,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = { .set_config = mlx5_vdpa_set_config, .get_generation = mlx5_vdpa_get_generation, .set_map = mlx5_vdpa_set_map, + .reset_map = mlx5_vdpa_reset_map, .set_group_asid = mlx5_set_group_asid, .get_vq_dma_dev = mlx5_get_vq_dma_dev, .free = mlx5_vdpa_free, -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 0/4] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is based off of the descriptor group v3 series from Dragos. [2] [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html [2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatu...@nvidia.com/ --- v2: - improved commit message to clarify the intended csope of .reset_map API - improved commit messages to clarify no breakage on older userspace v1: - rewrote commit messages to include more detailed description and background - reword to vendor specific IOMMU implementation from on-chip IOMMU - include parent device backend feautres to persistent iotlb precondition - reimplement mlx5_vdpa patch on top of descriptor group series RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (4): vdpa: introduce .reset_map operation callback vhost-vdpa: reset vendor specific mapping to initial state in .release vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vdpa/mlx5: implement .reset_map driver op drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 - drivers/vhost/vdpa.c | 31 ++ include/linux/vdpa.h | 10 ++ include/uapi/linux/vhost_types.h | 2 ++ 6 files changed, 74 insertions(+), 5 deletions(-) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 1/4] vdpa: introduce .reset_map operation callback
Device specific IOMMU parent driver who wishes to see mapping to be decoupled from virtio or vdpa device life cycle (device reset) can use it to restore memory mapping in the device IOMMU to the initial or default state. The reset of mapping is done per address space basis. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation details to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the he time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. The .reset_map is not a MUST for every parent that implements the .dma_map or .set_map API, because there could be software vDPA devices (which has use_va=true) that don't have to pin kernel memory so they don't care much about high mapping cost during device reset. And those software devices may have also implemented their own DMA ops, so don't have to use .reset_map to achieve a simple IOMMU device model for 1:1 passthrough mapping, either. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309b99cf..26ae6ae1eac3 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f57b95..a3f8160c9807 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel can be trusted to persist IOTLB mapping across vDPA reset. Without it, userspace has no way to tell apart if it's running on an older kernel, which could silently drop all iotlb mapping across vDPA reset, especially with broken parent driver implementation for the .reset driver op. The broken driver may incorrectly drop all mappings of its own as part of .reset, which inadvertently ends up with corrupted mapping state between vhost-vdpa userspace and the kernel. As a workaround, to make the mapping behaviour predictable across reset, userspace has to pro-actively remove all mappings before vDPA reset, and then restore all the mappings afterwards. This workaround is done unconditionally on top of all parent drivers today, due to the parent driver implementation issue and no means to differentiate. This workaround had been utilized in QEMU since day one when the corresponding vhost-vdpa userspace backend came to the world. There are 3 cases that backend may claim this feature bit on for: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver - parent device with vendor specific IOMMU implementation with persistent IOTLB mapping already that has to specifically declare this backend feature The reason why .reset_map is being one of the pre-condition for persistent iotlb is because without it, vhost-vdpa can't switch back iotlb to the initial state later on, especially for the on-chip IOMMU case which starts with identity mapping at device creation. virtio-vdpa requires on-chip IOMMU to perform 1:1 passthrough translation from PA to IOVA as-is to begin with, and .reset_map is the only means to turn back iotlb to the identity mapping mode after vhost-vdpa is gone. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start( started = false), but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the driver bug has been solved. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 15 +++ include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index a3f8160c9807..9202986a7d81 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct vhost_vdpa *v) return ops->get_backend_features(vdpa); } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map || + vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); +} + static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME) | BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK))) @@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); features |= vhost_vdpa_get_backend_features(v); if (copy_to_user(featurep, , sizeof(features
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: On Mon, Oct 16, 2023 at 8:33 AM Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). For two reasons: 1) backend features need acked by userspace this is by design 2) keep the odd behaviour seems to be more safe as we can't audit every userspace program The old behavior (without flag ack) cannot be trusted already, as: * Devices using platform IOMMU (in other words, not implementing neither .set_map nor .dma_map) does not unmap memory at virtio reset. * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset, now all backends work the same as far as I know., which was (and is) the way devices using the platform IOMMU works. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start(..., started = false), Exactly. It's not just QEMU, but any (older) userspace manipulates mappings through the vhost-vdpa iotlb interface has to unmap all mappings to workaround the vdpa parent driver bug. If they don't do explicit unmap, it would cause state inconsistency between vhost-vdpa and parent driver, then old mappings can't be restored, and new mapping can be added to iotlb after vDPA reset. There's no point to preserve this broken and inconsistent behavior between vhost-vdpa and parent driver, as userspace doesn't care at all! but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the bug has been solved. Right, I couldn't say it better than you do, thanks! The feature flag is more of an unusual means to indicating kernel bug having been fixed, rather than introduce a new feature or new kernel behavior ending up in change of userspace's expectation. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Si-Wei or
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/15/2023 11:32 PM, Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. Nope, it was the opposite. Maybe it was not clear enough, let me try once more - userspace CANNOT decouple IOTLB reset from vDPA reset today. This is because of bug/discrepancy in mlx5_vdap and vdpa_sim already breaking userspace's expectation, rendering the brokenness/inconsistency on vhost-vdpa mapping interface from behaving what it promised and should have done. Only with the IOTLB_PERSIST flag seen userspace can trust vhost-vdpa kernel interface *reliably* to decouple IOTLB reset from vDPA reset. Without seeing this flag, no matter how the code in QEMU was written, today's older userspace was never like to assume the mappings will *definitely* be cleared by vDPA reset. If any userspace implementation wants to get consistent behavior for all vDPA parent devices, it still has to *explicitly* clear all existing mappings by its own by sending bunch of unmap (iotlb invalidate) requests to vhost-vdpa kernel before resetting the vDPA backend. In brief, userspace is already broken by kernel implementation today, and new userspace needs some device flag to know for sure if kernel bug has already been fixed; older userspace doesn't care about preserving the broken kernel behavior at all, regardless whether or not it wants to decouple IOTLB from vDPA reset. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). For two reasons: 1) backend features need acked by userspace this is by design There's no breakage on this part. Backend feature IOTLB_PERSIST won't be set if userspace doesn't ack. 2) keep the odd behaviour seems to be more safe as we can't audit every userspace program Definitely don't have to audit every userspace program, but I cannot think of a case where a sane userspace program can be broken. Can you elaborate one or two potential userspace usage that may break because of this? As said, platform IOMMU already did it this way. Regards, -Siwei Thanks I think the purpose of the IOTLB_PERSIST flag is just to give userspace 100% certainty of persistent iotlb mapping not getting lost across vdpa reset. Thanks, -Siwei [1] https://lore.kernel.
Re: [RFC PATCH] vdpa_sim: implement .reset_map support
Hi Stefano, On 10/13/2023 2:22 AM, Stefano Garzarella wrote: Hi Si-Wei, On Fri, Oct 13, 2023 at 01:23:40AM -0700, Si-Wei Liu wrote: RFC only. Not tested on vdpa-sim-blk with user virtual address. I can test it, but what I should stress? Great, thank you! As you see, my patch moved vhost_iotlb_reset out of vdpasim_reset for the sake of decoupling mapping from vdpa device reset. For hardware devices this decoupling makes sense as platform IOMMU already did it. But I'm not sure if there's something in the software device (esp. with vdpa-blk and the userspace library stack) that may have to rely on the current .reset behavior that clears the vhost_iotlb. So perhaps you can try to exercise every possible case involving blk device reset, and see if anything (related to mapping) breaks? Works fine with vdpa-sim-net which uses physical address to map. Can you share your tests? so I'll try to do the same with blk. Basically everything involving virtio device reset in the guest, e.g. reboot the VM, remove/unbind then reprobe/bind the virtio-net module/driver, then see if device I/O (which needs mapping properly) is still flowing as expected. And then everything else that could trigger QEMU's vhost_dev_start/stop paths ending up as passive vhos-vdpa backend reset, for e.g. link status change, suspend/hibernate, SVQ switch and live migration. I am not sure if vdpa-blk supports live migration through SVQ or not, if not you don't need to worry about. This patch is based on top of [1]. [1] https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/ The series does not apply well on master or vhost tree. Where should I apply it? Sent the link through another email offline. Thanks, -Siwei If you have a tree with all of them applied, will be easy for me ;-) Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH] vdpa_sim: implement .reset_map support
RFC only. Not tested on vdpa-sim-blk with user virtual address. Works fine with vdpa-sim-net which uses physical address to map. This patch is based on top of [1]. [1] https://lore.kernel.org/virtualization/1696928580-7520-1-git-send-email-si-wei@oracle.com/ Signed-off-by: Si-Wei Liu --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 28 +--- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 76d4105..a7455f2 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -151,13 +151,6 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim) >iommu_lock); } - for (i = 0; i < vdpasim->dev_attr.nas; i++) { - vhost_iotlb_reset(>iommu[i]); - vhost_iotlb_add_range(>iommu[i], 0, ULONG_MAX, - 0, VHOST_MAP_RW); - vdpasim->iommu_pt[i] = true; - } - vdpasim->running = true; spin_unlock(>iommu_lock); @@ -637,6 +630,25 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, unsigned int asid, return ret; } +static int vdpasim_reset_map(struct vdpa_device *vdpa, unsigned int asid) +{ + struct vdpasim *vdpasim = vdpa_to_sim(vdpa); + + if (asid >= vdpasim->dev_attr.nas) + return -EINVAL; + + spin_lock(>iommu_lock); + if (vdpasim->iommu_pt[asid]) + goto out; + vhost_iotlb_reset(>iommu[asid]); + vhost_iotlb_add_range(>iommu[asid], 0, ULONG_MAX, + 0, VHOST_MAP_RW); + vdpasim->iommu_pt[asid] = true; +out: + spin_unlock(>iommu_lock); + return 0; +} + static int vdpasim_bind_mm(struct vdpa_device *vdpa, struct mm_struct *mm) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); @@ -759,6 +771,7 @@ static void vdpasim_free(struct vdpa_device *vdpa) .set_group_asid = vdpasim_set_group_asid, .dma_map= vdpasim_dma_map, .dma_unmap = vdpasim_dma_unmap, + .reset_map = vdpasim_reset_map, .bind_mm= vdpasim_bind_mm, .unbind_mm = vdpasim_unbind_mm, .free = vdpasim_free, @@ -796,6 +809,7 @@ static void vdpasim_free(struct vdpa_device *vdpa) .get_iova_range = vdpasim_get_iova_range, .set_group_asid = vdpasim_set_group_asid, .set_map= vdpasim_set_map, + .reset_map = vdpasim_reset_map, .bind_mm= vdpasim_bind_mm, .unbind_mm = vdpasim_unbind_mm, .free = vdpasim_free, -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op
On 10/12/2023 8:04 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR on the given ASID and recreate the initial DMA mapping. That way, the device .reset op can run free from having to maintain and clean up memory mappings by itself. The cvq mapping also needs to be cleared if is in the given ASID. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu I wonder if the simulator suffers from the exact same issue. For vdpa-sim !use_va (map using PA and with pinning) case, yes. But I'm not sure the situation of the vdpa-sim(-blk) use_va case, e.g. I haven't checked if there's dependency on today's reset behavior (coupled), and if QEMU vhost-vdpa backend driver is the only userspace consumer. Maybe Stefano knows? I can give it a try on simulator fix but don't count me on the vdpa-sim(-blk) use_va part. Regards, -Siwei If yes, let's fix the simulator as well? Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] vdpa: introduce .reset_map operation callback
On 10/12/2023 7:49 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Device specific IOMMU parent driver who wishes to see mapping to be decoupled from virtio or vdpa device life cycle (device reset) can use it to restore memory mapping in the device IOMMU to the initial or default state. The reset of mapping is done per address space basis. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation details to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the he time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309..26ae6ae 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) I think we need to mention that this is a must for parents that use set_map()? It's not a must IMO, some .set_map() user for e.g. VDUSE or vdpa-sim-blk can deliberately choose to implement .reset_map() depending on its own need. Those user_va software devices mostly don't care about mapping cost during reset, as they don't have to pin kernel memory in general. It's just whether or not they care about mapping being decoupled from device reset at all. And the exact implementation requirement of this interface has been documented right below. -Siwei Other than this: Acked-by: Jason Wang Thanks + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). I think the purpose of the IOTLB_PERSIST flag is just to give userspace 100% certainty of persistent iotlb mapping not getting lost across vdpa reset. Thanks, -Siwei [1] https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e4...@oracle.com/ [2] https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1...@oracle.com/ Otherwise we may break old userspace. Thanks + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/11/2023 4:21 AM, Eugenio Perez Martin wrote: On Tue, Oct 10, 2023 at 11:05 AM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); Now I'm wondering, does this call to vhost_vdpa_iotlb_unmap sets a different map (via .set_map) per element of the vhost_iotlb_itree? Yes and no, effectively this vhost_vdpa_iotlb_unmap call will pass an empty iotlb with zero map entry down to the driver via .set_map, so for .set_map interface it's always a different map no matter what. As for this special case, the internal implementation of mlx5_vdpa .set_map may choose to either destroy MR and recreate a new one, or remove all mappings on the existing MR (currently it uses destroy+recreate for simplicity without have to special case). But .reset_map is different - the 1:1 DMA MR has to be recreated explicitly after destroying the regular MR, so you see this is driver/device implementation specifics. Not a big deal since we're in the cleanup path, but it could be a nice optimization on top as we're going to reset the map of the asid anyway. You mean wrap up what's done in vhost_vdpa_iotlb_unmap and vhost_vdpa_reset_map to a new call, say vhost_vdpa_iotlb_reset? Yes this is possible, but be noted that the vhost_vdpa_iotlb_unmap also takes charge of pinning accounting other than mapping, and it has to also maintain it's own vhost_iotlb copy in sync. There's no such much code that can be consolidated or generalized at this point, as vhost_vdpa_reset_map() is very specific to some device implementation, and I don't see common need to optimize this further up in the map/unmap hot path rather than this cleanup slow path, just as you alluded to. Regards, -Siwei + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 4/4] vdpa/mlx5: implement .reset_map driver op
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR on the given ASID and recreate the initial DMA mapping. That way, the device .reset op can run free from having to maintain and clean up memory mappings by itself. The cvq mapping also needs to be cleared if is in the given ASID. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +- 3 files changed, 31 insertions(+), 5 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index db988ce..84547d9 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 66530e28..2197c46 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev) return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0); } + +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +{ + if (asid >= MLX5_VDPA_NUM_AS) + return -EINVAL; + + mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]); + + if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (mlx5_vdpa_create_dma_mr(mvdev)) + mlx5_vdpa_warn(mvdev, "create DMA MR failed\n"); + } else { + mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid); + } + + return 0; +} diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 6abe023..928e71b 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) unregister_link_notifier(ndev); teardown_driver(ndev); clear_vqs_ready(ndev); - mlx5_vdpa_destroy_mr_resources(>mvdev); ndev->mvdev.status = 0; ndev->mvdev.suspended = false; ndev->cur_num_vqs = 0; @@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) init_group_to_asid_map(mvdev); ++mvdev->generation; - if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { - if (mlx5_vdpa_create_dma_mr(mvdev)) - mlx5_vdpa_warn(mvdev, "create MR failed\n"); - } up_write(>reslock); return 0; @@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, return err; } +static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid) +{ + struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); + struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); + int err; + + down_write(>reslock); + err = mlx5_vdpa_reset_mr(mvdev, asid); + up_write(>reslock); + return err; +} + static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); @@ -3199,6 +3206,7 @@ static int mlx5_set_group_asid(struct vdpa_device *vdev, u32 group, .set_config = mlx5_vdpa_set_config, .get_generation = mlx5_vdpa_get_generation, .set_map = mlx5_vdpa_set_map, + .reset_map = mlx5_vdpa_reset_map, .set_group_asid = mlx5_set_group_asid, .get_vq_dma_dev = mlx5_get_vq_dma_dev, .free = mlx5_vdpa_free, -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel supports persistent IOTLB mapping across device reset. Without it, userspace has no way to tell apart if it's running on an older kernel, which could silently drop all iotlb mapping across vDPA reset. There are 3 cases that backend may claim this feature bit on: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver - parent device with vendor specific IOMMU implementation that explicitly declares the specific backend feature The reason why .reset_map is being one of the pre-condition for persistent iotlb is because without it, vhost-vdpa can't switch back iotlb to the initial state later on, especially for the on-chip IOMMU case which starts with identity mapping at device creation. virtio-vdpa requires on-chip IOMMU to perform 1:1 passthrough translation from PA to IOVA as-is to begin with, and .reset_map is the only means to turn back iotlb to the identity mapping mode after vhost-vdpa is gone. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 15 +++ include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index a3f8160..c92794f 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -413,6 +413,15 @@ static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) return ops->get_vq_desc_group; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map || + vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME) | BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK))) @@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); features |= vhost_vdpa_get_backend_features(v); if (copy_to_user(featurep, , sizeof(features))) r = -EFAULT; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index 18ad6ae..d765690 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -190,5 +190,7 @@ struct vhost_vdpa_iova_range { * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID. */ #define VHOST_BACKEND_F_DESC_ASID0x7 +/* IOTLB don't flush memory mapping across device reset */ +#define VHOST_BACKEND_F_IOTLB_PERSIST 0x8 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is based off of the descriptor group v3 series from Dragos. [2] [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html [2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatu...@nvidia.com/ --- v1: - rewrote commit messages to include more detailed description and background - reword to vendor specific IOMMU implementation from on-chip IOMMU - include parent device backend feautres to persistent iotlb precondition - reimplement mlx5_vdpa patch on top of descriptor group series RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (4): vdpa: introduce .reset_map operation callback vhost-vdpa: reset vendor specific mapping to initial state in .release vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vdpa/mlx5: implement .reset_map driver op drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +- drivers/vhost/vdpa.c | 31 +++ include/linux/vdpa.h | 10 ++ include/uapi/linux/vhost_types.h | 2 ++ 6 files changed, 74 insertions(+), 5 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/4] vdpa: introduce .reset_map operation callback
Device specific IOMMU parent driver who wishes to see mapping to be decoupled from virtio or vdpa device life cycle (device reset) can use it to restore memory mapping in the device IOMMU to the initial or default state. The reset of mapping is done per address space basis. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation details to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the he time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309..26ae6ae 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/16] vdpa: Add support for vq descriptor mappings
On 9/25/2023 12:59 AM, Dragos Tatulea wrote: On Tue, 2023-09-12 at 16:01 +0300, Dragos Tatulea wrote: This patch series adds support for vq descriptor table mappings which are used to improve vdpa live migration downtime. The improvement comes from using smaller mappings which take less time to create and destroy in hw. Gentle ping. Note that I will have to send a v2. The changes in mlx5_ifc.h will need to be merged first separately into the mlx5-next branch [0] and then pulled from there when the series is applied. This separation is unnecessary, as historically the virtio emulation portion of the update to mlx5_ifc.h often had to go through the vhost tree. See commits 1892a3d425bf and e13cd45d352d. Especially the additions from this series (mainly desc group mkey) have nothing to do with any networking or NIC driver feature. -Siwei [0] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next Thanks, Dragos The first part adds the vdpa core changes from Si-Wei [0]. The second part adds support in mlx5_vdpa: - Refactor the mr code to be able to cleanly add descriptor mappings. - Add hardware descriptor mr support. - Properly update iotlb for cvq during ASID switch. [0] https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com Dragos Tatulea (13): vdpa/mlx5: Create helper function for dma mappings vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code vdpa/mlx5: Take cvq iotlb lock during refresh vdpa/mlx5: Collapse "dvq" mr add/delete functions vdpa/mlx5: Rename mr destroy functions vdpa/mlx5: Allow creation/deletion of any given mr struct vdpa/mlx5: Move mr mutex out of mr struct vdpa/mlx5: Improve mr update flow vdpa/mlx5: Introduce mr for vq descriptor vdpa/mlx5: Enable hw support for vq descriptor mapping vdpa/mlx5: Make iotlb helper functions more generic vdpa/mlx5: Update cvq iotlb mapping on ASID change Cover letter: vdpa/mlx5: Add support for vq descriptor mappings Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vdpa/mlx5/core/mlx5_vdpa.h | 31 +++-- drivers/vdpa/mlx5/core/mr.c | 191 - drivers/vdpa/mlx5/core/resources.c | 6 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 100 ++- drivers/vhost/vdpa.c | 27 include/linux/mlx5/mlx5_ifc.h | 8 +- include/linux/mlx5/mlx5_ifc_vdpa.h | 7 +- include/linux/vdpa.h | 11 ++ include/uapi/linux/vhost.h | 8 ++ include/uapi/linux/vhost_types.h | 5 + 10 files changed, 264 insertions(+), 130 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/16] vdpa: Add support for vq descriptor mappings
On 9/13/2023 9:08 AM, Eugenio Perez Martin wrote: On Wed, Sep 13, 2023 at 3:03 AM Lei Yang wrote: Hi Dragos, Eugenio and Si-Wei My name is Lei Yang, a software Quality Engineer from Red Hat. And always paying attention to improving the live migration downtime issues because there are others QE asked about this problem when I share live migration status recently. Therefore I would like to test it in my environment. Before the testing I want to know if there is an expectation of downtime range based on this series of patches? In addition, QE also can help do a regression test based on this series of patches if there is a requirement. Hi Lei, Thanks for offering the testing bandwidth! I think we can only do regression tests here, as the userland part is still not sent to qemu. Right. Regression for now, even QEMU has it, to exercise the relevant feature it would need a supporting firmware that is not yet available for now. Just stay tuned. thanks for your patience, -Siwei Regards and thanks Lei On Tue, Sep 12, 2023 at 9:04 PM Dragos Tatulea wrote: This patch series adds support for vq descriptor table mappings which are used to improve vdpa live migration downtime. The improvement comes from using smaller mappings which take less time to create and destroy in hw. The first part adds the vdpa core changes from Si-Wei [0]. The second part adds support in mlx5_vdpa: - Refactor the mr code to be able to cleanly add descriptor mappings. - Add hardware descriptor mr support. - Properly update iotlb for cvq during ASID switch. [0] https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com Dragos Tatulea (13): vdpa/mlx5: Create helper function for dma mappings vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code vdpa/mlx5: Take cvq iotlb lock during refresh vdpa/mlx5: Collapse "dvq" mr add/delete functions vdpa/mlx5: Rename mr destroy functions vdpa/mlx5: Allow creation/deletion of any given mr struct vdpa/mlx5: Move mr mutex out of mr struct vdpa/mlx5: Improve mr update flow vdpa/mlx5: Introduce mr for vq descriptor vdpa/mlx5: Enable hw support for vq descriptor mapping vdpa/mlx5: Make iotlb helper functions more generic vdpa/mlx5: Update cvq iotlb mapping on ASID change Cover letter: vdpa/mlx5: Add support for vq descriptor mappings Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vdpa/mlx5/core/mlx5_vdpa.h | 31 +++-- drivers/vdpa/mlx5/core/mr.c| 191 - drivers/vdpa/mlx5/core/resources.c | 6 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 100 ++- drivers/vhost/vdpa.c | 27 include/linux/mlx5/mlx5_ifc.h | 8 +- include/linux/mlx5/mlx5_ifc_vdpa.h | 7 +- include/linux/vdpa.h | 11 ++ include/uapi/linux/vhost.h | 8 ++ include/uapi/linux/vhost_types.h | 5 + 10 files changed, 264 insertions(+), 130 deletions(-) -- 2.41.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On 9/12/2023 12:01 AM, Jason Wang wrote: On Tue, Sep 12, 2023 at 8:28 AM Si-Wei Liu wrote: On 9/10/2023 8:52 PM, Jason Wang wrote: On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu wrote: Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel supports persistent IOTLB mapping across device reset. As discussed, the IOTLB persists for devices with platform IOMMU at least. You've mentioned that this behaviour is covered by Qemu since it reset IOTLB on reset. I wonder what happens if we simply decouple the IOTLB reset from vDPA reset in Qemu. Could we meet bugs there? Not sure I understand. Simple decouple meaning to remove memory_listener registration/unregistration calls *unconditionally* from the vDPA dev start/stop path in QEMU, or make it conditional around the existence of PERSIST_IOTLB? If my memory is correct, currently the listeners were registered/unregistered during start/stop. I mean what if we register/unregister during init/free? Yes, the move to init/cleanup was assumed in below response. If unconditional, it breaks older host kernel, as the older kernel still silently drops all mapping across vDPA reset (VM reboot), Ok, right. ending up with network loss afterwards; if make the QEMU change conditional on PERSIST_IOTLB without the .reset_map API, it can't cover the virtio-vdpa 1:1 identity mapping case. Ok, I see. Let's add the above and explain why it can't cover the 1:1 mapping somewhere (probably the commit log) in the next version. OK. Will do. So I think we can probably introduce reset_map() but not say it's for on-chip IOMMU. We can probably say, it's for resetting the vendor specific mapping into initialization state? For sure. That's exactly the intent, though I didn't specifically tie on-chip to two-dimension or entity mapping. Yes I can reword to "vendor specific" in the next rev to avoid confusions and ambiguity. Thanks, -Siwei Btw, is there a Qemu patch for reference for this new feature? There's a WIP version, not ready yet for review: https://github.com/siwliu-kernel/qemu branch: vdpa-svq-asid-poc Will need to clean up code and split to smaller patches before I can post it, if the kernel part can be settled. Ok. Thanks Thanks, -Siwei Thanks There are two cases that backend may claim this feature bit on: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver Signed-off-by: Si-Wei Liu --- RFC v2 -> v3: - fix missing return due to merge error --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 71fbd559..b404504 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) return ops->get_vq_desc_group; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | -BIT_ULL(VHOST_BACKEND_F_RESUME))) +BIT_ULL(VHOST_BACKEND_F_RESUME) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST))) return -EOPNOTSUPP; if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) && !vhost_vdpa_can_suspend(v)) @@ -730,6 +739,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -785,6 +797,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persist
Re: [PATCH RFC v3 2/4] vdpa/mlx5: implement .reset_map driver op
On 9/11/2023 11:53 PM, Jason Wang wrote: On Tue, Sep 12, 2023 at 8:02 AM Si-Wei Liu wrote: On 9/10/2023 8:48 PM, Jason Wang wrote: On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu wrote: Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at device creation time, while this 1:1 mapping will be implicitly destroyed when the first .set_map call is invoked. Everytime when the .reset callback is invoked, any mapping left behind will be dropped then reset back to the initial 1:1 DMA mapping. In order to reduce excessive memory mapping cost during live migration, it is desirable to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. mappings should be left intact across virtio device reset. Leverage the .reset_map callback to reset memory mapping, then the device .reset routine can run free from having to clean up memory mappings. It's not clear the direct relationship between the persistent mapping and reset_map. Consider .reset_map as a simplified abstraction for on-chip IOMMU model, decoupling memory mapping mode switching from the current vdpa_reset hack. Slightly different than platform IOMMU iommu_domain_alloc/free, but works the best with existing .dma_map/.set_map APIs. Note that iommu_domain_alloc/free doesn't imply any mappings (even the identity mapping). Forget about this part, it just exposes the multi-dimension aspect of iommu domain unnecessarily, and I think we both don't like to. Although this was intended to make virtio-vdpa work seamlessly when it is used over mlx5-vdpa, similar to the DMA device deviation introduced to the vDPA driver API. Thanks, -Siwei As said in the other email, the distinction cannot be hidden, as there are bus drivers with varied mapping needs. On the other hand, I can live with the iommu_domain_alloc/free flavor strictly following the platform IOMMU model, but not sure if worth the complexity. I'm not sure I get this, maybe you can post some RFC or pseudo code? Could we do it step by step? For example, remove the mlx5_vdpa_destroy_mr() in mlx5_vdpa_reset() when PERSIST_IOTLB exists? I think today there's no way for the parent driver to negotiate backend features with userspace, for e.g. parent won't be able to perform mlx5_vdpa_destroy_mr for the virtio-vdpa case when PERSIST_IOTLB doesn't exist. And this backend features stuff is a vhost specific thing, not specifically tied to vdpa itself. How do we get it extended and propagated up to the vdpa bus layer? Just to make sure we are on the same page, I just want to know what happens if we simply remove mlx5_vdpa_destroy_mr() in mlx5_vdpa_reset()? And then we can deal with the resetting and others on top, For this proposed fix, dealing with vdpa_reset from vhost-vdpa is not specifically an issue, but how to get the mapping reverted back to 1:1 identity/passthrough when users want to switch from vhost-vdpa to virtio-vdpa is. or it needs some explanation for why reset_map() must be done first. Yep, I can add more to the commit log. Thanks Thanks, -Siwei Thanks Signed-off-by: Si-Wei Liu --- RFC v1 -> v2: - fix error path when both CVQ and DVQ fall in same asid --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 70 +++--- drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- 3 files changed, 56 insertions(+), 33 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index b53420e..5c9a25a 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 5a1971fc..ec2c7b4e1 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_mr *mr } } -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return; - prune_iotlb(mvdev); } -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) { struct mlx5_vdpa_mr *mr = &
Re: [PATCH RFC v2 1/4] vdpa: introduce .reset_map operation callback
On 9/11/2023 11:23 PM, Jason Wang wrote: On Tue, Sep 12, 2023 at 7:31 AM Si-Wei Liu wrote: Hi Jason, On 9/10/2023 8:42 PM, Jason Wang wrote: Hi Si-Wei: On Sat, Sep 9, 2023 at 9:34 PM Si-Wei Liu wrote: On-chip IOMMU parent driver could use it to restore memory mapping to the initial state. As discussed before. On-chip IOMMU is the hardware details that need to be hidden by the vDPA bus. I guess today this is exposed to the bus driver layer already, for e.g. vhost_vdpa_map() can call into the .dma_map, or .set_map, or iommu_map() flavors depending on the specific hardware IOMMU implementation underneath? Specifically, "struct iommu_domain *domain" is now part of "struct vhost_vdpa" at an individual bus driver (vhost-vdpa), rather than being wrapped around under the vdpa core "struct vdpa_device" as vdpa device level object. Do we know for what reason the hardware details could be exposed to bus callers like vhost_vdpa_map and vhost_vdpa_general_unmap, while it's prohibited for other similar cases on the other hand? Or is there a boundary in between I was not aware of? Let me try to explain: set_map(), dma_map(), dma_unmap() is used for parent specific mappings. It means the parents want to do vendor specific setup for the mapping. The abstraction of translation is still one dimension (thought the actual implementation in the parent could be two dimensions). So it's not necessarily the on-chip stuff (see the example of the VDUSE). That means we never expose two dimension mappings like (on-chip) beyond the bus. So it's not one dimension vs two dimensions but the platform specific mappings vs vendor specific mappings. OK, I think I saw on-chip was used interchangeably for vendor specific means of mapping even for VDUSE. While I think we both agreed it's too complex to expose the details of two-dimensions and we should try to avoid that (I thought on-chip doesn't imply two-dimension but just the vendor specific part). That's the reason why I hide this special detail under a simple .reset_map interface such that we could easily decouple mapping from virtio life cycle (device reset). I think a more fundamental question I don't quite understand, is adding an extra API to on-chip IOMMU itself an issue, or just that you don't like the way how the IOMMU model gets exposed via this specific API of .reset_map? extra API to on-chip IOMMU, since the on-chip logics should be hidden by the bus unless we want to introduce the two dimensions abstraction (which seems to be an overkill). Thanks for clarifications of your concern. I will rephrase on-chip to "vendor specific" and try to avoid mentioning the two-dimension aspect of the API. For the platform IOMMU case, internally there exists distinction between the 1:1 identify (passthrough) mode and DMA page mapping mode, and this distinction is somehow getting exposed and propagated through the IOMMU API - for e.g. iommu_domain_alloc() and iommu_attach_device() are being called explicitly from vhost_vdpa_alloc_domain() by vhost-vdpa (and the opposite from within vhost_vdpa_free_domain), while for virtio-vdpa it doesn't call any IOMMU API at all on the other hand It's the way the kernel manages DMA mappings. For a userspace driver via vhost-vDPA, it needs to call IOMMU APIs. And for a kernel driver via virtio-vDPA, DMA API is used (via the dma_dev exposed through virtio_vdpa). DMA API may decide to call IOMMU API if IOMMU is enabled but not in passthrough mode. Right, I think what I meant is, distinction of mapping requirement exists between two bus drivers, vhost-vdpa and virtio-vdpa. It's impossible to hide every details (identity, swiotlb, dmar) under the cover of DMA API simply using the IOMMU API abstraction. Same applies to how one dimension oriented vendor specific API ( .dma_map/.set_map I mean) can't cover all cases of potentially multi-dimensional mapping requirements from virtio-vdpa (which is using a feature rich DMA API instead of simple and lower level page mapping based IOMMU API). I now get that you may want to understand why .reset_map is required and which part of the userspace functionality won't work without it, on the other hand. - which is to inherit what default IOMMU domain has. Yes, but it's not a 1:1 (identify) mapping, it really depends on the configuration. (And there could even be a swiotlb layer in the middle). Yes, so I said inherit the configuration of the default domain, which could vary versus one-dimension. Ideally for on-chip IOMMU we can and should do pretty much the same, but I don't think there's a clean way without introducing any driver API to make vhost-vdpa case distinguish from the virtio-vdpa case. I'm afraid to say that it was just a hack to hide the necessary distinction needed by vdpa bus users for e.g. in the deep of vdpa_reset(), if not introducing any new driver API is the goal here... So rest_map() is fine if it is not defined just f
Re: [PATCH] vdpa: consume device_features parameter
Thanks David, for clarifications. Now I see the patch just got posted by Shannon (thanks!) with the correct iproute2 label in the subject line. We may expect to see this land on iproute2 repo soon? Thanks! -Siwei On 9/9/2023 1:36 PM, David Ahern wrote: On 9/8/23 12:37 PM, Si-Wei Liu wrote: Just out of my own curiosity, the patch is not applicable simply because the iproute2 was missing from the subject, or the code base somehow got most likely missing the iproute2 label in the Subject line changed that isn't aligned with the patch any more? Thanks, -Siwei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On 9/10/2023 8:52 PM, Jason Wang wrote: On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu wrote: Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel supports persistent IOTLB mapping across device reset. As discussed, the IOTLB persists for devices with platform IOMMU at least. You've mentioned that this behaviour is covered by Qemu since it reset IOTLB on reset. I wonder what happens if we simply decouple the IOTLB reset from vDPA reset in Qemu. Could we meet bugs there? Not sure I understand. Simple decouple meaning to remove memory_listener registration/unregistration calls *unconditionally* from the vDPA dev start/stop path in QEMU, or make it conditional around the existence of PERSIST_IOTLB? If unconditional, it breaks older host kernel, as the older kernel still silently drops all mapping across vDPA reset (VM reboot), ending up with network loss afterwards; if make the QEMU change conditional on PERSIST_IOTLB without the .reset_map API, it can't cover the virtio-vdpa 1:1 identity mapping case. Btw, is there a Qemu patch for reference for this new feature? There's a WIP version, not ready yet for review: https://github.com/siwliu-kernel/qemu branch: vdpa-svq-asid-poc Will need to clean up code and split to smaller patches before I can post it, if the kernel part can be settled. Thanks, -Siwei Thanks There are two cases that backend may claim this feature bit on: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver Signed-off-by: Si-Wei Liu --- RFC v2 -> v3: - fix missing return due to merge error --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 71fbd559..b404504 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) return ops->get_vq_desc_group; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | -BIT_ULL(VHOST_BACKEND_F_RESUME))) +BIT_ULL(VHOST_BACKEND_F_RESUME) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST))) return -EOPNOTSUPP; if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) && !vhost_vdpa_can_suspend(v)) @@ -730,6 +739,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -785,6 +797,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); if (copy_to_user(featurep, , sizeof(features))) r = -EFAULT; break; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index 6acc604..0fdb6f0 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -186,5 +186,7 @@ struct vhost_vdpa_iova_range { * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID. */ #define VHOST_BACKEND_F_DESC_ASID0x6 +/* IOTLB don't flush memory mapping across device reset */ +#define VHOST_BACKEND_F_IOTLB_PERSIST 0x7 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 2/4] vdpa/mlx5: implement .reset_map driver op
On 9/10/2023 8:48 PM, Jason Wang wrote: On Sat, Sep 9, 2023 at 9:46 PM Si-Wei Liu wrote: Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at device creation time, while this 1:1 mapping will be implicitly destroyed when the first .set_map call is invoked. Everytime when the .reset callback is invoked, any mapping left behind will be dropped then reset back to the initial 1:1 DMA mapping. In order to reduce excessive memory mapping cost during live migration, it is desirable to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. mappings should be left intact across virtio device reset. Leverage the .reset_map callback to reset memory mapping, then the device .reset routine can run free from having to clean up memory mappings. It's not clear the direct relationship between the persistent mapping and reset_map. Consider .reset_map as a simplified abstraction for on-chip IOMMU model, decoupling memory mapping mode switching from the current vdpa_reset hack. Slightly different than platform IOMMU iommu_domain_alloc/free, but works the best with existing .dma_map/.set_map APIs. As said in the other email, the distinction cannot be hidden, as there are bus drivers with varied mapping needs. On the other hand, I can live with the iommu_domain_alloc/free flavor strictly following the platform IOMMU model, but not sure if worth the complexity. Could we do it step by step? For example, remove the mlx5_vdpa_destroy_mr() in mlx5_vdpa_reset() when PERSIST_IOTLB exists? I think today there's no way for the parent driver to negotiate backend features with userspace, for e.g. parent won't be able to perform mlx5_vdpa_destroy_mr for the virtio-vdpa case when PERSIST_IOTLB doesn't exist. And this backend features stuff is a vhost specific thing, not specifically tied to vdpa itself. How do we get it extended and propagated up to the vdpa bus layer? And then we can deal with the resetting and others on top, For this proposed fix, dealing with vdpa_reset from vhost-vdpa is not specifically an issue, but how to get the mapping reverted back to 1:1 identity/passthrough when users want to switch from vhost-vdpa to virtio-vdpa is. or it needs some explanation for why reset_map() must be done first. Yep, I can add more to the commit log. Thanks, -Siwei Thanks Signed-off-by: Si-Wei Liu --- RFC v1 -> v2: - fix error path when both CVQ and DVQ fall in same asid --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 70 +++--- drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- 3 files changed, 56 insertions(+), 33 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index b53420e..5c9a25a 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 5a1971fc..ec2c7b4e1 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_mr *mr } } -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return; - prune_iotlb(mvdev); } -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) { struct mlx5_vdpa_mr *mr = >mr; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return; - if (!mr->initialized) return; @@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid) mutex_lock(>mkey_mtx); - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - _mlx5_vdpa_destroy_cvq_mr(mvdev, asid); + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) + _mlx5_vdpa_destroy_dvq_mr(mvdev); + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) + _mlx5_vdpa_destroy_cvq_mr(mvdev); mutex_unlock(>mkey_mtx); } @@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
Re: [PATCH RFC v2 1/4] vdpa: introduce .reset_map operation callback
Hi Jason, On 9/10/2023 8:42 PM, Jason Wang wrote: Hi Si-Wei: On Sat, Sep 9, 2023 at 9:34 PM Si-Wei Liu wrote: On-chip IOMMU parent driver could use it to restore memory mapping to the initial state. As discussed before. On-chip IOMMU is the hardware details that need to be hidden by the vDPA bus. I guess today this is exposed to the bus driver layer already, for e.g. vhost_vdpa_map() can call into the .dma_map, or .set_map, or iommu_map() flavors depending on the specific hardware IOMMU implementation underneath? Specifically, "struct iommu_domain *domain" is now part of "struct vhost_vdpa" at an individual bus driver (vhost-vdpa), rather than being wrapped around under the vdpa core "struct vdpa_device" as vdpa device level object. Do we know for what reason the hardware details could be exposed to bus callers like vhost_vdpa_map and vhost_vdpa_general_unmap, while it's prohibited for other similar cases on the other hand? Or is there a boundary in between I was not aware of? I think a more fundamental question I don't quite understand, is adding an extra API to on-chip IOMMU itself an issue, or just that you don't like the way how the IOMMU model gets exposed via this specific API of .reset_map? For the platform IOMMU case, internally there exists distinction between the 1:1 identify (passthrough) mode and DMA page mapping mode, and this distinction is somehow getting exposed and propagated through the IOMMU API - for e.g. iommu_domain_alloc() and iommu_attach_device() are being called explicitly from vhost_vdpa_alloc_domain() by vhost-vdpa (and the opposite from within vhost_vdpa_free_domain), while for virtio-vdpa it doesn't call any IOMMU API at all on the other hand - which is to inherit what default IOMMU domain has. Ideally for on-chip IOMMU we can and should do pretty much the same, but I don't think there's a clean way without introducing any driver API to make vhost-vdpa case distinguish from the virtio-vdpa case. I'm afraid to say that it was just a hack to hide the necessary distinction needed by vdpa bus users for e.g. in the deep of vdpa_reset(), if not introducing any new driver API is the goal here... Exposing this will complicate the implementation of bus drivers. As said above, this distinction is needed by bus drivers, and it's already done by platform IOMMU via IOMMU API. I can drop the .reset_map API while add another set of similar driver API to mimic iommu_domain_alloc/iommu_domain_free, but doing this will complicate the parent driver's implementation on the other hand. While .reset_map is what I can think of to be the simplest for parent, I can do the other way if you're fine with it. Let me know how it sounds. Thanks, -Siwei Thanks Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 17a4efa..daecf55 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -324,6 +324,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -401,6 +407,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel supports persistent IOTLB mapping across device reset. There are two cases that backend may claim this feature bit on: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver Signed-off-by: Si-Wei Liu --- RFC v2 -> v3: - fix missing return due to merge error --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 71fbd559..b404504 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) return ops->get_vq_desc_group; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | -BIT_ULL(VHOST_BACKEND_F_RESUME))) +BIT_ULL(VHOST_BACKEND_F_RESUME) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST))) return -EOPNOTSUPP; if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) && !vhost_vdpa_can_suspend(v)) @@ -730,6 +739,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -785,6 +797,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); if (copy_to_user(featurep, , sizeof(features))) r = -EFAULT; break; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index 6acc604..0fdb6f0 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -186,5 +186,7 @@ struct vhost_vdpa_iova_range { * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID. */ #define VHOST_BACKEND_F_DESC_ASID0x6 +/* IOTLB don't flush memory mapping across device reset */ +#define VHOST_BACKEND_F_IOTLB_PERSIST 0x7 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 2/4] vdpa/mlx5: implement .reset_map driver op
Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at device creation time, while this 1:1 mapping will be implicitly destroyed when the first .set_map call is invoked. Everytime when the .reset callback is invoked, any mapping left behind will be dropped then reset back to the initial 1:1 DMA mapping. In order to reduce excessive memory mapping cost during live migration, it is desirable to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. mappings should be left intact across virtio device reset. Leverage the .reset_map callback to reset memory mapping, then the device .reset routine can run free from having to clean up memory mappings. Signed-off-by: Si-Wei Liu --- RFC v1 -> v2: - fix error path when both CVQ and DVQ fall in same asid --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 70 +++--- drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- 3 files changed, 56 insertions(+), 33 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index b53420e..5c9a25a 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 5a1971fc..ec2c7b4e1 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_mr *mr } } -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return; - prune_iotlb(mvdev); } -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) { struct mlx5_vdpa_mr *mr = >mr; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return; - if (!mr->initialized) return; @@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid) mutex_lock(>mkey_mtx); - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - _mlx5_vdpa_destroy_cvq_mr(mvdev, asid); + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) + _mlx5_vdpa_destroy_dvq_mr(mvdev); + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) + _mlx5_vdpa_destroy_cvq_mr(mvdev); mutex_unlock(>mkey_mtx); } @@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev) } static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return 0; - return dup_iotlb(mvdev, iotlb); } static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { struct mlx5_vdpa_mr *mr = >mr; int err; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return 0; - if (mr->initialized) return 0; @@ -574,18 +562,22 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, { int err; - err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); - if (err) - return err; - - err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid); - if (err) - goto out_err; + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb); + if (err) + return err; + } + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb); + if (err) + goto out_err; + } return 0; out_err: - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
[PATCH RFC v3 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver
Devices with on-chip IOMMU may need to restore iotlb to 1:1 identity mapping from IOVA to PA. Before vhost-vdpa is going away, give them a chance to clean up and reset iotlb back to 1:1 identify mapping mode. This is done so that any vdpa bus driver may start with 1:1 identity mapping by default. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index eabac06..71fbd559 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with on-chip IOMMU need to restore iotlb +* to 1:1 identity mapping before vhost-vdpa is going +* to be removed and detached from the device. Give +* them a chance to do so, as this cannot be done +* efficiently via the whole-range unmap call above. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 1/4] vdpa: introduce .reset_map operation callback
On-chip IOMMU parent driver could use it to restore memory mapping to the initial state. Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 17a4efa..daecf55 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -324,6 +324,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -401,6 +407,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 0/4] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device should implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, which is mainly used to reset virtio specific device state. This new .reset_map() callback will be invoked only when the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device add, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (4): vdpa: introduce .reset_map operation callback vdpa/mlx5: implement .reset_map driver op vhost-vdpa: should restore 1:1 dma mapping before detaching driver vhost-vdpa: introduce IOTLB_PERSIST backend feature bit drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 70 +++--- drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- drivers/vhost/vdpa.c | 32 - include/linux/vdpa.h | 7 include/uapi/linux/vhost_types.h | 2 ++ 6 files changed, 96 insertions(+), 34 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel supports persistent IOTLB mapping across device reset. There are two cases that backend may claim this feature bit on: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 15 ++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 71fbd559..bbb1092 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -414,6 +414,14 @@ static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) return ops->get_vq_desc_group; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -716,7 +724,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | -BIT_ULL(VHOST_BACKEND_F_RESUME))) +BIT_ULL(VHOST_BACKEND_F_RESUME) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST))) return -EOPNOTSUPP; if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) && !vhost_vdpa_can_suspend(v)) @@ -729,6 +738,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EINVAL; if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; @@ -785,6 +796,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); if (copy_to_user(featurep, , sizeof(features))) r = -EFAULT; break; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index 6acc604..0fdb6f0 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -186,5 +186,7 @@ struct vhost_vdpa_iova_range { * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID. */ #define VHOST_BACKEND_F_DESC_ASID0x6 +/* IOTLB don't flush memory mapping across device reset */ +#define VHOST_BACKEND_F_IOTLB_PERSIST 0x7 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 1/4] vdpa: introduce .reset_map operation callback
On-chip IOMMU parent driver could use it to restore memory mapping to the initial state. Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 17a4efa..daecf55 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -324,6 +324,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -401,6 +407,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 0/4] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device should implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, which is mainly used to reset virtio specific device state. This new .reset_map() callback will be invoked only when the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device add, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html --- RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (4): vdpa: introduce .reset_map operation callback vdpa/mlx5: implement .reset_map driver op vhost-vdpa: should restore 1:1 dma mapping before detaching driver vhost-vdpa: introduce IOTLB_PERSIST backend feature bit drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 70 +++--- drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- drivers/vhost/vdpa.c | 32 - include/linux/vdpa.h | 7 include/uapi/linux/vhost_types.h | 2 ++ 6 files changed, 96 insertions(+), 34 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 2/4] vdpa/mlx5: implement .reset_map driver op
Today, mlx5_vdpa gets started by preallocate 1:1 DMA mapping at device creation time, while this 1:1 mapping will be implicitly destroyed when the first .set_map call is invoked. Everytime when the .reset callback is invoked, any mapping left behind will be dropped then reset back to the initial 1:1 DMA mapping. In order to reduce excessive memory mapping cost during live migration, it is desirable to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. mappings should be left intact across virtio device reset. Leverage the .reset_map callback to reset memory mapping, then the device .reset routine can run free from having to clean up memory mappings. Signed-off-by: Si-Wei Liu --- RFC v1 -> v2: - fix error path when both CVQ and DVQ fall in same asid --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 70 +++--- drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- 3 files changed, 56 insertions(+), 33 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index b53420e..5c9a25a 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 5a1971fc..ec2c7b4e1 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_mr *mr } } -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return; - prune_iotlb(mvdev); } -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) { struct mlx5_vdpa_mr *mr = >mr; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return; - if (!mr->initialized) return; @@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid) mutex_lock(>mkey_mtx); - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - _mlx5_vdpa_destroy_cvq_mr(mvdev, asid); + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) + _mlx5_vdpa_destroy_dvq_mr(mvdev); + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) + _mlx5_vdpa_destroy_cvq_mr(mvdev); mutex_unlock(>mkey_mtx); } @@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev) } static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return 0; - return dup_iotlb(mvdev, iotlb); } static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { struct mlx5_vdpa_mr *mr = >mr; int err; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return 0; - if (mr->initialized) return 0; @@ -574,18 +562,22 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, { int err; - err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); - if (err) - return err; - - err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid); - if (err) - goto out_err; + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb); + if (err) + return err; + } + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb); + if (err) + goto out_err; + } return 0; out_err: - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
[PATCH RFC v2 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver
Devices with on-chip IOMMU may need to restore iotlb to 1:1 identity mapping from IOVA to PA. Before vhost-vdpa is going away, give them a chance to clean up and reset iotlb back to 1:1 identify mapping mode. This is done so that any vdpa bus driver may start with 1:1 identity mapping by default. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index eabac06..71fbd559 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with on-chip IOMMU need to restore iotlb +* to 1:1 identity mapping before vhost-vdpa is going +* to be removed and detached from the device. Give +* them a chance to do so, as this cannot be done +* efficiently via the whole-range unmap call above. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 3/3] vhost-vdpa: uAPI to get dedicated descriptor group id
With _F_DESC_ASID backend feature, the device can now support the VHOST_VDPA_GET_VRING_DESC_GROUP ioctl, and it may expose the descriptor table (including avail and used ring) in a different group than the buffers it contains. This new uAPI will fetch the group ID of the descriptor table. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 10 ++ include/uapi/linux/vhost.h | 8 2 files changed, 18 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index f2e5dce..eabac06 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -602,6 +602,16 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, else if (copy_to_user(argp, , sizeof(s))) return -EFAULT; return 0; + case VHOST_VDPA_GET_VRING_DESC_GROUP: + if (!vhost_vdpa_has_desc_group(v)) + return -EOPNOTSUPP; + s.index = idx; + s.num = ops->get_vq_desc_group(vdpa, idx); + if (s.num >= vdpa->ngroups) + return -EIO; + else if (copy_to_user(argp, , sizeof(s))) + return -EFAULT; + return 0; case VHOST_VDPA_SET_GROUP_ASID: if (copy_from_user(, argp, sizeof(s))) return -EFAULT; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index f5c48b6..649560c 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -219,4 +219,12 @@ */ #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) +/* Get the group for the descriptor table including driver & device areas + * of a virtqueue: read index, write group in num. + * The virtqueue index is stored in the index field of vhost_vring_state. + * The group ID of the descriptor table for this specific virtqueue + * is returned via num field of vhost_vring_state. + */ +#define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F, \ + struct vhost_vring_state) #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group
Following patchset introduces dedicated group for descriptor table to reduce live migration downtime when passthrough VQ is being switched to shadow VQ. This RFC v2 is sent to incorporate the early feedback from reviewers on the uAPI and driver API part of changes, the associated driver patch set consuming ths API will come around soon along with formal submission of this series. Some initial performance data will be gathered using the real hardware device with mlx5_vdpa. The target goal of this series is to reduce the SVQ switching overhead to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices. The reduction in the downtime is thanks to avoiding the full remap in the switching. The plan of the intended driver implementation is to use a dedicated group (specifically, 2 in below table) to host the descriptor tables for data vqs, different from where buffer addresses are contained (in group 0 as below). cvq does not have to allocate dedicated group for descriptor table, so its buffers and descriptor table would always belong to the same group (1 in table below). | data vq | ctrl vq ==+==+=== vq_group |0 |1 vq_desc_group |2 |1 --- Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vhost/vdpa.c | 27 +++ include/linux/vdpa.h | 11 +++ include/uapi/linux/vhost.h | 8 include/uapi/linux/vhost_types.h | 5 + 4 files changed, 51 insertions(+) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 2/3] vhost-vdpa: introduce descriptor group backend feature
Userspace knows if the device has dedicated descriptor group or not by checking this feature bit. It's only exposed if the vdpa driver backend implements the .get_vq_desc_group() operation callback. Userspace trying to negotiate this feature when it or the dependent _F_IOTLB_ASID feature hasn't been exposed will result in an error. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- RFC v1 -> v2: - add clarifications for what areas F_DESC_ASID should cover --- drivers/vhost/vdpa.c | 17 + include/uapi/linux/vhost_types.h | 5 + 2 files changed, 22 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b43e868..f2e5dce 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -389,6 +389,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return ops->get_vq_desc_group; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -679,6 +687,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (copy_from_user(, featurep, sizeof(features))) return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | +BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME))) return -EOPNOTSUPP; @@ -688,6 +697,12 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_RESUME)) && !vhost_vdpa_can_resume(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && + !(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) + return -EINVAL; + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && +!vhost_vdpa_has_desc_group(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -741,6 +756,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_SUSPEND); if (vhost_vdpa_can_resume(v)) features |= BIT_ULL(VHOST_BACKEND_F_RESUME); + if (vhost_vdpa_has_desc_group(v)) + features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); if (copy_to_user(featurep, , sizeof(features))) r = -EFAULT; break; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index d3aad12a..6acc604 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -181,5 +181,10 @@ struct vhost_vdpa_iova_range { #define VHOST_BACKEND_F_SUSPEND 0x4 /* Device can be resumed */ #define VHOST_BACKEND_F_RESUME 0x5 +/* Device may expose the virtqueue's descriptor area, driver area and + * device area to a different group for ASID binding than where its + * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID. + */ +#define VHOST_BACKEND_F_DESC_ASID0x6 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v2 1/3] vdpa: introduce dedicated descriptor group for virtqueue
In some cases, the access to the virtqueue's descriptor area, device and driver areas (precluding indirect descriptor table in guest memory) may have to be confined to a different address space than where its buffers reside. Without loss of simplicity and generality with already established terminology, let's fold up these 3 areas and call them as a whole as descriptor table group, or descriptor group for short. Specifically, in case of split virtqueues, descriptor group consists of regions for Descriptor Table, Available Ring and Used Ring; for packed virtqueues layout, descriptor group contains Descriptor Ring, Driver and Device Event Suppression structures. The group ID for a dedicated descriptor group can be obtained through a new .get_vq_desc_group() op. If driver implements this op, it means that the descriptor, device and driver areas of the virtqueue may reside in a dedicated group than where its buffers reside, a.k.a the default virtqueue group through the .get_vq_group() op. In principle, the descriptor group may or may not have same group ID as the default group. Even if the descriptor group has a different ID, meaning the vq's descriptor group areas can optionally move to a separate address space than where guest memory resides, the descriptor group may still start from a default address space, same as where its buffers reside. To move the descriptor group to a different address space, .set_group_asid() has to be called to change the ASID binding for the group, which is no different than what needs to be done on any other virtqueue group. On the other hand, the .reset() semantics also applies on descriptor table group, meaning the device reset will clear all ASID bindings and move all virtqueue groups including descriptor group back to the default address space, i.e. in ASID 0. QEMU's shadow virtqueue is going to utilize dedicated descriptor group to speed up map and unmap operations, yielding tremendous downtime reduction by avoiding the full and slow remap cycle in SVQ switching. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- RFC v1 -> v2: - expand commit log to mention downtime reduction in switching - add clarifications for what "descriptor group" covers and whatnot --- include/linux/vdpa.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index db1b0ea..17a4efa 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -204,6 +204,16 @@ struct vdpa_map_file { * @vdev: vdpa device * @idx: virtqueue index * Returns u32: group id for this virtqueue + * @get_vq_desc_group: Get the group id for the descriptor table of + * a specific virtqueue (optional) + * @vdev: vdpa device + * @idx: virtqueue index + * Returns u32: group id for the descriptor table + * portion of this virtqueue. Could be different + * than the one from @get_vq_group, in which case + * the access to the descriptor table can be + * confined to a separate asid, isolating from + * the virtqueue's buffer address access. * @get_device_features: Get virtio features supported by the device * @vdev: vdpa device * Returns the virtio features support by the @@ -357,6 +367,7 @@ struct vdpa_config_ops { /* Device ops */ u32 (*get_vq_align)(struct vdpa_device *vdev); u32 (*get_vq_group)(struct vdpa_device *vdev, u16 idx); + u32 (*get_vq_desc_group)(struct vdpa_device *vdev, u16 idx); u64 (*get_device_features)(struct vdpa_device *vdev); int (*set_driver_features)(struct vdpa_device *vdev, u64 features); u64 (*get_driver_features)(struct vdpa_device *vdev); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vdpa: consume device_features parameter
On 9/7/2023 5:07 PM, David Ahern wrote: On 9/7/23 2:41 PM, Si-Wei Liu wrote: Hi David, Why this patch doesn't get picked in the last 4 months? Maybe the subject is not clear, but this is an iproute2 patch. Would it be possible to merge at your earliest convenience? PS, adding my R-b to the patch. It got marked "Not applicable": https://patchwork.kernel.org/project/netdevbpf/patch/29db10bca7e5ef6b1137282292660fc337a4323a.1683907102.git.allen.hu...@amd.com/ Resend the patch with any reviewed by tags and be sure to cc me. Just out of my own curiosity, the patch is not applicable simply because the iproute2 was missing from the subject, or the code base somehow got changed that isn't aligned with the patch any more? Thanks, -Siwei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vdpa: consume device_features parameter
Hi David, Why this patch doesn't get picked in the last 4 months? Maybe the subject is not clear, but this is an iproute2 patch. Would it be possible to merge at your earliest convenience? PS, adding my R-b to the patch. Thanks, -Siwei On Sat, May 13, 2023 at 12:42 AM Shannon Nelson wrote: > > From: Allen Hubbe > > Consume the parameter to device_features when parsing command line > options. Otherwise the parameter may be used again as an option name. > > # vdpa dev add ... device_features 0xdeadbeef mac 00:11:22:33:44:55 > Unknown option "0xdeadbeef" > > Fixes: a4442ce58ebb ("vdpa: allow provisioning device features") > Signed-off-by: Allen Hubbe > Reviewed-by: Shannon Nelson Reviewed-by: Si-Wei Liu > --- > vdpa/vdpa.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/vdpa/vdpa.c b/vdpa/vdpa.c > index 27647d73d498..8a2fca8647b6 100644 > --- a/vdpa/vdpa.c > +++ b/vdpa/vdpa.c > @@ -353,6 +353,8 @@ static int vdpa_argv_parse(struct vdpa *vdpa, int argc, char **argv, > >device_features); > if (err) > return err; > + > + NEXT_ARG_FWD(); > o_found |= VDPA_OPT_VDEV_FEATURES; > } else { > fprintf(stderr, "Unknown option \"%s\"\n", *argv); > -- > 2.17.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On 8/22/2023 1:54 AM, Jason Wang wrote: On Thu, Aug 17, 2023 at 7:44 AM Si-Wei Liu wrote: On 8/15/2023 6:48 PM, Jason Wang wrote: On Wed, Aug 16, 2023 at 6:31 AM Si-Wei Liu wrote: On 8/14/2023 7:25 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62b0a01..75092a7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; So this means the IOTLB/IOMMU mappings have already been decoupled from the vdpa reset. Not in the sense of API, it' been coupled since day one from the implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa and vdpa_sim. Because of that, later on the (improper) support for virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA ops") misused the .reset() op to realize 1:1 mapping, rendering strong coupling between device reset and reset of iotlb mappings. This series try to rectify that implementation deficiency, while keep userspace continuing to work with older kernel behavior. So it should have been noticed by the userspace. Yes, userspace had noticed this no-chip IOMMU discrepancy since day one I suppose. Unfortunately there's already code in userspace with this assumption in mind that proactively tears down and sets up iotlb mapping around vdpa device reset... I guess we can just fix the simulator and mlx5 then we are fine? Only IF we don't care about running new QEMU on older kernels with flawed on-chip iommu behavior around reset. But that's a big IF... So what I meant is: Userspace doesn't know whether the vendor specific mappings (set_map) are required or not. And in the implementation of vhost_vdpa, if platform IOMMU is used, the mappings are decoupled from the reset. So if the Qemu works with parents with platform IOMMU it means Qemu can work if we just decouple vendor specific mappings from the parents that uses set_map. I was aware of this, and if you may notice I don't even offer a way backward to retain/emulate the flawed vhost-iotlb reset behavior for older userspace - I consider it more of a bug in .set_map driver implementation of its own rather than what the vhost-vdpa iotlb abstraction wishes to expose to userspace in the first place. That's my understanding as well. If you ever look into QEMU's vhost_vdpa_reset_status() function, you may see memory_listener_unregister() will be called to evict all of the existing iotlb mappings right after vhost_vdpa_reset_device() across device reset, and later on at vhost_vdpa_dev_start(), memory_listener_register() will set up all iotlb mappings again. In an ideal world without this on-chip iommu deficiency QEMU should not have to behave this way - this is what I mentioned earlier that userspace had already noticed the discrepancy and it has to "proactively tear down and set up iotlb mapping around vdpa device reset". Apparently from functionality perspective this trick works completely fine with platform IOMMU, however, it's sub-optimal in the performance perspective. Right. We can't simply fix QEMU by moving this memory_listener_unregister() call out of the reset path unconditionally, as we don't want to break the already-functioning older kernel even though it's suboptimal in performance. I'm not sure how things can be broken in this case? Things won't be broken if we don't care about performance, for example reboot a large memory VM (translated to device reset internally) will freeze the guest and introduce extra reboot delay unnecessarily. If we want to fix the performance by remove memory_listener_unregister() unconditionally and we don't have such a flag to distinguish, we will break network connectivity entirely after reset - as all mappings are purged during reset on older parent driver. Or why it is specific to parent with set_map. As if without the .reset_map op and corresponding driver implementation (in correct way), there's no appropriate means for on-chip iommu parent driver to persist iotlb mappings across reset, isn't it? If the driver deliberately removes it from .reset, they don't support 1:1 DMA mapping for virtio-vdpa on the other hand, for instance. Instead, to keep new QEMU continuing to work on top of the existing or older kernels, QEMU has to check this IOTLB_PERSIST
Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback
On 8/17/2023 8:28 AM, Eugenio Perez Martin wrote: On Thu, Aug 17, 2023 at 2:05 AM Si-Wei Liu wrote: On 8/15/2023 6:55 PM, Jason Wang wrote: On Wed, Aug 16, 2023 at 3:49 AM Si-Wei Liu wrote: On 8/14/2023 7:21 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index db1b0ea..3a3878d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -314,6 +314,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) This exposes the device internal to the upper layer which is not optimal. Not sure what does it mean by "device internal", but this op callback just follows existing convention to describe what vdpa parent this API targets. I meant the bus tries to hide the differences among vendors. So it needs to hide on-chip IOMMU stuff to the upper layer. We can expose two dimensional IO mappings models but it looks like over engineering for this issue. More below. * @set_map:Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) : : * @dma_map:Map an area of PA to IOVA (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental map. : : * @dma_unmap: Unmap an area of IOVA (optional but * must be implemented with dma_map) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental unmap. Btw, what's the difference between this and a simple set_map(NULL)? I don't think parent drivers support this today - they can accept non-NULL iotlb containing empty map entry, but not a NULL iotlb. The behavior is undefined or it even causes panic when a NULL iotlb is passed in. We can do this simple change if it can work. If we go with setting up 1:1 DMA mapping at virtio-vdpa .probe() and tearing it down at .release(), perhaps set_map(NULL) is not sufficient. Further this doesn't work with .dma_map parent drivers. Probably, but I'd remove dma_map as it doesn't have any real users except for the simulator. OK, at a point there was suggestion to get this incremental API extended to support batching to be in par with or even replace .set_map, not sure if it's too soon to conclude. But I'm okay with the removal if need be. Yes, I think the right move in the long run is to delegate the batching to the parent driver. This allows drivers like mlx to add memory (like hotplugged memory) without the need of tearing down all the old maps. Nods. Having said that, maybe we can work on top if we need to remove .dma_map for now. I guess for that sake I would keep .dma_map unless there's strong objection against it. Thanks, -Siwei The reason why a new op is needed or better is because it allows userspace to tell apart different reset behavior from the older kernel (via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior could vary between parent drivers. I'm ok with a new feature flag, but we need to first seek a way to reuse the existing API. A feature flag is needed anyway. I'm fine with reusing but guess I'd want to converge on the direction first. Thanks, -Siwei Thanks Regards, -Siwei Thanks + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -390,6 +396,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16
Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback
On 8/15/2023 6:55 PM, Jason Wang wrote: On Wed, Aug 16, 2023 at 3:49 AM Si-Wei Liu wrote: On 8/14/2023 7:21 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index db1b0ea..3a3878d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -314,6 +314,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) This exposes the device internal to the upper layer which is not optimal. Not sure what does it mean by "device internal", but this op callback just follows existing convention to describe what vdpa parent this API targets. I meant the bus tries to hide the differences among vendors. So it needs to hide on-chip IOMMU stuff to the upper layer. We can expose two dimensional IO mappings models but it looks like over engineering for this issue. More below. * @set_map:Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) : : * @dma_map:Map an area of PA to IOVA (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental map. : : * @dma_unmap: Unmap an area of IOVA (optional but * must be implemented with dma_map) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental unmap. Btw, what's the difference between this and a simple set_map(NULL)? I don't think parent drivers support this today - they can accept non-NULL iotlb containing empty map entry, but not a NULL iotlb. The behavior is undefined or it even causes panic when a NULL iotlb is passed in. We can do this simple change if it can work. If we go with setting up 1:1 DMA mapping at virtio-vdpa .probe() and tearing it down at .release(), perhaps set_map(NULL) is not sufficient. Further this doesn't work with .dma_map parent drivers. Probably, but I'd remove dma_map as it doesn't have any real users except for the simulator. OK, at a point there was suggestion to get this incremental API extended to support batching to be in par with or even replace .set_map, not sure if it's too soon to conclude. But I'm okay with the removal if need be. The reason why a new op is needed or better is because it allows userspace to tell apart different reset behavior from the older kernel (via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior could vary between parent drivers. I'm ok with a new feature flag, but we need to first seek a way to reuse the existing API. A feature flag is needed anyway. I'm fine with reusing but guess I'd want to converge on the direction first. Thanks, -Siwei Thanks Regards, -Siwei Thanks + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -390,6 +396,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On 8/15/2023 6:48 PM, Jason Wang wrote: On Wed, Aug 16, 2023 at 6:31 AM Si-Wei Liu wrote: On 8/14/2023 7:25 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62b0a01..75092a7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; So this means the IOTLB/IOMMU mappings have already been decoupled from the vdpa reset. Not in the sense of API, it' been coupled since day one from the implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa and vdpa_sim. Because of that, later on the (improper) support for virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA ops") misused the .reset() op to realize 1:1 mapping, rendering strong coupling between device reset and reset of iotlb mappings. This series try to rectify that implementation deficiency, while keep userspace continuing to work with older kernel behavior. So it should have been noticed by the userspace. Yes, userspace had noticed this no-chip IOMMU discrepancy since day one I suppose. Unfortunately there's already code in userspace with this assumption in mind that proactively tears down and sets up iotlb mapping around vdpa device reset... I guess we can just fix the simulator and mlx5 then we are fine? Only IF we don't care about running new QEMU on older kernels with flawed on-chip iommu behavior around reset. But that's a big IF... So what I meant is: Userspace doesn't know whether the vendor specific mappings (set_map) are required or not. And in the implementation of vhost_vdpa, if platform IOMMU is used, the mappings are decoupled from the reset. So if the Qemu works with parents with platform IOMMU it means Qemu can work if we just decouple vendor specific mappings from the parents that uses set_map. I was aware of this, and if you may notice I don't even offer a way backward to retain/emulate the flawed vhost-iotlb reset behavior for older userspace - I consider it more of a bug in .set_map driver implementation of its own rather than what the vhost-vdpa iotlb abstraction wishes to expose to userspace in the first place. If you ever look into QEMU's vhost_vdpa_reset_status() function, you may see memory_listener_unregister() will be called to evict all of the existing iotlb mappings right after vhost_vdpa_reset_device() across device reset, and later on at vhost_vdpa_dev_start(), memory_listener_register() will set up all iotlb mappings again. In an ideal world without this on-chip iommu deficiency QEMU should not have to behave this way - this is what I mentioned earlier that userspace had already noticed the discrepancy and it has to "proactively tear down and set up iotlb mapping around vdpa device reset". Apparently from functionality perspective this trick works completely fine with platform IOMMU, however, it's sub-optimal in the performance perspective. We can't simply fix QEMU by moving this memory_listener_unregister() call out of the reset path unconditionally, as we don't want to break the already-functioning older kernel even though it's suboptimal in performance. Instead, to keep new QEMU continuing to work on top of the existing or older kernels, QEMU has to check this IOTLB_PERSIST feature flag to decide whether it is safe not to bother flushing and setting up iotlb across reset. For the platform IOMMU case, vdpa parent driver won't implement either the .set_map or .dma_map op, so it should be covered in the vhost_vdpa_has_persistent_map() check I suppose. Thanks, -Siwei Thanks Regards, -Siwei Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group
On 8/9/2023 8:50 PM, Jason Wang wrote: On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu wrote: Following patchset introduces dedicated group for descriptor table to reduce live migration downtime when passthrough VQ is being switched to shadow VQ. As this RFC set is to seek early feedback on the uAPI and driver API part, for now there's no associated driver patch consuming the API. As soon as the support is in place on both hardware device and driver, performance data will be show using real hardware device. The target goal of this series is to reduce the SVQ switching overhead to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices. The plan of the intended driver implementation is to use a dedicated group (specifically, 2 in below table) to host descriptor table for all data vqs, different from where buffer addresses are contained (in group 0 as below). cvq does not have to allocate dedicated group for descriptor table, so its buffers and descriptor table would always belong to a same group (1). I'm fine with this, but I think we need an implementation in the driver (e.g the simulator). Yes. FWIW for the sake of time saving and get this series accepted promptly in the upcoming v6.6 merge window, the driver we're going to support along with this series will be mlx5_vdpa in the formal submission, and simulator support may come up later after if I got spare cycle. Do you foresee any issue without simulator change? We will have mlx5_vdpa driver consuming the API for sure, that's the target of this work and it has to be proved working on real device at first. Thanks, -Siwei Thanks | data vq | ctrl vq ==+==+=== vq_group |0 |1 vq_desc_group |2 |1 --- Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vhost/vdpa.c | 27 +++ include/linux/vdpa.h | 11 +++ include/uapi/linux/vhost.h | 8 include/uapi/linux/vhost_types.h | 5 + 4 files changed, 51 insertions(+) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 2/3] vhost-vdpa: introduce descriptor group backend feature
On 8/9/2023 8:49 PM, Jason Wang wrote: On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu wrote: Userspace knows if the device has dedicated descriptor group or not by checking this feature bit. It's only exposed if the vdpa driver backend implements the .get_vq_desc_group() operation callback. Userspace trying to negotiate this feature when it or the dependent _F_IOTLB_ASID feature hasn't been exposed will result in an error. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + include/uapi/linux/vhost_types.h | 5 + 2 files changed, 22 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b43e868..f2e5dce 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -389,6 +389,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return ops->get_vq_desc_group; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -679,6 +687,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (copy_from_user(, featurep, sizeof(features))) return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | +BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME))) return -EOPNOTSUPP; @@ -688,6 +697,12 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_RESUME)) && !vhost_vdpa_can_resume(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && + !(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) + return -EINVAL; + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && +!vhost_vdpa_has_desc_group(v)) + return -EOPNOTSUPP; vhost_set_backend_features(>vdev, features); return 0; } @@ -741,6 +756,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_SUSPEND); if (vhost_vdpa_can_resume(v)) features |= BIT_ULL(VHOST_BACKEND_F_RESUME); + if (vhost_vdpa_has_desc_group(v)) + features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); if (copy_to_user(featurep, , sizeof(features))) r = -EFAULT; break; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index d3aad12a..0856f84 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -181,5 +181,10 @@ struct vhost_vdpa_iova_range { #define VHOST_BACKEND_F_SUSPEND 0x4 /* Device can be resumed */ #define VHOST_BACKEND_F_RESUME 0x5 +/* Device may expose the descriptor table, avail and used ring in a + * different group for ASID binding than the buffers it contains. Nit: s/a different group/different groups/? Yep, I will try to rephrase. Would below work? "Device may expose virtqueue's descriptor table, avail and used ring in a different group for ASID binding than where buffers it contains reside." Btw, not a native speaker but I think "descriptor" might be confusing since as you explained above, it contains more than just a descriptor table. Yep. I chose "descriptor" because packed virtqueue doesn't have "physical" avail and used ring other than descriptor table, but I think I am open to a better name. I once thought of "descriptor ring" but that might be too specific to packed virtqueue. Any suggestion? Thanks, -Siwei Thanks + * Requires VHOST_BACKEND_F_IOTLB_ASID. + */ +#define VHOST_BACKEND_F_DESC_ASID0x6 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group
On 8/9/2023 7:49 AM, Eugenio Perez Martin wrote: On Wed, Aug 9, 2023 at 2:56 PM Si-Wei Liu wrote: Following patchset introduces dedicated group for descriptor table to reduce live migration downtime when passthrough VQ is being switched to shadow VQ. As this RFC set is to seek early feedback on the uAPI and driver API part, for now there's no associated driver patch consuming the API. As soon as the support is in place on both hardware device and driver, performance data will be show using real hardware device. The target goal of this series is to reduce the SVQ switching overhead to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices. I would expand the cover letter with something in the line of: The reduction in the downtime is thanks to avoiding the full remap in the switching. Sure, will add in the next. The plan of the intended driver implementation is to use a dedicated group (specifically, 2 in below table) to host descriptor table for all data vqs, different from where buffer addresses are contained (in group 0 as below). cvq does not have to allocate dedicated group for descriptor table, so its buffers and descriptor table would always belong to a same group (1). | data vq | ctrl vq ==+==+=== vq_group |0 |1 vq_desc_group |2 |1 Acked-by: Eugenio Pérez Thanks! -Siwei --- Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vhost/vdpa.c | 27 +++ include/linux/vdpa.h | 11 +++ include/uapi/linux/vhost.h | 8 include/uapi/linux/vhost_types.h | 5 + 4 files changed, 51 insertions(+) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 2/4] vdpa/mlx5: implement .reset_map driver op
On 8/15/2023 1:26 AM, Dragos Tatulea wrote: On Mon, 2023-08-14 at 18:43 -0700, Si-Wei Liu wrote: This patch is based on top of the "vdpa/mlx5: Fixes for ASID handling" series [1]. [1] vdpa/mlx5: Fixes for ASID handling https://lore.kernel.org/virtualization/20230802171231.11001-1-dtatu...@nvidia.com/ Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c | 72 + - drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- 3 files changed, 54 insertions(+), 37 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index b53420e..5c9a25a 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 5a1971fc..c8d64fc 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_mr *mr } } -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return; - prune_iotlb(mvdev); } -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) { struct mlx5_vdpa_mr *mr = >mr; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return; - if (!mr->initialized) return; @@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid) mutex_lock(>mkey_mtx); - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - _mlx5_vdpa_destroy_cvq_mr(mvdev, asid); + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) + _mlx5_vdpa_destroy_dvq_mr(mvdev); + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) + _mlx5_vdpa_destroy_cvq_mr(mvdev); mutex_unlock(>mkey_mtx); } @@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev) } static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return 0; - return dup_iotlb(mvdev, iotlb); } static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { struct mlx5_vdpa_mr *mr = >mr; int err; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return 0; - if (mr->initialized) return 0; @@ -574,20 +562,18 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, { int err; - err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); - if (err) - return err; - - err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid); - if (err) - goto out_err; + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); + if (err) + return err; + } + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb); + if (err) + return err; I think you still need the goto here, when CVQ and DVQ fall in same asid and there's a CVQ mr creation error, you are left stuck with the DVQ mr. Yes, you are right, I will fix this in v2. Thank you for spotting this! -Siwei + } return 0; - -out_err: - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - - return err; } int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, @@ -601,6 +587,28 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev
Re: [PATCH RFC 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver
On 8/14/2023 7:32 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b43e868..62b0a01 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(>hash_link); vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with on-chip IOMMU need to restore iotlb +* to 1:1 identity mapping before vhost-vdpa is going +* to be removed and detached from the device. Give +* them a chance to do so, as this cannot be done +* efficiently via the whole-range unmap call above. +*/ Same question as before, if 1:1 is restored and the userspace doesn't do any IOTLB updating. It looks like a security issue? (Assuming IOVA is PA) This is already flawed independent of this series. It was introduced from the two commits I referenced earlier in the other thread. Today userspace is already able to do so with device reset and don't do any IOTLB update. This series don't get it worse nor make it better. FWIW as said earlier, to address this security issue properly we probably should set up 1:1 DMA mapping in virtio_vdpa_probe() on demand, and tears it down at virtio_vdpa_release_dev(). Question is, was virtio-vdpa the only vdpa bus user that needs 1:1 DMA mapping, or it's the other way around that vhost-vdpa is the only exception among all vdpa bus drivers that don't want to start with 1:1 by default. This would help parent vdpa implementation for what kind of mapping it should start with upon creation. Regards, -Siwei Thanks + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On 8/14/2023 7:25 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62b0a01..75092a7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; So this means the IOTLB/IOMMU mappings have already been decoupled from the vdpa reset. Not in the sense of API, it' been coupled since day one from the implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa and vdpa_sim. Because of that, later on the (improper) support for virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA ops") misused the .reset() op to realize 1:1 mapping, rendering strong coupling between device reset and reset of iotlb mappings. This series try to rectify that implementation deficiency, while keep userspace continuing to work with older kernel behavior. So it should have been noticed by the userspace. Yes, userspace had noticed this no-chip IOMMU discrepancy since day one I suppose. Unfortunately there's already code in userspace with this assumption in mind that proactively tears down and sets up iotlb mapping around vdpa device reset... I guess we can just fix the simulator and mlx5 then we are fine? Only IF we don't care about running new QEMU on older kernels with flawed on-chip iommu behavior around reset. But that's a big IF... Regards, -Siwei Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback
On 8/14/2023 7:21 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index db1b0ea..3a3878d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -314,6 +314,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) This exposes the device internal to the upper layer which is not optimal. Not sure what does it mean by "device internal", but this op callback just follows existing convention to describe what vdpa parent this API targets. * @set_map: Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) : : * @dma_map: Map an area of PA to IOVA (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental map. : : * @dma_unmap: Unmap an area of IOVA (optional but * must be implemented with dma_map) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental unmap. Btw, what's the difference between this and a simple set_map(NULL)? I don't think parent drivers support this today - they can accept non-NULL iotlb containing empty map entry, but not a NULL iotlb. The behavior is undefined or it even causes panic when a NULL iotlb is passed in. Further this doesn't work with .dma_map parent drivers. The reason why a new op is needed or better is because it allows userspace to tell apart different reset behavior from the older kernel (via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior could vary between parent drivers. Regards, -Siwei Thanks + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -390,6 +396,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization