Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On 2020/12/24 下午12:37, wangyunjian wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Thursday, December 24, 2020 11:10 AM To: wangyunjian ; net...@vger.kernel.org; m...@redhat.com; willemdebruijn.ker...@gmail.com Cc: virtualization@lists.linux-foundation.org; Lilijun (Jerry) ; chenchanghu ; xudingke ; huangbin (J) Subject: Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails On 2020/12/24 上午10:25, wangyunjian wrote: From: Yunjian Wang Currently the driver doesn't drop a packet which can't be sent by tun (e.g bad packet). In this case, the driver will always process the same packet lead to the tx queue stuck. To fix this issue: 1. in the case of persistent failure (e.g bad packet), the driver can skip this descriptor by ignoring the error. 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the driver schedules the worker to try again. I might be wrong but looking at alloc_skb_with_frags(), it returns -ENOBUFS actually: *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_mask); if (!skb) return NULL; Yes, but the sock_alloc_send_pskb() returns - EAGAIN when no sndbuf space. So there is need to check return value which is -EAGAIN or -ENOMEM or - EAGAIN? struct sk_buff *sock_alloc_send_pskb() { ... for (;;) { ... sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); err = -EAGAIN; if (!timeo) goto failure; ... } skb = alloc_skb_with_frags(header_len, data_len, max_page_order, errcode, sk->sk_allocation); if (skb) skb_set_owner_w(skb, sk); return skb; ... *errcode = err; return NULL; } -EAGAIN and -ENOBFS are fine. But I don't see how -ENOMEM can be returned. Thanks Thanks Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c8784dfafdd7..e76245daa7f6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { + if (unlikely(err == -EAGAIN || err == -ENOMEM)) { vhost_discard_vq_desc(vq, 1); vhost_net_enable_vq(net, vq); break; } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: len %d != %zd\n", err, len); done: @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { @@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + if (err == -EAGAIN || err == -ENOMEM) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: " " len %d != %zd\n", err, len); if (!zcopy_used) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On 2020/12/24 上午10:25, wangyunjian wrote: From: Yunjian Wang Currently the driver doesn't drop a packet which can't be sent by tun (e.g bad packet). In this case, the driver will always process the same packet lead to the tx queue stuck. To fix this issue: 1. in the case of persistent failure (e.g bad packet), the driver can skip this descriptor by ignoring the error. 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the driver schedules the worker to try again. I might be wrong but looking at alloc_skb_with_frags(), it returns -ENOBUFS actually: *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_mask); if (!skb) return NULL; Thanks Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c8784dfafdd7..e76245daa7f6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { + if (unlikely(err == -EAGAIN || err == -ENOMEM)) { vhost_discard_vq_desc(vq, 1); vhost_net_enable_vq(net, vq); break; } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: len %d != %zd\n", err, len); done: @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { @@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + if (err == -EAGAIN || err == -ENOMEM) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: " " len %d != %zd\n", err, len); if (!zcopy_used) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午10:17, Yongji Xie wrote: On Wed, Dec 23, 2020 at 4:08 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: This VDUSE driver enables implementing vDPA devices in userspace. Both control path and data path of vDPA devices will be able to be handled in userspace. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the VDUSE driver implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. Userspace can access those iova region via mmap(). Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace Now we only support virtio-vdpa bus driver with this patch applied. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 74 ++ Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig |8 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/eventfd.c | 221 drivers/vdpa/vdpa_user/eventfd.h | 48 + drivers/vdpa/vdpa_user/iova_domain.c | 442 drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ drivers/vdpa/vdpa_user/vduse.h | 59 ++ drivers/vdpa/vdpa_user/vduse_dev.c | 1121 include/uapi/linux/vdpa.h |1 + include/uapi/linux/vduse.h | 99 ++ 13 files changed, 2173 insertions(+) create mode 100644 Documentation/driver-api/vduse.rst create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/eventfd.c create mode 100644 drivers/vdpa/vdpa_user/eventfd.h create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h create mode 100644 drivers/vdpa/vdpa_user/vduse.h create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst new file mode 100644 index ..da9b3040f20a --- /dev/null +++ b/Documentation/driver-api/vduse.rst @@ -0,0 +1,74 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. + +How VDUSE works + +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on +the VDUSE character device (/dev/vduse). Then a file descriptor pointing +to the new resources will be returned, which can be used to implement the +userspace vDPA device's control path and data path. + +To implement control path, the read/write operations to the file descriptor +will be used to receive/reply the control messages from/to VDUSE driver. +Those control messages are based on the vdpa_config_ops which defines a +unified interface to control different types of vDPA device. + +The following types of messages are provided by the VDUSE framework now: + +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. + +- VDUSE_SET_VQ_NUM: Set the size of virtqueue + +- VDUSE_SET_VQ_READY: Set ready status of virtqueue + +- VDUSE_GET_VQ_READY: Get ready status of virtqueue + +- VDUSE_SET_FEATURES: Set virtio features supported by the driver + +- VDUSE_GET_FEATURES: Get virtio features supported by the device + +- VDUSE_SET_STATUS: Set the device status + +- VDUSE_GET_STATUS: Get the device status + +- VDUSE_SET_CONFIG: Write to device specific configuration space + +- VDUSE_GET_CONFIG: Read from device specific configuration space + +Please see include/linux/vdpa.h for details. + +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +driver which supports mapping the kernel dma buffer to a userspace iova +region dynamically. The userspace iova region can be created by passing +the userspace vDPA device fd to mmap(2). + +Besides, the eventfd mechanism is used to trigger interrupt callbacks and +receive virtqueue kicks in userspace. The following ioctls on the userspace +vDPA device fd are provided to support that: + +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used + by VDUSE driver to notify userspace to consume the vring. + +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is u
Re: [External] Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/23 下午8:14, Yongji Xie wrote: On Wed, Dec 23, 2020 at 5:05 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. It's OK for me. One problem is when to transfer the UPDATE_IOTLB message in virtio cases. Instead of generating IOTLB messages for userspace. How about record the mappings (which is a common case for device have on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl for userspace to query? Thanks + vduse_dev_msg_put(msg); + fput(file); + + return ret; +} + +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev, + u64 iova, u64 size) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + + return ret; +} + +static unsigned int perm_to_file_flags(u8 perm) +{ + unsigned int flags = 0; + + switch (perm) { + case VHOST_ACCESS_WO: + flags |= O_WRONLY; + break; + case VHOST_ACCESS_RO: + flags |= O_RDONLY; + break; + case VHOST_ACCESS_RW: + flags |= O_RDWR; + break; + default: + WARN(1, "invalidate vhost IOTLB permission\n"); +
Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On Wed, Dec 23, 2020 at 9:25 PM wangyunjian wrote: > > From: Yunjian Wang > > Currently the driver doesn't drop a packet which can't be sent by tun > (e.g bad packet). In this case, the driver will always process the > same packet lead to the tx queue stuck. > > To fix this issue: > 1. in the case of persistent failure (e.g bad packet), the driver > can skip this descriptor by ignoring the error. > 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the > driver schedules the worker to try again. > > Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") > Signed-off-by: Yunjian Wang Acked-by: Willem de Bruijn Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
On 2020/12/23 下午7:06, Yongji Xie wrote: On Wed, Dec 23, 2020 at 4:37 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: This patch introduces a new method in the vdpa_config_ops to support processing the raw vhost memory mapping message in the vDPA device driver. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 5 - include/linux/vdpa.h | 7 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 448be7875b6d..ccbb391e38be 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, if (r) return r; + if (ops->process_iotlb_msg) + return ops->process_iotlb_msg(vdpa, msg); + switch (msg->type) { case VHOST_IOTLB_UPDATE: r = vhost_vdpa_process_iotlb_update(v, msg); @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) int ret; /* Device want to do DMA by itself */ - if (ops->set_map || ops->dma_map) + if (ops->set_map || ops->dma_map || ops->process_iotlb_msg) return 0; bus = dma_dev->bus; diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 656fe264234e..7bccedf22f4b 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -172,6 +173,10 @@ struct vdpa_iova_range { * @vdev: vdpa device * Returns the iova range supported by * the device. + * @process_iotlb_msg: Process vhost memory mapping message (optional) + * Only used for VDUSE device now + * @vdev: vdpa device + * @msg: vhost memory mapping message * @set_map:Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) @@ -240,6 +245,8 @@ struct vdpa_config_ops { struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); /* DMA ops */ + int (*process_iotlb_msg)(struct vdpa_device *vdev, + struct vhost_iotlb_msg *msg); int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size, u64 pa, u32 perm); Is there any reason that it can't be done via dma_map/dma_unmap or set_map? To get the shmfd, we need the vma rather than physical address. And it's not necessary to pin the user pages in VDUSE case. Right, actually, vhost-vDPA is planning to support shared virtual address space. So let's try to reuse the existing config ops. How about just introduce an attribute to vdpa device that tells the bus tells the bus it can do shared virtual memory. Then when the device is probed by vhost-vDPA, use pages won't be pinned and we will do VA->VA mapping as IOVA->PA mapping in the vhost IOTLB and the config ops. vhost IOTLB needs to be extended to accept opaque pointer to store the file. And the file was pass via the config ops as well. Thanks Thanks, Yongji ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午6:59, Yongji Xie wrote: On Wed, Dec 23, 2020 at 2:38 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: This series introduces a framework, which can be used to implement vDPA Devices in a userspace program. The work consist of two parts: control path forwarding and data path offloading. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the core is mapping dma buffer into VDUSE daemon's address space, which can be implemented in different ways depending on the vdpa bus to which the vDPA device is attached. In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with bounce-buffering mechanism to achieve that. Rethink about the bounce buffer stuffs. I wonder instead of using kernel pages with mmap(), how about just use userspace pages like what vhost did? It means we need a worker to do bouncing but we don't need to care about annoying stuffs like page reclaiming? Now the I/O bouncing is done in the streaming DMA mapping routines which can be called from interrupt context. If we put this into a kworker, that means we need to synchronize with a kworker in an interrupt context. I think it can't work. We just need to make sure the buffer is ready before the user is trying to access them. But I admit it would be tricky (require shadow virtqueue etc) which is probably not a good idea. Thanks Thanks, Yongji ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] vhost/vsock: add IOTLB API support
On 2020/12/23 下午10:36, Stefano Garzarella wrote: This patch enables the IOTLB API support for vhost-vsock devices, allowing the userspace to emulate an IOMMU for the guest. These changes were made following vhost-net, in details this patch: - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb device if the feature is acked - implements VHOST_GET_BACKEND_FEATURES and VHOST_SET_BACKEND_FEATURES ioctls - calls vq_meta_prefetch() before vq processing to prefetch vq metadata address in IOTLB - provides .read_iter, .write_iter, and .poll callbacks for the chardev; they are used by the userspace to exchange IOTLB messages This patch was tested specifying "intel_iommu=strict" in the guest kernel command line. I used QEMU with a patch applied [1] to fix a simple issue (that patch was merged in QEMU v5.2.0): $ qemu -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on,device-iotlb=on \ -device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on [1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella Acked-by: Jason Wang --- The patch is the same of v1, but I re-tested it with: - QEMU v5.2.0-551-ga05f8ecd88 - Linux 5.9.15 (host) - Linux 5.9.15 and 5.10.0 (guest) Now, enabling 'ats' it works well, there are just a few simple changes. v1: https://www.spinics.net/lists/kernel/msg3716022.html v2: - updated commit message about QEMU version and string used to test - rebased on mst/vhost branch Thanks, Stefano --- drivers/vhost/vsock.c | 68 +-- 1 file changed, 65 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index a483cec31d5c..5e78fb719602 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -30,7 +30,12 @@ #define VHOST_VSOCK_PKT_WEIGHT 256 enum { - VHOST_VSOCK_FEATURES = VHOST_FEATURES, + VHOST_VSOCK_FEATURES = VHOST_FEATURES | + (1ULL << VIRTIO_F_ACCESS_PLATFORM) +}; + +enum { + VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) }; /* Used to track all the vhost_vsock instances on the system. */ @@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock, if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + /* Avoid further vmexits, we're already processing the virtqueue */ vhost_disable_notify(&vsock->dev, vq); @@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work) if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + vhost_disable_notify(&vsock->dev, vq); do { u32 len; @@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) mutex_lock(&vsock->dev.mutex); if ((features & (1 << VHOST_F_LOG_ALL)) && !vhost_log_access_ok(&vsock->dev)) { - mutex_unlock(&vsock->dev.mutex); - return -EFAULT; + goto err; + } + + if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { + if (vhost_init_device_iotlb(&vsock->dev, true)) + goto err; } for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) { @@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) } mutex_unlock(&vsock->dev.mutex); return 0; + +err: + mutex_unlock(&vsock->dev.mutex); + return -EFAULT; } static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, @@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, if (copy_from_user(&features, argp, sizeof(features))) return -EFAULT; return vhost_vsock_set_features(vsock, features); + case VHOST_GET_BACKEND_FEATURES: + features = VHOST_VSOCK_BACKEND_FEATURES; + if (copy_to_user(argp, &features, sizeof(features))) + return -EFAULT; + return 0; + case VHOST_SET_BACKEND_FEATURES: + if (copy_from_user(&features, argp, sizeof(features))) + return -EFAULT; + if (features & ~VHOST_VSOCK_BACKEND_FEATURES) + return -EOPNOTSUPP; + vhost_set_backend_features(&vsock->dev, features); + return 0; default: mutex_lock(&vsock->dev.mutex); r = vhost_dev_ioctl(&vsock->dev, ioctl, argp); @@ -823,6 +854,34 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, } } +static ssize_t vhost_vsock_chr_read_iter(struct kiocb *i
Re: [PATCH net v3 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On Wed, Dec 23, 2020 at 9:47 AM wangyunjian wrote: > > From: Yunjian Wang > > Currently the driver don't drop a packet which can't be send by tun > > (e.g bad packet). In this case, the driver will always process the > same packet lead to the tx queue stuck. > > To fix this issue: > 1. in the case of persistent failure (e.g bad packet), the driver > can skip this descriptior by ignoring the error. > 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the > driver schedules the worker to try again. > Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Since I have a few other comments, a few minor typo corrections too: don't -> doesn't, send -> sent, descriptior -> descriptor. > Signed-off-by: Yunjian Wang > > drivers/vhost/net.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index c8784dfafdd7..e49dd64d086a 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -827,9 +827,8 @@ static void handle_tx_copy(struct vhost_net *net, struct > socket *sock) > msg.msg_flags &= ~MSG_MORE; > } > > - /* TODO: Check specific error and bomb out unless ENOBUFS? */ > err = sock->ops->sendmsg(sock, &msg, len); > - if (unlikely(err < 0)) { > + if (unlikely(err == -EAGAIN || err == -ENOMEM)) { > vhost_discard_vq_desc(vq, 1); > vhost_net_enable_vq(net, vq); > break; > @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, > struct socket *sock) > msg.msg_flags &= ~MSG_MORE; > } > > - /* TODO: Check specific error and bomb out unless ENOBUFS? */ > err = sock->ops->sendmsg(sock, &msg, len); > if (unlikely(err < 0)) { > if (zcopy_used) { > @@ -931,9 +929,11 @@ static void handle_tx_zerocopy(struct vhost_net *net, > struct socket *sock) > nvq->upend_idx = ((unsigned)nvq->upend_idx - > 1) > % UIO_MAXIOV; > } > - vhost_discard_vq_desc(vq, 1); > - vhost_net_enable_vq(net, vq); > - break; > + if (err == -EAGAIN || err == -ENOMEM) { > + vhost_discard_vq_desc(vq, 1); > + vhost_net_enable_vq(net, vq); > + break; > + } > } > if (err != len) > pr_debug("Truncated TX packet: " Probably my bad for feedback in patch 2/2, but now vhost will incorrectly log bad packets as truncated packets. This will need to be if (err >= 0 && err != len). It would be nice if we could notify the guest in the transmit descriptor when a packet was dropped due to failing integrity checks (bad packet). But I don't think we easily can, so out of scope for this fix. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting
On Tue, Dec 22, 2020 at 7:39 PM Liang Li wrote: > > > > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev, > > > +struct hstate *h, unsigned int nid, > > > +struct scatterlist *sgl, unsigned int *offset) > > > +{ > > > + struct list_head *list = &h->hugepage_freelists[nid]; > > > + unsigned int page_len = PAGE_SIZE << h->order; > > > + struct page *page, *next; > > > + long budget; > > > + int ret = 0, scan_cnt = 0; > > > + > > > + /* > > > +* Perform early check, if free area is empty there is > > > +* nothing to process so we can skip this free_list. > > > +*/ > > > + if (list_empty(list)) > > > + return ret; > > > + > > > + spin_lock_irq(&hugetlb_lock); > > > + > > > + if (huge_page_order(h) > MAX_ORDER) > > > + budget = HUGEPAGE_REPORTING_CAPACITY; > > > + else > > > + budget = HUGEPAGE_REPORTING_CAPACITY * 32; > > > > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we > > don't even really need budget since this should probably be pulling > > out no more than one hugepage at a time. > > I want to disting a 2M page and 1GB page here. The order of 1GB page is > greater > than MAX_ORDER while 2M page's order is less than MAX_ORDER. The budget here is broken. When I put the budget in page reporting it was so that we wouldn't try to report all of the memory in a given region. It is meant to hold us to no more than one pass through 1/16 of the free memory. So essentially we will be slowly processing all of memory and it will take 16 calls (32 seconds) for us to process a system that is sitting completely idle. It is meant to pace us so we don't spend a ton of time doing work that will be undone, not to prevent us from burying a CPU which is what seems to be implied here. Using HUGEPAGE_REPORTING_CAPACITY makes no sense here. I was using it in the original definition because it was how many pages we could scoop out at a time and then I was aiming for a 16th of that. Here you are arbitrarily squaring HUGEPAGE_REPORTING_CAPACITY in terms of the amount of work you will doo since you are using it as a multiple instead of a divisor. > > > > > + /* loop through free list adding unreported pages to sg list */ > > > + list_for_each_entry_safe(page, next, list, lru) { > > > + /* We are going to skip over the reported pages. */ > > > + if (PageReported(page)) { > > > + if (++scan_cnt >= MAX_SCAN_NUM) { > > > + ret = scan_cnt; > > > + break; > > > + } > > > + continue; > > > + } > > > + > > > > It would probably have been better to place this set before your new > > set. I don't see your new set necessarily being the best use for page > > reporting. > > I haven't really latched on to what you mean, could you explain it again? It would be better for you to spend time understanding how this patch set works before you go about expanding it to do other things. Mistakes like the budget one above kind of point out the fact that you don't understand how this code was supposed to work and just kind of shoehorned you page zeroing code onto it. It would be better to look at trying to understand this code first before you extend it to support your zeroing use case. So adding huge pages first might make more sense than trying to zero and push the order down. The fact is the page reporting extension should be minimal for huge pages since they are just passed as a scatterlist so you should only need to add a small bit to page_reporting.c to extend it to support this use case. > > > > > + /* > > > +* If we fully consumed our budget then update our > > > +* state to indicate that we are requesting additional > > > +* processing and exit this list. > > > +*/ > > > + if (budget < 0) { > > > + atomic_set(&prdev->state, > > > PAGE_REPORTING_REQUESTED); > > > + next = page; > > > + break; > > > + } > > > + > > > > If budget is only ever going to be 1 then we probably could just look > > at making this the default case for any time we find a non-reported > > page. > > and here again. It comes down to the fact that the changes you made have a significant impact on how this is supposed to function. Reducing the scatterlist to a size of one makes the whole point of doing batching kind of pointless. Basically the code should be rewritten with the assumption that if you find a page you report it. The old code would batch things up because there is significant overhead to be addressed when going to the hypervisor to report said memory. Your code doesn't seem to really take anything like
[PATCH v2] vhost/vsock: add IOTLB API support
This patch enables the IOTLB API support for vhost-vsock devices, allowing the userspace to emulate an IOMMU for the guest. These changes were made following vhost-net, in details this patch: - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb device if the feature is acked - implements VHOST_GET_BACKEND_FEATURES and VHOST_SET_BACKEND_FEATURES ioctls - calls vq_meta_prefetch() before vq processing to prefetch vq metadata address in IOTLB - provides .read_iter, .write_iter, and .poll callbacks for the chardev; they are used by the userspace to exchange IOTLB messages This patch was tested specifying "intel_iommu=strict" in the guest kernel command line. I used QEMU with a patch applied [1] to fix a simple issue (that patch was merged in QEMU v5.2.0): $ qemu -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on,device-iotlb=on \ -device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on [1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- The patch is the same of v1, but I re-tested it with: - QEMU v5.2.0-551-ga05f8ecd88 - Linux 5.9.15 (host) - Linux 5.9.15 and 5.10.0 (guest) Now, enabling 'ats' it works well, there are just a few simple changes. v1: https://www.spinics.net/lists/kernel/msg3716022.html v2: - updated commit message about QEMU version and string used to test - rebased on mst/vhost branch Thanks, Stefano --- drivers/vhost/vsock.c | 68 +-- 1 file changed, 65 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index a483cec31d5c..5e78fb719602 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -30,7 +30,12 @@ #define VHOST_VSOCK_PKT_WEIGHT 256 enum { - VHOST_VSOCK_FEATURES = VHOST_FEATURES, + VHOST_VSOCK_FEATURES = VHOST_FEATURES | + (1ULL << VIRTIO_F_ACCESS_PLATFORM) +}; + +enum { + VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) }; /* Used to track all the vhost_vsock instances on the system. */ @@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock, if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + /* Avoid further vmexits, we're already processing the virtqueue */ vhost_disable_notify(&vsock->dev, vq); @@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work) if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + vhost_disable_notify(&vsock->dev, vq); do { u32 len; @@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) mutex_lock(&vsock->dev.mutex); if ((features & (1 << VHOST_F_LOG_ALL)) && !vhost_log_access_ok(&vsock->dev)) { - mutex_unlock(&vsock->dev.mutex); - return -EFAULT; + goto err; + } + + if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { + if (vhost_init_device_iotlb(&vsock->dev, true)) + goto err; } for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) { @@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) } mutex_unlock(&vsock->dev.mutex); return 0; + +err: + mutex_unlock(&vsock->dev.mutex); + return -EFAULT; } static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, @@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, if (copy_from_user(&features, argp, sizeof(features))) return -EFAULT; return vhost_vsock_set_features(vsock, features); + case VHOST_GET_BACKEND_FEATURES: + features = VHOST_VSOCK_BACKEND_FEATURES; + if (copy_to_user(argp, &features, sizeof(features))) + return -EFAULT; + return 0; + case VHOST_SET_BACKEND_FEATURES: + if (copy_from_user(&features, argp, sizeof(features))) + return -EFAULT; + if (features & ~VHOST_VSOCK_BACKEND_FEATURES) + return -EOPNOTSUPP; + vhost_set_backend_features(&vsock->dev, features); + return 0; default: mutex_lock(&vsock->dev.mutex); r = vhost_dev_ioctl(&vsock->dev, ioctl, argp); @@ -823,6 +854,34 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, } } +static ssize_t vhost_vsock_chr_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct file *file = iocb->ki_filp; + struct vhost_
Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails
On Wed, Dec 23, 2020 at 8:21 AM wangyunjian wrote: > > > -Original Message- > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Wednesday, December 23, 2020 10:54 AM > > To: Willem de Bruijn > > Cc: wangyunjian ; Network Development > > ; Michael S. Tsirkin ; > > virtualization@lists.linux-foundation.org; Lilijun (Jerry) > > ; chenchanghu ; > > xudingke ; huangbin (J) > > > > Subject: Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg > > fails > > > > > > On 2020/12/22 下午10:24, Willem de Bruijn wrote: > > > On Mon, Dec 21, 2020 at 11:41 PM Jason Wang > > wrote: > > >> > > >> On 2020/12/22 上午7:07, Willem de Bruijn wrote: > > >>> On Wed, Dec 16, 2020 at 3:20 AM wangyunjian > > wrote: > > From: Yunjian Wang > > > > Currently we break the loop and wake up the vhost_worker when > > sendmsg fails. When the worker wakes up again, we'll meet the same > > error. > > >>> The patch is based on the assumption that such error cases always > > >>> return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb? > > >>> > > This will cause high CPU load. To fix this issue, we can skip this > > description by ignoring the error. When we exceeds sndbuf, the > > return value of sendmsg is -EAGAIN. In the case we don't skip the > > description and don't drop packet. > > >>> the -> that > > >>> > > >>> here and above: description -> descriptor > > >>> > > >>> Perhaps slightly revise to more explicitly state that > > >>> > > >>> 1. in the case of persistent failure (i.e., bad packet), the driver > > >>> drops the packet 2. in the case of transient failure (e.g,. memory > > >>> pressure) the driver schedules the worker to try again later > > >> > > >> If we want to go with this way, we need a better time to wakeup the > > >> worker. Otherwise it just produces more stress on the cpu that is > > >> what this patch tries to avoid. > > > Perhaps I misunderstood the purpose of the patch: is it to drop > > > everything, regardless of transient or persistent failure, until the > > > ring runs out of descriptors? > > > > > > My understanding is that the main motivation is to avoid high cpu > > utilization > > when sendmsg() fail due to guest reason (e.g bad packet). > > > > My main motivation is to avoid the tx queue stuck. > > Should I describe it like this: > Currently the driver don't drop a packet which can't be send by tun > (e.g bad packet). In this case, the driver will always process the > same packet lead to the tx queue stuck. > > To fix this issue: > 1. in the case of persistent failure (e.g bad packet), the driver can skip > this descriptior by ignoring the error. > 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the driver > schedules the worker to try again. That sounds good to me, thanks. > Thanks > > > > > > > > > I can understand both a blocking and drop strategy during memory > > > pressure. But partial drop strategy until exceeding ring capacity > > > seems like a peculiar hybrid? > > > > > > Yes. So I wonder if we want to be do better when we are in the memory > > pressure. E.g can we let socket wake up us instead of rescheduling the > > workers here? At least in this case we know some memory might be freed? I don't know whether a blocking or drop strategy is the better choice. Either way, it probably deserves to be handled separately. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[GIT PULL] virtio,vdpa: features, cleanups, fixes
The following changes since commit 2c85ebc57b3e1817b6ce1a6b703928e113a90442: Linux 5.10 (2020-12-13 14:41:30 -0800) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to 418eddef050d5f6393c303a94e3173847ab85466: vdpa: Use simpler version of ida allocation (2020-12-18 16:14:31 -0500) virtio,vdpa: features, cleanups, fixes vdpa sim refactoring virtio mem Big Block Mode support misc cleanus, fixes Signed-off-by: Michael S. Tsirkin Christophe JAILLET (1): vdpa: ifcvf: Use dma_set_mask_and_coherent to simplify code Dan Carpenter (3): virtio_ring: Cut and paste bugs in vring_create_virtqueue_packed() virtio_net: Fix error code in probe() virtio_ring: Fix two use after free bugs David Hildenbrand (29): virtio-mem: determine nid only once using memory_add_physaddr_to_nid() virtio-mem: more precise calculation in virtio_mem_mb_state_prepare_next_mb() virtio-mem: simplify MAX_ORDER - 1 / pageblock_order handling virtio-mem: drop rc2 in virtio_mem_mb_plug_and_add() virtio-mem: use "unsigned long" for nr_pages when fake onlining/offlining virtio-mem: factor out calculation of the bit number within the subblock bitmap virtio-mem: print debug messages from virtio_mem_send_*_request() virtio-mem: factor out fake-offlining into virtio_mem_fake_offline() virtio-mem: factor out handling of fake-offline pages in memory notifier virtio-mem: retry fake-offlining via alloc_contig_range() on ZONE_MOVABLE virtio-mem: generalize check for added memory virtio-mem: generalize virtio_mem_owned_mb() virtio-mem: generalize virtio_mem_overlaps_range() virtio-mem: drop last_mb_id virtio-mem: don't always trigger the workqueue when offlining memory virtio-mem: generalize handling when memory is getting onlined deferred virito-mem: document Sub Block Mode (SBM) virtio-mem: memory block states are specific to Sub Block Mode (SBM) virito-mem: subblock states are specific to Sub Block Mode (SBM) virtio-mem: nb_sb_per_mb and subblock_size are specific to Sub Block Mode (SBM) virtio-mem: memory block ids are specific to Sub Block Mode (SBM) virito-mem: existing (un)plug functions are specific to Sub Block Mode (SBM) virtio-mem: memory notifier callbacks are specific to Sub Block Mode (SBM) virtio-mem: factor out adding/removing memory from Linux virtio-mem: Big Block Mode (BBM) memory hotplug virtio-mem: allow to force Big Block Mode (BBM) and set the big block size mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block virtio-mem: Big Block Mode (BBM) - basic memory hotunplug virtio-mem: Big Block Mode (BBM) - safe memory hotunplug Eli Cohen (1): vdpa/mlx5: Use write memory barrier after updating CQ index Enrico Weigelt, metux IT consult (2): uapi: virtio_ids.h: consistent indentions uapi: virtio_ids: add missing device type IDs from OASIS spec Max Gurtovoy (2): vdpa_sim: remove hard-coded virtq count vdpa: split vdpasim to core and net modules Parav Pandit (2): vdpa: Add missing comment for virtqueue count vdpa: Use simpler version of ida allocation Peng Fan (3): tools/virtio: include asm/bug.h tools/virtio: add krealloc_array tools/virtio: add barrier for aarch64 Stefano Garzarella (16): vdpa: remove unnecessary 'default n' in Kconfig entries vdpa_sim: remove unnecessary headers inclusion vdpa_sim: make IOTLB entries limit configurable vdpa_sim: rename vdpasim_config_ops variables vdpa_sim: add struct vdpasim_dev_attr for device attributes vdpa_sim: add device id field in vdpasim_dev_attr vdpa_sim: add supported_features field in vdpasim_dev_attr vdpa_sim: add work_fn in vdpasim_dev_attr vdpa_sim: store parsed MAC address in a buffer vdpa_sim: make 'config' generic and usable for any device type vdpa_sim: add get_config callback in vdpasim_dev_attr vdpa_sim: add set_config callback in vdpasim_dev_attr vdpa_sim: set vringh notify callback vdpa_sim: use kvmalloc to allocate vdpasim->buffer vdpa_sim: make vdpasim->buffer size configurable vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov Tian Tao (1): vhost_vdpa: switch to vmemdup_user() Zhang Changzhong (1): vhost scsi: fix error return code in vhost_scsi_set_endpoint() drivers/net/virtio_net.c |1 + drivers/vdpa/Kconfig | 18 +- drivers/vdpa/ifcvf/ifcvf_main.c | 11 +- drivers/vdpa/mlx5/net/mlx5_vnet.c|5 + drivers/vdpa/vdpa.c |2 +- drivers/vdpa/vdpa_sim/Mak
[PATCH v2] vdpa_sim: use iova module to allocate IOVA addresses
The identical mapping used until now created issues when mapping different virtual pages with the same physical address. To solve this issue, we can use the iova module, to handle the IOVA allocation. For simplicity we use an IOVA allocator with byte granularity. We add two new functions, vdpasim_map_range() and vdpasim_unmap_range(), to handle the IOVA allocation and the registration into the IOMMU/IOTLB. These functions are used by dma_map_ops callbacks. Acked-by: Jason Wang Signed-off-by: Stefano Garzarella --- v2: - used ULONG_MAX instead of ~0UL [Jason] - fixed typos in comment and patch description [Jason] --- drivers/vdpa/vdpa_sim/vdpa_sim.h | 2 + drivers/vdpa/vdpa_sim/vdpa_sim.c | 108 +++ drivers/vdpa/Kconfig | 1 + 3 files changed, 69 insertions(+), 42 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.h b/drivers/vdpa/vdpa_sim/vdpa_sim.h index b02142293d5b..6efe205e583e 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.h +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.h @@ -6,6 +6,7 @@ #ifndef _VDPA_SIM_H #define _VDPA_SIM_H +#include #include #include #include @@ -55,6 +56,7 @@ struct vdpasim { /* virtio config according to device type */ void *config; struct vhost_iotlb *iommu; + struct iova_domain iova; void *buffer; u32 status; u32 generation; diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index b3fcc67bfdf0..edc930719fb8 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -17,6 +17,7 @@ #include #include #include +#include #include "vdpa_sim.h" @@ -128,30 +129,57 @@ static int dir_to_perm(enum dma_data_direction dir) return perm; } +static dma_addr_t vdpasim_map_range(struct vdpasim *vdpasim, phys_addr_t paddr, + size_t size, unsigned int perm) +{ + struct iova *iova; + dma_addr_t dma_addr; + int ret; + + /* We set the limit_pfn to the maximum (ULONG_MAX - 1) */ + iova = alloc_iova(&vdpasim->iova, size, ULONG_MAX - 1, true); + if (!iova) + return DMA_MAPPING_ERROR; + + dma_addr = iova_dma_addr(&vdpasim->iova, iova); + + spin_lock(&vdpasim->iommu_lock); + ret = vhost_iotlb_add_range(vdpasim->iommu, (u64)dma_addr, + (u64)dma_addr + size - 1, (u64)paddr, perm); + spin_unlock(&vdpasim->iommu_lock); + + if (ret) { + __free_iova(&vdpasim->iova, iova); + return DMA_MAPPING_ERROR; + } + + return dma_addr; +} + +static void vdpasim_unmap_range(struct vdpasim *vdpasim, dma_addr_t dma_addr, + size_t size) +{ + spin_lock(&vdpasim->iommu_lock); + vhost_iotlb_del_range(vdpasim->iommu, (u64)dma_addr, + (u64)dma_addr + size - 1); + spin_unlock(&vdpasim->iommu_lock); + + free_iova(&vdpasim->iova, iova_pfn(&vdpasim->iova, dma_addr)); +} + static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs) { struct vdpasim *vdpasim = dev_to_sim(dev); - struct vhost_iotlb *iommu = vdpasim->iommu; - u64 pa = (page_to_pfn(page) << PAGE_SHIFT) + offset; - int ret, perm = dir_to_perm(dir); + phys_addr_t paddr = page_to_phys(page) + offset; + int perm = dir_to_perm(dir); if (perm < 0) return DMA_MAPPING_ERROR; - /* For simplicity, use identical mapping to avoid e.g iova -* allocator. -*/ - spin_lock(&vdpasim->iommu_lock); - ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1, - pa, dir_to_perm(dir)); - spin_unlock(&vdpasim->iommu_lock); - if (ret) - return DMA_MAPPING_ERROR; - - return (dma_addr_t)(pa); + return vdpasim_map_range(vdpasim, paddr, size, perm); } static void vdpasim_unmap_page(struct device *dev, dma_addr_t dma_addr, @@ -159,12 +187,8 @@ static void vdpasim_unmap_page(struct device *dev, dma_addr_t dma_addr, unsigned long attrs) { struct vdpasim *vdpasim = dev_to_sim(dev); - struct vhost_iotlb *iommu = vdpasim->iommu; - spin_lock(&vdpasim->iommu_lock); - vhost_iotlb_del_range(iommu, (u64)dma_addr, - (u64)dma_addr + size - 1); - spin_unlock(&vdpasim->iommu_lock); + vdpasim_unmap_range(vdpasim, dma_addr, size); } static void *vdpasim_alloc_coherent(struct device *dev, size_t size, @@ -172,27 +196,22 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size, unsigned long attrs) { struct vdpasim *vdpasim = dev_to_sim
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. + vduse_dev_msg_put(msg); + fput(file); + + return ret; +} + +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev, + u64 iova, u64 size) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + + return ret; +} + +static unsigned int perm_to_file_flags(u8 perm) +{ + unsigned int flags = 0; + + switch (perm) { + case VHOST_ACCESS_WO: + flags |= O_WRONLY; + break; + case VHOST_ACCESS_RO: + flags |= O_RDONLY; + break; + case VHOST_ACCESS_RW: + flags |= O_RDWR; + break; + default: + WARN(1, "invalidate vhost IOTLB permission\n"); + break; + } + + return flags; +} + static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; struct vduse_dev *dev = file->private_data; struct vduse_dev_msg *msg; - int size = sizeof(struct vduse_dev_request); +
Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
[...] >> I was rather saying that for security it's of little use IMHO. >> Application/VM start up time might be improved by using huge pages (and >> pre-zeroing these). Free page reporting might be improved by using >> MADV_FREE instead of MADV_DONTNEED in the hypervisor. >> >>> this feature, above all of them, which one is likely to become the >>> most strong one? From the implementation, you will find it is >>> configurable, users don't want to use it can turn it off. This is not >>> an option? >> >> Well, we have to maintain the feature and sacrifice a page flag. For >> example, do we expect someone explicitly enabling the feature just to >> speed up startup time of an app that consumes a lot of memory? I highly >> doubt it. > > In our production environment, there are three main applications have such > requirement, one is QEMU [creating a VM with SR-IOV passthrough device], > anther other two are DPDK related applications, DPDK OVS and SPDK vhost, > for best performance, they populate memory when starting up. For SPDK vhost, > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for > vhost 'live' upgrade, which is done by killing the old process and > starting a new > one with the new binary. In this case, we want the new process started as > quick > as possible to shorten the service downtime. We really enable this feature > to speed up startup time for them :) Thanks for info on the use case! All of these use cases either already use, or could use, huge pages IMHO. It's not your ordinary proprietary gaming app :) This is where pre-zeroing of huge pages could already help. Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... creating a file and pre-zeroing it from another process, or am I missing something important? At least for QEMU this should work AFAIK, where you can just pass the file to be use using memory-backend-file. > >> I'd love to hear opinions of other people. (a lot of people are offline >> until beginning of January, including, well, actually me :) ) > > OK. I will wait some time for others' feedback. Happy holidays! To you too, cheers! -- Thanks, David / dhildenb ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
On 2020/12/22 下午10:52, Xie Yongji wrote: This patch introduces a new method in the vdpa_config_ops to support processing the raw vhost memory mapping message in the vDPA device driver. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 5 - include/linux/vdpa.h | 7 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 448be7875b6d..ccbb391e38be 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, if (r) return r; + if (ops->process_iotlb_msg) + return ops->process_iotlb_msg(vdpa, msg); + switch (msg->type) { case VHOST_IOTLB_UPDATE: r = vhost_vdpa_process_iotlb_update(v, msg); @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) int ret; /* Device want to do DMA by itself */ - if (ops->set_map || ops->dma_map) + if (ops->set_map || ops->dma_map || ops->process_iotlb_msg) return 0; bus = dma_dev->bus; diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 656fe264234e..7bccedf22f4b 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -172,6 +173,10 @@ struct vdpa_iova_range { *@vdev: vdpa device *Returns the iova range supported by *the device. + * @process_iotlb_msg: Process vhost memory mapping message (optional) + * Only used for VDUSE device now + * @vdev: vdpa device + * @msg: vhost memory mapping message * @set_map: Set device memory mapping (optional) *Needed for device that using device *specific DMA translation (on-chip IOMMU) @@ -240,6 +245,8 @@ struct vdpa_config_ops { struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); /* DMA ops */ + int (*process_iotlb_msg)(struct vdpa_device *vdev, +struct vhost_iotlb_msg *msg); int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size, u64 pa, u32 perm); Is there any reason that it can't be done via dma_map/dma_unmap or set_map? Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午2:38, Jason Wang wrote: V1 to V2: - Add vhost-vdpa support I may miss something but I don't see any code to support that. E.g neither set_map nor dma_map/unmap is implemented in the config ops. Thanks Speak too fast :( I saw a new config ops was introduced. Let me dive into that. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 01/13] mm: export zap_page_range() for driver use
On Wed, Dec 23, 2020 at 02:32:07PM +0800, Yongji Xie wrote: > Now I want to map/unmap some pages in an userland vma dynamically. The > vm_insert_page() is being used for mapping. In the unmapping case, it > looks like the zap_page_range() does what I want. So I export it. > Otherwise, we need some ways to notify userspace to trigger it with > madvise(MADV_DONTNEED), which might not be able to meet all our needs. > For example, unmapping some pages in a memory shrinker function. > > So I'd like to know what's the limitation to use zap_page_range() in a > module. And if we can't use it in a module, is there any acceptable > way to achieve that? I think the anser is: don't play funny games with unmapped outside of munmap. Especially as synchronization is very hard to get right. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/22 下午10:52, Xie Yongji wrote: This VDUSE driver enables implementing vDPA devices in userspace. Both control path and data path of vDPA devices will be able to be handled in userspace. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the VDUSE driver implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. Userspace can access those iova region via mmap(). Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace Now we only support virtio-vdpa bus driver with this patch applied. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 74 ++ Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig |8 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/eventfd.c | 221 drivers/vdpa/vdpa_user/eventfd.h | 48 + drivers/vdpa/vdpa_user/iova_domain.c | 442 drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ drivers/vdpa/vdpa_user/vduse.h | 59 ++ drivers/vdpa/vdpa_user/vduse_dev.c | 1121 include/uapi/linux/vdpa.h |1 + include/uapi/linux/vduse.h | 99 ++ 13 files changed, 2173 insertions(+) create mode 100644 Documentation/driver-api/vduse.rst create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/eventfd.c create mode 100644 drivers/vdpa/vdpa_user/eventfd.h create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h create mode 100644 drivers/vdpa/vdpa_user/vduse.h create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst new file mode 100644 index ..da9b3040f20a --- /dev/null +++ b/Documentation/driver-api/vduse.rst @@ -0,0 +1,74 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. + +How VDUSE works + +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on +the VDUSE character device (/dev/vduse). Then a file descriptor pointing +to the new resources will be returned, which can be used to implement the +userspace vDPA device's control path and data path. + +To implement control path, the read/write operations to the file descriptor +will be used to receive/reply the control messages from/to VDUSE driver. +Those control messages are based on the vdpa_config_ops which defines a +unified interface to control different types of vDPA device. + +The following types of messages are provided by the VDUSE framework now: + +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. + +- VDUSE_SET_VQ_NUM: Set the size of virtqueue + +- VDUSE_SET_VQ_READY: Set ready status of virtqueue + +- VDUSE_GET_VQ_READY: Get ready status of virtqueue + +- VDUSE_SET_FEATURES: Set virtio features supported by the driver + +- VDUSE_GET_FEATURES: Get virtio features supported by the device + +- VDUSE_SET_STATUS: Set the device status + +- VDUSE_GET_STATUS: Get the device status + +- VDUSE_SET_CONFIG: Write to device specific configuration space + +- VDUSE_GET_CONFIG: Read from device specific configuration space + +Please see include/linux/vdpa.h for details. + +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +driver which supports mapping the kernel dma buffer to a userspace iova +region dynamically. The userspace iova region can be created by passing +the userspace vDPA device fd to mmap(2). + +Besides, the eventfd mechanism is used to trigger interrupt callbacks and +receive virtqueue kicks in userspace. The following ioctls on the userspace +vDPA device fd are provided to support that: + +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used + by VDUSE driver to notify userspace to consume the vring. + +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used + by userspace to notify VDUSE driver to trigger interrupt callbacks. + +MMU-based IOMMU Driver +-