Re: [External] Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/23 下午8:14, Yongji Xie wrote: On Wed, Dec 23, 2020 at 5:05 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. It's OK for me. One problem is when to transfer the UPDATE_IOTLB message in virtio cases. Instead of generating IOTLB messages for userspace. How about record the mappings (which is a common case for device have on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl for userspace to query? Thanks + vduse_dev_msg_put(msg); + fput(file); + + return ret; +} + +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev, + u64 iova, u64 size) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + + return ret; +} + +static unsigned int perm_to_file_flags(u8 perm) +{ + unsigned int flags = 0; + + switch (perm) { + case VHOST_ACCESS_WO: + flags |= O_WRONLY; + break; + case VHOST_ACCESS_RO: + flags |= O_RDONLY; + break; + case VHOST_ACCESS_RW: + flags |= O_RDWR; + break; + default: +
Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午10:17, Yongji Xie wrote: On Wed, Dec 23, 2020 at 4:08 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: This VDUSE driver enables implementing vDPA devices in userspace. Both control path and data path of vDPA devices will be able to be handled in userspace. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the VDUSE driver implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. Userspace can access those iova region via mmap(). Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace Now we only support virtio-vdpa bus driver with this patch applied. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 74 ++ Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig |8 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/eventfd.c | 221 drivers/vdpa/vdpa_user/eventfd.h | 48 + drivers/vdpa/vdpa_user/iova_domain.c | 442 drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ drivers/vdpa/vdpa_user/vduse.h | 59 ++ drivers/vdpa/vdpa_user/vduse_dev.c | 1121 include/uapi/linux/vdpa.h |1 + include/uapi/linux/vduse.h | 99 ++ 13 files changed, 2173 insertions(+) create mode 100644 Documentation/driver-api/vduse.rst create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/eventfd.c create mode 100644 drivers/vdpa/vdpa_user/eventfd.h create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h create mode 100644 drivers/vdpa/vdpa_user/vduse.h create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst new file mode 100644 index ..da9b3040f20a --- /dev/null +++ b/Documentation/driver-api/vduse.rst @@ -0,0 +1,74 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. + +How VDUSE works + +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on +the VDUSE character device (/dev/vduse). Then a file descriptor pointing +to the new resources will be returned, which can be used to implement the +userspace vDPA device's control path and data path. + +To implement control path, the read/write operations to the file descriptor +will be used to receive/reply the control messages from/to VDUSE driver. +Those control messages are based on the vdpa_config_ops which defines a +unified interface to control different types of vDPA device. + +The following types of messages are provided by the VDUSE framework now: + +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. + +- VDUSE_SET_VQ_NUM: Set the size of virtqueue + +- VDUSE_SET_VQ_READY: Set ready status of virtqueue + +- VDUSE_GET_VQ_READY: Get ready status of virtqueue + +- VDUSE_SET_FEATURES: Set virtio features supported by the driver + +- VDUSE_GET_FEATURES: Get virtio features supported by the device + +- VDUSE_SET_STATUS: Set the device status + +- VDUSE_GET_STATUS: Get the device status + +- VDUSE_SET_CONFIG: Write to device specific configuration space + +- VDUSE_GET_CONFIG: Read from device specific configuration space + +Please see include/linux/vdpa.h for details. + +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +driver which supports mapping the kernel dma buffer to a userspace iova +region dynamically. The userspace iova region can be created by passing +the userspace vDPA device fd to mmap(2). + +Besides, the eventfd mechanism is used to trigger interrupt callbacks and +receive virtqueue kicks in userspace. The following ioctls on the userspace +vDPA device fd are provided to support that: + +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used + by VDUSE driver to notify userspace to consume the vring. + +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue,
Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On 2020/12/24 上午10:25, wangyunjian wrote: From: Yunjian Wang Currently the driver doesn't drop a packet which can't be sent by tun (e.g bad packet). In this case, the driver will always process the same packet lead to the tx queue stuck. To fix this issue: 1. in the case of persistent failure (e.g bad packet), the driver can skip this descriptor by ignoring the error. 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the driver schedules the worker to try again. I might be wrong but looking at alloc_skb_with_frags(), it returns -ENOBUFS actually: *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_mask); if (!skb) return NULL; Thanks Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c8784dfafdd7..e76245daa7f6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { + if (unlikely(err == -EAGAIN || err == -ENOMEM)) { vhost_discard_vq_desc(vq, 1); vhost_net_enable_vq(net, vq); break; } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: len %d != %zd\n", err, len); done: @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { @@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + if (err == -EAGAIN || err == -ENOMEM) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: " " len %d != %zd\n", err, len); if (!zcopy_used)
Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On 2020/12/24 下午12:37, wangyunjian wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Thursday, December 24, 2020 11:10 AM To: wangyunjian ; netdev@vger.kernel.org; m...@redhat.com; willemdebruijn.ker...@gmail.com Cc: virtualizat...@lists.linux-foundation.org; Lilijun (Jerry) ; chenchanghu ; xudingke ; huangbin (J) Subject: Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails On 2020/12/24 上午10:25, wangyunjian wrote: From: Yunjian Wang Currently the driver doesn't drop a packet which can't be sent by tun (e.g bad packet). In this case, the driver will always process the same packet lead to the tx queue stuck. To fix this issue: 1. in the case of persistent failure (e.g bad packet), the driver can skip this descriptor by ignoring the error. 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the driver schedules the worker to try again. I might be wrong but looking at alloc_skb_with_frags(), it returns -ENOBUFS actually: *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_mask); if (!skb) return NULL; Yes, but the sock_alloc_send_pskb() returns - EAGAIN when no sndbuf space. So there is need to check return value which is -EAGAIN or -ENOMEM or - EAGAIN? struct sk_buff *sock_alloc_send_pskb() { ... for (;;) { ... sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); err = -EAGAIN; if (!timeo) goto failure; ... } skb = alloc_skb_with_frags(header_len, data_len, max_page_order, errcode, sk->sk_allocation); if (skb) skb_set_owner_w(skb, sk); return skb; ... *errcode = err; return NULL; } -EAGAIN and -ENOBFS are fine. But I don't see how -ENOMEM can be returned. Thanks Thanks Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c8784dfafdd7..e76245daa7f6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { + if (unlikely(err == -EAGAIN || err == -ENOMEM)) { vhost_discard_vq_desc(vq, 1); vhost_net_enable_vq(net, vq); break; } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: len %d != %zd\n", err, len); done: @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { @@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + if (err == -EAGAIN || err == -ENOMEM) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } } - if (err != len) + if (err >= 0 && err != len) pr_debug("Truncated TX packet: " " len %d != %zd\n", err, len); if (!zcopy_used)
Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails
On 2020/12/24 下午5:09, wangyunjian wrote: -EAGAIN and -ENOBFS are fine. But I don't see how -ENOMEM can be returned. The tun_build_skb() and tun_napi_alloc_frags() return -ENOMEM when memory allocation fails. Thanks You're right. So I think we need check them all. In the future, we need think of a better way. I feel such check is kind of fragile since people may modify core sock codes without having a look at what vhost does. Thanks
Re: [PATCH net v2] tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
On 2020/12/25 上午10:52, wangyunjian wrote: From: Yunjian Wang Currently the tun_napi_alloc_frags() function returns -ENOMEM when the number of iovs exceeds MAX_SKB_FRAGS + 1. However this is inappropriate, we should use -EMSGSIZE instead of -ENOMEM. The following distinctions are matters: 1. the caller need to drop the bad packet when -EMSGSIZE is returned, which means meeting a persistent failure. 2. the caller can try again when -ENOMEM is returned, which means meeting a transient failure. Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Yunjian Wang Acked-by: Willem de Bruijn --- v2: * update commit log suggested by Willem de Bruijn --- drivers/net/tun.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2dc1988a8973..15c6dd7fb04c 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1365,7 +1365,7 @@ static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile, int i; if (it->nr_segs > MAX_SKB_FRAGS + 1) - return ERR_PTR(-ENOMEM); + return ERR_PTR(-EMSGSIZE); local_bh_disable(); skb = napi_get_frags(&tfile->napi); Acked-by: Jason Wang
Re: [PATCH net v4 1/2] vhost_net: fix ubuf refcount incorrectly when sendmsg fails
On 2020/12/24 上午10:25, wangyunjian wrote: From: Yunjian Wang Currently the vhost_zerocopy_callback() maybe be called to decrease the refcount when sendmsg fails in tun. The error handling in vhost handle_tx_zerocopy() will try to decrease the same refcount again. This is wrong. To fix this issue, we only call vhost_net_ubuf_put() when vq->heads[nvq->desc].len == VHOST_DMA_IN_PROGRESS. Fixes: 0690899b4d45 ("tun: experimental zero copy tx support") Signed-off-by: Yunjian Wang Acked-by: Willem de Bruijn Acked-by: Michael S. Tsirkin Acked-by: Jason Wang --- drivers/vhost/net.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 531a00d703cd..c8784dfafdd7 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -863,6 +863,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) size_t len, total_len = 0; int err; struct vhost_net_ubuf_ref *ubufs; + struct ubuf_info *ubuf; bool zcopy_used; int sent_pkts = 0; @@ -895,9 +896,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) /* use msg_control to pass vhost zerocopy ubuf info to skb */ if (zcopy_used) { - struct ubuf_info *ubuf; ubuf = nvq->ubuf_info + nvq->upend_idx; - vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head); vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS; ubuf->callback = vhost_zerocopy_callback; @@ -927,7 +926,8 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { - vhost_net_ubuf_put(ubufs); + if (vq->heads[ubuf->desc].len == VHOST_DMA_IN_PROGRESS) + vhost_net_ubuf_put(ubufs); nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; }
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/24 下午3:37, Yongji Xie wrote: On Thu, Dec 24, 2020 at 10:41 AM Jason Wang wrote: On 2020/12/23 下午8:14, Yongji Xie wrote: On Wed, Dec 23, 2020 at 5:05 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. It's OK for me. One problem is when to transfer the UPDATE_IOTLB message in virtio cases. Instead of generating IOTLB messages for userspace. How about record the mappings (which is a common case for device have on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl for userspace to query? If so, the IOTLB UPDATE is actually triggered by ioctl, but IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Good point. Some questions here: 1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall is returned. But this patch doesn't have this guarantee. Can this lead some issues? 2) I think after VDUSE userspace receives IOTLB_INVALIDATE, it needs to issue a munmap(). What if it doesn't do that? Or how about trigger it when userspace call mmap() on the device fd? One possible issue is that the IOTLB_UPDATE/INVALIDATE might come after mmap(): 1) When vIOMMU is enabled 2) Guest memory topology has been changed (memory hotplug). Thanks Thanks, Yongji
Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/24 下午4:34, Yongji Xie wrote: Yes, the disadvantage is the performance. But it should be simpler (I guess) and we know it can succeed. Yes, another advantage is that we can support the VM using anonymous memory. Exactly. I think I can try this in v3. And the MMU-based IOMMU implementation can be a future optimization in the virtio-vdpa case. What's your opinion? Maybe I was wrong, but I think we can try as what has been proposed here first and use shadow virtqueue as backup plan if we fail. OK, I will continue to work on this proposal. Thanks
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/25 上午10:37, Yongji Xie wrote: On Thu, Dec 24, 2020 at 3:37 PM Yongji Xie wrote: On Thu, Dec 24, 2020 at 10:41 AM Jason Wang wrote: On 2020/12/23 下午8:14, Yongji Xie wrote: On Wed, Dec 23, 2020 at 5:05 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. It's OK for me. One problem is when to transfer the UPDATE_IOTLB message in virtio cases. Instead of generating IOTLB messages for userspace. How about record the mappings (which is a common case for device have on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl for userspace to query? If so, the IOTLB UPDATE is actually triggered by ioctl, but IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Or how about trigger it when userspace call mmap() on the device fd? Oh sorry, looks like mmap() needs to be called in IOTLB UPDATE message handler. Is it possible for the vdpa device to know which vdpa bus it is attached to? We'd better not. It's kind of layer violation. Thanks So that we can generate this message during probing. Otherwise, we don't know whether the iova domain of MMU-based IOMMU is needed. Thanks, Yongji
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/25 下午6:31, Yongji Xie wrote: On Fri, Dec 25, 2020 at 2:58 PM Jason Wang wrote: On 2020/12/24 下午3:37, Yongji Xie wrote: On Thu, Dec 24, 2020 at 10:41 AM Jason Wang wrote: On 2020/12/23 下午8:14, Yongji Xie wrote: On Wed, Dec 23, 2020 at 5:05 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. It's OK for me. One problem is when to transfer the UPDATE_IOTLB message in virtio cases. Instead of generating IOTLB messages for userspace. How about record the mappings (which is a common case for device have on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl for userspace to query? If so, the IOTLB UPDATE is actually triggered by ioctl, but IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Good point. Some questions here: 1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall is returned. But this patch doesn't have this guarantee. Can this lead some issues? I'm not sure. But should it be guaranteed in VDUSE userspace? Just like what vhost-user backend process does. I think so. This is because the userspace device needs a way to synchronize invalidation with its datapath. Otherwise, guest may thing the buffer is freed to be used by other parts but the it actually can be used by the VDUSE device. The may cause security issues. 2) I think after VDUSE userspac
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/28 下午4:14, Yongji Xie wrote: I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE is expected to be synchronous. This need to be solved by tweaking the current VDUSE API or we can re-visit to go with descriptors relaying first. Actually all vdpa related operations are synchronous in current implementation. The ops.set_map/dma_map/dma_unmap should not return until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied by userspace. Could it solve this problem? I was thinking whether or not we need to generate IOTLB_INVALIDATE message to VDUSE during dma_unmap (vduse_dev_unmap_page). If we don't, we're probably fine. Thanks Thanks, Yongji
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
- Original Message - > On Mon, Dec 28, 2020 at 4:43 PM Jason Wang wrote: > > > > > > On 2020/12/28 下午4:14, Yongji Xie wrote: > > >> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE > > >> is expected to be synchronous. This need to be solved by tweaking the > > >> current VDUSE API or we can re-visit to go with descriptors relaying > > >> first. > > >> > > > Actually all vdpa related operations are synchronous in current > > > implementation. The ops.set_map/dma_map/dma_unmap should not return > > > until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied > > > by userspace. Could it solve this problem? > > > > > > I was thinking whether or not we need to generate IOTLB_INVALIDATE > > message to VDUSE during dma_unmap (vduse_dev_unmap_page). > > > > If we don't, we're probably fine. > > > > It seems not feasible. This message will be also used in the > virtio-vdpa case to notify userspace to unmap some pages during > consistent dma unmapping. Maybe we can document it to make sure the > users can handle the message correctly. Just to make sure I understand your point. Do you mean you plan to notify the unmap of 1) streaming DMA or 2) coherent DMA? For 1) you probably need a workqueue to do that since dma unmap can be done in irq or bh context. And if usrspace does't do the unmap, it can still access the bounce buffer (if you don't zap pte)? Thanks > > Thanks, > Yongji > >
Re: [PATCH rfc 2/3] virtio-net: support receive timestamp
- Original Message - > On Mon, Dec 28, 2020 at 7:55 PM Michael S. Tsirkin wrote: > > > > On Mon, Dec 28, 2020 at 02:30:31PM -0500, Willem de Bruijn wrote: > > > On Mon, Dec 28, 2020 at 12:29 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Mon, Dec 28, 2020 at 11:22:32AM -0500, Willem de Bruijn wrote: > > > > > From: Willem de Bruijn > > > > > > > > > > Add optional PTP hardware timestamp offload for virtio-net. > > > > > > > > > > Accurate RTT measurement requires timestamps close to the wire. > > > > > Introduce virtio feature VIRTIO_NET_F_RX_TSTAMP. If negotiated, the > > > > > virtio-net header is expanded with room for a timestamp. A host may > > > > > pass receive timestamps for all or some packets. A timestamp is valid > > > > > if non-zero. > > > > > > > > > > The timestamp straddles (virtual) hardware domains. Like PTP, use > > > > > international atomic time (CLOCK_TAI) as global clock base. It is > > > > > guest responsibility to sync with host, e.g., through kvm-clock. > > > > > > > > > > Signed-off-by: Willem de Bruijn > > > > > diff --git a/include/uapi/linux/virtio_net.h > > > > > b/include/uapi/linux/virtio_net.h > > > > > index f6881b5b77ee..0ffe2eeebd4a 100644 > > > > > --- a/include/uapi/linux/virtio_net.h > > > > > +++ b/include/uapi/linux/virtio_net.h > > > > > @@ -57,6 +57,7 @@ > > > > >* Steering */ > > > > > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */ > > > > > > > > > > +#define VIRTIO_NET_F_RX_TSTAMP 55/* Host sends TAI > > > > > receive time */ > > > > > #define VIRTIO_NET_F_TX_HASH 56/* Guest sends hash report */ > > > > > #define VIRTIO_NET_F_HASH_REPORT 57 /* Supports hash report */ > > > > > #define VIRTIO_NET_F_RSS 60/* Supports RSS RX steering */ > > > > > @@ -182,6 +183,17 @@ struct virtio_net_hdr_v1_hash { > > > > > }; > > > > > }; > > > > > > > > > > +struct virtio_net_hdr_v12 { > > > > > + struct virtio_net_hdr_v1 hdr; > > > > > + struct { > > > > > + __le32 value; > > > > > + __le16 report; > > > > > + __le16 flow_state; > > > > > + } hash; > > > > > + __virtio32 reserved; > > > > > + __virtio64 tstamp; > > > > > +}; > > > > > + > > > > > #ifndef VIRTIO_NET_NO_LEGACY > > > > > /* This header comes first in the scatter-gather list. > > > > > * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it > > > > > must > > > > > > > > > > > > So it looks like VIRTIO_NET_F_RX_TSTAMP should depend on both > > > > VIRTIO_NET_F_RX_TSTAMP and VIRTIO_NET_F_HASH_REPORT then? > > > > > > Do you mean VIRTIO_NET_F_TX_TSTAMP depends on VIRTIO_NET_F_RX_TSTAMP? > > > > > > I think if either is enabled we need to enable the extended layout. > > > Regardless of whether TX_HASH or HASH_REPORT are enabled. If they are > > > not, then those fields are ignored. > > > > I mean do we waste the 8 bytes for hash if TSTAMP is set but HASH is not? > > If TSTAMP depends on HASH then point is moot. > > True, but the two features really are independent. > > I did consider using configurable metadata layout depending on > features negotiated. If there are tons of optional extensions, that > makes sense. But it is more complex and parsing error prone. With a > handful of options each of a few bytes, that did not seem worth the > cost to me at this point. Consider NFV workloads(64B) packet. Most fields of current vnet header is even a burdern. It might be the time to introduce configurable or self-descriptive vnet header. > > And importantly, such a mode can always be added later as a separate > VIRTIO_NET_F_PACKED_HEADER option. What will do feature provide? Thanks > > If anything, perhaps if we increase the virtio_net_hdr_* allocation, > we should allocate some additional reserved space now? As each new > size introduces quite a bit of boilerplate. Also, e.g., in qemu just > to pass the sizes between virtio-net driver and vhost-net device. > >
Re: [PATCH 07/21] vdpa: multiple address spaces support
On 2020/12/29 下午3:28, Eli Cohen wrote: @@ -43,6 +43,8 @@ struct vdpa_vq_state { * @index: device index * @features_valid: were features initialized? for legacy guests * @nvqs: the number of virtqueues + * @ngroups: the number of virtqueue groups + * @nas: the number of address spaces I am not sure these can be categorised as part of the state of the VQ. It's more of a property so maybe we can have a callback to get the properties of the VQ? Moreover, I think they should be handled in the hardware drivers to return 1 for both ngroups and nas. We can, but those are static attributes that is not expected to be changed by the driver. Introduce callbacks for those static stuffs does not give us any advantage. For group id and notification area, since we don't have a abstract of vdpa_virtqueue. The only choice is to introduce callbacks for them. Maybe it's time to introduce vdpa_virtqueue. Thanks
Re: [PATCH 07/21] vdpa: multiple address spaces support
On 2020/12/29 下午3:28, Eli Cohen wrote: @@ -43,6 +43,8 @@ struct vdpa_vq_state { * @index: device index * @features_valid: were features initialized? for legacy guests * @nvqs: the number of virtqueues + * @ngroups: the number of virtqueue groups + * @nas: the number of address spaces I am not sure these can be categorised as part of the state of the VQ. It's more of a property so maybe we can have a callback to get the properties of the VQ? Or maybe there's a misunderstanding of the patch. Those two attributes belongs to vdpa_device instead of vdpa_vq_state actually. Thanks
Re: [PATCH 10/21] vhost: support ASID in IOTLB API
On 2020/12/29 下午6:20, Eli Cohen wrote: -static int vhost_process_iotlb_msg(struct vhost_dev *dev, +static int vhost_process_iotlb_msg(struct vhost_dev *dev, u16 asid, struct vhost_iotlb_msg *msg) { int ret = 0; + if (asid != 0) + return -EINVAL; + mutex_lock(&dev->mutex); vhost_dev_lock_vqs(dev); switch (msg->type) { @@ -1135,6 +1138,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, struct vhost_iotlb_msg msg; size_t offset; int type, ret; + u16 asid = 0; You assume asid occupies just 16 bits. So maybe you should reserve the other 16 bits for future extension: struct vhost_msg_v2 { __u32 type; - __u32 reserved; + __u16 asid; + __u16 reserved; union { Moreover, maybe this should be reflected in previous patches that use the asid: -static int mlx5_vdpa_set_map(struct vdpa_device *vdev, struct vhost_iotlb *iotlb) +static int mlx5_vdpa_set_map(struct vdpa_device *vdev, u16 asid, +struct vhost_iotlb *iotlb) -static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, +static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u16 asid, struct vhost_iotlb_msg *msg) etc. Good catch. This is a bug of the code actually. Since I want to stick to 32bit to be large enough for e.g PASID. Will fix. Thanks
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/29 下午6:26, Yongji Xie wrote: On Tue, Dec 29, 2020 at 5:11 PM Jason Wang wrote: - Original Message - On Mon, Dec 28, 2020 at 4:43 PM Jason Wang wrote: On 2020/12/28 下午4:14, Yongji Xie wrote: I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE is expected to be synchronous. This need to be solved by tweaking the current VDUSE API or we can re-visit to go with descriptors relaying first. Actually all vdpa related operations are synchronous in current implementation. The ops.set_map/dma_map/dma_unmap should not return until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied by userspace. Could it solve this problem? I was thinking whether or not we need to generate IOTLB_INVALIDATE message to VDUSE during dma_unmap (vduse_dev_unmap_page). If we don't, we're probably fine. It seems not feasible. This message will be also used in the virtio-vdpa case to notify userspace to unmap some pages during consistent dma unmapping. Maybe we can document it to make sure the users can handle the message correctly. Just to make sure I understand your point. Do you mean you plan to notify the unmap of 1) streaming DMA or 2) coherent DMA? For 1) you probably need a workqueue to do that since dma unmap can be done in irq or bh context. And if usrspace does't do the unmap, it can still access the bounce buffer (if you don't zap pte)? I plan to do it in the coherent DMA case. Any reason for treating coherent DMA differently? It's true that userspace can access the dma buffer if userspace doesn't do the unmap. But the dma pages would not be freed and reused unless user space called munmap() for them. I wonder whether or not we could recycle IOVA in this case to avoid the IOTLB_UMAP message. Thanks Thanks, Yongji
Re: [PATCH 11/21] vhost-vdpa: introduce asid based IOTLB
On 2020/12/29 下午7:41, Eli Cohen wrote: On Wed, Dec 16, 2020 at 02:48:08PM +0800, Jason Wang wrote: This patch converts the vhost-vDPA device to support multiple IOTLBs tagged via ASID via hlist. This will be used for supporting multiple address spaces in the following patches. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 106 --- 1 file changed, 80 insertions(+), 26 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index feb6a58df22d..060d5b5b7e64 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -33,13 +33,21 @@ enum { #define VHOST_VDPA_DEV_MAX (1U << MINORBITS) +#define VHOST_VDPA_IOTLB_BUCKETS 16 + +struct vhost_vdpa_as { + struct hlist_node hash_link; + struct vhost_iotlb iotlb; + u32 id; +}; + struct vhost_vdpa { struct vhost_dev vdev; struct iommu_domain *domain; struct vhost_virtqueue *vqs; struct completion completion; struct vdpa_device *vdpa; - struct vhost_iotlb *iotlb; + struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS]; struct device dev; struct cdev cdev; atomic_t opened; @@ -49,12 +57,64 @@ struct vhost_vdpa { struct eventfd_ctx *config_ctx; int in_batch; struct vdpa_iova_range range; + int used_as; This is not really used. Not in this patch and later removed. Right, will remove this. Thanks
Re: [PATCH 11/21] vhost-vdpa: introduce asid based IOTLB
On 2020/12/29 下午8:05, Eli Cohen wrote: + +static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) The return value is never interpreted. I think it should either be made void or return values checked. Right, will make it void. +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); + + /* Remove default address space is not allowed */ + if (asid == 0) + return -EINVAL; Can you explain why? I think you have a memory leak due to this as no one will ever free as with id 0. Looks like a bug. Will remove this. Thanks
Re: [PATCH 11/21] vhost-vdpa: introduce asid based IOTLB
On 2020/12/29 下午7:53, Eli Cohen wrote: + +static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) +{ + struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS]; + struct vhost_vdpa_as *as; + + if (asid_to_as(v, asid)) + return NULL; + + as = kmalloc(sizeof(*as), GFP_KERNEL); kzalloc()? See comment below. + if (!as) + return NULL; + + vhost_iotlb_init(&as->iotlb, 0, 0); + as->id = asid; + hlist_add_head(&as->hash_link, head); + ++v->used_as; Although you eventually ended up removing used_as, this is a bug since you're incrementing a random value. Maybe it's better to be on the safe side and use kzalloc() for as above. Yes and used_as needs to be removed. Thanks
Re: [PATCH 12/21] vhost-vdpa: introduce uAPI to get the number of virtqueue groups
On 2020/12/29 下午8:24, Eli Cohen wrote: On Wed, Dec 16, 2020 at 02:48:09PM +0800, Jason Wang wrote: Follows the vDPA support for multiple address spaces, this patch introduce uAPI for the userspace to know the number of virtqueue groups supported by the vDPA device. Can you explain what exactly you mean be userspace? It's the userspace that uses the uAPI introduced in this patch. Is it just qemu or is it destined to the virtio_net driver run by the qemu process? It could be Qemu, DPDK or other userspace program. The guest virtio-net driver will not use this but talks to the virtio device emulated by Qemu. Also can you say for what purpose? This can be used for facilitate the checking of whether the control vq could be supported. E.g if the device support less than 2 groups, qemu won't advertise control vq. Yes, #groups could be inferred from GET_VRING_GROUP. But it's not straightforward as this. Thanks Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 4 include/uapi/linux/vhost.h | 3 +++ 2 files changed, 7 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 060d5b5b7e64..1ba5901b28e7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -536,6 +536,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, case VHOST_VDPA_GET_VRING_NUM: r = vhost_vdpa_get_vring_num(v, argp); break; + case VHOST_VDPA_GET_GROUP_NUM: + r = copy_to_user(argp, &v->vdpa->ngroups, +sizeof(v->vdpa->ngroups)); + break; case VHOST_SET_LOG_BASE: case VHOST_SET_LOG_FD: r = -ENOIOCTLCMD; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 59c6c0fbaba1..8a4e6e426bbf 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -145,4 +145,7 @@ /* Get the valid iova range */ #define VHOST_VDPA_GET_IOVA_RANGE _IOR(VHOST_VIRTIO, 0x78, \ struct vhost_vdpa_iova_range) +/* Get the number of virtqueue groups. */ +#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x79, unsigned int) + #endif -- 2.25.1
Re: [PATCH rfc 2/3] virtio-net: support receive timestamp
On 2020/12/29 下午10:20, Willem de Bruijn wrote: On Tue, Dec 29, 2020 at 4:23 AM Jason Wang wrote: - Original Message - On Mon, Dec 28, 2020 at 7:55 PM Michael S. Tsirkin wrote: On Mon, Dec 28, 2020 at 02:30:31PM -0500, Willem de Bruijn wrote: On Mon, Dec 28, 2020 at 12:29 PM Michael S. Tsirkin wrote: On Mon, Dec 28, 2020 at 11:22:32AM -0500, Willem de Bruijn wrote: From: Willem de Bruijn Add optional PTP hardware timestamp offload for virtio-net. Accurate RTT measurement requires timestamps close to the wire. Introduce virtio feature VIRTIO_NET_F_RX_TSTAMP. If negotiated, the virtio-net header is expanded with room for a timestamp. A host may pass receive timestamps for all or some packets. A timestamp is valid if non-zero. The timestamp straddles (virtual) hardware domains. Like PTP, use international atomic time (CLOCK_TAI) as global clock base. It is guest responsibility to sync with host, e.g., through kvm-clock. Signed-off-by: Willem de Bruijn diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index f6881b5b77ee..0ffe2eeebd4a 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -57,6 +57,7 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */ +#define VIRTIO_NET_F_RX_TSTAMP 55/* Host sends TAI receive time */ #define VIRTIO_NET_F_TX_HASH 56/* Guest sends hash report */ #define VIRTIO_NET_F_HASH_REPORT 57 /* Supports hash report */ #define VIRTIO_NET_F_RSS 60/* Supports RSS RX steering */ @@ -182,6 +183,17 @@ struct virtio_net_hdr_v1_hash { }; }; +struct virtio_net_hdr_v12 { + struct virtio_net_hdr_v1 hdr; + struct { + __le32 value; + __le16 report; + __le16 flow_state; + } hash; + __virtio32 reserved; + __virtio64 tstamp; +}; + #ifndef VIRTIO_NET_NO_LEGACY /* This header comes first in the scatter-gather list. * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must So it looks like VIRTIO_NET_F_RX_TSTAMP should depend on both VIRTIO_NET_F_RX_TSTAMP and VIRTIO_NET_F_HASH_REPORT then? Do you mean VIRTIO_NET_F_TX_TSTAMP depends on VIRTIO_NET_F_RX_TSTAMP? I think if either is enabled we need to enable the extended layout. Regardless of whether TX_HASH or HASH_REPORT are enabled. If they are not, then those fields are ignored. I mean do we waste the 8 bytes for hash if TSTAMP is set but HASH is not? If TSTAMP depends on HASH then point is moot. True, but the two features really are independent. I did consider using configurable metadata layout depending on features negotiated. If there are tons of optional extensions, that makes sense. But it is more complex and parsing error prone. With a handful of options each of a few bytes, that did not seem worth the cost to me at this point. Consider NFV workloads(64B) packet. Most fields of current vnet header is even a burdern. Such workloads will not negotiate these features, I imagine. I think this could only become an issue if a small packet workload would want to negotiate one optional feature, without the others. Yes. Even then, the actual cost of a few untouched bytes is negligible, as long as the headers don't span cache-lines. Right. It looks to me with hash report support the current header has exceeds 32 bytes which is the cacheline size for some archs. And it can be imagined that after two or more features it could be larger than 64 bytes. I expect it to be cheaper than having to parse a dynamic layout. The only overhead is probably analyzing "type" which should be cheap I guess. It might be the time to introduce configurable or self-descriptive vnet header. I did briefly toy with a type-val encoding. Not type-val-len, as all options have fixed length (so far), so length can be left implicit. But as each feature is negotiated before hand, the type can be left implicit too. Leaving just the data elements tightly packed. As long as order is defined. Right, so in this case there should be even no overhead. And importantly, such a mode can always be added later as a separate VIRTIO_NET_F_PACKED_HEADER option. What will do feature provide? The tightly packed data elements. Strip out the elements for non negotiated features. So if either tstamp option is negotiated, but hash is not, exclude the 64b of hash fields. Note that I'm not actually arguing for this (at this point). I second for this. Thanks
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/30 下午3:09, Yongji Xie wrote: On Wed, Dec 30, 2020 at 2:11 PM Jason Wang wrote: On 2020/12/29 下午6:26, Yongji Xie wrote: On Tue, Dec 29, 2020 at 5:11 PM Jason Wang wrote: - Original Message - On Mon, Dec 28, 2020 at 4:43 PM Jason Wang wrote: On 2020/12/28 下午4:14, Yongji Xie wrote: I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE is expected to be synchronous. This need to be solved by tweaking the current VDUSE API or we can re-visit to go with descriptors relaying first. Actually all vdpa related operations are synchronous in current implementation. The ops.set_map/dma_map/dma_unmap should not return until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied by userspace. Could it solve this problem? I was thinking whether or not we need to generate IOTLB_INVALIDATE message to VDUSE during dma_unmap (vduse_dev_unmap_page). If we don't, we're probably fine. It seems not feasible. This message will be also used in the virtio-vdpa case to notify userspace to unmap some pages during consistent dma unmapping. Maybe we can document it to make sure the users can handle the message correctly. Just to make sure I understand your point. Do you mean you plan to notify the unmap of 1) streaming DMA or 2) coherent DMA? For 1) you probably need a workqueue to do that since dma unmap can be done in irq or bh context. And if usrspace does't do the unmap, it can still access the bounce buffer (if you don't zap pte)? I plan to do it in the coherent DMA case. Any reason for treating coherent DMA differently? Now the memory of the bounce buffer is allocated page by page in the page fault handler. So it can't be used in coherent DMA mapping case which needs some memory with contiguous virtual addresses. I can use vmalloc() to do allocation for the bounce buffer instead. But it might cause some memory waste. Any suggestion? I may miss something. But I don't see a relationship between the IOTLB_UNMAP and vmalloc(). It's true that userspace can access the dma buffer if userspace doesn't do the unmap. But the dma pages would not be freed and reused unless user space called munmap() for them. I wonder whether or not we could recycle IOVA in this case to avoid the IOTLB_UMAP message. We can achieve that if we use vmalloc() to do allocation for the bounce buffer which can be used in coherent DMA mapping case. But looks like we still have no way to avoid the IOTLB_UMAP message in vhost-vdpa case. I think that's fine. For virtio-vdpa, from VDUSE userspace perspective, it works like a driver that is using SWIOTLB in this case. Thanks Thanks, Yongji
Re: [PATCH rfc 2/3] virtio-net: support receive timestamp
On 2020/12/29 上午8:57, Willem de Bruijn wrote: On Mon, Dec 28, 2020 at 5:59 PM Jakub Kicinski wrote: On Mon, 28 Dec 2020 11:22:32 -0500 Willem de Bruijn wrote: From: Willem de Bruijn Add optional PTP hardware timestamp offload for virtio-net. Accurate RTT measurement requires timestamps close to the wire. Introduce virtio feature VIRTIO_NET_F_RX_TSTAMP. If negotiated, the virtio-net header is expanded with room for a timestamp. A host may pass receive timestamps for all or some packets. A timestamp is valid if non-zero. The timestamp straddles (virtual) hardware domains. Like PTP, use international atomic time (CLOCK_TAI) as global clock base. It is guest responsibility to sync with host, e.g., through kvm-clock. Would this not be confusing to some user space SW to have a NIC with no PHC deliver HW stamps? I'd CC Richard on this, unless you already discussed with him offline. Thanks, good point. I should have included Richard. There is a well understood method for synchronizing guest and host clock in KVM using ptp_kvm. For virtual environments without NIC hardware offload, the when host timestamps in software, this suffices. Syncing host with NIC is assumed if the host advertises the feature and implements using real hardware timestamps. Or it could be useful for virtio hardware when there's no KVM that provides PTP. Thanks
Re: [PATCH 12/21] vhost-vdpa: introduce uAPI to get the number of virtqueue groups
On 2020/12/30 下午6:05, Eli Cohen wrote: On Wed, Dec 16, 2020 at 02:48:09PM +0800, Jason Wang wrote: Follows the vDPA support for multiple address spaces, this patch introduce uAPI for the userspace to know the number of virtqueue groups supported by the vDPA device. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 4 include/uapi/linux/vhost.h | 3 +++ 2 files changed, 7 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 060d5b5b7e64..1ba5901b28e7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -536,6 +536,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, case VHOST_VDPA_GET_VRING_NUM: r = vhost_vdpa_get_vring_num(v, argp); break; + case VHOST_VDPA_GET_GROUP_NUM: + r = copy_to_user(argp, &v->vdpa->ngroups, +sizeof(v->vdpa->ngroups)); + break; Is this and other ioctls already supported in qemu? Not yet, the prototype is under development. I test the series with a small and dedicated userspace program. Thanks
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/30 下午6:12, Yongji Xie wrote: On Wed, Dec 30, 2020 at 4:41 PM Jason Wang wrote: On 2020/12/30 下午3:09, Yongji Xie wrote: On Wed, Dec 30, 2020 at 2:11 PM Jason Wang wrote: On 2020/12/29 下午6:26, Yongji Xie wrote: On Tue, Dec 29, 2020 at 5:11 PM Jason Wang wrote: - Original Message - On Mon, Dec 28, 2020 at 4:43 PM Jason Wang wrote: On 2020/12/28 下午4:14, Yongji Xie wrote: I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE is expected to be synchronous. This need to be solved by tweaking the current VDUSE API or we can re-visit to go with descriptors relaying first. Actually all vdpa related operations are synchronous in current implementation. The ops.set_map/dma_map/dma_unmap should not return until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied by userspace. Could it solve this problem? I was thinking whether or not we need to generate IOTLB_INVALIDATE message to VDUSE during dma_unmap (vduse_dev_unmap_page). If we don't, we're probably fine. It seems not feasible. This message will be also used in the virtio-vdpa case to notify userspace to unmap some pages during consistent dma unmapping. Maybe we can document it to make sure the users can handle the message correctly. Just to make sure I understand your point. Do you mean you plan to notify the unmap of 1) streaming DMA or 2) coherent DMA? For 1) you probably need a workqueue to do that since dma unmap can be done in irq or bh context. And if usrspace does't do the unmap, it can still access the bounce buffer (if you don't zap pte)? I plan to do it in the coherent DMA case. Any reason for treating coherent DMA differently? Now the memory of the bounce buffer is allocated page by page in the page fault handler. So it can't be used in coherent DMA mapping case which needs some memory with contiguous virtual addresses. I can use vmalloc() to do allocation for the bounce buffer instead. But it might cause some memory waste. Any suggestion? I may miss something. But I don't see a relationship between the IOTLB_UNMAP and vmalloc(). In the vmalloc() case, the coherent DMA page will be taken from the memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore during coherent DMA unmapping because those vmalloc'ed memory which has been mapped into userspace address space during initialization can be reused. And userspace should not unmap the region until we destroy the device. Just to make sure I understand. My understanding is that IOTLB_UNMAP is only needed when there's a change the mapping from IOVA to page. So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to free list to be used by the next IOVA allocating. IOTLB_UNMAP could be avoided. So we are not limited by how the pages are actually allocated? Thanks It's true that userspace can access the dma buffer if userspace doesn't do the unmap. But the dma pages would not be freed and reused unless user space called munmap() for them. I wonder whether or not we could recycle IOVA in this case to avoid the IOTLB_UMAP message. We can achieve that if we use vmalloc() to do allocation for the bounce buffer which can be used in coherent DMA mapping case. But looks like we still have no way to avoid the IOTLB_UMAP message in vhost-vdpa case. I think that's fine. For virtio-vdpa, from VDUSE userspace perspective, it works like a driver that is using SWIOTLB in this case. OK, will do it in v3. Thanks, Yongji
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/31 下午1:15, Yongji Xie wrote: On Thu, Dec 31, 2020 at 10:49 AM Jason Wang wrote: On 2020/12/30 下午6:12, Yongji Xie wrote: On Wed, Dec 30, 2020 at 4:41 PM Jason Wang wrote: On 2020/12/30 下午3:09, Yongji Xie wrote: On Wed, Dec 30, 2020 at 2:11 PM Jason Wang wrote: On 2020/12/29 下午6:26, Yongji Xie wrote: On Tue, Dec 29, 2020 at 5:11 PM Jason Wang wrote: - Original Message - On Mon, Dec 28, 2020 at 4:43 PM Jason Wang wrote: On 2020/12/28 下午4:14, Yongji Xie wrote: I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE is expected to be synchronous. This need to be solved by tweaking the current VDUSE API or we can re-visit to go with descriptors relaying first. Actually all vdpa related operations are synchronous in current implementation. The ops.set_map/dma_map/dma_unmap should not return until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied by userspace. Could it solve this problem? I was thinking whether or not we need to generate IOTLB_INVALIDATE message to VDUSE during dma_unmap (vduse_dev_unmap_page). If we don't, we're probably fine. It seems not feasible. This message will be also used in the virtio-vdpa case to notify userspace to unmap some pages during consistent dma unmapping. Maybe we can document it to make sure the users can handle the message correctly. Just to make sure I understand your point. Do you mean you plan to notify the unmap of 1) streaming DMA or 2) coherent DMA? For 1) you probably need a workqueue to do that since dma unmap can be done in irq or bh context. And if usrspace does't do the unmap, it can still access the bounce buffer (if you don't zap pte)? I plan to do it in the coherent DMA case. Any reason for treating coherent DMA differently? Now the memory of the bounce buffer is allocated page by page in the page fault handler. So it can't be used in coherent DMA mapping case which needs some memory with contiguous virtual addresses. I can use vmalloc() to do allocation for the bounce buffer instead. But it might cause some memory waste. Any suggestion? I may miss something. But I don't see a relationship between the IOTLB_UNMAP and vmalloc(). In the vmalloc() case, the coherent DMA page will be taken from the memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore during coherent DMA unmapping because those vmalloc'ed memory which has been mapped into userspace address space during initialization can be reused. And userspace should not unmap the region until we destroy the device. Just to make sure I understand. My understanding is that IOTLB_UNMAP is only needed when there's a change the mapping from IOVA to page. Yes, that's true. So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to free list to be used by the next IOVA allocating. IOTLB_UNMAP could be avoided. So we are not limited by how the pages are actually allocated? In coherent DMA cases, we need to return some memory with contiguous kernel virtual addresses. That is the reason why we need vmalloc() here. If we allocate the memory page by page, the corresponding kernel virtual addresses in a contiguous IOVA range might not be contiguous. Yes, but we can do that as what has been done in the series (alloc_pages_exact()). Or do you mean it would be a little bit hard to recycle IOVA/pages here? Thanks And in streaming DMA cases, there is no limit. So another choice is using vmalloc'ed memory only for coherent DMA cases. Not sure if this is clear for you. Thanks, Yongji
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/31 下午2:52, Yongji Xie wrote: On Thu, Dec 31, 2020 at 1:50 PM Jason Wang wrote: On 2020/12/31 下午1:15, Yongji Xie wrote: On Thu, Dec 31, 2020 at 10:49 AM Jason Wang wrote: On 2020/12/30 下午6:12, Yongji Xie wrote: On Wed, Dec 30, 2020 at 4:41 PM Jason Wang wrote: On 2020/12/30 下午3:09, Yongji Xie wrote: On Wed, Dec 30, 2020 at 2:11 PM Jason Wang wrote: On 2020/12/29 下午6:26, Yongji Xie wrote: On Tue, Dec 29, 2020 at 5:11 PM Jason Wang wrote: - Original Message - On Mon, Dec 28, 2020 at 4:43 PM Jason Wang wrote: On 2020/12/28 下午4:14, Yongji Xie wrote: I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE is expected to be synchronous. This need to be solved by tweaking the current VDUSE API or we can re-visit to go with descriptors relaying first. Actually all vdpa related operations are synchronous in current implementation. The ops.set_map/dma_map/dma_unmap should not return until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied by userspace. Could it solve this problem? I was thinking whether or not we need to generate IOTLB_INVALIDATE message to VDUSE during dma_unmap (vduse_dev_unmap_page). If we don't, we're probably fine. It seems not feasible. This message will be also used in the virtio-vdpa case to notify userspace to unmap some pages during consistent dma unmapping. Maybe we can document it to make sure the users can handle the message correctly. Just to make sure I understand your point. Do you mean you plan to notify the unmap of 1) streaming DMA or 2) coherent DMA? For 1) you probably need a workqueue to do that since dma unmap can be done in irq or bh context. And if usrspace does't do the unmap, it can still access the bounce buffer (if you don't zap pte)? I plan to do it in the coherent DMA case. Any reason for treating coherent DMA differently? Now the memory of the bounce buffer is allocated page by page in the page fault handler. So it can't be used in coherent DMA mapping case which needs some memory with contiguous virtual addresses. I can use vmalloc() to do allocation for the bounce buffer instead. But it might cause some memory waste. Any suggestion? I may miss something. But I don't see a relationship between the IOTLB_UNMAP and vmalloc(). In the vmalloc() case, the coherent DMA page will be taken from the memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore during coherent DMA unmapping because those vmalloc'ed memory which has been mapped into userspace address space during initialization can be reused. And userspace should not unmap the region until we destroy the device. Just to make sure I understand. My understanding is that IOTLB_UNMAP is only needed when there's a change the mapping from IOVA to page. Yes, that's true. So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to free list to be used by the next IOVA allocating. IOTLB_UNMAP could be avoided. So we are not limited by how the pages are actually allocated? In coherent DMA cases, we need to return some memory with contiguous kernel virtual addresses. That is the reason why we need vmalloc() here. If we allocate the memory page by page, the corresponding kernel virtual addresses in a contiguous IOVA range might not be contiguous. Yes, but we can do that as what has been done in the series (alloc_pages_exact()). Or do you mean it would be a little bit hard to recycle IOVA/pages here? Yes, it might be hard to reuse the memory. For example, we firstly allocate 1 IOVA/page during dma_map, then the IOVA is freed during dma_unmap. Actually we can't reuse this single page if we need a two-pages area in the next IOVA allocating. So the best way is using IOTLB_UNMAP to free this single page during dma_unmap too. Thanks, Yongji I get you now. Then I agree that let's go with IOTLB_UNMAP. Thanks
Re: [PATCH linux-next v2 1/7] vdpa_sim_net: Make mac address array static
On 2021/1/4 上午11:31, Parav Pandit wrote: MAC address array is used only in vdpa_sim_net.c. Hence, keep it static. Signed-off-by: Parav Pandit --- Changelog: v1->v2: - new patch --- drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c index c10b6981fdab..f0482427186b 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c @@ -33,7 +33,7 @@ static char *macaddr; module_param(macaddr, charp, 0); MODULE_PARM_DESC(macaddr, "Ethernet MAC address"); -u8 macaddr_buf[ETH_ALEN]; +static u8 macaddr_buf[ETH_ALEN]; static struct vdpasim *vdpasim_net_dev; Acked-by: Jason Wang
Re: [PATCH linux-next v2 4/7] vdpa: Define vdpa mgmt device, ops and a netlink interface
On 2021/1/4 上午11:31, Parav Pandit wrote: To add one or more VDPA devices, define a management device which allows adding or removing vdpa device. A management device defines set of callbacks to manage vdpa devices. To begin with, it defines add and remove callbacks through which a user defined vdpa device can be added or removed. A unique management device is identified by its unique handle identified by management device name and optionally the bus name. Hence, introduce routine through which driver can register a management device and its callback operations for adding and remove a vdpa device. Introduce vdpa netlink socket family so that user can query management device and its attributes. Example of show vdpa management device which allows creating vdpa device of networking class (device id = 0x1) of virtio specification 1.1 section 5.1.1. $ vdpa mgmtdev show vdpasim_net: supported_classes: net Example of showing vdpa management device in JSON format. $ vdpa mgmtdev show -jp { "show": { "vdpasim_net": { "supported_classes": [ "net" ] } } } Signed-off-by: Parav Pandit Reviewed-by: Eli Cohen Reviewed-by: Jason Wang --- Changelog: v1->v2: - rebased - updated commit log example for management device name from "vdpasim" to "vdpasim_net" - removed device_id as net and block management devices are separated So I wonder whether there could be a type of management devices that can deal with multiple types of virtio devices. If yes, we probably need to add device id back. Thanks
Re: [PATCH linux-next v2 7/7] vdpa_sim_net: Add support for user supported devices
On 2021/1/4 上午11:31, Parav Pandit wrote: static int __init vdpasim_net_init(void) { int ret = 0; @@ -176,6 +264,8 @@ static int __init vdpasim_net_init(void) if (default_device) ret = vdpasim_net_default_dev_register(); + else + ret = vdpasim_net_mgmtdev_init(); return ret; } @@ -183,6 +273,8 @@ static void __exit vdpasim_net_exit(void) { if (default_device) vdpasim_net_default_dev_unregister(); + else + vdpasim_net_mgmtdev_cleanup(); } module_init(vdpasim_net_init); -- 2.26.2 I wonder what's the value of keeping the default device that is out of the control of management API. Thanks
Re: [PATCH linux-next v2 7/7] vdpa_sim_net: Add support for user supported devices
On 2021/1/4 下午3:21, Parav Pandit wrote: From: Jason Wang Sent: Monday, January 4, 2021 12:35 PM On 2021/1/4 上午11:31, Parav Pandit wrote: static int __init vdpasim_net_init(void) { int ret = 0; @@ -176,6 +264,8 @@ static int __init vdpasim_net_init(void) if (default_device) ret = vdpasim_net_default_dev_register(); + else + ret = vdpasim_net_mgmtdev_init(); return ret; } @@ -183,6 +273,8 @@ static void __exit vdpasim_net_exit(void) { if (default_device) vdpasim_net_default_dev_unregister(); + else + vdpasim_net_mgmtdev_cleanup(); } module_init(vdpasim_net_init); -- 2.26.2 I wonder what's the value of keeping the default device that is out of the control of management API. I think we can remove it like how I did in the v1 version. And actual vendor drivers like mlx5_vdpa will likely should do only user created devices. I added only for backward compatibility purpose, but we can remove the default simulated vdpa net device. What do you recommend? I think we'd better mandate this management API. This can avoid vendor specific configuration that may complex management layer. Thanks
Re: [PATCH linux-next v2 4/7] vdpa: Define vdpa mgmt device, ops and a netlink interface
On 2021/1/4 下午3:24, Parav Pandit wrote: From: Jason Wang Sent: Monday, January 4, 2021 12:33 PM On 2021/1/4 上午11:31, Parav Pandit wrote: To add one or more VDPA devices, define a management device which allows adding or removing vdpa device. A management device defines set of callbacks to manage vdpa devices. To begin with, it defines add and remove callbacks through which a user defined vdpa device can be added or removed. A unique management device is identified by its unique handle identified by management device name and optionally the bus name. Hence, introduce routine through which driver can register a management device and its callback operations for adding and remove a vdpa device. Introduce vdpa netlink socket family so that user can query management device and its attributes. Example of show vdpa management device which allows creating vdpa device of networking class (device id = 0x1) of virtio specification 1.1 section 5.1.1. $ vdpa mgmtdev show vdpasim_net: supported_classes: net Example of showing vdpa management device in JSON format. $ vdpa mgmtdev show -jp { "show": { "vdpasim_net": { "supported_classes": [ "net" ] } } } Signed-off-by: Parav Pandit Reviewed-by: Eli Cohen Reviewed-by: Jason Wang --- Changelog: v1->v2: - rebased - updated commit log example for management device name from "vdpasim" to "vdpasim_net" - removed device_id as net and block management devices are separated So I wonder whether there could be a type of management devices that can deal with multiple types of virtio devices. If yes, we probably need to add device id back. At this point mlx5 plan to support only net. It is useful to see what type of vdpa device is supported by a management device. In future if a mgmt dev supports multiple types, user needs to choose desired type. I guess we can differ this optional type to future, when such mgmt. device will/may be available. I worry if we remove device_id, it may gives a hint that multiple mgmt devices needs to be registered if it supports multiple types. So if possible I would like to keep the device_id here. Thanks
Re: [PATCH 06/21] vdpa: introduce virtqueue groups
On 2021/1/4 下午6:04, Stefan Hajnoczi wrote: On Wed, Dec 16, 2020 at 02:48:03PM +0800, Jason Wang wrote: This patch introduces virtqueue groups to vDPA device. The virtqueue group is the minimal set of virtqueues that must share an address space. And the adddress space identifier could only be attached to a specific virtqueue group. A new mandated bus operation is introduced to get the virtqueue group ID for a specific virtqueue. All the vDPA device drivers were converted to simply support a single virtqueue group. Signed-off-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_main.c | 9 - drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 +++- drivers/vdpa/vdpa.c | 4 +++- drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++- include/linux/vdpa.h | 12 +--- 5 files changed, 37 insertions(+), 7 deletions(-) Maybe consider calling it iotlb_group or iommu_group so the purpose of the group is clear? I'm not sure. The reason that I choose virtqueue group is because: 1) Virtqueue is the only entity that tries to issues DMA 2) For IOMMU group, it may cause confusion to the existing IOMMU group who group devices 3) IOTLB is the concept in vhost, we don't have such definition in the virtio spec Thanks
Re: [PATCH linux-next v2 4/7] vdpa: Define vdpa mgmt device, ops and a netlink interface
On 2021/1/5 下午2:33, Parav Pandit wrote: From: Jason Wang Sent: Tuesday, January 5, 2021 9:40 AM On 2021/1/4 下午3:24, Parav Pandit wrote: From: Jason Wang Sent: Monday, January 4, 2021 12:33 PM On 2021/1/4 上午11:31, Parav Pandit wrote: To add one or more VDPA devices, define a management device which allows adding or removing vdpa device. A management device defines set of callbacks to manage vdpa devices. To begin with, it defines add and remove callbacks through which a user defined vdpa device can be added or removed. A unique management device is identified by its unique handle identified by management device name and optionally the bus name. Hence, introduce routine through which driver can register a management device and its callback operations for adding and remove a vdpa device. Introduce vdpa netlink socket family so that user can query management device and its attributes. Example of show vdpa management device which allows creating vdpa device of networking class (device id = 0x1) of virtio specification 1.1 section 5.1.1. $ vdpa mgmtdev show vdpasim_net: supported_classes: net Example of showing vdpa management device in JSON format. $ vdpa mgmtdev show -jp { "show": { "vdpasim_net": { "supported_classes": [ "net" ] } } } Signed-off-by: Parav Pandit Reviewed-by: Eli Cohen Reviewed-by: Jason Wang --- Changelog: v1->v2: - rebased - updated commit log example for management device name from "vdpasim" to "vdpasim_net" - removed device_id as net and block management devices are separated So I wonder whether there could be a type of management devices that can deal with multiple types of virtio devices. If yes, we probably need to add device id back. At this point mlx5 plan to support only net. It is useful to see what type of vdpa device is supported by a management device. In future if a mgmt dev supports multiple types, user needs to choose desired type. I guess we can differ this optional type to future, when such mgmt. device will/may be available. I worry if we remove device_id, it may gives a hint that multiple mgmt devices needs to be registered if it supports multiple types. No it shouldn't. because we do expose multiple supported types in mgmtdev attributes. Right. So if possible I would like to keep the device_id here. Its possible to keep it. But with current drivers, mainly mlx5 and vdpa_sim, it is redundant. Not sure of the ifc's plan. We have been splitting modules to handle net and block differently in mlx5 as well as vdpa_sim. So it looks to me that both may be separate management drivers (and management devices). Such as vdpasim_net and vdpasim_block. mlx5 doesn't have plan for block yet. Ok. Then it's fine. Thanks
Re: [PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit
On 2021/1/5 下午5:11, Xuan Zhuo wrote: The first patch made some adjustments to xsk. Thanks a lot for the work. It's rather interesting. The second patch itself can be used as an independent patch to solve the problem that XDP may fail to load when the number of queues is insufficient. It would be better to send this as a separated patch. Several people asked for this before. The third to last patch implements support for xsk in virtio-net. A practical problem with virtio is that tx interrupts are not very reliable. There will always be some missing or delayed tx interrupts. So I specially added a point timer to solve this problem. Of course, considering performance issues, The timer only triggers when the ring of the network card is full. This is sub-optimal. We need figure out the root cause. We don't meet such issue before. Several questions: - is tx interrupt enabled? - can you still see the issue if you disable event index? - what's backend did you use? qemu or vhost(user)? Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also developing it, but I found that the modification may be relatively large, so I consider this patch set to be separated from the code related to xsk zero copy rx. That's fine, but a question here. How is the multieuque being handled here. I'm asking since there's no programmable filters/directors support in virtio spec now. Thanks Xuan Zhuo (5): xsk: support get page for drv virtio-net: support XDP_TX when not more queues virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx xsk, virtio-net: prepare for support xsk virtio-net, xsk: virtio-net support xsk zero copy tx drivers/net/virtio_net.c| 643 +++- include/linux/netdevice.h | 1 + include/net/xdp_sock_drv.h | 10 + include/net/xsk_buff_pool.h | 1 + net/xdp/xsk_buff_pool.c | 10 +- 5 files changed, 597 insertions(+), 68 deletions(-) -- 1.8.3.1
Re: [PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit
On 2021/1/5 下午8:42, Xuan Zhuo wrote: On Tue, 5 Jan 2021 17:32:19 +0800, Jason Wang wrote: On 2021/1/5 下午5:11, Xuan Zhuo wrote: The first patch made some adjustments to xsk. Thanks a lot for the work. It's rather interesting. The second patch itself can be used as an independent patch to solve the problem that XDP may fail to load when the number of queues is insufficient. It would be better to send this as a separated patch. Several people asked for this before. The third to last patch implements support for xsk in virtio-net. A practical problem with virtio is that tx interrupts are not very reliable. There will always be some missing or delayed tx interrupts. So I specially added a point timer to solve this problem. Of course, considering performance issues, The timer only triggers when the ring of the network card is full. This is sub-optimal. We need figure out the root cause. We don't meet such issue before. Several questions: - is tx interrupt enabled? - can you still see the issue if you disable event index? - what's backend did you use? qemu or vhost(user)? Sorry, it may just be a problem with the backend I used here. I just tested the latest qemu and it did not have this problem. I think I should delete the timer-related code? Yes, please. Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also developing it, but I found that the modification may be relatively large, so I consider this patch set to be separated from the code related to xsk zero copy rx. That's fine, but a question here. How is the multieuque being handled here. I'm asking since there's no programmable filters/directors support in virtio spec now. Thanks I don't really understand what you mean. In the case of multiple queues, there is no problem. So consider we bind xsk to queue 4, how can you make sure the traffic to be directed queue 4? One possible solution is to use filters as what suggested in af_xdp.rst: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 ... But virtio-net doesn't have any filters that could be programmed from the driver. Anything I missed here? Thanks Xuan Zhuo (5): xsk: support get page for drv virtio-net: support XDP_TX when not more queues virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx xsk, virtio-net: prepare for support xsk virtio-net, xsk: virtio-net support xsk zero copy tx drivers/net/virtio_net.c| 643 +++- include/linux/netdevice.h | 1 + include/net/xdp_sock_drv.h | 10 + include/net/xsk_buff_pool.h | 1 + net/xdp/xsk_buff_pool.c | 10 +- 5 files changed, 597 insertions(+), 68 deletions(-) -- 1.8.3.1
Re: [PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit
On 2021/1/6 上午10:55, Xuan Zhuo wrote: On Wed, 6 Jan 2021 10:46:43 +0800, Jason Wang wrote: On 2021/1/5 下午8:42, Xuan Zhuo wrote: On Tue, 5 Jan 2021 17:32:19 +0800, Jason Wang wrote: On 2021/1/5 下午5:11, Xuan Zhuo wrote: The first patch made some adjustments to xsk. Thanks a lot for the work. It's rather interesting. The second patch itself can be used as an independent patch to solve the problem that XDP may fail to load when the number of queues is insufficient. It would be better to send this as a separated patch. Several people asked for this before. The third to last patch implements support for xsk in virtio-net. A practical problem with virtio is that tx interrupts are not very reliable. There will always be some missing or delayed tx interrupts. So I specially added a point timer to solve this problem. Of course, considering performance issues, The timer only triggers when the ring of the network card is full. This is sub-optimal. We need figure out the root cause. We don't meet such issue before. Several questions: - is tx interrupt enabled? - can you still see the issue if you disable event index? - what's backend did you use? qemu or vhost(user)? Sorry, it may just be a problem with the backend I used here. I just tested the latest qemu and it did not have this problem. I think I should delete the timer-related code? Yes, please. Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also developing it, but I found that the modification may be relatively large, so I consider this patch set to be separated from the code related to xsk zero copy rx. That's fine, but a question here. How is the multieuque being handled here. I'm asking since there's no programmable filters/directors support in virtio spec now. Thanks I don't really understand what you mean. In the case of multiple queues, there is no problem. So consider we bind xsk to queue 4, how can you make sure the traffic to be directed queue 4? One possible solution is to use filters as what suggested in af_xdp.rst: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 ... But virtio-net doesn't have any filters that could be programmed from the driver. Anything I missed here? Thanks I understand what you mean, this problem does exist, and I encountered it when I tested qemu. First of all, this is that the problem only affects recv. This patch is for xmit. Of course, our normal business must also have recv scenarios. My solution in developing the upper-level application is to bond all the queues to ensure that we can receive the packets we want. I'm not sure I get you here. Note that. one advantage of AF_XDP is that is allows XSK to be bound to a specific queue and the rest could still be used by kernel. And I think in the implementation of the use, even if the network card supports filters, we should also bond all the queues, because we don't know which queue the traffic we care about will arrive from. With the help of filters the card can select a specific queue based on hash or n-tuple so it should work? Regarding the problem of virtio-net, I think our core question is whether we need to deal with this problem in the driver of virtio-net, I personally think that we should add the virtio specification to define this scenario. Yes, so do you want to do that? It would make virtio-net more user friendly to AF_XDP. (Or if you wish I can post patch to extend the spec). When I tested it, I found that some cloud vendors' implementations guarantee this queue selection algorithm. Right, though spec suggest a automatic steering algorithm but it's not mandatory. Vendor can implement their own. But hash or ntuple filter should be still useful. Thanks Thanks!! Xuan Zhuo (5): xsk: support get page for drv virtio-net: support XDP_TX when not more queues virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx xsk, virtio-net: prepare for support xsk virtio-net, xsk: virtio-net support xsk zero copy tx drivers/net/virtio_net.c| 643 +++- include/linux/netdevice.h | 1 + include/net/xdp_sock_drv.h | 10 + include/net/xsk_buff_pool.h | 1 + net/xdp/xsk_buff_pool.c | 10 +- 5 files changed, 597 insertions(+), 68 deletions(-) -- 1.8.3.1
Re: [PATCH] vdpa/mlx5: Fix memory key MTT population
On 2021/1/6 下午5:05, Eli Cohen wrote: map_direct_mr() assumed that the number of scatter/gather entries returned by dma_map_sg_attrs() was equal to the number of segments in the sgl list. This led to wrong population of the mkey object. Fix this by properly referring to the returned value. In addition, get rid of fill_sg() whjich effect is overwritten bu populate_mtts(). Typo. Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Signed-off-by: Eli Cohen --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 28 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index 5c92a576edae..08f742fd2409 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -15,6 +15,7 @@ struct mlx5_vdpa_direct_mr { struct sg_table sg_head; int log_size; int nsg; + int nent; struct list_head list; u64 offset; }; diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 4b6195666c58..d300f799efcd 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -25,17 +25,6 @@ static int get_octo_len(u64 len, int page_shift) return (npages + 1) / 2; } -static void fill_sg(struct mlx5_vdpa_direct_mr *mr, void *in) -{ - struct scatterlist *sg; - __be64 *pas; - int i; - - pas = MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt); - for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i) - (*pas) = cpu_to_be64(sg_dma_address(sg)); -} - static void mlx5_set_access_mode(void *mkc, int mode) { MLX5_SET(mkc, mkc, access_mode_1_0, mode & 0x3); @@ -45,10 +34,18 @@ static void mlx5_set_access_mode(void *mkc, int mode) static void populate_mtts(struct mlx5_vdpa_direct_mr *mr, __be64 *mtt) { struct scatterlist *sg; + int nsg = mr->nsg; + u64 dma_addr; + u64 dma_len; + int j = 0; int i; - for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i) - mtt[i] = cpu_to_be64(sg_dma_address(sg)); + for_each_sg(mr->sg_head.sgl, sg, mr->nent, i) { + for (dma_addr = sg_dma_address(sg), dma_len = sg_dma_len(sg); +nsg && dma_len; +nsg--, dma_addr += BIT(mr->log_size), dma_len -= BIT(mr->log_size)) + mtt[j++] = cpu_to_be64(dma_addr); It looks to me the mtt entry is also limited by log_size. It's better to explain this a little bit in the commit log. Thanks + } } static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr) @@ -64,7 +61,6 @@ static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct return -ENOMEM; MLX5_SET(create_mkey_in, in, uid, mvdev->res.uid); - fill_sg(mr, in); mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry); MLX5_SET(mkc, mkc, lw, !!(mr->perm & VHOST_MAP_WO)); MLX5_SET(mkc, mkc, lr, !!(mr->perm & VHOST_MAP_RO)); @@ -276,8 +272,8 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr done: mr->log_size = log_entity_size; mr->nsg = nsg; - err = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, DMA_BIDIRECTIONAL, 0); - if (!err) + mr->nent = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, DMA_BIDIRECTIONAL, 0); + if (!mr->nent) goto err_map; err = create_direct_mr(mvdev, mr);
Re: [PATCH v1] vdpa/mlx5: Fix memory key MTT population
On 2021/1/7 下午3:18, Eli Cohen wrote: map_direct_mr() assumed that the number of scatter/gather entries returned by dma_map_sg_attrs() was equal to the number of segments in the sgl list. This led to wrong population of the mkey object. Fix this by properly referring to the returned value. The hardware expects each MTT entry to contain the DMA address of a contiguous block of memory of size (1 << mr->log_size) bytes. dma_map_sg_attrs() can coalesce several sg entries into a single scatter/gather entry of contiguous DMA range so we need to scan the list and refer to the size of each s/g entry. In addition, get rid of fill_sg() which effect is overwritten by populate_mtts(). Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Signed-off-by: Eli Cohen --- V0->V1: 1. Fix typos 2. Improve changelog Acked-by: Jason Wang drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 28 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index 5c92a576edae..08f742fd2409 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -15,6 +15,7 @@ struct mlx5_vdpa_direct_mr { struct sg_table sg_head; int log_size; int nsg; + int nent; struct list_head list; u64 offset; }; diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 4b6195666c58..d300f799efcd 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -25,17 +25,6 @@ static int get_octo_len(u64 len, int page_shift) return (npages + 1) / 2; } -static void fill_sg(struct mlx5_vdpa_direct_mr *mr, void *in) -{ - struct scatterlist *sg; - __be64 *pas; - int i; - - pas = MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt); - for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i) - (*pas) = cpu_to_be64(sg_dma_address(sg)); -} - static void mlx5_set_access_mode(void *mkc, int mode) { MLX5_SET(mkc, mkc, access_mode_1_0, mode & 0x3); @@ -45,10 +34,18 @@ static void mlx5_set_access_mode(void *mkc, int mode) static void populate_mtts(struct mlx5_vdpa_direct_mr *mr, __be64 *mtt) { struct scatterlist *sg; + int nsg = mr->nsg; + u64 dma_addr; + u64 dma_len; + int j = 0; int i; - for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i) - mtt[i] = cpu_to_be64(sg_dma_address(sg)); + for_each_sg(mr->sg_head.sgl, sg, mr->nent, i) { + for (dma_addr = sg_dma_address(sg), dma_len = sg_dma_len(sg); +nsg && dma_len; +nsg--, dma_addr += BIT(mr->log_size), dma_len -= BIT(mr->log_size)) + mtt[j++] = cpu_to_be64(dma_addr); + } } static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr) @@ -64,7 +61,6 @@ static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct return -ENOMEM; MLX5_SET(create_mkey_in, in, uid, mvdev->res.uid); - fill_sg(mr, in); mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry); MLX5_SET(mkc, mkc, lw, !!(mr->perm & VHOST_MAP_WO)); MLX5_SET(mkc, mkc, lr, !!(mr->perm & VHOST_MAP_RO)); @@ -276,8 +272,8 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr done: mr->log_size = log_entity_size; mr->nsg = nsg; - err = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, DMA_BIDIRECTIONAL, 0); - if (!err) + mr->nent = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, DMA_BIDIRECTIONAL, 0); + if (!mr->nent) goto err_map; err = create_direct_mr(mvdev, mr);
Re: [PATCH net-next 0/2] Introduce XDP_FLAGS_NO_TX flag
On 2021/1/9 上午10:49, Charlie Somerville wrote: This patch series introduces a new flag XDP_FLAGS_NO_TX which prevents the allocation of additional send queues for XDP programs. This part I don't understand. Is such flag a must? I think the answer is probably not. Why not simply do: 1) if we had sufficient TX queues, use dedicated TX queues for XDP_TX 2) if we don't, simple synchronize through spin_lock[1] Thanks [1] https://www.spinics.net/lists/bpf/msg32587.html Included in this patch series is an implementation of XDP_FLAGS_NO_TX for the virtio_net driver. This flag is intended to be advisory - not all drivers must implement support for it. Many virtualised environments only provide enough virtio_net send queues for the number of processors allocated to the VM: # nproc 8 # ethtool --show-channels ens3 Channel parameters for ens3: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 8 In this configuration XDP is unusable because the virtio_net driver always tries to allocate an extra send queue for each processor - even if the XDP the program never uses the XDP_TX functionality. While XDP_TX is still unavailable in these environments, this new flag expands the set of XDP programs that can be used. This is my first contribution to the kernel, so apologies if I've sent this to the wrong list. I have tried to cc relevant maintainers but it's possible I may have missed some people. I'm looking forward to receiving feedback on this change. Charlie Somerville (2): xdp: Add XDP_FLAGS_NO_TX flag virtio_net: Implement XDP_FLAGS_NO_TX support drivers/net/virtio_net.c | 17 + include/uapi/linux/if_link.h | 5 - 2 files changed, 17 insertions(+), 5 deletions(-)
Re: [PATCH 21/21] vdpasim: control virtqueue support
On 2021/1/11 下午8:26, Eli Cohen wrote: On Wed, Dec 16, 2020 at 02:48:18PM +0800, Jason Wang wrote: This patch introduces the control virtqueue support for vDPA simulator. This is a requirement for supporting advanced features like multiqueue. A requirement for control virtqueue is to isolate its memory access from the rx/tx virtqueues. This is because when using vDPA device for VM, the control virqueue is not directly assigned to VM. Userspace (Qemu) will present a shadow control virtqueue to control for recording the device states. The isolation is done via the virtqueue groups and ASID support in vDPA through vhost-vdpa. The simulator is extended to have: 1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue) 2) two virtqueue groups: group 0 contains RXVQ and TXVQ; group 1 contains CVQ 3) two address spaces and the simulator simply implements the address spaces by mapping it 1:1 to IOTLB. For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1 to group 1. So we have: 1) The IOTLB for virtqueue group 0 contains the mappings of guest, so RX and TX can be assigned to guest directly. 2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which is the buffers that allocated and managed by VMM only. So CVQ of vhost-vdpa is visible to VMM only. And Guest can not access the CVQ of vhost-vdpa. For the other use cases, since AS 0 is associated to all virtqueue groups by default. All virtqueues share the same mapping by default. To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is implemented in the simulator for the driver to set mac address. Hi Jason, is there any version of qemu/libvirt available that I can see the control virtqueue working in action? Not yet, the qemu part depends on the shadow virtqueue work of Eugenio. But it will work as: 1) qemu will use a separated address space for the control virtqueue (shadow) exposed through vhost-vDPA 2) the commands sent through control virtqueue by guest driver will intercept by qemu 3) Qemu will send those commands to the shadow control virtqueue Eugenio, any ETA for the new version of shadow virtqueue support in Qemu? Thanks
Re: [PATCH v1] vhost_vdpa: fix the problem in vhost_vdpa_set_config_call
On 2021/1/12 上午10:46, Cindy Lu wrote: in vhost_vdpa_set_config_call, the cb.private should be vhost_vdpa. Should be "In" this cb.private will finally use in vhost_vdpa_config_cb as vhost_vdpa.Fix this issue An whitespace is needed before Fix and a full stop after "issue" Fixes: 776f395004d82 ("vhost_vdpa: Support config interrupt in vdpa") Acked-by: Jason Wang Please post a V2 with the above fixed and cc stable. Thanks Signed-off-by: Cindy Lu --- drivers/vhost/vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index ef688c8c0e0e..3fbb9c1f49da 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -319,7 +319,7 @@ static long vhost_vdpa_set_config_call(struct vhost_vdpa *v, u32 __user *argp) struct eventfd_ctx *ctx; cb.callback = vhost_vdpa_config_cb; - cb.private = v->vdpa; + cb.private = v; if (copy_from_user(&fd, argp, sizeof(fd))) return -EFAULT;
Re: [PATCH net] vdpa/mlx5: Fix miss to set VIRTIO_NET_S_LINK_UP for virtio_net_config
On 2020/10/19 下午5:07, we...@ucloud.cn wrote: From: wenxu Qemu get virtio_net_config from the vdpa driver. So The vdpa driver should set the VIRTIO_NET_S_LINK_UP flag to virtio_net_config like vdpa_sim. Or the link of virtio net NIC in the virtual machine will never up. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: wenxu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 74264e59..af6c74c 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1537,6 +1537,8 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64 features) ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; ndev->config.mtu = __cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev), ndev->mtu); + ndev->config.status = __cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev), + VIRTIO_NET_S_LINK_UP); return err; } Other than the small issue pointed out by Jakub. Acked-by: Jason Wang
Re: [PATCH net v2] Revert "virtio-net: ethtool configurable RXCSUM"
On 2020/10/20 上午1:32, Michael S. Tsirkin wrote: This reverts commit 3618ad2a7c0e78e4258386394d5d5f92a3dbccf8. When the device does not have a control vq (e.g. when using a version of QEMU based on upstream v0.10 or older, or when specifying ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_rx_extra=off,ctrl_mac_addr=off for the device on the QEMU command line), that commit causes a crash: [ 72.229171] kernel BUG at drivers/net/virtio_net.c:1667! [ 72.230266] invalid opcode: [#1] PREEMPT SMP [ 72.231172] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc8-02934-g3618ad2a7c0e7 #1 [ 72.231172] EIP: virtnet_send_command+0x120/0x140 [ 72.231172] Code: 00 0f 94 c0 8b 7d f0 65 33 3d 14 00 00 00 75 1c 8d 65 f4 5b 5e 5f 5d c3 66 90 be 01 00 00 00 e9 6e ff ff ff 8d b6 00 +00 00 00 <0f> 0b e8 d9 bb 82 00 eb 17 8d b4 26 00 00 00 00 8d b4 26 00 00 00 [ 72.231172] EAX: 000d EBX: f72895c0 ECX: 0017 EDX: 0011 [ 72.231172] ESI: f7197800 EDI: ed69bd00 EBP: ed69bcf4 ESP: ed69bc98 [ 72.231172] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246 [ 72.231172] CR0: 80050033 CR2: CR3: 02c84000 CR4: 000406f0 [ 72.231172] Call Trace: [ 72.231172] ? __virt_addr_valid+0x45/0x60 [ 72.231172] ? ___cache_free+0x51f/0x760 [ 72.231172] ? kobject_uevent_env+0xf4/0x560 [ 72.231172] virtnet_set_guest_offloads+0x4d/0x80 [ 72.231172] virtnet_set_features+0x85/0x120 [ 72.231172] ? virtnet_set_guest_offloads+0x80/0x80 [ 72.231172] __netdev_update_features+0x27a/0x8e0 [ 72.231172] ? kobject_uevent+0xa/0x20 [ 72.231172] ? netdev_register_kobject+0x12c/0x160 [ 72.231172] register_netdevice+0x4fe/0x740 [ 72.231172] register_netdev+0x1c/0x40 [ 72.231172] virtnet_probe+0x728/0xb60 [ 72.231172] ? _raw_spin_unlock+0x1d/0x40 [ 72.231172] ? virtio_vdpa_get_status+0x1c/0x20 [ 72.231172] virtio_dev_probe+0x1c6/0x271 [ 72.231172] really_probe+0x195/0x2e0 [ 72.231172] driver_probe_device+0x26/0x60 [ 72.231172] device_driver_attach+0x49/0x60 [ 72.231172] __driver_attach+0x46/0xc0 [ 72.231172] ? device_driver_attach+0x60/0x60 [ 72.231172] bus_add_driver+0x197/0x1c0 [ 72.231172] driver_register+0x66/0xc0 [ 72.231172] register_virtio_driver+0x1b/0x40 [ 72.231172] virtio_net_driver_init+0x61/0x86 [ 72.231172] ? veth_init+0x14/0x14 [ 72.231172] do_one_initcall+0x76/0x2e4 [ 72.231172] ? rdinit_setup+0x2a/0x2a [ 72.231172] do_initcalls+0xb2/0xd5 [ 72.231172] kernel_init_freeable+0x14f/0x179 [ 72.231172] ? rest_init+0x100/0x100 [ 72.231172] kernel_init+0xd/0xe0 [ 72.231172] ret_from_fork+0x1c/0x30 [ 72.231172] Modules linked in: [ 72.269563] ---[ end trace a6ebc4afea0e6cb1 ]--- The reason is that virtnet_set_features now calls virtnet_set_guest_offloads unconditionally, it used to only call it when there is something to configure. If device does not have a control vq, everything breaks. Looking at this some more, I noticed that it's not really checking the hardware too much. E.g. if ((dev->features ^ features) & NETIF_F_LRO) { if (features & NETIF_F_LRO) offloads |= GUEST_OFFLOAD_LRO_MASK & vi->guest_offloads_capable; else offloads &= ~GUEST_OFFLOAD_LRO_MASK; } and (1ULL << VIRTIO_NET_F_GUEST_TSO6) | \ (1ULL << VIRTIO_NET_F_GUEST_ECN) | \ (1ULL << VIRTIO_NET_F_GUEST_UFO)) But there's no guarantee that e.g. VIRTIO_NET_F_GUEST_TSO6 is set. If it isn't command should not send it. Further static int virtnet_set_features(struct net_device *dev, netdev_features_t features) { struct virtnet_info *vi = netdev_priv(dev); u64 offloads = vi->guest_offloads; seems wrong since guest_offloads is zero initialized, I'm not sure I get here. Did you mean vi->guest_offloads? We initialize it during probe for (i = 0; i < ARRAY_SIZE(guest_offloads); i++) if (virtio_has_feature(vi->vdev, guest_offloads[i])) set_bit(guest_offloads[i], &vi->guest_offloads); it does not reflect the state after reset which comes from the features. Revert the original commit for now. Cc: Tonghao Zhang Cc: Willem de Bruijn Fixes: 3618ad2a7c0e7 ("virtio-net: ethtool configurable RXCSUM") Reported-by: kernel test robot Signed-off-by: Michael S. Tsirkin --- changes from v1: - clarify how to reproduce the bug in the log drivers/net/virtio_net.c | 50 +++- 1 file changed, 13 insertions(+), 37 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d2d2c4a53cf2..21b71148c532 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -68,8 +68,6 @@ static const unsigned long guest_offloads[] = { (1ULL << VIRTI
Re: [PATCH 1/2] KVM: not register a IRQ bypass producer if unsupported or disabled
On 2020/10/19 下午5:06, Zhenzhong Duan wrote: If Post interrupt is disabled due to hardware limit or forcely disabled by "intremap=nopost" parameter, return -EINVAL so that the legacy mode IRQ isn't registered as IRQ bypass producer. Is there any side effect if it was still registered? With this change, below message is printed: "vfio-pci :db:00.0: irq bypass producer (token 60c8cda5) registration fails: -22" I may miss something, but the patch only touches vhost-vDPA instead of VFIO? Thanks ..which also hints us if a vfio or vdpa device works in PI mode or legacy remapping mode. Add a print to vdpa code just like what vfio_msi_set_vector_signal() does. Signed-off-by: Zhenzhong Duan --- arch/x86/kvm/svm/avic.c | 3 +-- arch/x86/kvm/vmx/vmx.c | 5 ++--- drivers/vhost/vdpa.c| 5 + 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index ac830cd..316142a 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -814,7 +814,7 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, if (!kvm_arch_has_assigned_device(kvm) || !irq_remapping_cap(IRQ_POSTING_CAP)) - return 0; + return ret; pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n", __func__, host_irq, guest_irq, set); @@ -899,7 +899,6 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, } } - ret = 0; out: srcu_read_unlock(&kvm->irq_srcu, idx); return ret; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f0a9954..1fed6d6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7716,12 +7716,12 @@ static int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq, struct kvm_lapic_irq irq; struct kvm_vcpu *vcpu; struct vcpu_data vcpu_info; - int idx, ret = 0; + int idx, ret = -EINVAL; if (!kvm_arch_has_assigned_device(kvm) || !irq_remapping_cap(IRQ_POSTING_CAP) || !kvm_vcpu_apicv_active(kvm->vcpus[0])) - return 0; + return ret; idx = srcu_read_lock(&kvm->irq_srcu); irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); @@ -7787,7 +7787,6 @@ static int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq, } } - ret = 0; out: srcu_read_unlock(&kvm->irq_srcu, idx); return ret; diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62a9bb0..b20060a 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -107,6 +107,11 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, u16 qid) vq->call_ctx.producer.token = vq->call_ctx.ctx; vq->call_ctx.producer.irq = irq; ret = irq_bypass_register_producer(&vq->call_ctx.producer); + if (unlikely(ret)) + dev_info(&vdpa->dev, + "irq bypass producer (token %p) registration fails: %d\n", + vq->call_ctx.producer.token, ret); + spin_unlock(&vq->call_ctx.ctx_lock); }
Re: [PATCH 2/2] KVM: not link irqfd with a fake IRQ bypass producer
On 2020/10/19 下午5:06, Zhenzhong Duan wrote: In case failure to setup Post interrupt for an IRQ, it make no sense to assign irqfd->producer to the producer. This change makes code more robust. It's better to describe what issue we will get without this patch. Thanks Signed-off-by: Zhenzhong Duan --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ce856e0..277e961 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10683,13 +10683,14 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons, container_of(cons, struct kvm_kernel_irqfd, consumer); int ret; - irqfd->producer = prod; kvm_arch_start_assignment(irqfd->kvm); ret = kvm_x86_ops.update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 1); if (ret) kvm_arch_end_assignment(irqfd->kvm); + else + irqfd->producer = prod; return ret; }
Re: [PATCH net] vdpa/mlx5: Fix miss to set VIRTIO_NET_S_LINK_UP for virtio_net_config
On 2020/10/20 下午3:44, Eli Cohen wrote: On Tue, Oct 20, 2020 at 10:03:00AM +0800, Jason Wang wrote: On 2020/10/19 下午5:07, we...@ucloud.cn wrote: From: wenxu Qemu get virtio_net_config from the vdpa driver. So The vdpa driver should set the VIRTIO_NET_S_LINK_UP flag to virtio_net_config like vdpa_sim. Or the link of virtio net NIC in the virtual machine will never up. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: wenxu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 74264e59..af6c74c 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1537,6 +1537,8 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64 features) ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; ndev->config.mtu = __cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev), ndev->mtu); + ndev->config.status = __cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev), + VIRTIO_NET_S_LINK_UP); return err; } Other than the small issue pointed out by Jakub. Acked-by: Jason Wang I already posted a fix for this a while ago and Michael should merge it. https://lkml.org/lkml/2020/9/17/543 Aha, I just forget this. It's queued here: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?h=linux-next And with my ack actually. Thanks
Re: [PATCH v4] Revert "virtio-net: ethtool configurable RXCSUM"
On 2020/10/21 下午10:30, Michael S. Tsirkin wrote: This reverts commit 3618ad2a7c0e78e4258386394d5d5f92a3dbccf8. When control vq is not negotiated, that commit causes a crash: [ 72.229171] kernel BUG at drivers/net/virtio_net.c:1667! [ 72.230266] invalid opcode: [#1] PREEMPT SMP [ 72.231172] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc8-02934-g3618ad2a7c0e7 #1 [ 72.231172] EIP: virtnet_send_command+0x120/0x140 [ 72.231172] Code: 00 0f 94 c0 8b 7d f0 65 33 3d 14 00 00 00 75 1c 8d 65 f4 5b 5e 5f 5d c3 66 90 be 01 00 00 00 e9 6e ff ff ff 8d b6 00 +00 00 00 <0f> 0b e8 d9 bb 82 00 eb 17 8d b4 26 00 00 00 00 8d b4 26 00 00 00 [ 72.231172] EAX: 000d EBX: f72895c0 ECX: 0017 EDX: 0011 [ 72.231172] ESI: f7197800 EDI: ed69bd00 EBP: ed69bcf4 ESP: ed69bc98 [ 72.231172] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246 [ 72.231172] CR0: 80050033 CR2: CR3: 02c84000 CR4: 000406f0 [ 72.231172] Call Trace: [ 72.231172] ? __virt_addr_valid+0x45/0x60 [ 72.231172] ? ___cache_free+0x51f/0x760 [ 72.231172] ? kobject_uevent_env+0xf4/0x560 [ 72.231172] virtnet_set_guest_offloads+0x4d/0x80 [ 72.231172] virtnet_set_features+0x85/0x120 [ 72.231172] ? virtnet_set_guest_offloads+0x80/0x80 [ 72.231172] __netdev_update_features+0x27a/0x8e0 [ 72.231172] ? kobject_uevent+0xa/0x20 [ 72.231172] ? netdev_register_kobject+0x12c/0x160 [ 72.231172] register_netdevice+0x4fe/0x740 [ 72.231172] register_netdev+0x1c/0x40 [ 72.231172] virtnet_probe+0x728/0xb60 [ 72.231172] ? _raw_spin_unlock+0x1d/0x40 [ 72.231172] ? virtio_vdpa_get_status+0x1c/0x20 [ 72.231172] virtio_dev_probe+0x1c6/0x271 [ 72.231172] really_probe+0x195/0x2e0 [ 72.231172] driver_probe_device+0x26/0x60 [ 72.231172] device_driver_attach+0x49/0x60 [ 72.231172] __driver_attach+0x46/0xc0 [ 72.231172] ? device_driver_attach+0x60/0x60 [ 72.231172] bus_add_driver+0x197/0x1c0 [ 72.231172] driver_register+0x66/0xc0 [ 72.231172] register_virtio_driver+0x1b/0x40 [ 72.231172] virtio_net_driver_init+0x61/0x86 [ 72.231172] ? veth_init+0x14/0x14 [ 72.231172] do_one_initcall+0x76/0x2e4 [ 72.231172] ? rdinit_setup+0x2a/0x2a [ 72.231172] do_initcalls+0xb2/0xd5 [ 72.231172] kernel_init_freeable+0x14f/0x179 [ 72.231172] ? rest_init+0x100/0x100 [ 72.231172] kernel_init+0xd/0xe0 [ 72.231172] ret_from_fork+0x1c/0x30 [ 72.231172] Modules linked in: [ 72.269563] ---[ end trace a6ebc4afea0e6cb1 ]--- The reason is that virtnet_set_features now calls virtnet_set_guest_offloads unconditionally, it used to only call it when there is something to configure. If device does not have a control vq, everything breaks. Revert the original commit for now. Cc: Tonghao Zhang Cc: Willem de Bruijn Fixes: 3618ad2a7c0e7 ("virtio-net: ethtool configurable RXCSUM") Reported-by: kernel test robot Signed-off-by: Michael S. Tsirkin --- Acked-by: Jason Wang Same patch as all of v1-v3, just tweaking the commit log. drivers/net/virtio_net.c | 50 +++- 1 file changed, 13 insertions(+), 37 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d2d2c4a53cf2..21b71148c532 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -68,8 +68,6 @@ static const unsigned long guest_offloads[] = { (1ULL << VIRTIO_NET_F_GUEST_ECN) | \ (1ULL << VIRTIO_NET_F_GUEST_UFO)) -#define GUEST_OFFLOAD_CSUM_MASK (1ULL << VIRTIO_NET_F_GUEST_CSUM) - struct virtnet_stat_desc { char desc[ETH_GSTRING_LEN]; size_t offset; @@ -2524,48 +2522,29 @@ static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, return 0; } -static netdev_features_t virtnet_fix_features(struct net_device *netdev, - netdev_features_t features) -{ - /* If Rx checksum is disabled, LRO should also be disabled. */ - if (!(features & NETIF_F_RXCSUM)) - features &= ~NETIF_F_LRO; - - return features; -} - static int virtnet_set_features(struct net_device *dev, netdev_features_t features) { struct virtnet_info *vi = netdev_priv(dev); - u64 offloads = vi->guest_offloads; + u64 offloads; int err; - /* Don't allow configuration while XDP is active. */ - if (vi->xdp_queue_pairs) - return -EBUSY; - if ((dev->features ^ features) & NETIF_F_LRO) { + if (vi->xdp_queue_pairs) + return -EBUSY; + if (features & NETIF_F_LRO) - offloads |= GUEST_OFFLOAD_LRO_MASK & - vi->guest_offloads_capable; + offloads = vi->guest_offloads_capable;
Re: [PATCH] vdpa: handle irq bypass register failure case
On 2020/10/23 下午6:40, Zhu Lingshan wrote: LKP considered variable 'ret' in vhost_vdpa_setup_vq_irq() as a unused variable, so suggest we remove it. Actually it stores return value of irq_bypass_register_producer(), but we did not check it, we should handle the failure case. This commit will print a message if irq bypass register producer fail, in this case, vqs still remain functional. Signed-off-by: Zhu Lingshan Reported-by: kernel test robot --- drivers/vhost/vdpa.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62a9bb0efc55..d6b2c3bd1b01 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -107,6 +107,9 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, u16 qid) vq->call_ctx.producer.token = vq->call_ctx.ctx; vq->call_ctx.producer.irq = irq; ret = irq_bypass_register_producer(&vq->call_ctx.producer); + if (unlikely(ret)) + dev_info(&v->dev, "vq %u, irq bypass producer (token %p) registration fails, ret = %d\n", +qid, vq->call_ctx.producer.token, ret); spin_unlock(&vq->call_ctx.ctx_lock); } Acked-by: Jason Wang
Re: [PATCH net] vhost_vdpa: Return -EFUALT if copy_from_user() fails
On 2020/10/23 下午11:34, Michael S. Tsirkin wrote: On Fri, Oct 23, 2020 at 03:08:53PM +0300, Dan Carpenter wrote: The copy_to/from_user() functions return the number of bytes which we weren't able to copy but the ioctl should return -EFAULT if they fail. Fixes: a127c5bbb6a8 ("vhost-vdpa: fix backend feature ioctls") Signed-off-by: Dan Carpenter Acked-by: Michael S. Tsirkin Needed for stable I guess. Agree. Acked-by: Jason Wang --- drivers/vhost/vdpa.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62a9bb0efc55..c94a97b6bd6d 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -428,12 +428,11 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, void __user *argp = (void __user *)arg; u64 __user *featurep = argp; u64 features; - long r; + long r = 0; if (cmd == VHOST_SET_BACKEND_FEATURES) { - r = copy_from_user(&features, featurep, sizeof(features)); - if (r) - return r; + if (copy_from_user(&features, featurep, sizeof(features))) + return -EFAULT; if (features & ~VHOST_VDPA_BACKEND_FEATURES) return -EOPNOTSUPP; vhost_set_backend_features(&v->vdev, features); @@ -476,7 +475,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, break; case VHOST_GET_BACKEND_FEATURES: features = VHOST_VDPA_BACKEND_FEATURES; - r = copy_to_user(featurep, &features, sizeof(features)); + if (copy_to_user(featurep, &features, sizeof(features))) + r = -EFAULT; break; default: r = vhost_dev_ioctl(&v->vdev, cmd, argp);
Re: [PATCH] vhost: Use mutex to protect vq_irq setup
On 2020/10/28 下午10:20, Eli Cohen wrote: Both irq_bypass_register_producer() and irq_bypass_unregister_producer() require process context to run. Change the call context lock from spinlock to mutex to protect the setup process to avoid deadlocks. Fixes: 265a0ad8731d ("vhost: introduce vhost_vring_call") Signed-off-by: Eli Cohen Hi Eli: During review we spot that the spinlock is not necessary. And it was already protected by vq mutex. So it was removed in this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=86e182fe12ee5869022614457037097c70fe2ed1 Thanks --- drivers/vhost/vdpa.c | 10 +- drivers/vhost/vhost.c | 6 +++--- drivers/vhost/vhost.h | 3 ++- 3 files changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index be783592fe58..0a744f2b6e76 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -98,26 +98,26 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, u16 qid) return; irq = ops->get_vq_irq(vdpa, qid); - spin_lock(&vq->call_ctx.ctx_lock); + mutex_lock(&vq->call_ctx.ctx_lock); irq_bypass_unregister_producer(&vq->call_ctx.producer); if (!vq->call_ctx.ctx || irq < 0) { - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); return; } vq->call_ctx.producer.token = vq->call_ctx.ctx; vq->call_ctx.producer.irq = irq; ret = irq_bypass_register_producer(&vq->call_ctx.producer); - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); } static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) { struct vhost_virtqueue *vq = &v->vqs[qid]; - spin_lock(&vq->call_ctx.ctx_lock); + mutex_lock(&vq->call_ctx.ctx_lock); irq_bypass_unregister_producer(&vq->call_ctx.producer); - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); } static void vhost_vdpa_reset(struct vhost_vdpa *v) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 9ad45e1d27f0..938239e11455 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -302,7 +302,7 @@ static void vhost_vring_call_reset(struct vhost_vring_call *call_ctx) { call_ctx->ctx = NULL; memset(&call_ctx->producer, 0x0, sizeof(struct irq_bypass_producer)); - spin_lock_init(&call_ctx->ctx_lock); + mutex_init(&call_ctx->ctx_lock); } static void vhost_vq_reset(struct vhost_dev *dev, @@ -1650,9 +1650,9 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg break; } - spin_lock(&vq->call_ctx.ctx_lock); + mutex_lock(&vq->call_ctx.ctx_lock); swap(ctx, vq->call_ctx.ctx); - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); break; case VHOST_SET_VRING_ERR: if (copy_from_user(&f, argp, sizeof f)) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 9032d3c2a9f4..e8855ea04205 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -64,7 +64,8 @@ enum vhost_uaddr_type { struct vhost_vring_call { struct eventfd_ctx *ctx; struct irq_bypass_producer producer; - spinlock_t ctx_lock; + /* protect vq irq setup */ + struct mutex ctx_lock; }; /* The virtqueue structure describes a queue attached to a device. */
Re: [PATCH] vhost: Use mutex to protect vq_irq setup
On 2020/10/29 下午3:37, Eli Cohen wrote: On Thu, Oct 29, 2020 at 03:03:24PM +0800, Jason Wang wrote: On 2020/10/28 下午10:20, Eli Cohen wrote: Both irq_bypass_register_producer() and irq_bypass_unregister_producer() require process context to run. Change the call context lock from spinlock to mutex to protect the setup process to avoid deadlocks. Fixes: 265a0ad8731d ("vhost: introduce vhost_vring_call") Signed-off-by: Eli Cohen Hi Eli: During review we spot that the spinlock is not necessary. And it was already protected by vq mutex. So it was removed in this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=86e182fe12ee5869022614457037097c70fe2ed1 Thanks I see, thanks. BTW, while testing irq bypassing, I noticed that qemu started crashing and I fail to boot the VM? Is that a known issue. I checked using updated master branch of qemu updated yesterday. Not known yet. Any ideas how to check this further? I would be helpful if you can paste the calltrace here. Did anyone actually check that irq bypassing works? Yes, Ling Shan tested it via IFCVF driver. Thanks --- drivers/vhost/vdpa.c | 10 +- drivers/vhost/vhost.c | 6 +++--- drivers/vhost/vhost.h | 3 ++- 3 files changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index be783592fe58..0a744f2b6e76 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -98,26 +98,26 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, u16 qid) return; irq = ops->get_vq_irq(vdpa, qid); - spin_lock(&vq->call_ctx.ctx_lock); + mutex_lock(&vq->call_ctx.ctx_lock); irq_bypass_unregister_producer(&vq->call_ctx.producer); if (!vq->call_ctx.ctx || irq < 0) { - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); return; } vq->call_ctx.producer.token = vq->call_ctx.ctx; vq->call_ctx.producer.irq = irq; ret = irq_bypass_register_producer(&vq->call_ctx.producer); - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); } static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) { struct vhost_virtqueue *vq = &v->vqs[qid]; - spin_lock(&vq->call_ctx.ctx_lock); + mutex_lock(&vq->call_ctx.ctx_lock); irq_bypass_unregister_producer(&vq->call_ctx.producer); - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); } static void vhost_vdpa_reset(struct vhost_vdpa *v) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 9ad45e1d27f0..938239e11455 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -302,7 +302,7 @@ static void vhost_vring_call_reset(struct vhost_vring_call *call_ctx) { call_ctx->ctx = NULL; memset(&call_ctx->producer, 0x0, sizeof(struct irq_bypass_producer)); - spin_lock_init(&call_ctx->ctx_lock); + mutex_init(&call_ctx->ctx_lock); } static void vhost_vq_reset(struct vhost_dev *dev, @@ -1650,9 +1650,9 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg break; } - spin_lock(&vq->call_ctx.ctx_lock); + mutex_lock(&vq->call_ctx.ctx_lock); swap(ctx, vq->call_ctx.ctx); - spin_unlock(&vq->call_ctx.ctx_lock); + mutex_unlock(&vq->call_ctx.ctx_lock); break; case VHOST_SET_VRING_ERR: if (copy_from_user(&f, argp, sizeof f)) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 9032d3c2a9f4..e8855ea04205 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -64,7 +64,8 @@ enum vhost_uaddr_type { struct vhost_vring_call { struct eventfd_ctx *ctx; struct irq_bypass_producer producer; - spinlock_t ctx_lock; + /* protect vq irq setup */ + struct mutex ctx_lock; }; /* The virtqueue structure describes a queue attached to a device. */
Re: [PATCH] vhost: Use mutex to protect vq_irq setup
On 2020/10/29 下午3:50, Eli Cohen wrote: On Thu, Oct 29, 2020 at 03:39:24PM +0800, Jason Wang wrote: On 2020/10/29 下午3:37, Eli Cohen wrote: On Thu, Oct 29, 2020 at 03:03:24PM +0800, Jason Wang wrote: On 2020/10/28 下午10:20, Eli Cohen wrote: Both irq_bypass_register_producer() and irq_bypass_unregister_producer() require process context to run. Change the call context lock from spinlock to mutex to protect the setup process to avoid deadlocks. Fixes: 265a0ad8731d ("vhost: introduce vhost_vring_call") Signed-off-by: Eli Cohen Hi Eli: During review we spot that the spinlock is not necessary. And it was already protected by vq mutex. So it was removed in this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=86e182fe12ee5869022614457037097c70fe2ed1 Thanks I see, thanks. BTW, while testing irq bypassing, I noticed that qemu started crashing and I fail to boot the VM? Is that a known issue. I checked using updated master branch of qemu updated yesterday. Not known yet. Any ideas how to check this further? I would be helpful if you can paste the calltrace here. I am not too familiar with qemu. Assuming I am using virsh start to boot the VM, how can I get the call trace? You probably need to configure qemu with --enable-debug. Then after VM is launching, you can use gdb to attach to the qemu process, then gdb may report a calltrace if qemu crashes. Thanks
Re: [PATCH] vhost/vsock: add IOTLB API support
On 2020/10/30 上午1:43, Stefano Garzarella wrote: This patch enables the IOTLB API support for vhost-vsock devices, allowing the userspace to emulate an IOMMU for the guest. These changes were made following vhost-net, in details this patch: - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb device if the feature is acked - implements VHOST_GET_BACKEND_FEATURES and VHOST_SET_BACKEND_FEATURES ioctls - calls vq_meta_prefetch() before vq processing to prefetch vq metadata address in IOTLB - provides .read_iter, .write_iter, and .poll callbacks for the chardev; they are used by the userspace to exchange IOTLB messages This patch was tested with QEMU and a patch applied [1] to fix a simple issue: $ qemu -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on \ -device vhost-vsock-pci,guest-cid=3,iommu_platform=on Patch looks good, but a question: It looks to me you don't enable ATS which means vhost won't get any invalidation request or did I miss anything? Thanks [1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html Signed-off-by: Stefano Garzarella --- drivers/vhost/vsock.c | 68 +-- 1 file changed, 65 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index a483cec31d5c..5e78fb719602 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -30,7 +30,12 @@ #define VHOST_VSOCK_PKT_WEIGHT 256 enum { - VHOST_VSOCK_FEATURES = VHOST_FEATURES, + VHOST_VSOCK_FEATURES = VHOST_FEATURES | + (1ULL << VIRTIO_F_ACCESS_PLATFORM) +}; + +enum { + VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) }; /* Used to track all the vhost_vsock instances on the system. */ @@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock, if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + /* Avoid further vmexits, we're already processing the virtqueue */ vhost_disable_notify(&vsock->dev, vq); @@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work) if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + vhost_disable_notify(&vsock->dev, vq); do { u32 len; @@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) mutex_lock(&vsock->dev.mutex); if ((features & (1 << VHOST_F_LOG_ALL)) && !vhost_log_access_ok(&vsock->dev)) { - mutex_unlock(&vsock->dev.mutex); - return -EFAULT; + goto err; + } + + if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { + if (vhost_init_device_iotlb(&vsock->dev, true)) + goto err; } for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) { @@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) } mutex_unlock(&vsock->dev.mutex); return 0; + +err: + mutex_unlock(&vsock->dev.mutex); + return -EFAULT; } static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, @@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, if (copy_from_user(&features, argp, sizeof(features))) return -EFAULT; return vhost_vsock_set_features(vsock, features); + case VHOST_GET_BACKEND_FEATURES: + features = VHOST_VSOCK_BACKEND_FEATURES; + if (copy_to_user(argp, &features, sizeof(features))) + return -EFAULT; + return 0; + case VHOST_SET_BACKEND_FEATURES: + if (copy_from_user(&features, argp, sizeof(features))) + return -EFAULT; + if (features & ~VHOST_VSOCK_BACKEND_FEATURES) + return -EOPNOTSUPP; + vhost_set_backend_features(&vsock->dev, features); + return 0; default: mutex_lock(&vsock->dev.mutex); r = vhost_dev_ioctl(&vsock->dev, ioctl, argp); @@ -823,6 +854,34 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, } } +static ssize_t vhost_vsock_chr_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct file *file = iocb->ki_filp; + struct vhost_vsock *vsock = file->private_data; + struct vhost_dev *dev = &vsock->dev; + int noblock = file->f_flags & O_NONBLOCK; + + return vhost_chr_read_iter(dev, to, noblock); +} + +static ssize_t vhost_vsock_chr_write_iter(struct kiocb *iocb, + struct iov_iter *from) +{ +
Re: [PATCH] vhost/vsock: add IOTLB API support
On 2020/10/30 下午6:54, Stefano Garzarella wrote: On Fri, Oct 30, 2020 at 06:02:18PM +0800, Jason Wang wrote: On 2020/10/30 上午1:43, Stefano Garzarella wrote: This patch enables the IOTLB API support for vhost-vsock devices, allowing the userspace to emulate an IOMMU for the guest. These changes were made following vhost-net, in details this patch: - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb device if the feature is acked - implements VHOST_GET_BACKEND_FEATURES and VHOST_SET_BACKEND_FEATURES ioctls - calls vq_meta_prefetch() before vq processing to prefetch vq metadata address in IOTLB - provides .read_iter, .write_iter, and .poll callbacks for the chardev; they are used by the userspace to exchange IOTLB messages This patch was tested with QEMU and a patch applied [1] to fix a simple issue: $ qemu -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on \ -device vhost-vsock-pci,guest-cid=3,iommu_platform=on Patch looks good, but a question: It looks to me you don't enable ATS which means vhost won't get any invalidation request or did I miss anything? You're right, I didn't see invalidation requests, only miss and updates. Now I have tried to enable 'ats' and 'device-iotlb' but I still don't see any invalidation. How can I test it? (Sorry but I don't have much experience yet with vIOMMU) I guess it's because the batched unmap. Maybe you can try to use "intel_iommu=strict" in guest kernel command line to see if it works. Btw, make sure the qemu contains the patch [1]. Otherwise ATS won't be enabled for recent Linux Kernel in the guest. Thanks [1] https://patchew.org/QEMU/20200909081731.24688-1-jasow...@redhat.com/ Thanks, Stefano
Re: [PATCH] vhost/vsock: add IOTLB API support
On 2020/11/3 上午1:11, Stefano Garzarella wrote: On Fri, Oct 30, 2020 at 07:44:43PM +0800, Jason Wang wrote: On 2020/10/30 下午6:54, Stefano Garzarella wrote: On Fri, Oct 30, 2020 at 06:02:18PM +0800, Jason Wang wrote: On 2020/10/30 上午1:43, Stefano Garzarella wrote: This patch enables the IOTLB API support for vhost-vsock devices, allowing the userspace to emulate an IOMMU for the guest. These changes were made following vhost-net, in details this patch: - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb device if the feature is acked - implements VHOST_GET_BACKEND_FEATURES and VHOST_SET_BACKEND_FEATURES ioctls - calls vq_meta_prefetch() before vq processing to prefetch vq metadata address in IOTLB - provides .read_iter, .write_iter, and .poll callbacks for the chardev; they are used by the userspace to exchange IOTLB messages This patch was tested with QEMU and a patch applied [1] to fix a simple issue: $ qemu -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on \ -device vhost-vsock-pci,guest-cid=3,iommu_platform=on Patch looks good, but a question: It looks to me you don't enable ATS which means vhost won't get any invalidation request or did I miss anything? You're right, I didn't see invalidation requests, only miss and updates. Now I have tried to enable 'ats' and 'device-iotlb' but I still don't see any invalidation. How can I test it? (Sorry but I don't have much experience yet with vIOMMU) I guess it's because the batched unmap. Maybe you can try to use "intel_iommu=strict" in guest kernel command line to see if it works. Btw, make sure the qemu contains the patch [1]. Otherwise ATS won't be enabled for recent Linux Kernel in the guest. The problem was my kernel, it was built with a tiny configuration. Using fedora stock kernel I can see the 'invalidate' requests, but I also had the following issues. Do they make you ring any bells? $ ./qemu -m 4G -smp 4 -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on,device-iotlb=on \ -device vhost-vsock-pci,guest-cid=6,iommu_platform=on,ats=on,id=v1 qemu-system-x86_64: vtd_iova_to_slpte: detected IOVA overflow (iova=0x1d4030c0) It's a hint that IOVA exceeds the AW. It might be worth to check whether the missed IOVA reported from IOTLB is legal. Thanks qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=00:03:00, iova=0x1d4030c0) qemu-system-x86_64: New fault is not recorded due to compression of faults Guest kernel messages: [ 44.940872] DMAR: DRHD: handling fault status reg 2 [ 44.941989] DMAR: [DMA Read] Request device [00:03.0] PASID fault addr 88W [ 49.785884] DMAR: DRHD: handling fault status reg 2 [ 49.788874] DMAR: [DMA Read] Request device [00:03.0] PASID fault addr 88W QEMU: b149dea55c Merge remote-tracking branch 'remotes/cschoenebeck/tags/pull-9p-20201102' into staging Linux guest: 5.8.16-200.fc32.x86_64 Thanks, Stefano
Re: [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit
在 2021/4/16 下午2:29, Xuan Zhuo 写道: On Fri, 16 Apr 2021 13:35:33 +0800, Jason Wang wrote: 在 2021/4/15 下午6:27, Xuan Zhuo 写道: On Wed, 14 Apr 2021 13:46:45 +0800, Jason Wang wrote: 在 2021/4/13 上午11:15, Xuan Zhuo 写道: This patch implements the core part of xsk zerocopy xmit. When the user calls sendto to consume the data in the xsk tx queue, virtnet_xsk_wakeup() will be called. In wakeup, it will try to send a part of the data directly. There are two purposes for this realization: 1. Send part of the data quickly to reduce the transmission delay of the first packet. 2. Trigger tx interrupt, start napi to consume xsk tx data. All sent xsk packets share the virtio-net header of xsk_hdr. If xsk needs to support csum and other functions later, consider assigning xsk hdr separately for each sent packet. There are now three situations in free_old_xmit(): skb, xdp frame, xsk desc. Based on the last two bit of ptr returned by virtqueue_get_buf(): 00 is skb by default. 01 represents the packet sent by xdp 10 is the packet sent by xsk If the xmit work of xsk has not been completed, but the ring is full, napi must first exit and wait for the ring to be available, so need_wakeup() is set. If free_old_xmit() is called first by start_xmit(), we can quickly wake up napi to execute xsk xmit task. When recycling, we need to count the number of bytes sent, so put xsk desc->len into the ptr pointer. Because ptr does not point to meaningful objects in xsk. Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li --- drivers/net/virtio_net.c | 296 ++- 1 file changed, 292 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 8242a9e9f17d..c441d6bf1510 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -46,6 +46,11 @@ module_param(napi_tx, bool, 0644); #define VIRTIO_XDP_REDIRBIT(1) #define VIRTIO_XDP_FLAG BIT(0) +#define VIRTIO_XSK_FLAGBIT(1) + +#define VIRTIO_XSK_PTR_SHIFT 4 + +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr; /* RX packet size EWMA. The average packet size is used to determine the packet * buffer size when refilling RX rings. As the entire RX ring may be refilled @@ -138,6 +143,12 @@ struct send_queue { struct { /* xsk pool */ struct xsk_buff_pool __rcu *pool; + + /* save the desc for next xmit, when xmit fail. */ + struct xdp_desc last_desc; As replied in the pervious version this looks tricky. I think we need to make sure to reserve some slots as skb path did. This looks exactly like what stmmac did which alos shares XDP and skb for the same ring. + + /* xsk wait for tx inter or softirq */ + bool need_wakeup; } xsk; }; @@ -255,6 +266,15 @@ struct padded_vnet_hdr { char padding[4]; }; +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool, + int budget, bool in_napi); +static void virtnet_xsk_complete(struct send_queue *sq, u32 num); + +static bool is_skb_ptr(void *ptr) +{ + return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG)); +} + static bool is_xdp_frame(void *ptr) { return (unsigned long)ptr & VIRTIO_XDP_FLAG; @@ -265,6 +285,19 @@ static void *xdp_to_ptr(struct xdp_frame *ptr) return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG); } +static void *xsk_to_ptr(struct xdp_desc *desc) +{ + /* save the desc len to ptr */ + u64 p = desc->len << VIRTIO_XSK_PTR_SHIFT; + + return (void *)((unsigned long)p | VIRTIO_XSK_FLAG); +} + +static void ptr_to_xsk(void *ptr, struct xdp_desc *desc) +{ + desc->len = ((unsigned long)ptr) >> VIRTIO_XSK_PTR_SHIFT; +} + static struct xdp_frame *ptr_to_xdp(void *ptr) { return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); @@ -273,25 +306,35 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) static void __free_old_xmit(struct send_queue *sq, bool in_napi, struct virtnet_sq_stats *stats) { + unsigned int xsknum = 0; unsigned int len; void *ptr; while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - if (likely(!is_xdp_frame(ptr))) { + if (is_skb_ptr(ptr)) { struct sk_buff *skb = ptr; pr_debug("Sent skb %p\n", skb); stats->bytes += skb->len; napi_consume_skb(skb, in_napi); - } else { + } else if (is_xdp_frame(ptr)) { struct xdp_frame *frame = ptr_to_xdp(ptr); stats->bytes += frame->len; xdp_return_frame(frame); + } else { + struct
Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom
在 2021/4/16 下午5:16, Xuan Zhuo 写道: In page_to_skb(), if we have enough tailroom to save skb_shared_info, we can use build_skb to create skb directly. No need to alloc for additional space. And it can save a 'frags slot', which is very friendly to GRO. Here, if the payload of the received package is too small (less than GOOD_COPY_LEN), we still choose to copy it directly to the space got by napi_alloc_skb. So we can reuse these pages. Testing Machine: The four queues of the network card are bound to the cpu1. Test command: for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& done The size of the udp package is 1000, so in the case of this patch, there will always be enough tailroom to use build_skb. The sent udp packet will be discarded because there is no port to receive it. The irqsoftd of the machine is 100%, we observe the received quantity displayed by sar -n DEV 1: no build_skb: 956864.00 rxpck/s build_skb:1158465.00 rxpck/s I suggess to test the case of XDP_PASS in this case as well. Signed-off-by: Xuan Zhuo Suggested-by: Jason Wang --- v3: fix the truesize when headroom > 0 v2: conflict resolution drivers/net/virtio_net.c | 69 1 file changed, 48 insertions(+), 21 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 101659cd4b87..8cd76037c724 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -379,21 +379,17 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, struct receive_queue *rq, struct page *page, unsigned int offset, unsigned int len, unsigned int truesize, - bool hdr_valid, unsigned int metasize) + bool hdr_valid, unsigned int metasize, + unsigned int headroom) { struct sk_buff *skb; struct virtio_net_hdr_mrg_rxbuf *hdr; unsigned int copy, hdr_len, hdr_padded_len; - char *p; + int tailroom, shinfo_size; + char *p, *hdr_p; p = page_address(page) + offset; - - /* copy small packet so we can reuse these pages for small data */ - skb = napi_alloc_skb(&rq->napi, GOOD_COPY_LEN); - if (unlikely(!skb)) - return NULL; - - hdr = skb_vnet_hdr(skb); + hdr_p = p; hdr_len = vi->hdr_len; if (vi->mergeable_rx_bufs) @@ -401,14 +397,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, else hdr_padded_len = sizeof(struct padded_vnet_hdr); - /* hdr_valid means no XDP, so we can copy the vnet header */ - if (hdr_valid) - memcpy(hdr, p, hdr_len); + /* If headroom is not 0, there is an offset between the beginning of the +* data and the allocated space, otherwise the data and the allocated +* space are aligned. +*/ + if (headroom) { + /* The actual allocated space size is PAGE_SIZE. */ I think the comment in receive_mergeable() is better: /* Buffers with headroom use PAGE_SIZE as alloc size, * see add_recvbuf_mergeable() + get_mergeable_buf_len() */ + truesize = PAGE_SIZE; + tailroom = truesize - len - offset; + } else { + tailroom = truesize - len; + } len -= hdr_len; offset += hdr_padded_len; p += hdr_padded_len; + shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + if (len > GOOD_COPY_LEN && tailroom >= shinfo_size) { Any reason that you don't use build_skb() for len < GOOD_COPY_LEN? Other looks good. Thanks + skb = build_skb(p, truesize); + if (unlikely(!skb)) + return NULL; + + skb_put(skb, len); + goto ok; + } + + /* copy small packet so we can reuse these pages for small data */ + skb = napi_alloc_skb(&rq->napi, GOOD_COPY_LEN); + if (unlikely(!skb)) + return NULL; + /* Copy all frame if it fits skb->head, otherwise * we let virtio_net_hdr_to_skb() and GRO pull headers as needed. */ @@ -418,11 +438,6 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, copy = ETH_HLEN + metasize; skb_put_data(skb, p, copy); - if (metasize) { - __skb_pull(skb, metasize); - skb_metadata_set(skb, metasize); - } - len -= copy; offset += copy; @@ -431,7 +446,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, skb_add_rx_frag(skb, 0, page, offset, len, truesize); else put_page(page); - ret
Re: [PATCH V3 1/3] vDPA/ifcvf: deduce VIRTIO device ID when probe
在 2021/4/16 下午3:16, Zhu Lingshan 写道: This commit deduces VIRTIO device ID as device type when probe, then ifcvf_vdpa_get_device_id() can simply return the ID. ifcvf_vdpa_get_features() and ifcvf_vdpa_get_config_size() can work properly based on the device ID. Signed-off-by: Zhu Lingshan --- drivers/vdpa/ifcvf/ifcvf_base.h | 1 + drivers/vdpa/ifcvf/ifcvf_main.c | 30 ++ 2 files changed, 19 insertions(+), 12 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_base.h b/drivers/vdpa/ifcvf/ifcvf_base.h index b2eeb16b9c2c..1c04cd256fa7 100644 --- a/drivers/vdpa/ifcvf/ifcvf_base.h +++ b/drivers/vdpa/ifcvf/ifcvf_base.h @@ -84,6 +84,7 @@ struct ifcvf_hw { u32 notify_off_multiplier; u64 req_features; u64 hw_features; + u32 dev_type; struct virtio_pci_common_cfg __iomem *common_cfg; void __iomem *net_cfg; struct vring_info vring[IFCVF_MAX_QUEUE_PAIRS * 2]; diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index 44d7586019da..469a9b5737b7 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -323,19 +323,9 @@ static u32 ifcvf_vdpa_get_generation(struct vdpa_device *vdpa_dev) static u32 ifcvf_vdpa_get_device_id(struct vdpa_device *vdpa_dev) { - struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev); - struct pci_dev *pdev = adapter->pdev; - u32 ret = -ENODEV; - - if (pdev->device < 0x1000 || pdev->device > 0x107f) - return ret; - - if (pdev->device < 0x1040) - ret = pdev->subsystem_device; - else - ret = pdev->device - 0x1040; + struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev); - return ret; + return vf->dev_type; } static u32 ifcvf_vdpa_get_vendor_id(struct vdpa_device *vdpa_dev) @@ -466,6 +456,22 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id) pci_set_drvdata(pdev, adapter); vf = &adapter->vf; + + /* This drirver drives both modern virtio devices and transitional +* devices in modern mode. +* vDPA requires feature bit VIRTIO_F_ACCESS_PLATFORM, +* so legacy devices and transitional devices in legacy +* mode will not work for vDPA, this driver will not +* drive devices with legacy interface. +*/ + if (pdev->device < 0x1000 || pdev->device > 0x107f) + return -EOPNOTSUPP; So this seems useless, id_table has already did that for us: The driver supports: #define IFCVF_DEVICE_ID 0x1041 and #define C5000X_PL_BLK_DEVICE_ID 0x1001 I think we can never reach the condition above. Thanks + + if (pdev->device < 0x1040) + vf->dev_type = pdev->subsystem_device; + else + vf->dev_type = pdev->device - 0x1040; + vf->base = pcim_iomap_table(pdev); adapter->pdev = pdev;
Re: [PATCH V3 2/3] vDPA/ifcvf: enable Intel C5000X-PL virtio-block for vDPA
在 2021/4/16 下午3:16, Zhu Lingshan 写道: This commit enabled Intel FPGA SmartNIC C5000X-PL virtio-block for vDPA. Signed-off-by: Zhu Lingshan Acked-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_base.h | 8 +++- drivers/vdpa/ifcvf/ifcvf_main.c | 19 ++- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_base.h b/drivers/vdpa/ifcvf/ifcvf_base.h index 1c04cd256fa7..0111bfdeb342 100644 --- a/drivers/vdpa/ifcvf/ifcvf_base.h +++ b/drivers/vdpa/ifcvf/ifcvf_base.h @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -28,7 +29,12 @@ #define C5000X_PL_SUBSYS_VENDOR_ID0x8086 #define C5000X_PL_SUBSYS_DEVICE_ID0x0001 -#define IFCVF_SUPPORTED_FEATURES \ +#define C5000X_PL_BLK_VENDOR_ID0x1AF4 +#define C5000X_PL_BLK_DEVICE_ID0x1001 +#define C5000X_PL_BLK_SUBSYS_VENDOR_ID 0x8086 +#define C5000X_PL_BLK_SUBSYS_DEVICE_ID 0x0002 + +#define IFCVF_NET_SUPPORTED_FEATURES \ ((1ULL << VIRTIO_NET_F_MAC) | \ (1ULL << VIRTIO_F_ANY_LAYOUT)| \ (1ULL << VIRTIO_F_VERSION_1) | \ diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index 469a9b5737b7..376b2014916a 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -168,10 +168,23 @@ static struct ifcvf_hw *vdpa_to_vf(struct vdpa_device *vdpa_dev) static u64 ifcvf_vdpa_get_features(struct vdpa_device *vdpa_dev) { + struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev); struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev); + struct pci_dev *pdev = adapter->pdev; + u64 features; - features = ifcvf_get_features(vf) & IFCVF_SUPPORTED_FEATURES; + switch (vf->dev_type) { + case VIRTIO_ID_NET: + features = ifcvf_get_features(vf) & IFCVF_NET_SUPPORTED_FEATURES; + break; + case VIRTIO_ID_BLOCK: + features = ifcvf_get_features(vf); + break; + default: + features = 0; + IFCVF_ERR(pdev, "VIRTIO ID %u not supported\n", vf->dev_type); + } return features; } @@ -517,6 +530,10 @@ static struct pci_device_id ifcvf_pci_ids[] = { C5000X_PL_DEVICE_ID, C5000X_PL_SUBSYS_VENDOR_ID, C5000X_PL_SUBSYS_DEVICE_ID) }, + { PCI_DEVICE_SUB(C5000X_PL_BLK_VENDOR_ID, +C5000X_PL_BLK_DEVICE_ID, +C5000X_PL_BLK_SUBSYS_VENDOR_ID, +C5000X_PL_BLK_SUBSYS_DEVICE_ID) }, { 0 }, };
Re: [PATCH V3 3/3] vDPA/ifcvf: get_config_size should return dev specific config size
在 2021/4/16 下午3:16, Zhu Lingshan 写道: get_config_size() should return the size based on the decected device type. Signed-off-by: Zhu Lingshan Acked-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_main.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index 376b2014916a..3b6f7862dbb8 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -356,7 +356,24 @@ static u32 ifcvf_vdpa_get_vq_align(struct vdpa_device *vdpa_dev) static size_t ifcvf_vdpa_get_config_size(struct vdpa_device *vdpa_dev) { - return sizeof(struct virtio_net_config); + struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev); + struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev); + struct pci_dev *pdev = adapter->pdev; + size_t size; + + switch (vf->dev_type) { + case VIRTIO_ID_NET: + size = sizeof(struct virtio_net_config); + break; + case VIRTIO_ID_BLOCK: + size = sizeof(struct virtio_blk_config); + break; + default: + size = 0; + IFCVF_ERR(pdev, "VIRTIO ID %u not supported\n", vf->dev_type); + } + + return size; } static void ifcvf_vdpa_get_config(struct vdpa_device *vdpa_dev,
Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom
在 2021/4/20 上午7:29, Mat Martineau 写道: On Fri, 16 Apr 2021, Xuan Zhuo wrote: In page_to_skb(), if we have enough tailroom to save skb_shared_info, we can use build_skb to create skb directly. No need to alloc for additional space. And it can save a 'frags slot', which is very friendly to GRO. Here, if the payload of the received package is too small (less than GOOD_COPY_LEN), we still choose to copy it directly to the space got by napi_alloc_skb. So we can reuse these pages. Testing Machine: The four queues of the network card are bound to the cpu1. Test command: for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& done The size of the udp package is 1000, so in the case of this patch, there will always be enough tailroom to use build_skb. The sent udp packet will be discarded because there is no port to receive it. The irqsoftd of the machine is 100%, we observe the received quantity displayed by sar -n DEV 1: no build_skb: 956864.00 rxpck/s build_skb: 1158465.00 rxpck/s Signed-off-by: Xuan Zhuo Suggested-by: Jason Wang --- v3: fix the truesize when headroom > 0 v2: conflict resolution drivers/net/virtio_net.c | 69 1 file changed, 48 insertions(+), 21 deletions(-) Xuan, I realize this has been merged to net-next already, but I'm getting a use-after-free with KASAN in page_to_skb() with this patch. Reverting this change fixes the UAF. I've included the KASAN dump below, and a couple of comments inline. diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 101659cd4b87..8cd76037c724 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -379,21 +379,17 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, struct receive_queue *rq, struct page *page, unsigned int offset, unsigned int len, unsigned int truesize, - bool hdr_valid, unsigned int metasize) + bool hdr_valid, unsigned int metasize, + unsigned int headroom) { struct sk_buff *skb; struct virtio_net_hdr_mrg_rxbuf *hdr; unsigned int copy, hdr_len, hdr_padded_len; - char *p; + int tailroom, shinfo_size; + char *p, *hdr_p; p = page_address(page) + offset; - - /* copy small packet so we can reuse these pages for small data */ - skb = napi_alloc_skb(&rq->napi, GOOD_COPY_LEN); - if (unlikely(!skb)) - return NULL; - - hdr = skb_vnet_hdr(skb); + hdr_p = p; hdr_p is assigned here, pointer to an address in the provided page... hdr_len = vi->hdr_len; if (vi->mergeable_rx_bufs) (snip) @@ -431,7 +446,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, skb_add_rx_frag(skb, 0, page, offset, len, truesize); else put_page(page); page is potentially released here... - return skb; + goto ok; } /* @@ -458,6 +473,18 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, if (page) give_pages(rq, page); +ok: + /* hdr_valid means no XDP, so we can copy the vnet header */ + if (hdr_valid) { + hdr = skb_vnet_hdr(skb); + memcpy(hdr, hdr_p, hdr_len); and hdr_p is dereferenced here. Right, I tend to recover the way to copy hdr and set meta just after alloc/build_skb(). Thanks I'm seeing this KASAN UAF at boot time in a kvm VM (Fedora 33 host and guest, if that helps): [ 61.202483] == [ 61.204005] BUG: KASAN: use-after-free in page_to_skb+0x32a/0x4b0 [ 61.205387] Read of size 12 at addr 888105bdf800 by task NetworkManager/579 [ 61.207035] [ 61.207408] CPU: 0 PID: 579 Comm: NetworkManager Not tainted 5.12.0-rc7+ #2 [ 61.208715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 [ 61.210257] Call Trace: [ 61.210730] [ 61.211209] dump_stack+0x93/0xc2 [ 61.211996] print_address_description.constprop.0+0x18/0x130 [ 61.213310] ? page_to_skb+0x32a/0x4b0 [ 61.214318] ? page_to_skb+0x32a/0x4b0 [ 61.215085] kasan_report.cold+0x7f/0x111 [ 61.215966] ? trace_hardirqs_off+0x10/0xe0 [ 61.216823] ? page_to_skb+0x32a/0x4b0 [ 61.217809] kasan_check_range+0xf9/0x1e0 [ 61.217834] memcpy+0x20/0x60 [ 61.217848] page_to_skb+0x32a/0x4b0 [ 61.217888] receive_buf+0x1434/0x2690 [ 61.217926] ? page_to_skb+0x4b0/0x4b0 [ 61.217947] ? find_held_lock+0x85/0xa0 [ 61.217964] ? lock_release+0x1d0/0x400 [ 61.217974] ? virtnet_poll+0x1d8/0x6b0 [ 61.217983] ? detach_buf_split+0x254/0x290 [ 61.218008] ? virtqueue_get_buf_ctx_split+0x145/0x1f0 [ 61.218032] virtnet_poll+0x2a8/0x6b0 [ 61.218057] ? receive_buf+0x2690/0x2690 [ 61.218067] ? lock_release+0x400/0x400 [ 61.218119] __napi_poll+0x57/0x2f0 [ 61.229624] net_rx_action+0x4dd/0x590
Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom
在 2021/4/20 上午10:38, Jason Wang 写道: : + /* hdr_valid means no XDP, so we can copy the vnet header */ + if (hdr_valid) { + hdr = skb_vnet_hdr(skb); + memcpy(hdr, hdr_p, hdr_len); and hdr_p is dereferenced here. Right, I tend to recover the way to copy hdr and set meta just after alloc/build_skb(). Thanks Btw, since the patch modifies a critical path of virtio-net I suggest to test the following cases: 1) netperf TCP stream 2) netperf UDP with packet size from 64 to PAGE_SIZE 3) XDP_PASS with 1) 4) XDP_PASS with 2) 5) XDP metadata with 1) 6) XDP metadata with 2) Thanks
Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom
在 2021/4/20 上午12:48, David Ahern 写道: On 4/16/21 2:16 AM, Xuan Zhuo wrote: In page_to_skb(), if we have enough tailroom to save skb_shared_info, we can use build_skb to create skb directly. No need to alloc for additional space. And it can save a 'frags slot', which is very friendly to GRO. Here, if the payload of the received package is too small (less than GOOD_COPY_LEN), we still choose to copy it directly to the space got by napi_alloc_skb. So we can reuse these pages. Testing Machine: The four queues of the network card are bound to the cpu1. Test command: for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& done The size of the udp package is 1000, so in the case of this patch, there will always be enough tailroom to use build_skb. The sent udp packet will be discarded because there is no port to receive it. The irqsoftd of the machine is 100%, we observe the received quantity displayed by sar -n DEV 1: no build_skb: 956864.00 rxpck/s build_skb:1158465.00 rxpck/s virtio_net is using napi_consume_skb, so napi_build_skb should show a small increase from build_skb. Yes and we probably need to do this in receive_small(). Thanks
Re: [PATCH V4 1/3] vDPA/ifcvf: deduce VIRTIO device ID when probe
在 2021/4/19 下午2:33, Zhu Lingshan 写道: This commit deduces VIRTIO device ID as device type when probe, then ifcvf_vdpa_get_device_id() can simply return the ID. ifcvf_vdpa_get_features() and ifcvf_vdpa_get_config_size() can work properly based on the device ID. Signed-off-by: Zhu Lingshan Acked-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_base.h | 1 + drivers/vdpa/ifcvf/ifcvf_main.c | 27 +++ 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_base.h b/drivers/vdpa/ifcvf/ifcvf_base.h index b2eeb16b9c2c..1c04cd256fa7 100644 --- a/drivers/vdpa/ifcvf/ifcvf_base.h +++ b/drivers/vdpa/ifcvf/ifcvf_base.h @@ -84,6 +84,7 @@ struct ifcvf_hw { u32 notify_off_multiplier; u64 req_features; u64 hw_features; + u32 dev_type; struct virtio_pci_common_cfg __iomem *common_cfg; void __iomem *net_cfg; struct vring_info vring[IFCVF_MAX_QUEUE_PAIRS * 2]; diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index 44d7586019da..66927ec81fa5 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -323,19 +323,9 @@ static u32 ifcvf_vdpa_get_generation(struct vdpa_device *vdpa_dev) static u32 ifcvf_vdpa_get_device_id(struct vdpa_device *vdpa_dev) { - struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev); - struct pci_dev *pdev = adapter->pdev; - u32 ret = -ENODEV; - - if (pdev->device < 0x1000 || pdev->device > 0x107f) - return ret; - - if (pdev->device < 0x1040) - ret = pdev->subsystem_device; - else - ret = pdev->device - 0x1040; + struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev); - return ret; + return vf->dev_type; } static u32 ifcvf_vdpa_get_vendor_id(struct vdpa_device *vdpa_dev) @@ -466,6 +456,19 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id) pci_set_drvdata(pdev, adapter); vf = &adapter->vf; + + /* This drirver drives both modern virtio devices and transitional +* devices in modern mode. +* vDPA requires feature bit VIRTIO_F_ACCESS_PLATFORM, +* so legacy devices and transitional devices in legacy +* mode will not work for vDPA, this driver will not +* drive devices with legacy interface. +*/ + if (pdev->device < 0x1040) + vf->dev_type = pdev->subsystem_device; + else + vf->dev_type = pdev->device - 0x1040; + vf->base = pcim_iomap_table(pdev); adapter->pdev = pdev;
Re: [PATCH net-next v2 2/5] virtio-net: separate rx/tx coalescing moderation cmds
On Thu, Nov 2, 2023 at 9:10 PM Heng Qi wrote: > > This patch separates the rx and tx global coalescing moderation > commands to support netdim switches in subsequent patches. > > Signed-off-by: Heng Qi Acked-by: Jason Wang Thanks > --- > v1->v2: > - Add 'i' definition. > > drivers/net/virtio_net.c | 31 --- > 1 file changed, 28 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 0ad2894e6a5e..0285301caf78 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3266,10 +3266,10 @@ static int virtnet_get_link_ksettings(struct > net_device *dev, > return 0; > } > > -static int virtnet_send_notf_coal_cmds(struct virtnet_info *vi, > - struct ethtool_coalesce *ec) > +static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi, > + struct ethtool_coalesce *ec) > { > - struct scatterlist sgs_tx, sgs_rx; > + struct scatterlist sgs_tx; > int i; > > vi->ctrl->coal_tx.tx_usecs = cpu_to_le32(ec->tx_coalesce_usecs); > @@ -3289,6 +3289,15 @@ static int virtnet_send_notf_coal_cmds(struct > virtnet_info *vi, > vi->sq[i].intr_coal.max_packets = ec->tx_max_coalesced_frames; > } > > + return 0; > +} > + > +static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi, > + struct ethtool_coalesce *ec) > +{ > + struct scatterlist sgs_rx; > + int i; > + > vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs); > vi->ctrl->coal_rx.rx_max_packets = > cpu_to_le32(ec->rx_max_coalesced_frames); > sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx)); > @@ -3309,6 +3318,22 @@ static int virtnet_send_notf_coal_cmds(struct > virtnet_info *vi, > return 0; > } > > +static int virtnet_send_notf_coal_cmds(struct virtnet_info *vi, > + struct ethtool_coalesce *ec) > +{ > + int err; > + > + err = virtnet_send_tx_notf_coal_cmds(vi, ec); > + if (err) > + return err; > + > + err = virtnet_send_rx_notf_coal_cmds(vi, ec); > + if (err) > + return err; > + > + return 0; > +} > + > static int virtnet_send_ctrl_coal_vq_cmd(struct virtnet_info *vi, > u16 vqn, u32 max_usecs, u32 > max_packets) > { > -- > 2.19.1.6.gb485710b > >
Re: [PATCH net-next v2 4/5] virtio-net: support rx netdim
On Thu, Nov 2, 2023 at 9:10 PM Heng Qi wrote: > > By comparing the traffic information in the complete napi processes, > let the virtio-net driver automatically adjust the coalescing > moderation parameters of each receive queue. > > Signed-off-by: Heng Qi > --- > v1->v2: > - Improved the judgment of dim switch conditions. > - Fix the safe problem of work thread. Nit: it's better to be more concrete here, e.g what kind of "safe problem" it is. > > drivers/net/virtio_net.c | 187 ++- > 1 file changed, 165 insertions(+), 22 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 69fe09e99b3c..5473aa1ee5cd 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -172,6 +173,17 @@ struct receive_queue { > > struct virtnet_rq_stats stats; > > + /* The number of rx notifications */ > + u16 calls; > + > + /* Is dynamic interrupt moderation enabled? */ > + bool dim_enabled; > + > + /* Dynamic Interrupt Moderation */ > + struct dim dim; > + > + u32 packets_in_napi; > + > struct virtnet_interrupt_coalesce intr_coal; > > /* Chain pages by the private ptr. */ > @@ -305,6 +317,9 @@ struct virtnet_info { > u8 duplex; > u32 speed; > > + /* Is rx dynamic interrupt moderation enabled? */ > + bool rx_dim_enabled; > + > /* Interrupt coalescing settings */ > struct virtnet_interrupt_coalesce intr_coal_tx; > struct virtnet_interrupt_coalesce intr_coal_rx; > @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq) > struct virtnet_info *vi = rvq->vdev->priv; > struct receive_queue *rq = &vi->rq[vq2rxq(rvq)]; > > + rq->calls++; > virtqueue_napi_schedule(&rq->napi, rvq); > } > > @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct receive_queue > *rq) > } > } > > +static void virtnet_rx_dim_work(struct work_struct *work); > + > +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct > receive_queue *rq) > +{ > + struct dim_sample cur_sample = {}; > + > + if (!rq->packets_in_napi) > + return; > + > + u64_stats_update_begin(&rq->stats.syncp); > + dim_update_sample(rq->calls, > + u64_stats_read(&rq->stats.packets), > + u64_stats_read(&rq->stats.bytes), > + &cur_sample); > + u64_stats_update_end(&rq->stats.syncp); > + > + net_dim(&rq->dim, cur_sample); > + rq->packets_in_napi = 0; > +} > + > static int virtnet_poll(struct napi_struct *napi, int budget) > { > struct receive_queue *rq = > @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, int > budget) > struct send_queue *sq; > unsigned int received; > unsigned int xdp_xmit = 0; > + bool napi_complete; > > virtnet_poll_cleantx(rq); > > received = virtnet_receive(rq, budget, &xdp_xmit); > + rq->packets_in_napi += received; > > if (xdp_xmit & VIRTIO_XDP_REDIR) > xdp_do_flush(); > > /* Out of packets? */ > - if (received < budget) > - virtqueue_napi_complete(napi, rq->vq, received); > + if (received < budget) { > + napi_complete = virtqueue_napi_complete(napi, rq->vq, > received); > + if (napi_complete && rq->dim_enabled) > + virtnet_rx_dim_update(vi, rq); > + } > > if (xdp_xmit & VIRTIO_XDP_TX) { > sq = virtnet_xdp_get_sq(vi); > @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct > virtnet_info *vi, int qp_index) > virtnet_napi_tx_disable(&vi->sq[qp_index].napi); > napi_disable(&vi->rq[qp_index].napi); > xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq); > + cancel_work_sync(&vi->rq[qp_index].dim.work); > } > > static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index) > @@ -2196,6 +2238,9 @@ static int virtnet_enable_queue_pair(struct > virtnet_info *vi, int qp_index) > if (err < 0) > goto err_xdp_reg_mem_model; > > + INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work); Any reason we need to do the INIT_WORK here but not probe? > + vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE; > + > virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi); > virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, > &vi->sq[qp_index].napi); > > @@ -2393,8 +2438,10 @@ static int virtnet_rx_resize(struct virtnet_info *vi, > > qindex = rq - vi->rq; > > - if (running) > + if (running) { > napi_disable(&rq->napi); > + cancel_work_sync(&rq->dim.work); > + } > > err
Re: [PATCH net-next v2 5/5] virtio-net: return -EOPNOTSUPP for adaptive-tx
On Thu, Nov 2, 2023 at 9:10 PM Heng Qi wrote: > > We do not currently support tx dim, so respond to -EOPNOTSUPP. > > Signed-off-by: Heng Qi > --- > v1->v2: > - Use -EOPNOTSUPP instead of specific implementation. > > drivers/net/virtio_net.c | 29 ++--- > 1 file changed, 26 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 5473aa1ee5cd..03edeadd0725 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3364,9 +3364,15 @@ static int virtnet_get_link_ksettings(struct > net_device *dev, > static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi, > struct ethtool_coalesce *ec) > { > + bool tx_ctrl_dim_on = !!ec->use_adaptive_tx_coalesce; > struct scatterlist sgs_tx; > int i; > > + if (tx_ctrl_dim_on) { > + pr_debug("Failed to enable adaptive-tx, which is not > supported\n"); > + return -EOPNOTSUPP; > + } When can we hit this? Thanks > + > vi->ctrl->coal_tx.tx_usecs = cpu_to_le32(ec->tx_coalesce_usecs); > vi->ctrl->coal_tx.tx_max_packets = > cpu_to_le32(ec->tx_max_coalesced_frames); > sg_init_one(&sgs_tx, &vi->ctrl->coal_tx, sizeof(vi->ctrl->coal_tx)); > @@ -3497,6 +3503,25 @@ static int virtnet_send_rx_notf_coal_vq_cmds(struct > virtnet_info *vi, > return 0; > } > > +static int virtnet_send_tx_notf_coal_vq_cmds(struct virtnet_info *vi, > +struct ethtool_coalesce *ec, > +u16 queue) > +{ > + bool tx_ctrl_dim_on = !!ec->use_adaptive_tx_coalesce; > + int err; > + > + if (tx_ctrl_dim_on) { > + pr_debug("Enabling adaptive-tx for txq%d is not supported\n", > queue); > + return -EOPNOTSUPP; > + } > + > + err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue, > + ec->tx_coalesce_usecs, > + ec->tx_max_coalesced_frames); > + > + return err; > +} > + > static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi, > struct ethtool_coalesce *ec, > u16 queue) > @@ -3507,9 +3532,7 @@ static int virtnet_send_notf_coal_vq_cmds(struct > virtnet_info *vi, > if (err) > return err; > > - err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue, > - ec->tx_coalesce_usecs, > - ec->tx_max_coalesced_frames); > + err = virtnet_send_tx_notf_coal_vq_cmds(vi, ec, queue); > if (err) > return err; > > -- > 2.19.1.6.gb485710b >
Re: [PATCH net-next v2 04/21] virtio_net: move core structures to virtio_net.h
On Tue, Nov 7, 2023 at 11:13 AM Xuan Zhuo wrote: > > Move some core structures (send_queue, receive_queue, virtnet_info) > definitions and the relative structures definitions into the > virtio_net.h file. > > That will be used by the other c code files. > > Signed-off-by: Xuan Zhuo > --- Acked-by: Jason Wang Thanks
Re: [PATCH net-next v2 05/21] virtio_net: add prefix virtnet to all struct inside virtio_net.h
On Tue, Nov 7, 2023 at 11:13 AM Xuan Zhuo wrote: > > We move some structures to the header file, but these structures do not > prefixed with virtnet. This patch adds virtnet for these. > > Signed-off-by: Xuan Zhuo > --- Acked-by: Jason Wang Thanks
Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode
On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo wrote: > > If the xsk is enabling, the xsk tx will share the send queue. > But the xsk requires that the send queue use the premapped mode. > So the send queue must support premapped mode. > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio/main.c | 163 > drivers/net/virtio/virtio_net.h | 16 > 2 files changed, 163 insertions(+), 16 deletions(-) > > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c > index 16e75c08639e..f052db459156 100644 > --- a/drivers/net/virtio/main.c > +++ b/drivers/net/virtio/main.c > @@ -46,6 +46,7 @@ module_param(napi_tx, bool, 0644); > #define VIRTIO_XDP_REDIR BIT(1) > > #define VIRTIO_XDP_FLAGBIT(0) > +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG) > > #define VIRTNET_DRIVER_VERSION "1.0.0" > > @@ -167,6 +168,29 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) > return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); > } > > +static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data) > +{ > + struct virtnet_sq_dma *next, *head; > + > + head = (void *)((unsigned long)data & ~VIRTIO_XMIT_DATA_MASK); > + > + data = head->data; > + > + while (head) { > + virtqueue_dma_unmap_single_attrs(sq->vq, head->addr, > head->len, > +DMA_TO_DEVICE, 0); > + > + next = head->next; > + > + head->next = sq->dmainfo.free; > + sq->dmainfo.free = head; > + > + head = next; > + } > + > + return data; > +} > + > static void __free_old_xmit(struct virtnet_sq *sq, bool in_napi, > u64 *bytes, u64 *packets) > { > @@ -175,14 +199,24 @@ static void __free_old_xmit(struct virtnet_sq *sq, bool > in_napi, > > while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { > if (!is_xdp_frame(ptr)) { > - struct sk_buff *skb = ptr; > + struct sk_buff *skb; > + > + if (sq->do_dma) > + ptr = virtnet_sq_unmap(sq, ptr); > + > + skb = ptr; > > pr_debug("Sent skb %p\n", skb); > > *bytes += skb->len; > napi_consume_skb(skb, in_napi); > } else { > - struct xdp_frame *frame = ptr_to_xdp(ptr); > + struct xdp_frame *frame; > + > + if (sq->do_dma) > + ptr = virtnet_sq_unmap(sq, ptr); > + > + frame = ptr_to_xdp(ptr); > > *bytes += xdp_get_frame_len(frame); > xdp_return_frame(frame); > @@ -567,22 +601,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq *rq, > u32 size, gfp_t gfp) > return buf; > } > > -static void virtnet_rq_set_premapped(struct virtnet_info *vi) > +static int virtnet_sq_set_premapped(struct virtnet_sq *sq) > { > - int i; > + struct virtnet_sq_dma *d; > + int err, size, i; > > - /* disable for big mode */ > - if (!vi->mergeable_rx_bufs && vi->big_packets) > - return; > + size = virtqueue_get_vring_size(sq->vq); > + > + size += MAX_SKB_FRAGS + 2; Btw, the dmainfo seems per sg? If I'm correct, how can vq_size + MAX_SKB_FRAGS + 2 work? > + > + sq->dmainfo.head = kcalloc(size, sizeof(*sq->dmainfo.head), > GFP_KERNEL); > + if (!sq->dmainfo.head) > + return -ENOMEM; > + > + err = virtqueue_set_dma_premapped(sq->vq); > + if (err) { > + kfree(sq->dmainfo.head); > + return err; > + } Allocating after set_dma_premapped() seems easier. Btw, is there a benchmark of TX PPS just for this patch to demonstrate the impact of the performance? > + > + sq->dmainfo.free = NULL; > + > + sq->do_dma = true; > + > + for (i = 0; i < size; ++i) { > + d = &sq->dmainfo.head[i]; > + > + d->next = sq->dmainfo.free; > + sq->dmainfo.free = d; > + } > + > + return 0; > +} > + > +static void virtnet_set_premapped(struct virtnet_info *vi) > +{ > + int i; > > for (i = 0; i < vi->max_queue_pairs; i++) { > - if (virtqueue_set_dma_premapped(vi->rq[i].vq)) > - continue; > + virtnet_sq_set_premapped(&vi->sq[i]); > > - vi->rq[i].do_dma = true; > + /* disable for big mode */ > + if (vi->mergeable_rx_bufs || !vi->big_packets) { > + if (!virtqueue_set_dma_premapped(vi->rq[i].vq)) > + vi->rq[i].do_dma = true; How about sticking a virtnet_rq_set_premapped() and calling it here? It seems more clean. Btw, the big mode support for pre mapping is still worthwhil
Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode
On Thu, Nov 9, 2023 at 7:06 PM Xuan Zhuo wrote: > > On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang wrote: > > On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo > > wrote: > > > > > > If the xsk is enabling, the xsk tx will share the send queue. > > > But the xsk requires that the send queue use the premapped mode. > > > So the send queue must support premapped mode. > > > > > > Signed-off-by: Xuan Zhuo > > > --- > > > drivers/net/virtio/main.c | 163 > > > drivers/net/virtio/virtio_net.h | 16 > > > 2 files changed, 163 insertions(+), 16 deletions(-) > > > > > > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c > > > index 16e75c08639e..f052db459156 100644 > > > --- a/drivers/net/virtio/main.c > > > +++ b/drivers/net/virtio/main.c > > > @@ -46,6 +46,7 @@ module_param(napi_tx, bool, 0644); > > > #define VIRTIO_XDP_REDIR BIT(1) > > > > > > #define VIRTIO_XDP_FLAGBIT(0) > > > +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG) > > > > > > #define VIRTNET_DRIVER_VERSION "1.0.0" > > > > > > @@ -167,6 +168,29 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) > > > return (struct xdp_frame *)((unsigned long)ptr & > > > ~VIRTIO_XDP_FLAG); > > > } > > > > > > +static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data) > > > +{ > > > + struct virtnet_sq_dma *next, *head; > > > + > > > + head = (void *)((unsigned long)data & ~VIRTIO_XMIT_DATA_MASK); > > > + > > > + data = head->data; > > > + > > > + while (head) { > > > + virtqueue_dma_unmap_single_attrs(sq->vq, head->addr, > > > head->len, > > > +DMA_TO_DEVICE, 0); > > > + > > > + next = head->next; > > > + > > > + head->next = sq->dmainfo.free; > > > + sq->dmainfo.free = head; > > > + > > > + head = next; > > > + } > > > + > > > + return data; > > > +} > > > + > > > static void __free_old_xmit(struct virtnet_sq *sq, bool in_napi, > > > u64 *bytes, u64 *packets) > > > { > > > @@ -175,14 +199,24 @@ static void __free_old_xmit(struct virtnet_sq *sq, > > > bool in_napi, > > > > > > while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { > > > if (!is_xdp_frame(ptr)) { > > > - struct sk_buff *skb = ptr; > > > + struct sk_buff *skb; > > > + > > > + if (sq->do_dma) > > > + ptr = virtnet_sq_unmap(sq, ptr); > > > + > > > + skb = ptr; > > > > > > pr_debug("Sent skb %p\n", skb); > > > > > > *bytes += skb->len; > > > napi_consume_skb(skb, in_napi); > > > } else { > > > - struct xdp_frame *frame = ptr_to_xdp(ptr); > > > + struct xdp_frame *frame; > > > + > > > + if (sq->do_dma) > > > + ptr = virtnet_sq_unmap(sq, ptr); > > > + > > > + frame = ptr_to_xdp(ptr); > > > > > > *bytes += xdp_get_frame_len(frame); > > > xdp_return_frame(frame); > > > @@ -567,22 +601,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq > > > *rq, u32 size, gfp_t gfp) > > > return buf; > > > } > > > > > > -static void virtnet_rq_set_premapped(struct virtnet_info *vi) > > > +static int virtnet_sq_set_premapped(struct virtnet_sq *sq) > > > { > > > - int i; > > > + struct virtnet_sq_dma *d; > > > + int err, size, i; > > > > > > - /* disable for big mode */ > > > - if (!vi->mergeable_rx_bufs && vi->big_packets) > > > - return; > > > + size = virtqueue_get_vring_size(sq->vq); > > > + > > > + size += MAX_SKB_FRAGS + 2; > > > > Btw, the dma
Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode
On Tue, Nov 14, 2023 at 11:42 AM Xuan Zhuo wrote: > > On Tue, 14 Nov 2023 11:26:42 +0800, Jason Wang wrote: > > On Thu, Nov 9, 2023 at 7:06 PM Xuan Zhuo wrote: > > > > > > On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang wrote: > > > > On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo > > > > wrote: > > > > > > > > > > If the xsk is enabling, the xsk tx will share the send queue. > > > > > But the xsk requires that the send queue use the premapped mode. > > > > > So the send queue must support premapped mode. > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > --- > > > > > drivers/net/virtio/main.c | 163 > > > > > > > > > > drivers/net/virtio/virtio_net.h | 16 > > > > > 2 files changed, 163 insertions(+), 16 deletions(-) > > > > > [...] > > > > > > > > I think we need to seek a way to reuse what has been stored by virtio > > > > core. It should be much more efficient. > > > > > > > > > Yes. > > > > > > But that is for net-next branch. > > > > > > Can we do that as a fix after that is merged to 6.8? > > > > We still have time. I would like to do it from the start. > > > I want to finish the job including new AF_XDP ZC feature. > Because that this must wait the merge window. > Base on that, the optimizing work can be done everytime. > > If we work from the new virtio prepare, that can be merged to 6.8. > And the AF_XDP zc must wait 6.9. right? It can be part of this series. Or anything I missed? My understanding is that, since the information is handy, it just requires new helpers. So I don't expect it needs a large series. Thanks > > Thanks >
Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode
On Tue, Nov 14, 2023 at 11:59 AM Xuan Zhuo wrote: > > On Tue, 14 Nov 2023 11:55:52 +0800, Jason Wang wrote: > > On Tue, Nov 14, 2023 at 11:42 AM Xuan Zhuo > > wrote: > > > > > > On Tue, 14 Nov 2023 11:26:42 +0800, Jason Wang > > > wrote: > > > > On Thu, Nov 9, 2023 at 7:06 PM Xuan Zhuo > > > > wrote: > > > > > > > > > > On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang > > > > > wrote: > > > > > > On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo > > > > > > wrote: > > > > > > > > > > > > > > If the xsk is enabling, the xsk tx will share the send queue. > > > > > > > But the xsk requires that the send queue use the premapped mode. > > > > > > > So the send queue must support premapped mode. > > > > > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > > > --- > > > > > > > drivers/net/virtio/main.c | 163 > > > > > > > > > > > > > > drivers/net/virtio/virtio_net.h | 16 > > > > > > > 2 files changed, 163 insertions(+), 16 deletions(-) > > > > > > > > > > > [...] > > > > > > > > > > > > > > I think we need to seek a way to reuse what has been stored by > > > > > > virtio > > > > > > core. It should be much more efficient. > > > > > > > > > > > > > > > Yes. > > > > > > > > > > But that is for net-next branch. > > > > > > > > > > Can we do that as a fix after that is merged to 6.8? > > > > > > > > We still have time. I would like to do it from the start. > > > > > > > > > I want to finish the job including new AF_XDP ZC feature. > > > Because that this must wait the merge window. > > > Base on that, the optimizing work can be done everytime. > > > > > > If we work from the new virtio prepare, that can be merged to 6.8. > > > And the AF_XDP zc must wait 6.9. right? > > > > It can be part of this series. Or anything I missed? > > > > My understanding is that, since the information is handy, it just > > requires new helpers. So I don't expect it needs a large series. > > Now, this is pushing to net-next. > > If we add an new virtio-core helper. That must be pushed to virtio branch. > And this patch set must wait that. I don't think so if it's just a matter of new helpers. The acknowledgement from the virtio maintainer should be sufficient. Let's just try and see? THanks > > Thanks. > > > > > > Thanks > > > > > > > > Thanks > > > > > >
Re: [PATCH net-next v4 4/4] virtio-net: support rx netdim
On Mon, Nov 20, 2023 at 8:38 PM Heng Qi wrote: > > By comparing the traffic information in the complete napi processes, > let the virtio-net driver automatically adjust the coalescing > moderation parameters of each receive queue. > > Signed-off-by: Heng Qi > --- > v2->v3: > - Some minor modifications. > > v1->v2: > - Improved the judgment of dim switch conditions. > - Cancel the work when vq reset. > > drivers/net/virtio_net.c | 191 ++- > 1 file changed, 169 insertions(+), 22 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 69fe09e99b3c..bc32d5aae005 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -172,6 +173,17 @@ struct receive_queue { > > struct virtnet_rq_stats stats; > > + /* The number of rx notifications */ > + u16 calls; > + > + /* Is dynamic interrupt moderation enabled? */ > + bool dim_enabled; > + > + /* Dynamic Interrupt Moderation */ > + struct dim dim; > + > + u32 packets_in_napi; > + > struct virtnet_interrupt_coalesce intr_coal; > > /* Chain pages by the private ptr. */ > @@ -305,6 +317,9 @@ struct virtnet_info { > u8 duplex; > u32 speed; > > + /* Is rx dynamic interrupt moderation enabled? */ > + bool rx_dim_enabled; > + > /* Interrupt coalescing settings */ > struct virtnet_interrupt_coalesce intr_coal_tx; > struct virtnet_interrupt_coalesce intr_coal_rx; > @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq) > struct virtnet_info *vi = rvq->vdev->priv; > struct receive_queue *rq = &vi->rq[vq2rxq(rvq)]; > > + rq->calls++; > virtqueue_napi_schedule(&rq->napi, rvq); > } > > @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct receive_queue > *rq) > } > } > > +static void virtnet_rx_dim_work(struct work_struct *work); > + > +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct > receive_queue *rq) > +{ > + struct dim_sample cur_sample = {}; > + > + if (!rq->packets_in_napi) > + return; > + > + u64_stats_update_begin(&rq->stats.syncp); > + dim_update_sample(rq->calls, > + u64_stats_read(&rq->stats.packets), > + u64_stats_read(&rq->stats.bytes), > + &cur_sample); > + u64_stats_update_end(&rq->stats.syncp); > + > + net_dim(&rq->dim, cur_sample); > + rq->packets_in_napi = 0; > +} > + > static int virtnet_poll(struct napi_struct *napi, int budget) > { > struct receive_queue *rq = > @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, int > budget) > struct send_queue *sq; > unsigned int received; > unsigned int xdp_xmit = 0; > + bool napi_complete; > > virtnet_poll_cleantx(rq); > > received = virtnet_receive(rq, budget, &xdp_xmit); > + rq->packets_in_napi += received; > > if (xdp_xmit & VIRTIO_XDP_REDIR) > xdp_do_flush(); > > /* Out of packets? */ > - if (received < budget) > - virtqueue_napi_complete(napi, rq->vq, received); > + if (received < budget) { > + napi_complete = virtqueue_napi_complete(napi, rq->vq, > received); > + if (napi_complete && rq->dim_enabled) > + virtnet_rx_dim_update(vi, rq); > + } > > if (xdp_xmit & VIRTIO_XDP_TX) { > sq = virtnet_xdp_get_sq(vi); > @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct > virtnet_info *vi, int qp_index) > virtnet_napi_tx_disable(&vi->sq[qp_index].napi); > napi_disable(&vi->rq[qp_index].napi); > xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq); > + cancel_work_sync(&vi->rq[qp_index].dim.work); > } > > static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index) > @@ -2196,6 +2238,9 @@ static int virtnet_enable_queue_pair(struct > virtnet_info *vi, int qp_index) > if (err < 0) > goto err_xdp_reg_mem_model; > > + INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work); > + vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE; So in V2, you explained it can be done here but I want to know why it must be done here. For example, the refill_work is initialized in alloc_queues(). > + > virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi); > virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, > &vi->sq[qp_index].napi); > > @@ -2393,8 +2438,10 @@ static int virtnet_rx_resize(struct virtnet_info *vi, > > qindex = rq - vi->rq; > > - if (running) > + if (running) { > napi_disable(&rq->napi); > + cancel_work_syn
Re: [PATCH net-next v4 4/4] virtio-net: support rx netdim
On Wed, Nov 22, 2023 at 1:52 PM Jason Wang wrote: > > On Mon, Nov 20, 2023 at 8:38 PM Heng Qi wrote: > > > > By comparing the traffic information in the complete napi processes, > > let the virtio-net driver automatically adjust the coalescing > > moderation parameters of each receive queue. > > > > Signed-off-by: Heng Qi > > --- > > v2->v3: > > - Some minor modifications. > > > > v1->v2: > > - Improved the judgment of dim switch conditions. > > - Cancel the work when vq reset. > > > > drivers/net/virtio_net.c | 191 ++- > > 1 file changed, 169 insertions(+), 22 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 69fe09e99b3c..bc32d5aae005 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -19,6 +19,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -172,6 +173,17 @@ struct receive_queue { > > > > struct virtnet_rq_stats stats; > > > > + /* The number of rx notifications */ > > + u16 calls; > > + > > + /* Is dynamic interrupt moderation enabled? */ > > + bool dim_enabled; > > + > > + /* Dynamic Interrupt Moderation */ > > + struct dim dim; > > + > > + u32 packets_in_napi; > > + > > struct virtnet_interrupt_coalesce intr_coal; > > > > /* Chain pages by the private ptr. */ > > @@ -305,6 +317,9 @@ struct virtnet_info { > > u8 duplex; > > u32 speed; > > > > + /* Is rx dynamic interrupt moderation enabled? */ > > + bool rx_dim_enabled; > > + > > /* Interrupt coalescing settings */ > > struct virtnet_interrupt_coalesce intr_coal_tx; > > struct virtnet_interrupt_coalesce intr_coal_rx; > > @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq) > > struct virtnet_info *vi = rvq->vdev->priv; > > struct receive_queue *rq = &vi->rq[vq2rxq(rvq)]; > > > > + rq->calls++; > > virtqueue_napi_schedule(&rq->napi, rvq); > > } > > > > @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct > > receive_queue *rq) > > } > > } > > > > +static void virtnet_rx_dim_work(struct work_struct *work); > > + > > +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct > > receive_queue *rq) > > +{ > > + struct dim_sample cur_sample = {}; > > + > > + if (!rq->packets_in_napi) > > + return; > > + > > + u64_stats_update_begin(&rq->stats.syncp); > > + dim_update_sample(rq->calls, > > + u64_stats_read(&rq->stats.packets), > > + u64_stats_read(&rq->stats.bytes), > > + &cur_sample); > > + u64_stats_update_end(&rq->stats.syncp); > > + > > + net_dim(&rq->dim, cur_sample); > > + rq->packets_in_napi = 0; > > +} > > + > > static int virtnet_poll(struct napi_struct *napi, int budget) > > { > > struct receive_queue *rq = > > @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, > > int budget) > > struct send_queue *sq; > > unsigned int received; > > unsigned int xdp_xmit = 0; > > + bool napi_complete; > > > > virtnet_poll_cleantx(rq); > > > > received = virtnet_receive(rq, budget, &xdp_xmit); > > + rq->packets_in_napi += received; > > > > if (xdp_xmit & VIRTIO_XDP_REDIR) > > xdp_do_flush(); > > > > /* Out of packets? */ > > - if (received < budget) > > - virtqueue_napi_complete(napi, rq->vq, received); > > + if (received < budget) { > > + napi_complete = virtqueue_napi_complete(napi, rq->vq, > > received); > > + if (napi_complete && rq->dim_enabled) > > + virtnet_rx_dim_update(vi, rq); > > + } > > > > if (xdp_xmit & VIRTIO_XDP_TX) { > > sq = virtnet_xdp_get_sq(vi); > > @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct > > virtnet_info *vi, int qp_index) > > virt
Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access
On Tue, Dec 5, 2023 at 4:02 PM Heng Qi wrote: > > Currently access to ctrl cmd is globally protected via rtnl_lock and works > fine. But if dim work's access to ctrl cmd also holds rtnl_lock, deadlock > may occur due to cancel_work_sync for dim work. Can you explain why? > Therefore, treating > ctrl cmd as a separate protection object of the lock is the solution and > the basis for the next patch. Let's don't do that. Reasons are: 1) virtnet_send_command() may wait for cvq commands for an indefinite time 2) hold locks may complicate the future hardening works around cvq Thanks
Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access
On Tue, Dec 5, 2023 at 7:06 PM Heng Qi wrote: > > > > 在 2023/12/5 下午4:35, Jason Wang 写道: > > On Tue, Dec 5, 2023 at 4:02 PM Heng Qi wrote: > >> Currently access to ctrl cmd is globally protected via rtnl_lock and works > >> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, deadlock > >> may occur due to cancel_work_sync for dim work. > > Can you explain why? > > For example, during the bus unbind operation, the following call stack > occurs: > virtnet_remove -> unregister_netdev -> rtnl_lock[1] -> virtnet_close -> > cancel_work_sync -> virtnet_rx_dim_work -> rtnl_lock[2] (deadlock occurs). Can we use rtnl_trylock() and reschedule the work? > > >> Therefore, treating > >> ctrl cmd as a separate protection object of the lock is the solution and > >> the basis for the next patch. > > Let's don't do that. Reasons are: > > > > 1) virtnet_send_command() may wait for cvq commands for an indefinite time > > Yes, I took that into consideration. But ndo_set_rx_mode's need for an > atomic > environment rules out the mutex lock. It is a "bug" that we need to fix. > > > 2) hold locks may complicate the future hardening works around cvq > > Agree, but I don't seem to have thought of a better way besides passing > the lock. > Do you have any other better ideas or suggestions? So far no, you can refer to the past discussions, it might require the collaboration from the uAPI and stack. Thanks > > Thanks! > > > > > Thanks >
Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access
On Wed, Dec 6, 2023 at 9:03 PM Heng Qi wrote: > > > > 在 2023/12/6 下午8:27, Paolo Abeni 写道: > > On Tue, 2023-12-05 at 19:05 +0800, Heng Qi wrote: > >> 在 2023/12/5 下午4:35, Jason Wang 写道: > >>> On Tue, Dec 5, 2023 at 4:02 PM Heng Qi wrote: > >>>> Currently access to ctrl cmd is globally protected via rtnl_lock and > >>>> works > >>>> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, deadlock > >>>> may occur due to cancel_work_sync for dim work. > >>> Can you explain why? > >> For example, during the bus unbind operation, the following call stack > >> occurs: > >> virtnet_remove -> unregister_netdev -> rtnl_lock[1] -> virtnet_close -> > >> cancel_work_sync -> virtnet_rx_dim_work -> rtnl_lock[2] (deadlock occurs). > >> > >>>> Therefore, treating > >>>> ctrl cmd as a separate protection object of the lock is the solution and > >>>> the basis for the next patch. > >>> Let's don't do that. Reasons are: > >>> > >>> 1) virtnet_send_command() may wait for cvq commands for an indefinite time > >> Yes, I took that into consideration. But ndo_set_rx_mode's need for an > >> atomic > >> environment rules out the mutex lock. > >> > >>> 2) hold locks may complicate the future hardening works around cvq > >> Agree, but I don't seem to have thought of a better way besides passing > >> the lock. > >> Do you have any other better ideas or suggestions? > > What about: > > > > - using the rtnl lock only > > - virtionet_close() invokes cancel_work(), without flushing the work > > - virtnet_remove() calls flush_work() after unregister_netdev(), > > outside the rtnl lock > > > > Should prevent both the deadlock and the UaF. > > > Hi, Paolo and Jason! > > Thank you very much for your effective suggestions, but I found another > solution[1], > based on the ideas of rtnl_trylock and refill_work, which works very well: > > [1] > +static void virtnet_rx_dim_work(struct work_struct *work) > +{ > +struct dim *dim = container_of(work, struct dim, work); > +struct receive_queue *rq = container_of(dim, > +struct receive_queue, dim); > +struct virtnet_info *vi = rq->vq->vdev->priv; > +struct net_device *dev = vi->dev; > +struct dim_cq_moder update_moder; > +int i, qnum, err; > + > +if (!rtnl_trylock()) > +return; Don't we need to reschedule here? like if (rq->dim_enabled) sechedule_work() ? Thanks > + > +for (i = 0; i < vi->curr_queue_pairs; i++) { > +rq = &vi->rq[i]; > +dim = &rq->dim; > +qnum = rq - vi->rq; > + > +if (!rq->dim_enabled) > +continue; > + > +update_moder = net_dim_get_rx_moderation(dim->mode, > dim->profile_ix); > +if (update_moder.usec != rq->intr_coal.max_usecs || > +update_moder.pkts != rq->intr_coal.max_packets) { > +err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum, > + update_moder.usec, > + update_moder.pkts); > +if (err) > +pr_debug("%s: Failed to send dim parameters on rxq%d\n", > + dev->name, qnum); > +dim->state = DIM_START_MEASURE; > +} > +} > + > +rtnl_unlock(); > +} > > > In addition, other optimizations[2] have been tried, but it may be due > to the sparsely > scheduled work that the retry condition is always satisfied, affecting > performance, > so [1] is the final solution: > > [2] > > +static void virtnet_rx_dim_work(struct work_struct *work) > +{ > +struct dim *dim = container_of(work, struct dim, work); > +struct receive_queue *rq = container_of(dim, > +struct receive_queue, dim); > +struct virtnet_info *vi = rq->vq->vdev->priv; > +struct net_device *dev = vi->dev; > +struct dim_cq_moder update_moder; > +int i, qnum, err, count; > + > +if (!rtnl_trylock()) > +return; > +retry: > +count = vi->curr_queue_pairs; > +for (i = 0; i < vi->curr_queue_pairs; i++) { > +rq = &vi->rq[i]; > +dim = &rq->dim; > +qnum = rq - vi->rq; > +update_moder = net_dim_get_rx_moderation(dim->mode, > dim->profile_ix); > +if (update_moder.usec != rq->intr_coal.max_usecs || > +u
Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access
On Thu, Dec 7, 2023 at 12:47 PM Heng Qi wrote: > > > > 在 2023/12/7 下午12:19, Jason Wang 写道: > > On Wed, Dec 6, 2023 at 9:03 PM Heng Qi wrote: > >> > >> > >> 在 2023/12/6 下午8:27, Paolo Abeni 写道: > >>> On Tue, 2023-12-05 at 19:05 +0800, Heng Qi wrote: > >>>> 在 2023/12/5 下午4:35, Jason Wang 写道: > >>>>> On Tue, Dec 5, 2023 at 4:02 PM Heng Qi wrote: > >>>>>> Currently access to ctrl cmd is globally protected via rtnl_lock and > >>>>>> works > >>>>>> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, > >>>>>> deadlock > >>>>>> may occur due to cancel_work_sync for dim work. > >>>>> Can you explain why? > >>>> For example, during the bus unbind operation, the following call stack > >>>> occurs: > >>>> virtnet_remove -> unregister_netdev -> rtnl_lock[1] -> virtnet_close -> > >>>> cancel_work_sync -> virtnet_rx_dim_work -> rtnl_lock[2] (deadlock > >>>> occurs). > >>>> > >>>>>> Therefore, treating > >>>>>> ctrl cmd as a separate protection object of the lock is the solution > >>>>>> and > >>>>>> the basis for the next patch. > >>>>> Let's don't do that. Reasons are: > >>>>> > >>>>> 1) virtnet_send_command() may wait for cvq commands for an indefinite > >>>>> time > >>>> Yes, I took that into consideration. But ndo_set_rx_mode's need for an > >>>> atomic > >>>> environment rules out the mutex lock. > >>>> > >>>>> 2) hold locks may complicate the future hardening works around cvq > >>>> Agree, but I don't seem to have thought of a better way besides passing > >>>> the lock. > >>>> Do you have any other better ideas or suggestions? > >>> What about: > >>> > >>> - using the rtnl lock only > >>> - virtionet_close() invokes cancel_work(), without flushing the work > >>> - virtnet_remove() calls flush_work() after unregister_netdev(), > >>> outside the rtnl lock > >>> > >>> Should prevent both the deadlock and the UaF. > >> > >> Hi, Paolo and Jason! > >> > >> Thank you very much for your effective suggestions, but I found another > >> solution[1], > >> based on the ideas of rtnl_trylock and refill_work, which works very well: > >> > >> [1] > >> +static void virtnet_rx_dim_work(struct work_struct *work) > >> +{ > >> +struct dim *dim = container_of(work, struct dim, work); > >> +struct receive_queue *rq = container_of(dim, > >> +struct receive_queue, dim); > >> +struct virtnet_info *vi = rq->vq->vdev->priv; > >> +struct net_device *dev = vi->dev; > >> +struct dim_cq_moder update_moder; > >> +int i, qnum, err; > >> + > >> +if (!rtnl_trylock()) > >> +return; > > Don't we need to reschedule here? > > > > like > > > > if (rq->dim_enabled) > > sechedule_work() > > > > ? > > I think no, we don't need this. > > The work of each queue will be called by "net_dim()->schedule_work()" > when napi traffic changes (before schedule_work(), the dim->profile_ix > of the corresponding rxq has been updated). > So we only need to traverse and update the profiles of all rxqs in the > work which is obtaining the rtnl_lock. Ok, let's have a comment here then. Thanks > > Thanks! > > > > > Thanks > > > >> + > >> +for (i = 0; i < vi->curr_queue_pairs; i++) { > >> +rq = &vi->rq[i]; > >> +dim = &rq->dim; > >> +qnum = rq - vi->rq; > >> + > >> +if (!rq->dim_enabled) > >> +continue; > >> + > >> +update_moder = net_dim_get_rx_moderation(dim->mode, > >> dim->profile_ix); > >> +if (update_moder.usec != rq->intr_coal.max_usecs || > >> +update_moder.pkts != rq->intr_coal.max_packets) { > >> +err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum, > >> + update_moder.usec, > >> + update_moder.pkts); > >> +if (
Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic
On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn wrote: > > Heng Qi wrote: > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道: > > > Heng Qi wrote: > > >> virtio-net has two ways to switch napi_tx: one is through the > > >> module parameter, and the other is through coalescing parameter > > >> settings (provided that the nic status is down). > > >> > > >> Sometimes we face performance regression caused by napi_tx, > > >> then we need to switch napi_tx when debugging. However, the > > >> existing methods are a bit troublesome, such as needing to > > >> reload the driver or turn off the network card. Why is this troublesome? We don't need to turn off the card, it's just a toggling of the interface. This ends up with pretty simple code. > So try to make > > >> this update. > > >> > > >> Signed-off-by: Heng Qi > > >> Reviewed-by: Xuan Zhuo > > > The commit does not explain why it is safe to do so. > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends and > > no new tx napi will be scheduled. > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot > > send the packet. > > > > Then we can safely toggle the weight to indicate where to clear the buffers. > > > > > > > > The tx-napi weights are not really weights: it is a boolean whether > > > napi is used for transmit cleaning, or whether packets are cleaned > > > in ndo_start_xmit. > > > > Right. > > > > > > > > There certainly are some subtle issues with regard to pausing/waking > > > queues when switching between modes. > > > > What are "subtle issues" and if there are any, we find them. > > A single runtime test is not sufficient to exercise all edge cases. > > Please don't leave it to reviewers to establish the correctness of a > patch. +1 And instead of trying to do this, it would be much better to optimize the NAPI performance. Then we can drop the orphan mode. > > The napi_tx and non-napi code paths differ in how they handle at least > the following structures: > > 1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is > needed as delay until the next ndo_start_xmit and thus completion is > unbounded. > > When switching to napi mode, orphaned skbs may now be cleaned by the > napi handler. This is indeed safe. > > When switching from napi to non-napi, the unbound latency resurfaces. > It is a small edge case, and I think a potentially acceptable risk, if > the user of this knob is aware of the risk. > > 2. virtqueue callback ("interrupt" masking). The non-napi path enables > the interrupt (disables the mask) when available descriptors falls > beneath a low watermark, and reenables when it recovers above a high > watermark. Napi disables when napi is scheduled, and reenables on > napi complete. > > 3. dev_queue->state (QUEUE_STATE_DRV_XOFF). if the ring falls below > a low watermark, the driver stops the stack for queuing more packets. > In napi mode, it schedules napi to clean packets. See the calls to > netif_xmit_stopped, netif_stop_subqueue, netif_start_subqueue and > netif_tx_wake_queue. > > Some if this can be assumed safe by looking at existing analogous > code, such as the queue stop/start in virtnet_tx_resize. > > But that all virtqueue callback and dev_queue->state transitions are > correct when switching between modes at runtime is not trivial to > establish, deserves some thought and explanation in the commit > message. Thanks
Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic
On Fri, Dec 22, 2023 at 4:14 PM Michael S. Tsirkin wrote: > > On Fri, Dec 22, 2023 at 10:35:07AM +0800, Jason Wang wrote: > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn > > wrote: > > > > > > Heng Qi wrote: > > > > > > > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道: > > > > > Heng Qi wrote: > > > > >> virtio-net has two ways to switch napi_tx: one is through the > > > > >> module parameter, and the other is through coalescing parameter > > > > >> settings (provided that the nic status is down). > > > > >> > > > > >> Sometimes we face performance regression caused by napi_tx, > > > > >> then we need to switch napi_tx when debugging. However, the > > > > >> existing methods are a bit troublesome, such as needing to > > > > >> reload the driver or turn off the network card. > > > > Why is this troublesome? We don't need to turn off the card, it's just > > a toggling of the interface. > > > > This ends up with pretty simple code. > > > > > So try to make > > > > >> this update. > > > > >> > > > > >> Signed-off-by: Heng Qi > > > > >> Reviewed-by: Xuan Zhuo > > > > > The commit does not explain why it is safe to do so. > > > > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends and > > > > no new tx napi will be scheduled. > > > > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot > > > > send the packet. > > > > > > > > Then we can safely toggle the weight to indicate where to clear the > > > > buffers. > > > > > > > > > > > > > > The tx-napi weights are not really weights: it is a boolean whether > > > > > napi is used for transmit cleaning, or whether packets are cleaned > > > > > in ndo_start_xmit. > > > > > > > > Right. > > > > > > > > > > > > > > There certainly are some subtle issues with regard to pausing/waking > > > > > queues when switching between modes. > > > > > > > > What are "subtle issues" and if there are any, we find them. > > > > > > A single runtime test is not sufficient to exercise all edge cases. > > > > > > Please don't leave it to reviewers to establish the correctness of a > > > patch. > > > > +1 > > > > And instead of trying to do this, it would be much better to optimize > > the NAPI performance. Then we can drop the orphan mode. > > "To address your problem, optimize our code to the level which we > couldn't achieve in more than 10 years". Last time QE didn't report any issue for TCP. For others, the code might just need some optimization if it really matters, it's just because nobody has worked on this part in the past years. The ethtool trick is just for debugging purposes, I can hardly believe it is used by any management layer software. > That's not a reasonable > requirement. Not getting an interrupt will always be better for some > workloads. So NAPI has been enabled by default for many years. And most of the NIC doesn't do orphans. Orphan has known issues like pktgen and others. Keeping two modes may result in tricky code and complicate the features like BQL [1]. We would have interrupt coalescing and dim soon. Admin can enable the heuristic or tweak it according to the workload which is much more user friendly than orphaning. I don't see a strong point to keep the orphan mode any more considering the advantages of NAPI. Thanks [1] https://lore.kernel.org/lkml/20181205225323.12555-2-...@redhat.com/ > > > > > > > > The napi_tx and non-napi code paths differ in how they handle at least > > > the following structures: > > > > > > 1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is > > > needed as delay until the next ndo_start_xmit and thus completion is > > > unbounded. > > > > > > When switching to napi mode, orphaned skbs may now be cleaned by the > > > napi handler. This is indeed safe. > > > > > > When switching from napi to non-napi, the unbound latency resurfaces. > > > It is a small edge case, and I think a potentially acceptable risk, if > > > the user of this knob is aware of the risk. > > > > > > 2. virtqueue callback ("interr
Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic
On Mon, Dec 25, 2023 at 10:25 AM Jason Xing wrote: > > Hello Jason, > On Fri, Dec 22, 2023 at 10:36 AM Jason Wang wrote: > > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn > > wrote: > > > > > > Heng Qi wrote: > > > > > > > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道: > > > > > Heng Qi wrote: > > > > >> virtio-net has two ways to switch napi_tx: one is through the > > > > >> module parameter, and the other is through coalescing parameter > > > > >> settings (provided that the nic status is down). > > > > >> > > > > >> Sometimes we face performance regression caused by napi_tx, > > > > >> then we need to switch napi_tx when debugging. However, the > > > > >> existing methods are a bit troublesome, such as needing to > > > > >> reload the driver or turn off the network card. > > > > Why is this troublesome? We don't need to turn off the card, it's just > > a toggling of the interface. > > > > This ends up with pretty simple code. > > > > > So try to make > > > > >> this update. > > > > >> > > > > >> Signed-off-by: Heng Qi > > > > >> Reviewed-by: Xuan Zhuo > > > > > The commit does not explain why it is safe to do so. > > > > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends and > > > > no new tx napi will be scheduled. > > > > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot > > > > send the packet. > > > > > > > > Then we can safely toggle the weight to indicate where to clear the > > > > buffers. > > > > > > > > > > > > > > The tx-napi weights are not really weights: it is a boolean whether > > > > > napi is used for transmit cleaning, or whether packets are cleaned > > > > > in ndo_start_xmit. > > > > > > > > Right. > > > > > > > > > > > > > > There certainly are some subtle issues with regard to pausing/waking > > > > > queues when switching between modes. > > > > > > > > What are "subtle issues" and if there are any, we find them. > > > > > > A single runtime test is not sufficient to exercise all edge cases. > > > > > > Please don't leave it to reviewers to establish the correctness of a > > > patch. > > > > +1 > > > [...] > > And instead of trying to do this, it would be much better to optimize > > the NAPI performance. Then we can drop the orphan mode. > > Do you mean when to call skb_orphan()? If yes, I just want to provide > more information that we also have some performance issues where the > flow control takes a bad effect, especially under some small > throughput in our production environment. I think you need to describe it in detail. > What strikes me as odd is if I restart the network, then the issue > will go with the wind. I cannot reproduce it in my testing machine. > One more thing: if I force skb_orphan() the current skb in every > start_xmit(), it could also solve the issue but not in a proper way. > After all, it drops the flow control... :S Yes, that's the known issue. Thanks > > Thanks, > Jason > > > > > > > > The napi_tx and non-napi code paths differ in how they handle at least > > > the following structures: > > > > > > 1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is > > > needed as delay until the next ndo_start_xmit and thus completion is > > > unbounded. > > > > > > When switching to napi mode, orphaned skbs may now be cleaned by the > > > napi handler. This is indeed safe. > > > > > > When switching from napi to non-napi, the unbound latency resurfaces. > > > It is a small edge case, and I think a potentially acceptable risk, if > > > the user of this knob is aware of the risk. > > > > > > 2. virtqueue callback ("interrupt" masking). The non-napi path enables > > > the interrupt (disables the mask) when available descriptors falls > > > beneath a low watermark, and reenables when it recovers above a high > > > watermark. Napi disables when napi is scheduled, and reenables on > > > napi complete. > > > > > > 3. dev_queue->state (QUEUE_STATE_DRV_XOFF). if the ring falls below > > > a low watermark, the driver stops the stack for queuing more packets. > > > In napi mode, it schedules napi to clean packets. See the calls to > > > netif_xmit_stopped, netif_stop_subqueue, netif_start_subqueue and > > > netif_tx_wake_queue. > > > > > > Some if this can be assumed safe by looking at existing analogous > > > code, such as the queue stop/start in virtnet_tx_resize. > > > > > > But that all virtqueue callback and dev_queue->state transitions are > > > correct when switching between modes at runtime is not trivial to > > > establish, deserves some thought and explanation in the commit > > > message. > > > > Thanks > > > > >
Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic
On Mon, Dec 25, 2023 at 2:34 PM Jason Xing wrote: > > On Mon, Dec 25, 2023 at 12:14 PM Jason Wang wrote: > > > > On Mon, Dec 25, 2023 at 10:25 AM Jason Xing > > wrote: > > > > > > Hello Jason, > > > On Fri, Dec 22, 2023 at 10:36 AM Jason Wang wrote: > > > > > > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn > > > > wrote: > > > > > > > > > > Heng Qi wrote: > > > > > > > > > > > > > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道: > > > > > > > Heng Qi wrote: > > > > > > >> virtio-net has two ways to switch napi_tx: one is through the > > > > > > >> module parameter, and the other is through coalescing parameter > > > > > > >> settings (provided that the nic status is down). > > > > > > >> > > > > > > >> Sometimes we face performance regression caused by napi_tx, > > > > > > >> then we need to switch napi_tx when debugging. However, the > > > > > > >> existing methods are a bit troublesome, such as needing to > > > > > > >> reload the driver or turn off the network card. > > > > > > > > Why is this troublesome? We don't need to turn off the card, it's just > > > > a toggling of the interface. > > > > > > > > This ends up with pretty simple code. > > > > > > > > > So try to make > > > > > > >> this update. > > > > > > >> > > > > > > >> Signed-off-by: Heng Qi > > > > > > >> Reviewed-by: Xuan Zhuo > > > > > > > The commit does not explain why it is safe to do so. > > > > > > > > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends > > > > > > and > > > > > > no new tx napi will be scheduled. > > > > > > > > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot > > > > > > send the packet. > > > > > > > > > > > > Then we can safely toggle the weight to indicate where to clear the > > > > > > buffers. > > > > > > > > > > > > > > > > > > > > The tx-napi weights are not really weights: it is a boolean > > > > > > > whether > > > > > > > napi is used for transmit cleaning, or whether packets are cleaned > > > > > > > in ndo_start_xmit. > > > > > > > > > > > > Right. > > > > > > > > > > > > > > > > > > > > There certainly are some subtle issues with regard to > > > > > > > pausing/waking > > > > > > > queues when switching between modes. > > > > > > > > > > > > What are "subtle issues" and if there are any, we find them. > > > > > > > > > > A single runtime test is not sufficient to exercise all edge cases. > > > > > > > > > > Please don't leave it to reviewers to establish the correctness of a > > > > > patch. > > > > > > > > +1 > > > > > > > [...] > > > > And instead of trying to do this, it would be much better to optimize > > > > the NAPI performance. Then we can drop the orphan mode. > > > > [...] > > > Do you mean when to call skb_orphan()? If yes, I just want to provide > > > more information that we also have some performance issues where the > > > flow control takes a bad effect, especially under some small > > > throughput in our production environment. > > > > I think you need to describe it in detail. > > Some of the details were described below in the last email. The > decreased performance happened because of flow control: the delay of > skb free means the delay What do you mean by delay here? Is it an interrupt delay? If yes, Does it work better if you simply remove virtqueue_enable_cb_delayed() with virtqueue_enable_cb()? As the former may delay the interrupt more or less depend on the traffic. > of decreasing of sk_wmem_alloc, then it will > hit the limit of TSQ mechanism, finally causing transmitting slowly in > the TCP layer. TSQ might work better with BQL which virtio-net doesn't have right now. > > > > > >
Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic
On Mon, Dec 25, 2023 at 2:44 PM Jason Wang wrote: > > On Mon, Dec 25, 2023 at 2:34 PM Jason Xing wrote: > > > > On Mon, Dec 25, 2023 at 12:14 PM Jason Wang wrote: > > > > > > On Mon, Dec 25, 2023 at 10:25 AM Jason Xing > > > wrote: > > > > > > > > Hello Jason, > > > > On Fri, Dec 22, 2023 at 10:36 AM Jason Wang wrote: > > > > > > > > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn > > > > > wrote: > > > > > > > > > > > > Heng Qi wrote: > > > > > > > > > > > > > > > > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道: > > > > > > > > Heng Qi wrote: > > > > > > > >> virtio-net has two ways to switch napi_tx: one is through the > > > > > > > >> module parameter, and the other is through coalescing parameter > > > > > > > >> settings (provided that the nic status is down). > > > > > > > >> > > > > > > > >> Sometimes we face performance regression caused by napi_tx, > > > > > > > >> then we need to switch napi_tx when debugging. However, the > > > > > > > >> existing methods are a bit troublesome, such as needing to > > > > > > > >> reload the driver or turn off the network card. > > > > > > > > > > Why is this troublesome? We don't need to turn off the card, it's just > > > > > a toggling of the interface. > > > > > > > > > > This ends up with pretty simple code. > > > > > > > > > > > So try to make > > > > > > > >> this update. > > > > > > > >> > > > > > > > >> Signed-off-by: Heng Qi > > > > > > > >> Reviewed-by: Xuan Zhuo > > > > > > > > The commit does not explain why it is safe to do so. > > > > > > > > > > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi > > > > > > > ends and > > > > > > > no new tx napi will be scheduled. > > > > > > > > > > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack > > > > > > > cannot > > > > > > > send the packet. > > > > > > > > > > > > > > Then we can safely toggle the weight to indicate where to clear > > > > > > > the buffers. > > > > > > > > > > > > > > > > > > > > > > > The tx-napi weights are not really weights: it is a boolean > > > > > > > > whether > > > > > > > > napi is used for transmit cleaning, or whether packets are > > > > > > > > cleaned > > > > > > > > in ndo_start_xmit. > > > > > > > > > > > > > > Right. > > > > > > > > > > > > > > > > > > > > > > > There certainly are some subtle issues with regard to > > > > > > > > pausing/waking > > > > > > > > queues when switching between modes. > > > > > > > > > > > > > > What are "subtle issues" and if there are any, we find them. > > > > > > > > > > > > A single runtime test is not sufficient to exercise all edge cases. > > > > > > > > > > > > Please don't leave it to reviewers to establish the correctness of a > > > > > > patch. > > > > > > > > > > +1 > > > > > > > > > [...] > > > > > And instead of trying to do this, it would be much better to optimize > > > > > the NAPI performance. Then we can drop the orphan mode. > > > > > > [...] > > > > Do you mean when to call skb_orphan()? If yes, I just want to provide > > > > more information that we also have some performance issues where the > > > > flow control takes a bad effect, especially under some small > > > > throughput in our production environment. > > > > > > I think you need to describe it in detail. > > > > Some of the details were described below in the last email. The > > decreased performance happened
Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic
On Mon, Dec 25, 2023 at 4:03 PM Michael S. Tsirkin wrote: > > On Mon, Dec 25, 2023 at 12:12:48PM +0800, Jason Wang wrote: > > On Fri, Dec 22, 2023 at 4:14 PM Michael S. Tsirkin wrote: > > > > > > On Fri, Dec 22, 2023 at 10:35:07AM +0800, Jason Wang wrote: > > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn > > > > wrote: > > > > > > > > > > Heng Qi wrote: > > > > > > > > > > > > > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道: > > > > > > > Heng Qi wrote: > > > > > > >> virtio-net has two ways to switch napi_tx: one is through the > > > > > > >> module parameter, and the other is through coalescing parameter > > > > > > >> settings (provided that the nic status is down). > > > > > > >> > > > > > > >> Sometimes we face performance regression caused by napi_tx, > > > > > > >> then we need to switch napi_tx when debugging. However, the > > > > > > >> existing methods are a bit troublesome, such as needing to > > > > > > >> reload the driver or turn off the network card. > > > > > > > > Why is this troublesome? We don't need to turn off the card, it's just > > > > a toggling of the interface. > > > > > > > > This ends up with pretty simple code. > > > > > > > > > So try to make > > > > > > >> this update. > > > > > > >> > > > > > > >> Signed-off-by: Heng Qi > > > > > > >> Reviewed-by: Xuan Zhuo > > > > > > > The commit does not explain why it is safe to do so. > > > > > > > > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends > > > > > > and > > > > > > no new tx napi will be scheduled. > > > > > > > > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot > > > > > > send the packet. > > > > > > > > > > > > Then we can safely toggle the weight to indicate where to clear the > > > > > > buffers. > > > > > > > > > > > > > > > > > > > > The tx-napi weights are not really weights: it is a boolean > > > > > > > whether > > > > > > > napi is used for transmit cleaning, or whether packets are cleaned > > > > > > > in ndo_start_xmit. > > > > > > > > > > > > Right. > > > > > > > > > > > > > > > > > > > > There certainly are some subtle issues with regard to > > > > > > > pausing/waking > > > > > > > queues when switching between modes. > > > > > > > > > > > > What are "subtle issues" and if there are any, we find them. > > > > > > > > > > A single runtime test is not sufficient to exercise all edge cases. > > > > > > > > > > Please don't leave it to reviewers to establish the correctness of a > > > > > patch. > > > > > > > > +1 > > > > > > > > And instead of trying to do this, it would be much better to optimize > > > > the NAPI performance. Then we can drop the orphan mode. > > > > > > "To address your problem, optimize our code to the level which we > > > couldn't achieve in more than 10 years". > > > > Last time QE didn't report any issue for TCP. For others, the code > > might just need some optimization if it really matters, it's just > > because nobody has worked on this part in the past years. > > You think nobody worked on performance of virtio net because nobody > could bother? No, I just describe what I've seen from the list. No patches were posted for performance optimization in recent years. > I think it's just micro optimized to a level where > progress is difficult. > > Maybe. Let me clarify what I meant. If people can help to optimize the NAPI to be as fast as orphans or something like 80%-90% of orphans. It should be sufficient to remove the orphan path completely. This allows a lot of good features to be built easily on top. To me, it seems fine to remove it even now as it breaks socket accounting which is actually a bug and NAPI has been enabled for years but if you don't wish to do that, that's fine. We can wait for the coalescing and dim to be used for a while and revisit this. Thanks
Re: [PATCH v2] virtio_net: fix missing dma unmap for resize
On Thu, Jan 4, 2024 at 6:18 AM Michael S. Tsirkin wrote: > > On Wed, Jan 03, 2024 at 01:58:03PM -0800, Jakub Kicinski wrote: > > On Tue, 26 Dec 2023 17:43:33 +0800 Xuan Zhuo wrote: > > > For rq, we have three cases getting buffers from virtio core: > > > > > > 1. virtqueue_get_buf{,_ctx} > > > 2. virtqueue_detach_unused_buf > > > 3. callback for virtqueue_resize > > > > > > But in commit 295525e29a5b("virtio_net: merge dma operations when > > > filling mergeable buffers"), I missed the dma unmap for the #3 case. > > > > > > That will leak some memory, because I did not release the pages referred > > > by the unused buffers. > > > > > > If we do such script, we will make the system OOM. > > > > > > while true > > > do > > > ethtool -G ens4 rx 128 > > > ethtool -G ens4 rx 256 > > > free -m > > > done > > > > > > Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling > > > mergeable buffers") > > > Signed-off-by: Xuan Zhuo > > > > Michael, Jason, looks good? Worth pushing it to v6.7? > > I'd say yes. > > Acked-by: Michael S. Tsirkin Acked-by: Jason Wang Thanks >
Re: [PATCH net-next v3 00/27] virtio-net: support AF_XDP zero copy
On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo wrote: > > ## AF_XDP > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero > copy feature of xsk (XDP socket) needs to be supported by the driver. The > performance of zero copy is very good. mlx5 and intel ixgbe already support > this feature, This patch set allows virtio-net to support xsk's zerocopy xmit > feature. > > At present, we have completed some preparation: > > 1. vq-reset (virtio spec and kernel code) > 2. virtio-core premapped dma > 3. virtio-net xdp refactor > > So it is time for Virtio-Net to complete the support for the XDP Socket > Zerocopy. > > Virtio-net can not increase the queue num at will, so xsk shares the queue > with > kernel. > > On the other hand, Virtio-Net does not support generate interrupt from driver > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it > is also the local CPU, then we wake up napi directly. > > This patch set includes some refactor to the virtio-net to let that to support > AF_XDP. > > ## performance > > ENV: Qemu with vhost-user(polling mode). > Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz > > ### virtio PMD in guest with testpmd > > testpmd> show port stats all > > NIC statistics for port 0 > RX-packets: 19531092064 RX-missed: 0 RX-bytes: 1093741155584 > RX-errors: 0 > RX-nombuf: 0 > TX-packets: 595992 TX-errors: 0 TX-bytes: 371030645664 > > > Throughput (since last show) > Rx-pps: 8861574 Rx-bps: 3969985208 > Tx-pps: 8861493 Tx-bps: 3969962736 > > > ### AF_XDP PMD in guest with testpmd > > testpmd> show port stats all > > NIC statistics for port 0 > RX-packets: 68152727 RX-missed: 0 RX-bytes: 3816552712 > RX-errors: 0 > RX-nombuf: 0 > TX-packets: 68114967 TX-errors: 33216 TX-bytes: 3814438152 > > Throughput (since last show) > Rx-pps: 6333196 Rx-bps: 2837272088 > Tx-pps: 6333227 Tx-bps: 2837285936 > > > But AF_XDP consumes more CPU for tx and rx napi(100% and 86%). > > ## maintain > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support > in > virtio-net. > > Please review. > > Thanks. > > v3 > 1. virtio introduces helpers for virtio-net sq using premapped dma > 2. xsk has more complete support for merge mode > 3. fix some problems > > v2 > 1. wakeup uses the way of GVE. No send ipi to wakeup napi on remote cpu. > 2. remove rcu. Because we synchronize all operat, so the rcu is not > needed. > 3. split the commit "move to virtio_net.h" in last patch set. Just move > the >struct/api to header when we use them. > 4. add comments for some code > > v1: > 1. remove two virtio commits. Push this patchset to net-next > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: > support tx > 3. fix some warnings > > > > Xuan Zhuo (27): > virtio_net: rename free_old_xmit_skbs to free_old_xmit > virtio_net: unify the code for recycling the xmit ptr > virtio_net: independent directory > virtio_net: move core structures to virtio_net.h > virtio_net: add prefix virtnet to all struct inside virtio_net.h > virtio_ring: introduce virtqueue_get_buf_ctx_dma() > virtio_ring: virtqueue_disable_and_recycle let the callback detach > bufs > virtio_ring: introduce virtqueue_detach_unused_buf_dma() > virtio_ring: introduce virtqueue_get_dma_premapped() > virtio_net: sq support premapped mode > virtio_net: separate virtnet_rx_resize() > virtio_net: separate virtnet_tx_resize() > virtio_net: xsk: bind/unbind xsk > virtio_net: xsk: prevent disable tx napi > virtio_net: move some api to header > virtio_net: xsk: tx: support xmit xsk buffer > virtio_net: xsk: tx: support wakeup > virtio_net: xsk: tx: handle the transmitted xsk buffer > virtio_net: xsk: tx: free the unused xsk buffer > virtio_net: separate receive_mergeable > virtio_net: separate receive_buf > virtio_net: xsk: rx: support fill with xsk buffer > virtio_net: xsk: rx: support recv merge mode > virtio_net: xsk: rx: support recv small mode > virtio_net: xsk: rx: free the unused xsk buffer > virtio_net: update tx timeout record > virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY Hi Xuan: This series seems too huge to be reviewed easily. I'd suggest to split it to be multiple series (as suggested by https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#tl-dr) Thanks > > MAINTAINERS | 2 +- > drivers/net/Kconfig | 8 +- > drivers/net/Makefile
Re: [PATCH net-next v3 06/27] virtio_ring: introduce virtqueue_get_buf_ctx_dma()
On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo wrote: > > introduce virtqueue_get_buf_ctx_dma() to collect the dma info when > get buf from virtio core for premapped mode. > > If the virtio queue is premapped mode, the virtio-net send buf may > have many desc. Every desc dma address need to be unmap. So here we > introduce a new helper to collect the dma address of the buffer from > the virtio core. > > Because the BAD_RING is called (that may set vq->broken), so > the relative "const" of vq is removed. > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 174 +-- > include/linux/virtio.h | 16 > 2 files changed, 142 insertions(+), 48 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 51d8f3299c10..1374b3fd447c 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct > vring_virtqueue *vq) > return vq->dma_dev; > } > > +/* > + * use_dma_api premapped -> do_unmap > + * 1. false falsefalse > + * 2. truefalsetrue > + * 3. truetrue false > + * > + * Only #3, we should return the DMA info to the driver. Btw, I guess you meant "#3 is false" here? And could we reduce the size of these 3 * 3 matrices? It's usually a hint that the code is not optmized. Thanks
Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang
On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun wrote: > > From: Zhu Yanjun > > Some devices emulate the virtio_net hardwares. When virtio_net > driver sends commands to the emulated hardware, normally the > hardware needs time to response. Sometimes the time is very > long. Thus, the following will appear. Then the whole system > will hang. > The similar problems also occur in Intel NICs and Mellanox NICs. > As such, the similar solution is borrowed from them. A timeout > value is added and the timeout value as large as possible is set > to ensure that the driver gets the maximum possible response from > the hardware. > > " > [ 213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! > [(udev-worker):3157] > [ 213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr > rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency > intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm > x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt > dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry > pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me > isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec > idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf > ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop > zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni > polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 > sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg > scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath > [ 213.796194] irq event stamp: 67740 > [ 213.796195] hardirqs last enabled at (67739): [] > asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 213.796203] hardirqs last disabled at (67740): [] > sysvec_apic_timer_interrupt+0xe/0x90 > [ 213.796208] softirqs last enabled at (67686): [] > __irq_exit_rcu+0xbe/0xe0 > [ 213.796214] softirqs last disabled at (67681): [] > __irq_exit_rcu+0xbe/0xe0 > [ 213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not > tainted 6.7.0+ #9 > [ 213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, > BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022 > [ 213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110 > [ 213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 > 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 > e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b > [ 213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246 > [ 213.796233] RAX: RBX: ff2f15095896f000 RCX: > 0001 > [ 213.796235] RDX: RSI: ff4bbb362306f9cc RDI: > ff2f15095896f000 > [ 213.796236] RBP: R08: R09: > > [ 213.796237] R10: 0003 R11: ff2f15095893cc40 R12: > 0002 > [ 213.796239] R13: 0004 R14: R15: > ff2f1509534f3000 > [ 213.796240] FS: 7f775847d0c0() GS:ff2f1528bac0() > knlGS: > [ 213.796242] CS: 0010 DS: ES: CR0: 80050033 > [ 213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: > 00f71ef0 > [ 213.796245] DR0: DR1: DR2: > > [ 213.796246] DR3: DR6: fffe07f0 DR7: > 0400 > [ 213.796247] PKRU: 5554 > [ 213.796249] Call Trace: > [ 213.796250] > [ 213.796252] ? watchdog_timer_fn+0x1c0/0x220 > [ 213.796258] ? __pfx_watchdog_timer_fn+0x10/0x10 > [ 213.796261] ? __hrtimer_run_queues+0x1af/0x380 > [ 213.796269] ? hrtimer_interrupt+0xf8/0x230 > [ 213.796274] ? __sysvec_apic_timer_interrupt+0x64/0x1a0 > [ 213.796279] ? sysvec_apic_timer_interrupt+0x6d/0x90 > [ 213.796282] > [ 213.796284] > [ 213.796285] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 213.796293] ? virtqueue_get_buf_ctx_split+0x8d/0x110 > [ 213.796297] virtnet_send_command+0x18a/0x1f0 [virtio_net] > [ 213.796310] _virtnet_set_queues+0xc6/0x120 [virtio_net] > [ 213.796319] virtnet_probe+0xa06/0xd50 [virtio_net] > [ 213.796328] virtio_dev_probe+0x195/0x230 > [ 213.796333] really_probe+0x19f/0x400 > [ 213.796338] ? __pfx___driver_attach+0x10/0x10 > [ 213.796340] __driver_probe_device+0x78/0x160 > [ 213.796343] driver_probe_device+0x1f/0x90 > [ 213.796346] __driver_attach+0xd6/0x1d0 > [ 213.796349] bus_for_each_dev+0x8c/0xe0 > [ 213.796355] bus_add_driver+0x119/0x220 > [ 213.796359] driver_register+0x59/0x100 > [ 213.796362] ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net] > [ 213.796369] virtio_net_driver_init+0x8e/0xff0 [virtio_net] > [ 213.796375] do_one_initcall+0x6f/0x380 > [ 213.796384] do_init_module+0x60/0x240 > [ 213.796388] init_module_from_file+0x86/0xc0 > [ 213.796396] idempotent_init_module+0x129/0x2c0 > [ 213.79640
Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang
On Mon, Jan 15, 2024 at 6:22 PM Zhu Yanjun wrote: > > > 在 2024/1/15 10:20, Jason Wang 写道: > > On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun wrote: > > From: Zhu Yanjun > > Some devices emulate the virtio_net hardwares. When virtio_net > driver sends commands to the emulated hardware, normally the > hardware needs time to response. Sometimes the time is very > long. Thus, the following will appear. Then the whole system > will hang. > The similar problems also occur in Intel NICs and Mellanox NICs. > As such, the similar solution is borrowed from them. A timeout > value is added and the timeout value as large as possible is set > to ensure that the driver gets the maximum possible response from > the hardware. > > " > [ 213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! > [(udev-worker):3157] > [ 213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr > rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency > intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm > x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt > dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry > pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me > isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec > idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf > ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop > zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni > polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 > sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg > scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath > [ 213.796194] irq event stamp: 67740 > [ 213.796195] hardirqs last enabled at (67739): [] > asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 213.796203] hardirqs last disabled at (67740): [] > sysvec_apic_timer_interrupt+0xe/0x90 > [ 213.796208] softirqs last enabled at (67686): [] > __irq_exit_rcu+0xbe/0xe0 > [ 213.796214] softirqs last disabled at (67681): [] > __irq_exit_rcu+0xbe/0xe0 > [ 213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not > tainted 6.7.0+ #9 > [ 213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, > BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022 > [ 213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110 > [ 213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 > 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 > e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b > [ 213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246 > [ 213.796233] RAX: RBX: ff2f15095896f000 RCX: > 0001 > [ 213.796235] RDX: RSI: ff4bbb362306f9cc RDI: > ff2f15095896f000 > [ 213.796236] RBP: R08: R09: > > [ 213.796237] R10: 0003 R11: ff2f15095893cc40 R12: > 0002 > [ 213.796239] R13: 0004 R14: R15: > ff2f1509534f3000 > [ 213.796240] FS: 7f775847d0c0() GS:ff2f1528bac0() > knlGS: > [ 213.796242] CS: 0010 DS: ES: CR0: 80050033 > [ 213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: > 00f71ef0 > [ 213.796245] DR0: DR1: DR2: > > [ 213.796246] DR3: DR6: fffe07f0 DR7: > 0400 > [ 213.796247] PKRU: 5554 > [ 213.796249] Call Trace: > [ 213.796250] > [ 213.796252] ? watchdog_timer_fn+0x1c0/0x220 > [ 213.796258] ? __pfx_watchdog_timer_fn+0x10/0x10 > [ 213.796261] ? __hrtimer_run_queues+0x1af/0x380 > [ 213.796269] ? hrtimer_interrupt+0xf8/0x230 > [ 213.796274] ? __sysvec_apic_timer_interrupt+0x64/0x1a0 > [ 213.796279] ? sysvec_apic_timer_interrupt+0x6d/0x90 > [ 213.796282] > [ 213.796284] > [ 213.796285] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 213.796293] ? virtqueue_get_buf_ctx_split+0x8d/0x110 > [ 213.796297] virtnet_send_command+0x18a/0x1f0 [virtio_net] > [ 213.796310] _virtnet_set_queues+0xc6/0x120 [virtio_net] > [ 213.796319] virtnet_probe+0xa06/0xd50 [virtio_net] > [ 213.796328] virtio_dev_probe+0x195/0x230 > [ 213.796333] really_probe+0x19f/0x400 > [ 213.796338] ? __pfx___driver_attach+0x10/0x10 > [ 213.796340] __driver_probe_device+0x78/0x160 > [ 213.796343] driver_probe_device+0x1f/0x90 > [ 213.796346] __driver_attach+0xd6/0x1d0 > [ 213.796349] bus_for_each_dev+0x8c/0xe0 > [ 213.796355] bus_add_driver
Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang
On Fri, Jan 19, 2024 at 10:27 PM Heng Qi wrote: > > > > 在 2024/1/18 下午8:01, Zhu Yanjun 写道: > > > > 在 2024/1/16 20:04, Paolo Abeni 写道: > >> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote: > >>> From: Zhu Yanjun > >>> > >>> Some devices emulate the virtio_net hardwares. When virtio_net > >>> driver sends commands to the emulated hardware, normally the > >>> hardware needs time to response. Sometimes the time is very > >>> long. Thus, the following will appear. Then the whole system > >>> will hang. > >>> The similar problems also occur in Intel NICs and Mellanox NICs. > >>> As such, the similar solution is borrowed from them. A timeout > >>> value is added and the timeout value as large as possible is set > >>> to ensure that the driver gets the maximum possible response from > >>> the hardware. > >>> > >>> " > >>> [ 213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! > >>> [(udev-worker):3157] > >>> [ 213.796114] Modules linked in: virtio_net(+) net_failover > >>> failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common > >>> intel_uncore_frequency intel_uncore_frequency_common intel_ifs > >>> i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp > >>> coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support > >>> vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev > >>> intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci > >>> isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus > >>> i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf > >>> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update > >>> fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel > >>> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 > >>> bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi > >>> pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath > >>> [ 213.796194] irq event stamp: 67740 > >>> [ 213.796195] hardirqs last enabled at (67739): > >>> [] asm_sysvec_apic_timer_interrupt+0x1a/0x20 > >>> [ 213.796203] hardirqs last disabled at (67740): > >>> [] sysvec_apic_timer_interrupt+0xe/0x90 > >>> [ 213.796208] softirqs last enabled at (67686): > >>> [] __irq_exit_rcu+0xbe/0xe0 > >>> [ 213.796214] softirqs last disabled at (67681): > >>> [] __irq_exit_rcu+0xbe/0xe0 > >>> [ 213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded > >>> Not tainted 6.7.0+ #9 > >>> [ 213.796220] Hardware name: Intel Corporation > >>> M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 > >>> 11/14/2022 > >>> [ 213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110 > >>> [ 213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 > >>> 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 > >>> 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 > >>> 89 e8 5b > >>> [ 213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246 > >>> [ 213.796233] RAX: RBX: ff2f15095896f000 RCX: > >>> 0001 > >>> [ 213.796235] RDX: RSI: ff4bbb362306f9cc RDI: > >>> ff2f15095896f000 > >>> [ 213.796236] RBP: R08: R09: > >>> > >>> [ 213.796237] R10: 0003 R11: ff2f15095893cc40 R12: > >>> 0002 > >>> [ 213.796239] R13: 0004 R14: R15: > >>> ff2f1509534f3000 > >>> [ 213.796240] FS: 7f775847d0c0() GS:ff2f1528bac0() > >>> knlGS: > >>> [ 213.796242] CS: 0010 DS: ES: CR0: 80050033 > >>> [ 213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: > >>> 00f71ef0 > >>> [ 213.796245] DR0: DR1: DR2: > >>> > >>> [ 213.796246] DR3: DR6: fffe07f0 DR7: > >>> 0400 > >>> [ 213.796247] PKRU: 5554 > >>> [ 213.796249] Call Trace: > >>> [ 213.796250] > >>> [ 213.796252] ? watchdog_timer_fn+0x1c0/0x220 > >>> [ 213.796258] ? __pfx_watchdog_timer_fn+0x10/0x10 > >>> [ 213.796261] ? __hrtimer_run_queues+0x1af/0x380 > >>> [ 213.796269] ? hrtimer_interrupt+0xf8/0x230 > >>> [ 213.796274] ? __sysvec_apic_timer_interrupt+0x64/0x1a0 > >>> [ 213.796279] ? sysvec_apic_timer_interrupt+0x6d/0x90 > >>> [ 213.796282] > >>> [ 213.796284] > >>> [ 213.796285] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 > >>> [ 213.796293] ? virtqueue_get_buf_ctx_split+0x8d/0x110 > >>> [ 213.796297] virtnet_send_command+0x18a/0x1f0 [virtio_net] > >>> [ 213.796310] _virtnet_set_queues+0xc6/0x120 [virtio_net] > >>> [ 213.796319] virtnet_probe+0xa06/0xd50 [virtio_net] > >>> [ 213.796328] virtio_dev_probe+0x195/0x230 > >>> [ 213.796333] really_probe+0x19f/0x400 > >>> [ 213.796338] ? __pfx___driver_attach+0x10/0x10 > >>> [ 213.796340] __driver_probe_device+0x78/0x160 > >>> [ 213.796343] driver_probe_device+0x1f/0x90 > >>> [ 213.796346] __driver_attach+0xd6/0x1d
Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang
On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun wrote: > > > 在 2024/1/20 1:29, Andrew Lunn 写道: > >while (!virtqueue_get_buf(vi->cvq, &tmp) && > > - !virtqueue_is_broken(vi->cvq)) > > + !virtqueue_is_broken(vi->cvq)) { > > +if (timeout) > > +timeout--; > This is not really a timeout, just a loop counter. 200 iterations could > be a very short time on reasonable H/W. I guess this avoid the soft > lockup, but possibly (likely?) breaks the functionality when we need to > loop for some non negligible time. > > I fear we need a more complex solution, as mentioned by Micheal in the > thread you quoted. > >>> Got it. I also look forward to the more complex solution to this problem. > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout > >> to get a reasonable timeout? > > The usual solution to this is include/linux/iopoll.h. If you can sleep > > read_poll_timeout() otherwise read_poll_timeout_atomic(). > > I read carefully the functions read_poll_timeout() and > read_poll_timeout_atomic(). The timeout is set by the caller of the 2 > functions. FYI, in order to avoid a swtich of atomic or not, we need convert rx mode setting to workqueue first: https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html > > As such, can we add a module parameter to customize this timeout value > by the user? Who is the "user" here, or how can the "user" know the value? > > Or this timeout value is stored in device register, virtio_net driver > will read this timeout value at initialization? See another thread. The design needs to be general, or you can post a RFC. In another thought, we've already had a tx watchdog, maybe we can have something similar to cvq and use timeout + reset in that case. Thans > > Zhu Yanjun > > > > > Andrew >
Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang
On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo wrote: > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang wrote: > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun wrote: > > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道: > > > >>>>>while (!virtqueue_get_buf(vi->cvq, &tmp) && > > > >>>>> - !virtqueue_is_broken(vi->cvq)) > > > >>>>> + !virtqueue_is_broken(vi->cvq)) { > > > >>>>> +if (timeout) > > > >>>>> +timeout--; > > > >>>> This is not really a timeout, just a loop counter. 200 iterations > > > >>>> could > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need > > > >>>> to > > > >>>> loop for some non negligible time. > > > >>>> > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in > > > >>>> the > > > >>>> thread you quoted. > > > >>> Got it. I also look forward to the more complex solution to this > > > >>> problem. > > > >> Can we add a device capability (new feature bit) such as > > > >> ctrq_wait_timeout > > > >> to get a reasonable timeout? > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep > > > > read_poll_timeout() otherwise read_poll_timeout_atomic(). > > > > > > I read carefully the functions read_poll_timeout() and > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2 > > > functions. > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx > > mode setting to workqueue first: > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html > > > > > > > > As such, can we add a module parameter to customize this timeout value > > > by the user? > > > > Who is the "user" here, or how can the "user" know the value? > > > > > > > > Or this timeout value is stored in device register, virtio_net driver > > > will read this timeout value at initialization? > > > > See another thread. The design needs to be general, or you can post a RFC. > > > > In another thought, we've already had a tx watchdog, maybe we can have > > something similar to cvq and use timeout + reset in that case. > > But we may block by the reset ^_^ if the device is broken? I mean vq reset here. It looks like we have multiple goals here 1) avoid lockups, using workqueue + cond_resched() seems to be sufficient, it has issue but nothing new 2) recover from the unresponsive device, the issue for timeout is that it needs to deal with false positives Thanks > > Thanks. > > > > > > Thans > > > > > > > > Zhu Yanjun > > > > > > > > > > > Andrew > > > > > >
Re: [PATCH net-next v3 06/27] virtio_ring: introduce virtqueue_get_buf_ctx_dma()
On Tue, Jan 16, 2024 at 3:47 PM Xuan Zhuo wrote: > > On Thu, 11 Jan 2024 16:34:09 +0800, Jason Wang wrote: > > On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo > > wrote: > > > > > > introduce virtqueue_get_buf_ctx_dma() to collect the dma info when > > > get buf from virtio core for premapped mode. > > > > > > If the virtio queue is premapped mode, the virtio-net send buf may > > > have many desc. Every desc dma address need to be unmap. So here we > > > introduce a new helper to collect the dma address of the buffer from > > > the virtio core. > > > > > > Because the BAD_RING is called (that may set vq->broken), so > > > the relative "const" of vq is removed. > > > > > > Signed-off-by: Xuan Zhuo > > > --- > > > drivers/virtio/virtio_ring.c | 174 +-- > > > include/linux/virtio.h | 16 > > > 2 files changed, 142 insertions(+), 48 deletions(-) > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > index 51d8f3299c10..1374b3fd447c 100644 > > > --- a/drivers/virtio/virtio_ring.c > > > +++ b/drivers/virtio/virtio_ring.c > > > @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct > > > vring_virtqueue *vq) > > > return vq->dma_dev; > > > } > > > > > > +/* > > > + * use_dma_api premapped -> do_unmap > > > + * 1. false falsefalse > > > + * 2. truefalsetrue > > > + * 3. truetrue false > > > + * > > > + * Only #3, we should return the DMA info to the driver. > > > > Btw, I guess you meant "#3 is false" here? > > > > And could we reduce the size of these 3 * 3 matrices? It's usually a > > hint that the code is not optmized. > > On the process of doing dma map, we force the (use_dma_api, premapped). > > if premapped: > virtio core skip dma map > else: > if use_dma_api: > do dma map > else: > work with the physical address. > > Here we force the (premapped, do_unmap). > > do_unmap is an optimization. We just check this to know should we do dma unmap > or not. > > Now, we introduced an new case, when the virtio core skip dma unmap, > we may need to return the dma info to the driver. That just occur when > the (premapped, do_unmap) is (true, false). Because that the (premmaped, > do_unmap) may be (false, false). > > For the matrices, I just want to show where the do_unmap comes from. > That is a optimization, we use this many places, not to check (use_dma_api, > premapped) on the process of doing unmap. And only for the case #3, we should > return the dma info to drivers. Ok, it tries to ease the life of the readers. I wonder if something like bool virtqueue_needs_unmap() can help, it can judge based on the value of use_dma_api and premapped. Thanks > > Thanks. > > > > > Thanks > > > > >
Re: [PATCH net-next 00/17] virtio-net: support AF_XDP zero copy (3/3)
On Wed, Jan 17, 2024 at 1:58 PM Xuan Zhuo wrote: > > On Tue, 16 Jan 2024 07:07:05 -0800, Jakub Kicinski wrote: > > On Tue, 16 Jan 2024 13:37:30 +0100 Paolo Abeni wrote: > > > For future submission it would be better if you split this series in > > > smaller chunks: the maximum size allowed is 15 patches. > > > > Which does not mean you can split it up and post them all at the same > > time, FWIW. > > > I hope some ones have time to reivew the other parts. Will review those this week. Thanks > In the future, I will post one after the last one is merged. > > Thanks. >
Re: [PATCH net-next v3 06/27] virtio_ring: introduce virtqueue_get_buf_ctx_dma()
On Mon, Jan 22, 2024 at 2:12 PM Xuan Zhuo wrote: > > On Mon, 22 Jan 2024 12:18:51 +0800, Jason Wang wrote: > > On Tue, Jan 16, 2024 at 3:47 PM Xuan Zhuo > > wrote: > > > > > > On Thu, 11 Jan 2024 16:34:09 +0800, Jason Wang > > > wrote: > > > > On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo > > > > wrote: > > > > > > > > > > introduce virtqueue_get_buf_ctx_dma() to collect the dma info when > > > > > get buf from virtio core for premapped mode. > > > > > > > > > > If the virtio queue is premapped mode, the virtio-net send buf may > > > > > have many desc. Every desc dma address need to be unmap. So here we > > > > > introduce a new helper to collect the dma address of the buffer from > > > > > the virtio core. > > > > > > > > > > Because the BAD_RING is called (that may set vq->broken), so > > > > > the relative "const" of vq is removed. > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > --- > > > > > drivers/virtio/virtio_ring.c | 174 > > > > > +-- > > > > > include/linux/virtio.h | 16 > > > > > 2 files changed, 142 insertions(+), 48 deletions(-) > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > b/drivers/virtio/virtio_ring.c > > > > > index 51d8f3299c10..1374b3fd447c 100644 > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct > > > > > vring_virtqueue *vq) > > > > > return vq->dma_dev; > > > > > } > > > > > > > > > > +/* > > > > > + * use_dma_api premapped -> do_unmap > > > > > + * 1. false falsefalse > > > > > + * 2. truefalsetrue > > > > > + * 3. truetrue false > > > > > + * > > > > > + * Only #3, we should return the DMA info to the driver. > > > > > > > > Btw, I guess you meant "#3 is false" here? > > > > > > > > And could we reduce the size of these 3 * 3 matrices? It's usually a > > > > hint that the code is not optmized. > > > > > > On the process of doing dma map, we force the (use_dma_api, premapped). > > > > > > if premapped: > > > virtio core skip dma map > > > else: > > > if use_dma_api: > > > do dma map > > > else: > > > work with the physical address. > > > > > > Here we force the (premapped, do_unmap). > > > > > > do_unmap is an optimization. We just check this to know should we do dma > > > unmap > > > or not. > > > > > > Now, we introduced an new case, when the virtio core skip dma unmap, > > > we may need to return the dma info to the driver. That just occur when > > > the (premapped, do_unmap) is (true, false). Because that the (premmaped, > > > do_unmap) may be (false, false). > > > > > > For the matrices, I just want to show where the do_unmap comes from. > > > That is a optimization, we use this many places, not to check > > > (use_dma_api, > > > premapped) on the process of doing unmap. And only for the case #3, we > > > should > > > return the dma info to drivers. > > > > Ok, it tries to ease the life of the readers. > > > > I wonder if something like > > > > bool virtqueue_needs_unmap() can help, it can judge based on the value > > of use_dma_api and premapped. > > > I think not too much. > > Because do_unmap is for this. > > > > +static bool vring_need_unmap(struct vring_virtqueue *vq, > +struct virtio_dma_head *dma, > +dma_addr_t addr, unsigned int length) > +{ > + if (vq->do_unmap) > + return true; > > Before this, we is to judge whether we should do unmap or not. > After this, we is to judge whehter we should return dma info to driver or not. > > If you want to simplify this function, I will say no. > > If you want to replace "do_unmap" with virtqueue_needs_unmap(), I will say ok. That's my point. > But I think we donot need to do that. Just a suggestion, and you can move the comment above there. Thanks > > + > + if (!vq->premapped) > + return false; > + > + if (!dma) > + return false; > + > + if (unlikely(dma->next >= dma->num)) { > + BAD_RING(vq, "premapped vq: collect dma overflow: %pad %u\n", > +&addr, length); > + return false; > + } > + > + dma->items[dma->next].addr = addr; > + dma->items[dma->next].length = length; > + > + ++dma->next; > + > + return false; > +} > > > Thanks. > > > > > > Thanks > > > > > > > > Thanks. > > > > > > > > > > > Thanks > > > > > > > > > > > > > >
Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang
On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo wrote: > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang wrote: > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo > > wrote: > > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang > > > wrote: > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun > > > > wrote: > > > > > > > > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道: > > > > > >>>>>while (!virtqueue_get_buf(vi->cvq, &tmp) && > > > > > >>>>> - !virtqueue_is_broken(vi->cvq)) > > > > > >>>>> + !virtqueue_is_broken(vi->cvq)) { > > > > > >>>>> +if (timeout) > > > > > >>>>> +timeout--; > > > > > >>>> This is not really a timeout, just a loop counter. 200 > > > > > >>>> iterations could > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the > > > > > >>>> soft > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we > > > > > >>>> need to > > > > > >>>> loop for some non negligible time. > > > > > >>>> > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal > > > > > >>>> in the > > > > > >>>> thread you quoted. > > > > > >>> Got it. I also look forward to the more complex solution to this > > > > > >>> problem. > > > > > >> Can we add a device capability (new feature bit) such as > > > > > >> ctrq_wait_timeout > > > > > >> to get a reasonable timeout? > > > > > > The usual solution to this is include/linux/iopoll.h. If you can > > > > > > sleep > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic(). > > > > > > > > > > I read carefully the functions read_poll_timeout() and > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2 > > > > > functions. > > > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx > > > > mode setting to workqueue first: > > > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html > > > > > > > > > > > > > > As such, can we add a module parameter to customize this timeout value > > > > > by the user? > > > > > > > > Who is the "user" here, or how can the "user" know the value? > > > > > > > > > > > > > > Or this timeout value is stored in device register, virtio_net driver > > > > > will read this timeout value at initialization? > > > > > > > > See another thread. The design needs to be general, or you can post a > > > > RFC. > > > > > > > > In another thought, we've already had a tx watchdog, maybe we can have > > > > something similar to cvq and use timeout + reset in that case. > > > > > > But we may block by the reset ^_^ if the device is broken? > > > > I mean vq reset here. > > I see. > > I mean when the deivce is broken, the vq reset also many be blocked. > > void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, > u16 index) > { > struct virtio_pci_modern_common_cfg __iomem *cfg; > > cfg = (struct virtio_pci_modern_common_cfg __iomem > *)mdev->common; > > vp_iowrite16(index, &cfg->cfg.queue_select); > vp_iowrite16(1, &cfg->queue_reset); > > while (vp_ioread16(&cfg->queue_reset)) > msleep(1); > > while (vp_ioread16(&cfg->cfg.queue_enable)) > msleep(1); > } > EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset); > > In this function, for the broken device, we can not expect something. Yes, it's best effort, there's no guarantee then. But it doesn't harm to try. Thanks > > > > > > It looks like we have multiple goals here > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be > > sufficient, it has issue but nothing new > > 2) recover from the unresponsive device, the issue for timeout is that > > it needs to deal with false positives > > > I agree. > > But I want to add a new goal, cvq async. In the netdim, we will > send many requests via the cvq, so the cvq async will be nice. > > Thanks. > > > > > > Thanks > > > > > > > > Thanks. > > > > > > > > > > > > > > Thans > > > > > > > > > > > > > > Zhu Yanjun > > > > > > > > > > > > > > > > > Andrew > > > > > > > > > > > > > > >