Re: [External] Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-23 Thread Jason Wang



On 2020/12/23 下午8:14, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 5:05 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
   Documentation/driver-api/vduse.rst |  15 +++-
   drivers/vdpa/vdpa_user/vduse_dev.c | 147 
-
   include/uapi/linux/vduse.h |  11 +++
   3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:

   - VDUSE_GET_CONFIG: Read from device specific configuration space

+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
   Please see include/linux/vdpa.h for details.

-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
   driver which supports mapping the kernel dma buffer to a userspace iova
   region dynamically. The userspace iova region can be created by passing
   the userspace vDPA device fd to mmap(2).

+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
   Besides, the eventfd mechanism is used to trigger interrupt callbacks and
   receive virtqueue kicks in userspace. The following ioctls on the userspace
   vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@

   struct vduse_dev_msg {
   struct vduse_dev_request req;
+ struct file *iotlb_file;
   struct vduse_dev_response resp;
   struct list_head list;
   wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
   return ret;
   }

+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+ u64 offset, u64 iova, u64 size, u8 perm)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.offset = offset;
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+ msg->req.iotlb.perm = perm;
+ msg->req.iotlb.fd = -1;
+ msg->iotlb_file = get_file(file);
+
+ ret = vduse_dev_msg_sync(dev, msg);


My feeling is that we should provide consistent API for the userspace
device to use.

E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message
by using VDUSE file as msg->iotlb_file here.


It's OK for me. One problem is when to transfer the UPDATE_IOTLB
message in virtio cases.



Instead of generating IOTLB messages for userspace.

How about record the mappings (which is a common case for device have 
on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl 
for userspace to query?


Thanks





+ vduse_dev_msg_put(msg);
+ fput(file);
+
+ return ret;
+}
+
+static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
+ u64 iova, u64 size)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+
+ ret = vduse_dev_msg_sync(dev, msg);
+ vduse_dev_msg_put(msg);
+
+ return ret;
+}
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+ unsigned int flags = 0;
+
+ switch (perm) {
+ case VHOST_ACCESS_WO:
+ flags |= O_WRONLY;
+ break;
+ case VHOST_ACCESS_RO:
+ flags |= O_RDONLY;
+ break;
+ case VHOST_ACCESS_RW:
+ flags |= O_RDWR;
+ break;
+ default:
+   

Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang



On 2020/12/23 下午10:17, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 4:08 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji 
---
   Documentation/driver-api/vduse.rst |   74 ++
   Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
   drivers/vdpa/Kconfig   |8 +
   drivers/vdpa/Makefile  |1 +
   drivers/vdpa/vdpa_user/Makefile|5 +
   drivers/vdpa/vdpa_user/eventfd.c   |  221 
   drivers/vdpa/vdpa_user/eventfd.h   |   48 +
   drivers/vdpa/vdpa_user/iova_domain.c   |  442 
   drivers/vdpa/vdpa_user/iova_domain.h   |   93 ++
   drivers/vdpa/vdpa_user/vduse.h |   59 ++
   drivers/vdpa/vdpa_user/vduse_dev.c | 1121 

   include/uapi/linux/vdpa.h  |1 +
   include/uapi/linux/vduse.h |   99 ++
   13 files changed, 2173 insertions(+)
   create mode 100644 Documentation/driver-api/vduse.rst
   create mode 100644 drivers/vdpa/vdpa_user/Makefile
   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
   create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
new file mode 100644
index ..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, 

Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-23 Thread Jason Wang



On 2020/12/24 上午10:25, wangyunjian wrote:

From: Yunjian Wang 

Currently the driver doesn't drop a packet which can't be sent by tun
(e.g bad packet). In this case, the driver will always process the
same packet lead to the tx queue stuck.

To fix this issue:
1. in the case of persistent failure (e.g bad packet), the driver
can skip this descriptor by ignoring the error.
2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
driver schedules the worker to try again.



I might be wrong but looking at alloc_skb_with_frags(), it returns 
-ENOBUFS actually:


    *errcode = -ENOBUFS;
    skb = alloc_skb(header_len, gfp_mask);
    if (!skb)
        return NULL;

Thanks




Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Yunjian Wang 
---
  drivers/vhost/net.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c8784dfafdd7..e76245daa7f6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct 
socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
+   if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
vhost_discard_vq_desc(vq, 1);
vhost_net_enable_vq(net, vq);
break;
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: len %d != %zd\n",
 err, len);
  done:
@@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
@@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   if (err == -EAGAIN || err == -ENOMEM) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
if (!zcopy_used)




Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-23 Thread Jason Wang



On 2020/12/24 下午12:37, wangyunjian wrote:

-Original Message-
From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Thursday, December 24, 2020 11:10 AM
To: wangyunjian ; netdev@vger.kernel.org;
m...@redhat.com; willemdebruijn.ker...@gmail.com
Cc: virtualizat...@lists.linux-foundation.org; Lilijun (Jerry)
; chenchanghu ;
xudingke ; huangbin (J)

Subject: Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg
fails


On 2020/12/24 上午10:25, wangyunjian wrote:

From: Yunjian Wang 

Currently the driver doesn't drop a packet which can't be sent by tun
(e.g bad packet). In this case, the driver will always process the
same packet lead to the tx queue stuck.

To fix this issue:
1. in the case of persistent failure (e.g bad packet), the driver can
skip this descriptor by ignoring the error.
2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
driver schedules the worker to try again.


I might be wrong but looking at alloc_skb_with_frags(), it returns -ENOBUFS
actually:

      *errcode = -ENOBUFS;
      skb = alloc_skb(header_len, gfp_mask);
      if (!skb)
          return NULL;

Yes, but the sock_alloc_send_pskb() returns - EAGAIN when no sndbuf space.
So there is need to check return value which is -EAGAIN or -ENOMEM or - EAGAIN?

struct sk_buff *sock_alloc_send_pskb()
{
...
for (;;) {
...
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
err = -EAGAIN;
if (!timeo)
goto failure;
...
}
skb = alloc_skb_with_frags(header_len, data_len, max_page_order,
   errcode, sk->sk_allocation);
if (skb)
skb_set_owner_w(skb, sk);
return skb;
...
*errcode = err;
return NULL;
}



-EAGAIN and -ENOBFS are fine. But I don't see how -ENOMEM can be returned.

Thanks



Thanks



Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Yunjian Wang 
---
   drivers/vhost/net.c | 16 
   1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
c8784dfafdd7..e76245daa7f6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net,

struct socket *sock)

msg.msg_flags &= ~MSG_MORE;
}

-   /* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
+   if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
vhost_discard_vq_desc(vq, 1);
vhost_net_enable_vq(net, vq);
break;
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: len %d != %zd\n",
 err, len);
   done:
@@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net

*net, struct socket *sock)

msg.msg_flags &= ~MSG_MORE;
}

-   /* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
@@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net

*net, struct socket *sock)

nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   if (err == -EAGAIN || err == -ENOMEM) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
if (!zcopy_used)




Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-24 Thread Jason Wang



On 2020/12/24 下午5:09, wangyunjian wrote:

-EAGAIN and -ENOBFS are fine. But I don't see how -ENOMEM can be returned.

The tun_build_skb() and tun_napi_alloc_frags() return -ENOMEM when memory
allocation fails.

Thanks



You're right. So I think we need check them all.

In the future, we need think of a better way. I feel such check is kind 
of fragile since people may modify core sock codes without having a look 
at what vhost does.


Thanks



Re: [PATCH net v2] tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS

2020-12-24 Thread Jason Wang



On 2020/12/25 上午10:52, wangyunjian wrote:

From: Yunjian Wang 

Currently the tun_napi_alloc_frags() function returns -ENOMEM when the
number of iovs exceeds MAX_SKB_FRAGS + 1. However this is inappropriate,
we should use -EMSGSIZE instead of -ENOMEM.

The following distinctions are matters:
1. the caller need to drop the bad packet when -EMSGSIZE is returned,
which means meeting a persistent failure.
2. the caller can try again when -ENOMEM is returned, which means
meeting a transient failure.

Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Yunjian Wang 
Acked-by: Willem de Bruijn 
---
v2:
   * update commit log suggested by Willem de Bruijn
---
  drivers/net/tun.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2dc1988a8973..15c6dd7fb04c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1365,7 +1365,7 @@ static struct sk_buff *tun_napi_alloc_frags(struct 
tun_file *tfile,
int i;
  
  	if (it->nr_segs > MAX_SKB_FRAGS + 1)

-   return ERR_PTR(-ENOMEM);
+   return ERR_PTR(-EMSGSIZE);
  
  	local_bh_disable();

skb = napi_get_frags(&tfile->napi);



Acked-by: Jason Wang 




Re: [PATCH net v4 1/2] vhost_net: fix ubuf refcount incorrectly when sendmsg fails

2020-12-24 Thread Jason Wang



On 2020/12/24 上午10:25, wangyunjian wrote:

From: Yunjian Wang 

Currently the vhost_zerocopy_callback() maybe be called to decrease
the refcount when sendmsg fails in tun. The error handling in vhost
handle_tx_zerocopy() will try to decrease the same refcount again.
This is wrong. To fix this issue, we only call vhost_net_ubuf_put()
when vq->heads[nvq->desc].len == VHOST_DMA_IN_PROGRESS.

Fixes: 0690899b4d45 ("tun: experimental zero copy tx support")
Signed-off-by: Yunjian Wang 
Acked-by: Willem de Bruijn 
Acked-by: Michael S. Tsirkin 



Acked-by: Jason Wang 



---
  drivers/vhost/net.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 531a00d703cd..c8784dfafdd7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -863,6 +863,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
size_t len, total_len = 0;
int err;
struct vhost_net_ubuf_ref *ubufs;
+   struct ubuf_info *ubuf;
bool zcopy_used;
int sent_pkts = 0;
  
@@ -895,9 +896,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
  
  		/* use msg_control to pass vhost zerocopy ubuf info to skb */

if (zcopy_used) {
-   struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
-
vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
@@ -927,7 +926,8 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
-   vhost_net_ubuf_put(ubufs);
+   if (vq->heads[ubuf->desc].len == 
VHOST_DMA_IN_PROGRESS)
+   vhost_net_ubuf_put(ubufs);
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}




Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-24 Thread Jason Wang



On 2020/12/24 下午3:37, Yongji Xie wrote:

On Thu, Dec 24, 2020 at 10:41 AM Jason Wang  wrote:


On 2020/12/23 下午8:14, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 5:05 PM Jason Wang  wrote:

On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
Documentation/driver-api/vduse.rst |  15 +++-
drivers/vdpa/vdpa_user/vduse_dev.c | 147 
-
include/uapi/linux/vduse.h |  11 +++
3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:

- VDUSE_GET_CONFIG: Read from device specific configuration space

+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
Please see include/linux/vdpa.h for details.

-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
driver which supports mapping the kernel dma buffer to a userspace iova
region dynamically. The userspace iova region can be created by passing
the userspace vDPA device fd to mmap(2).

+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
Besides, the eventfd mechanism is used to trigger interrupt callbacks and
receive virtqueue kicks in userspace. The following ioctls on the userspace
vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@

struct vduse_dev_msg {
struct vduse_dev_request req;
+ struct file *iotlb_file;
struct vduse_dev_response resp;
struct list_head list;
wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
return ret;
}

+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+ u64 offset, u64 iova, u64 size, u8 perm)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.offset = offset;
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+ msg->req.iotlb.perm = perm;
+ msg->req.iotlb.fd = -1;
+ msg->iotlb_file = get_file(file);
+
+ ret = vduse_dev_msg_sync(dev, msg);

My feeling is that we should provide consistent API for the userspace
device to use.

E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message
by using VDUSE file as msg->iotlb_file here.


It's OK for me. One problem is when to transfer the UPDATE_IOTLB
message in virtio cases.


Instead of generating IOTLB messages for userspace.

How about record the mappings (which is a common case for device have
on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
for userspace to query?


If so, the IOTLB UPDATE is actually triggered by ioctl, but
IOTLB_INVALIDATE is triggered by the message. Is it a little odd?



Good point.

Some questions here:

1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall 
is returned. But this patch doesn't have this guarantee. Can this lead 
some issues?
2) I think after VDUSE userspace receives IOTLB_INVALIDATE, it needs to 
issue a munmap(). What if it doesn't do that?




  Or
how about trigger it when userspace call mmap() on the device fd?



One possible issue is that the IOTLB_UPDATE/INVALIDATE might come after 
mmap():


1) When vIOMMU is enabled
2) Guest memory topology has been changed (memory hotplug).

Thanks




Thanks,
Yongji





Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

2020-12-24 Thread Jason Wang



On 2020/12/24 下午4:34, Yongji Xie wrote:

Yes, the disadvantage is the performance. But it should be simpler (I
guess) and we know it can succeed.


Yes, another advantage is that we can support the VM using anonymous memory.



Exactly.





I think I can try this in v3. And the
MMU-based IOMMU implementation can be a future optimization in the
virtio-vdpa case. What's your opinion?

Maybe I was wrong, but I think we can try as what has been proposed here
first and use shadow virtqueue as backup plan if we fail.


OK, I will continue to work on this proposal.



Thanks







Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-24 Thread Jason Wang



On 2020/12/25 上午10:37, Yongji Xie wrote:

On Thu, Dec 24, 2020 at 3:37 PM Yongji Xie  wrote:

On Thu, Dec 24, 2020 at 10:41 AM Jason Wang  wrote:


On 2020/12/23 下午8:14, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 5:05 PM Jason Wang  wrote:

On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
Documentation/driver-api/vduse.rst |  15 +++-
drivers/vdpa/vdpa_user/vduse_dev.c | 147 
-
include/uapi/linux/vduse.h |  11 +++
3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:

- VDUSE_GET_CONFIG: Read from device specific configuration space

+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
Please see include/linux/vdpa.h for details.

-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
driver which supports mapping the kernel dma buffer to a userspace iova
region dynamically. The userspace iova region can be created by passing
the userspace vDPA device fd to mmap(2).

+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
Besides, the eventfd mechanism is used to trigger interrupt callbacks and
receive virtqueue kicks in userspace. The following ioctls on the userspace
vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@

struct vduse_dev_msg {
struct vduse_dev_request req;
+ struct file *iotlb_file;
struct vduse_dev_response resp;
struct list_head list;
wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
return ret;
}

+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+ u64 offset, u64 iova, u64 size, u8 perm)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.offset = offset;
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+ msg->req.iotlb.perm = perm;
+ msg->req.iotlb.fd = -1;
+ msg->iotlb_file = get_file(file);
+
+ ret = vduse_dev_msg_sync(dev, msg);

My feeling is that we should provide consistent API for the userspace
device to use.

E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message
by using VDUSE file as msg->iotlb_file here.


It's OK for me. One problem is when to transfer the UPDATE_IOTLB
message in virtio cases.


Instead of generating IOTLB messages for userspace.

How about record the mappings (which is a common case for device have
on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
for userspace to query?


If so, the IOTLB UPDATE is actually triggered by ioctl, but
IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Or
how about trigger it when userspace call mmap() on the device fd?


Oh sorry, looks like mmap() needs to be called in IOTLB UPDATE message
handler. Is it possible for the vdpa device to know which vdpa bus it
is attached to?



We'd better not. It's kind of layer violation.

Thanks



  So that we can generate this message during probing.
Otherwise, we don't know whether the iova domain of MMU-based IOMMU is
needed.

Thanks,
Yongji





Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-27 Thread Jason Wang



On 2020/12/25 下午6:31, Yongji Xie wrote:

On Fri, Dec 25, 2020 at 2:58 PM Jason Wang  wrote:


On 2020/12/24 下午3:37, Yongji Xie wrote:

On Thu, Dec 24, 2020 at 10:41 AM Jason Wang  wrote:

On 2020/12/23 下午8:14, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 5:05 PM Jason Wang  wrote:

On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
 Documentation/driver-api/vduse.rst |  15 +++-
 drivers/vdpa/vdpa_user/vduse_dev.c | 147 
-
 include/uapi/linux/vduse.h |  11 +++
 3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:

 - VDUSE_GET_CONFIG: Read from device specific configuration space

+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
 Please see include/linux/vdpa.h for details.

-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
 driver which supports mapping the kernel dma buffer to a userspace iova
 region dynamically. The userspace iova region can be created by passing
 the userspace vDPA device fd to mmap(2).

+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
 Besides, the eventfd mechanism is used to trigger interrupt callbacks and
 receive virtqueue kicks in userspace. The following ioctls on the userspace
 vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@

 struct vduse_dev_msg {
 struct vduse_dev_request req;
+ struct file *iotlb_file;
 struct vduse_dev_response resp;
 struct list_head list;
 wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
 return ret;
 }

+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+ u64 offset, u64 iova, u64 size, u8 perm)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.offset = offset;
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+ msg->req.iotlb.perm = perm;
+ msg->req.iotlb.fd = -1;
+ msg->iotlb_file = get_file(file);
+
+ ret = vduse_dev_msg_sync(dev, msg);

My feeling is that we should provide consistent API for the userspace
device to use.

E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message
by using VDUSE file as msg->iotlb_file here.


It's OK for me. One problem is when to transfer the UPDATE_IOTLB
message in virtio cases.

Instead of generating IOTLB messages for userspace.

How about record the mappings (which is a common case for device have
on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
for userspace to query?


If so, the IOTLB UPDATE is actually triggered by ioctl, but
IOTLB_INVALIDATE is triggered by the message. Is it a little odd?


Good point.

Some questions here:

1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall
is returned. But this patch doesn't have this guarantee. Can this lead
some issues?

I'm not sure. But should it be guaranteed in VDUSE userspace? Just
like what vhost-user backend process does.



I think so. This is because the userspace device needs a way to 
synchronize invalidation with its datapath. Otherwise, guest may thing 
the buffer is freed to be used by other parts but the it actually can be 
used by the VDUSE device. The may cause security issues.






2) I think after VDUSE userspac

Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-28 Thread Jason Wang



On 2020/12/28 下午4:14, Yongji Xie wrote:

I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
is expected to be synchronous. This need to be solved by tweaking the
current VDUSE API or we can re-visit to go with descriptors relaying first.


Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?



 I was thinking whether or not we need to generate IOTLB_INVALIDATE 
message to VDUSE during dma_unmap (vduse_dev_unmap_page).


If we don't, we're probably fine.

Thanks




Thanks,
Yongji





Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-29 Thread Jason Wang



- Original Message -
> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang  wrote:
> >
> >
> > On 2020/12/28 下午4:14, Yongji Xie wrote:
> > >> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> > >> is expected to be synchronous. This need to be solved by tweaking the
> > >> current VDUSE API or we can re-visit to go with descriptors relaying
> > >> first.
> > >>
> > > Actually all vdpa related operations are synchronous in current
> > > implementation. The ops.set_map/dma_map/dma_unmap should not return
> > > until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> > > by userspace. Could it solve this problem?
> >
> >
> >   I was thinking whether or not we need to generate IOTLB_INVALIDATE
> > message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >
> > If we don't, we're probably fine.
> >
> 
> It seems not feasible. This message will be also used in the
> virtio-vdpa case to notify userspace to unmap some pages during
> consistent dma unmapping. Maybe we can document it to make sure the
> users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?

Thanks


> 
> Thanks,
> Yongji
> 
> 



Re: [PATCH rfc 2/3] virtio-net: support receive timestamp

2020-12-29 Thread Jason Wang



- Original Message -
> On Mon, Dec 28, 2020 at 7:55 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Dec 28, 2020 at 02:30:31PM -0500, Willem de Bruijn wrote:
> > > On Mon, Dec 28, 2020 at 12:29 PM Michael S. Tsirkin 
> > > wrote:
> > > >
> > > > On Mon, Dec 28, 2020 at 11:22:32AM -0500, Willem de Bruijn wrote:
> > > > > From: Willem de Bruijn 
> > > > >
> > > > > Add optional PTP hardware timestamp offload for virtio-net.
> > > > >
> > > > > Accurate RTT measurement requires timestamps close to the wire.
> > > > > Introduce virtio feature VIRTIO_NET_F_RX_TSTAMP. If negotiated, the
> > > > > virtio-net header is expanded with room for a timestamp. A host may
> > > > > pass receive timestamps for all or some packets. A timestamp is valid
> > > > > if non-zero.
> > > > >
> > > > > The timestamp straddles (virtual) hardware domains. Like PTP, use
> > > > > international atomic time (CLOCK_TAI) as global clock base. It is
> > > > > guest responsibility to sync with host, e.g., through kvm-clock.
> > > > >
> > > > > Signed-off-by: Willem de Bruijn 
> > > > > diff --git a/include/uapi/linux/virtio_net.h
> > > > > b/include/uapi/linux/virtio_net.h
> > > > > index f6881b5b77ee..0ffe2eeebd4a 100644
> > > > > --- a/include/uapi/linux/virtio_net.h
> > > > > +++ b/include/uapi/linux/virtio_net.h
> > > > > @@ -57,6 +57,7 @@
> > > > >* Steering */
> > > > >  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */
> > > > >
> > > > > +#define VIRTIO_NET_F_RX_TSTAMP 55/* Host sends TAI
> > > > > receive time */
> > > > >  #define VIRTIO_NET_F_TX_HASH   56/* Guest sends hash report */
> > > > >  #define VIRTIO_NET_F_HASH_REPORT  57 /* Supports hash report */
> > > > >  #define VIRTIO_NET_F_RSS   60/* Supports RSS RX steering */
> > > > > @@ -182,6 +183,17 @@ struct virtio_net_hdr_v1_hash {
> > > > >   };
> > > > >  };
> > > > >
> > > > > +struct virtio_net_hdr_v12 {
> > > > > + struct virtio_net_hdr_v1 hdr;
> > > > > + struct {
> > > > > + __le32 value;
> > > > > + __le16 report;
> > > > > + __le16 flow_state;
> > > > > + } hash;
> > > > > + __virtio32 reserved;
> > > > > + __virtio64 tstamp;
> > > > > +};
> > > > > +
> > > > >  #ifndef VIRTIO_NET_NO_LEGACY
> > > > >  /* This header comes first in the scatter-gather list.
> > > > >   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it
> > > > >   must
> > > >
> > > >
> > > > So it looks like VIRTIO_NET_F_RX_TSTAMP should depend on both
> > > > VIRTIO_NET_F_RX_TSTAMP and VIRTIO_NET_F_HASH_REPORT then?
> > >
> > > Do you mean VIRTIO_NET_F_TX_TSTAMP depends on VIRTIO_NET_F_RX_TSTAMP?
> > >
> > > I think if either is enabled we need to enable the extended layout.
> > > Regardless of whether TX_HASH or HASH_REPORT are enabled. If they are
> > > not, then those fields are ignored.
> >
> > I mean do we waste the 8 bytes for hash if TSTAMP is set but HASH is not?
> > If TSTAMP depends on HASH then point is moot.
> 
> True, but the two features really are independent.
> 
> I did consider using configurable metadata layout depending on
> features negotiated. If there are tons of optional extensions, that
> makes sense. But it is more complex and parsing error prone. With a
> handful of options each of a few bytes, that did not seem worth the
> cost to me at this point.

Consider NFV workloads(64B) packet. Most fields of current vnet header
is even a burdern. It might be the time to introduce configurable or
self-descriptive vnet header.


> 
> And importantly, such a mode can always be added later as a separate
> VIRTIO_NET_F_PACKED_HEADER option.

What will do feature provide?

Thanks

> 
> If anything, perhaps if we increase the virtio_net_hdr_* allocation,
> we should allocate some additional reserved space now? As each new
> size introduces quite a bit of boilerplate. Also, e.g., in qemu just
> to pass the sizes between virtio-net driver and vhost-net device.
> 
> 



Re: [PATCH 07/21] vdpa: multiple address spaces support

2020-12-29 Thread Jason Wang



On 2020/12/29 下午3:28, Eli Cohen wrote:

@@ -43,6 +43,8 @@ struct vdpa_vq_state {
   * @index: device index
   * @features_valid: were features initialized? for legacy guests
   * @nvqs: the number of virtqueues
+ * @ngroups: the number of virtqueue groups
+ * @nas: the number of address spaces

I am not sure these can be categorised as part of the state of the VQ.
It's more of a property so maybe we can have a callback to get the
properties of the VQ?

Moreover, I think they should be handled in the hardware drivers to
return 1 for both ngroups and nas.



We can, but those are static attributes that is not expected to be 
changed by the driver.


Introduce callbacks for those static stuffs does not give us any advantage.

For group id and notification area, since we don't have a abstract of 
vdpa_virtqueue. The only choice is to introduce callbacks for them.


Maybe it's time to introduce vdpa_virtqueue.

Thanks



Re: [PATCH 07/21] vdpa: multiple address spaces support

2020-12-29 Thread Jason Wang



On 2020/12/29 下午3:28, Eli Cohen wrote:

@@ -43,6 +43,8 @@ struct vdpa_vq_state {
   * @index: device index
   * @features_valid: were features initialized? for legacy guests
   * @nvqs: the number of virtqueues
+ * @ngroups: the number of virtqueue groups
+ * @nas: the number of address spaces

I am not sure these can be categorised as part of the state of the VQ.
It's more of a property so maybe we can have a callback to get the
properties of the VQ?



Or maybe there's a misunderstanding of the patch.

Those two attributes belongs to vdpa_device instead of vdpa_vq_state 
actually.


Thanks



Re: [PATCH 10/21] vhost: support ASID in IOTLB API

2020-12-29 Thread Jason Wang



On 2020/12/29 下午6:20, Eli Cohen wrote:

-static int vhost_process_iotlb_msg(struct vhost_dev *dev,
+static int vhost_process_iotlb_msg(struct vhost_dev *dev, u16 asid,
   struct vhost_iotlb_msg *msg)
  {
int ret = 0;
  
+	if (asid != 0)

+   return -EINVAL;
+
mutex_lock(&dev->mutex);
vhost_dev_lock_vqs(dev);
switch (msg->type) {
@@ -1135,6 +1138,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
struct vhost_iotlb_msg msg;
size_t offset;
int type, ret;
+   u16 asid = 0;

You assume asid occupies just 16 bits. So maybe you should reserve the
other 16 bits for future extension:

struct vhost_msg_v2 {
 __u32 type;
-   __u32 reserved;
+   __u16 asid;
+   __u16 reserved;
 union {

Moreover, maybe this should be reflected in previous patches that use
the asid:

-static int mlx5_vdpa_set_map(struct vdpa_device *vdev, struct vhost_iotlb 
*iotlb)
+static int mlx5_vdpa_set_map(struct vdpa_device *vdev, u16 asid,
+struct vhost_iotlb *iotlb)

-static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
+static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u16 asid,
 struct vhost_iotlb_msg *msg)

etc.



Good catch.

This is a bug of the code actually. Since I want to stick to 32bit to be 
large enough for e.g PASID.


Will fix.

Thanks








Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-29 Thread Jason Wang



On 2020/12/29 下午6:26, Yongji Xie wrote:

On Tue, Dec 29, 2020 at 5:11 PM Jason Wang  wrote:



- Original Message -

On Mon, Dec 28, 2020 at 4:43 PM Jason Wang  wrote:


On 2020/12/28 下午4:14, Yongji Xie wrote:

I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
is expected to be synchronous. This need to be solved by tweaking the
current VDUSE API or we can re-visit to go with descriptors relaying
first.


Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?


   I was thinking whether or not we need to generate IOTLB_INVALIDATE
message to VDUSE during dma_unmap (vduse_dev_unmap_page).

If we don't, we're probably fine.


It seems not feasible. This message will be also used in the
virtio-vdpa case to notify userspace to unmap some pages during
consistent dma unmapping. Maybe we can document it to make sure the
users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?


I plan to do it in the coherent DMA case.



Any reason for treating coherent DMA differently?



It's true that userspace can
access the dma buffer if userspace doesn't do the unmap. But the dma
pages would not be freed and reused unless user space called munmap()
for them.



I wonder whether or not we could recycle IOVA in this case to avoid the 
IOTLB_UMAP message.


Thanks




Thanks,
Yongji





Re: [PATCH 11/21] vhost-vdpa: introduce asid based IOTLB

2020-12-29 Thread Jason Wang



On 2020/12/29 下午7:41, Eli Cohen wrote:

On Wed, Dec 16, 2020 at 02:48:08PM +0800, Jason Wang wrote:

This patch converts the vhost-vDPA device to support multiple IOTLBs
tagged via ASID via hlist. This will be used for supporting multiple
address spaces in the following patches.

Signed-off-by: Jason Wang 
---
  drivers/vhost/vdpa.c | 106 ---
  1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index feb6a58df22d..060d5b5b7e64 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -33,13 +33,21 @@ enum {
  
  #define VHOST_VDPA_DEV_MAX (1U << MINORBITS)
  
+#define VHOST_VDPA_IOTLB_BUCKETS 16

+
+struct vhost_vdpa_as {
+   struct hlist_node hash_link;
+   struct vhost_iotlb iotlb;
+   u32 id;
+};
+
  struct vhost_vdpa {
struct vhost_dev vdev;
struct iommu_domain *domain;
struct vhost_virtqueue *vqs;
struct completion completion;
struct vdpa_device *vdpa;
-   struct vhost_iotlb *iotlb;
+   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
struct device dev;
struct cdev cdev;
atomic_t opened;
@@ -49,12 +57,64 @@ struct vhost_vdpa {
struct eventfd_ctx *config_ctx;
int in_batch;
struct vdpa_iova_range range;
+   int used_as;

This is not really used. Not in this patch and later removed.



Right, will remove this.

Thanks




Re: [PATCH 11/21] vhost-vdpa: introduce asid based IOTLB

2020-12-29 Thread Jason Wang



On 2020/12/29 下午8:05, Eli Cohen wrote:

+
+static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)

The return value is never interpreted. I think it should either be made
void or return values checked.



Right, will make it void.





+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
+
+   /* Remove default address space is not allowed */
+   if (asid == 0)
+   return -EINVAL;

Can you explain why? I think you have a memory leak due to this as no
one will ever free as with id 0.



Looks like a bug. Will remove this.

Thanks




Re: [PATCH 11/21] vhost-vdpa: introduce asid based IOTLB

2020-12-29 Thread Jason Wang



On 2020/12/29 下午7:53, Eli Cohen wrote:

+
+static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 
asid)
+{
+   struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_vdpa_as *as;
+
+   if (asid_to_as(v, asid))
+   return NULL;
+
+   as = kmalloc(sizeof(*as), GFP_KERNEL);

kzalloc()? See comment below.


+   if (!as)
+   return NULL;
+
+   vhost_iotlb_init(&as->iotlb, 0, 0);
+   as->id = asid;
+   hlist_add_head(&as->hash_link, head);
+   ++v->used_as;

Although you eventually ended up removing used_as, this is a bug since
you're incrementing a random value. Maybe it's better to be on the safe
side and use kzalloc() for as above.



Yes and used_as needs to be removed.

Thanks









Re: [PATCH 12/21] vhost-vdpa: introduce uAPI to get the number of virtqueue groups

2020-12-29 Thread Jason Wang



On 2020/12/29 下午8:24, Eli Cohen wrote:

On Wed, Dec 16, 2020 at 02:48:09PM +0800, Jason Wang wrote:

Follows the vDPA support for multiple address spaces, this patch
introduce uAPI for the userspace to know the number of virtqueue
groups supported by the vDPA device.

Can you explain what exactly you mean be userspace?



It's the userspace that uses the uAPI introduced in this patch.



Is it just qemu or
is it destined to the virtio_net driver run by the qemu process?



It could be Qemu, DPDK or other userspace program.

The guest virtio-net driver will not use this but talks to the virtio 
device emulated by Qemu.




Also can you say for what purpose?



This can be used for facilitate the checking of whether the control vq 
could be supported.


E.g if the device support less than 2 groups, qemu won't advertise 
control vq.


Yes, #groups could be inferred from GET_VRING_GROUP. But it's not 
straightforward as this.


Thanks





Signed-off-by: Jason Wang 
---
  drivers/vhost/vdpa.c   | 4 
  include/uapi/linux/vhost.h | 3 +++
  2 files changed, 7 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 060d5b5b7e64..1ba5901b28e7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -536,6 +536,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
case VHOST_VDPA_GET_VRING_NUM:
r = vhost_vdpa_get_vring_num(v, argp);
break;
+   case VHOST_VDPA_GET_GROUP_NUM:
+   r = copy_to_user(argp, &v->vdpa->ngroups,
+sizeof(v->vdpa->ngroups));
+   break;
case VHOST_SET_LOG_BASE:
case VHOST_SET_LOG_FD:
r = -ENOIOCTLCMD;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 59c6c0fbaba1..8a4e6e426bbf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -145,4 +145,7 @@
  /* Get the valid iova range */
  #define VHOST_VDPA_GET_IOVA_RANGE _IOR(VHOST_VIRTIO, 0x78, \
 struct vhost_vdpa_iova_range)
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x79, unsigned int)
+
  #endif
--
2.25.1





Re: [PATCH rfc 2/3] virtio-net: support receive timestamp

2020-12-30 Thread Jason Wang



On 2020/12/29 下午10:20, Willem de Bruijn wrote:

On Tue, Dec 29, 2020 at 4:23 AM Jason Wang  wrote:



- Original Message -

On Mon, Dec 28, 2020 at 7:55 PM Michael S. Tsirkin  wrote:

On Mon, Dec 28, 2020 at 02:30:31PM -0500, Willem de Bruijn wrote:

On Mon, Dec 28, 2020 at 12:29 PM Michael S. Tsirkin 
wrote:

On Mon, Dec 28, 2020 at 11:22:32AM -0500, Willem de Bruijn wrote:

From: Willem de Bruijn 

Add optional PTP hardware timestamp offload for virtio-net.

Accurate RTT measurement requires timestamps close to the wire.
Introduce virtio feature VIRTIO_NET_F_RX_TSTAMP. If negotiated, the
virtio-net header is expanded with room for a timestamp. A host may
pass receive timestamps for all or some packets. A timestamp is valid
if non-zero.

The timestamp straddles (virtual) hardware domains. Like PTP, use
international atomic time (CLOCK_TAI) as global clock base. It is
guest responsibility to sync with host, e.g., through kvm-clock.

Signed-off-by: Willem de Bruijn 
diff --git a/include/uapi/linux/virtio_net.h
b/include/uapi/linux/virtio_net.h
index f6881b5b77ee..0ffe2eeebd4a 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,7 @@
* Steering */
  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */

+#define VIRTIO_NET_F_RX_TSTAMP 55/* Host sends TAI
receive time */
  #define VIRTIO_NET_F_TX_HASH   56/* Guest sends hash report */
  #define VIRTIO_NET_F_HASH_REPORT  57 /* Supports hash report */
  #define VIRTIO_NET_F_RSS   60/* Supports RSS RX steering */
@@ -182,6 +183,17 @@ struct virtio_net_hdr_v1_hash {
   };
  };

+struct virtio_net_hdr_v12 {
+ struct virtio_net_hdr_v1 hdr;
+ struct {
+ __le32 value;
+ __le16 report;
+ __le16 flow_state;
+ } hash;
+ __virtio32 reserved;
+ __virtio64 tstamp;
+};
+
  #ifndef VIRTIO_NET_NO_LEGACY
  /* This header comes first in the scatter-gather list.
   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it
   must


So it looks like VIRTIO_NET_F_RX_TSTAMP should depend on both
VIRTIO_NET_F_RX_TSTAMP and VIRTIO_NET_F_HASH_REPORT then?

Do you mean VIRTIO_NET_F_TX_TSTAMP depends on VIRTIO_NET_F_RX_TSTAMP?

I think if either is enabled we need to enable the extended layout.
Regardless of whether TX_HASH or HASH_REPORT are enabled. If they are
not, then those fields are ignored.

I mean do we waste the 8 bytes for hash if TSTAMP is set but HASH is not?
If TSTAMP depends on HASH then point is moot.

True, but the two features really are independent.

I did consider using configurable metadata layout depending on
features negotiated. If there are tons of optional extensions, that
makes sense. But it is more complex and parsing error prone. With a
handful of options each of a few bytes, that did not seem worth the
cost to me at this point.

Consider NFV workloads(64B) packet. Most fields of current vnet header
is even a burdern.

Such workloads will not negotiate these features, I imagine.

I think this could only become an issue if a small packet workload
would want to negotiate one optional feature, without the others.



Yes.



Even then, the actual cost of a few untouched bytes is negligible, as
long as the headers don't span cache-lines.



Right. It looks to me with hash report support the current header has 
exceeds 32 bytes which is the cacheline size for some archs. And it can 
be imagined that after two or more features it could be larger than 64 
bytes.




  I expect it to be cheaper
than having to parse a dynamic layout.



The only overhead is probably analyzing "type" which should be cheap I 
guess.






It might be the time to introduce configurable or
self-descriptive vnet header.

I did briefly toy with a type-val encoding. Not type-val-len, as all
options have fixed length (so far), so length can be left implicit.
But as each feature is negotiated before hand, the type can be left
implicit too. Leaving just the data elements tightly packed. As long
as order is defined.



Right, so in this case there should be even no overhead.





And importantly, such a mode can always be added later as a separate
VIRTIO_NET_F_PACKED_HEADER option.

What will do feature provide?

The tightly packed data elements. Strip out the elements for non
negotiated features. So if either tstamp option is negotiated, but
hash is not, exclude the 64b of hash fields. Note that I'm not
actually arguing for this (at this point).



I second for this.

Thanks



Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-30 Thread Jason Wang



On 2020/12/30 下午3:09, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 2:11 PM Jason Wang  wrote:


On 2020/12/29 下午6:26, Yongji Xie wrote:

On Tue, Dec 29, 2020 at 5:11 PM Jason Wang  wrote:


- Original Message -

On Mon, Dec 28, 2020 at 4:43 PM Jason Wang  wrote:

On 2020/12/28 下午4:14, Yongji Xie wrote:

I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
is expected to be synchronous. This need to be solved by tweaking the
current VDUSE API or we can re-visit to go with descriptors relaying
first.


Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?

I was thinking whether or not we need to generate IOTLB_INVALIDATE
message to VDUSE during dma_unmap (vduse_dev_unmap_page).

If we don't, we're probably fine.


It seems not feasible. This message will be also used in the
virtio-vdpa case to notify userspace to unmap some pages during
consistent dma unmapping. Maybe we can document it to make sure the
users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?


I plan to do it in the coherent DMA case.


Any reason for treating coherent DMA differently?


Now the memory of the bounce buffer is allocated page by page in the
page fault handler. So it can't be used in coherent DMA mapping case
which needs some memory with contiguous virtual addresses. I can use
vmalloc() to do allocation for the bounce buffer instead. But it might
cause some memory waste. Any suggestion?



I may miss something. But I don't see a relationship between the 
IOTLB_UNMAP and vmalloc().






It's true that userspace can
access the dma buffer if userspace doesn't do the unmap. But the dma
pages would not be freed and reused unless user space called munmap()
for them.


I wonder whether or not we could recycle IOVA in this case to avoid the
IOTLB_UMAP message.


We can achieve that if we use vmalloc() to do allocation for the
bounce buffer which can be used in coherent DMA mapping case. But
looks like we still have no way to avoid the IOTLB_UMAP message in
vhost-vdpa case.



I think that's fine. For virtio-vdpa, from VDUSE userspace perspective, 
it works like a driver that is using SWIOTLB in this case.


Thanks




Thanks,
Yongji





Re: [PATCH rfc 2/3] virtio-net: support receive timestamp

2020-12-30 Thread Jason Wang



On 2020/12/29 上午8:57, Willem de Bruijn wrote:

On Mon, Dec 28, 2020 at 5:59 PM Jakub Kicinski  wrote:

On Mon, 28 Dec 2020 11:22:32 -0500 Willem de Bruijn wrote:

From: Willem de Bruijn 

Add optional PTP hardware timestamp offload for virtio-net.

Accurate RTT measurement requires timestamps close to the wire.
Introduce virtio feature VIRTIO_NET_F_RX_TSTAMP. If negotiated, the
virtio-net header is expanded with room for a timestamp. A host may
pass receive timestamps for all or some packets. A timestamp is valid
if non-zero.

The timestamp straddles (virtual) hardware domains. Like PTP, use
international atomic time (CLOCK_TAI) as global clock base. It is
guest responsibility to sync with host, e.g., through kvm-clock.

Would this not be confusing to some user space SW to have a NIC with
no PHC deliver HW stamps?

I'd CC Richard on this, unless you already discussed with him offline.

Thanks, good point. I should have included Richard.

There is a well understood method for synchronizing guest and host
clock in KVM using ptp_kvm. For virtual environments without NIC
hardware offload, the when host timestamps in software, this suffices.

Syncing host with NIC is assumed if the host advertises the feature
and implements using real hardware timestamps.



Or it could be useful for virtio hardware when there's no KVM that 
provides PTP.


Thanks








Re: [PATCH 12/21] vhost-vdpa: introduce uAPI to get the number of virtqueue groups

2020-12-30 Thread Jason Wang



On 2020/12/30 下午6:05, Eli Cohen wrote:

On Wed, Dec 16, 2020 at 02:48:09PM +0800, Jason Wang wrote:

Follows the vDPA support for multiple address spaces, this patch
introduce uAPI for the userspace to know the number of virtqueue
groups supported by the vDPA device.

Signed-off-by: Jason Wang
---
  drivers/vhost/vdpa.c   | 4 
  include/uapi/linux/vhost.h | 3 +++
  2 files changed, 7 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 060d5b5b7e64..1ba5901b28e7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -536,6 +536,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
case VHOST_VDPA_GET_VRING_NUM:
r = vhost_vdpa_get_vring_num(v, argp);
break;
+   case VHOST_VDPA_GET_GROUP_NUM:
+   r = copy_to_user(argp, &v->vdpa->ngroups,
+sizeof(v->vdpa->ngroups));
+   break;

Is this and other ioctls already supported in qemu?



Not yet, the prototype is under development.

I test the series with a small and dedicated userspace program.

Thanks








Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-30 Thread Jason Wang



On 2020/12/30 下午6:12, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 4:41 PM Jason Wang  wrote:


On 2020/12/30 下午3:09, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 2:11 PM Jason Wang  wrote:

On 2020/12/29 下午6:26, Yongji Xie wrote:

On Tue, Dec 29, 2020 at 5:11 PM Jason Wang  wrote:

- Original Message -

On Mon, Dec 28, 2020 at 4:43 PM Jason Wang  wrote:

On 2020/12/28 下午4:14, Yongji Xie wrote:

I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
is expected to be synchronous. This need to be solved by tweaking the
current VDUSE API or we can re-visit to go with descriptors relaying
first.


Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?

 I was thinking whether or not we need to generate IOTLB_INVALIDATE
message to VDUSE during dma_unmap (vduse_dev_unmap_page).

If we don't, we're probably fine.


It seems not feasible. This message will be also used in the
virtio-vdpa case to notify userspace to unmap some pages during
consistent dma unmapping. Maybe we can document it to make sure the
users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?


I plan to do it in the coherent DMA case.

Any reason for treating coherent DMA differently?


Now the memory of the bounce buffer is allocated page by page in the
page fault handler. So it can't be used in coherent DMA mapping case
which needs some memory with contiguous virtual addresses. I can use
vmalloc() to do allocation for the bounce buffer instead. But it might
cause some memory waste. Any suggestion?


I may miss something. But I don't see a relationship between the
IOTLB_UNMAP and vmalloc().


In the vmalloc() case, the coherent DMA page will be taken from the
memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
during coherent DMA unmapping because those vmalloc'ed memory which
has been mapped into userspace address space during initialization can
be reused. And userspace should not unmap the region until we destroy
the device.



Just to make sure I understand. My understanding is that IOTLB_UNMAP is 
only needed when there's a change the mapping from IOVA to page.


So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to 
free list to be used by the next IOVA allocating. IOTLB_UNMAP could be 
avoided.


So we are not limited by how the pages are actually allocated?

Thanks





It's true that userspace can
access the dma buffer if userspace doesn't do the unmap. But the dma
pages would not be freed and reused unless user space called munmap()
for them.

I wonder whether or not we could recycle IOVA in this case to avoid the
IOTLB_UMAP message.


We can achieve that if we use vmalloc() to do allocation for the
bounce buffer which can be used in coherent DMA mapping case. But
looks like we still have no way to avoid the IOTLB_UMAP message in
vhost-vdpa case.


I think that's fine. For virtio-vdpa, from VDUSE userspace perspective,
it works like a driver that is using SWIOTLB in this case.


OK, will do it in v3.

Thanks,
Yongji





Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-30 Thread Jason Wang



On 2020/12/31 下午1:15, Yongji Xie wrote:

On Thu, Dec 31, 2020 at 10:49 AM Jason Wang  wrote:


On 2020/12/30 下午6:12, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 4:41 PM Jason Wang  wrote:

On 2020/12/30 下午3:09, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 2:11 PM Jason Wang  wrote:

On 2020/12/29 下午6:26, Yongji Xie wrote:

On Tue, Dec 29, 2020 at 5:11 PM Jason Wang  wrote:

- Original Message -

On Mon, Dec 28, 2020 at 4:43 PM Jason Wang  wrote:

On 2020/12/28 下午4:14, Yongji Xie wrote:

I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
is expected to be synchronous. This need to be solved by tweaking the
current VDUSE API or we can re-visit to go with descriptors relaying
first.


Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?

  I was thinking whether or not we need to generate IOTLB_INVALIDATE
message to VDUSE during dma_unmap (vduse_dev_unmap_page).

If we don't, we're probably fine.


It seems not feasible. This message will be also used in the
virtio-vdpa case to notify userspace to unmap some pages during
consistent dma unmapping. Maybe we can document it to make sure the
users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?


I plan to do it in the coherent DMA case.

Any reason for treating coherent DMA differently?


Now the memory of the bounce buffer is allocated page by page in the
page fault handler. So it can't be used in coherent DMA mapping case
which needs some memory with contiguous virtual addresses. I can use
vmalloc() to do allocation for the bounce buffer instead. But it might
cause some memory waste. Any suggestion?

I may miss something. But I don't see a relationship between the
IOTLB_UNMAP and vmalloc().


In the vmalloc() case, the coherent DMA page will be taken from the
memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
during coherent DMA unmapping because those vmalloc'ed memory which
has been mapped into userspace address space during initialization can
be reused. And userspace should not unmap the region until we destroy
the device.


Just to make sure I understand. My understanding is that IOTLB_UNMAP is
only needed when there's a change the mapping from IOVA to page.


Yes, that's true.


So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
avoided.

So we are not limited by how the pages are actually allocated?


In coherent DMA cases, we need to return some memory with contiguous
kernel virtual addresses. That is the reason why we need vmalloc()
here. If we allocate the memory page by page, the corresponding kernel
virtual addresses in a contiguous IOVA range might not be contiguous.



Yes, but we can do that as what has been done in the series 
(alloc_pages_exact()). Or do you mean it would be a little bit hard to 
recycle IOVA/pages here?


Thanks



And in streaming DMA cases, there is no limit. So another choice is
using vmalloc'ed memory only for coherent DMA cases.

Not sure if this is clear for you.

Thanks,
Yongji





Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-30 Thread Jason Wang



On 2020/12/31 下午2:52, Yongji Xie wrote:

On Thu, Dec 31, 2020 at 1:50 PM Jason Wang  wrote:


On 2020/12/31 下午1:15, Yongji Xie wrote:

On Thu, Dec 31, 2020 at 10:49 AM Jason Wang  wrote:

On 2020/12/30 下午6:12, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 4:41 PM Jason Wang  wrote:

On 2020/12/30 下午3:09, Yongji Xie wrote:

On Wed, Dec 30, 2020 at 2:11 PM Jason Wang  wrote:

On 2020/12/29 下午6:26, Yongji Xie wrote:

On Tue, Dec 29, 2020 at 5:11 PM Jason Wang  wrote:

- Original Message -

On Mon, Dec 28, 2020 at 4:43 PM Jason Wang  wrote:

On 2020/12/28 下午4:14, Yongji Xie wrote:

I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
is expected to be synchronous. This need to be solved by tweaking the
current VDUSE API or we can re-visit to go with descriptors relaying
first.


Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?

   I was thinking whether or not we need to generate IOTLB_INVALIDATE
message to VDUSE during dma_unmap (vduse_dev_unmap_page).

If we don't, we're probably fine.


It seems not feasible. This message will be also used in the
virtio-vdpa case to notify userspace to unmap some pages during
consistent dma unmapping. Maybe we can document it to make sure the
users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?


I plan to do it in the coherent DMA case.

Any reason for treating coherent DMA differently?


Now the memory of the bounce buffer is allocated page by page in the
page fault handler. So it can't be used in coherent DMA mapping case
which needs some memory with contiguous virtual addresses. I can use
vmalloc() to do allocation for the bounce buffer instead. But it might
cause some memory waste. Any suggestion?

I may miss something. But I don't see a relationship between the
IOTLB_UNMAP and vmalloc().


In the vmalloc() case, the coherent DMA page will be taken from the
memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
during coherent DMA unmapping because those vmalloc'ed memory which
has been mapped into userspace address space during initialization can
be reused. And userspace should not unmap the region until we destroy
the device.

Just to make sure I understand. My understanding is that IOTLB_UNMAP is
only needed when there's a change the mapping from IOVA to page.


Yes, that's true.


So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
avoided.

So we are not limited by how the pages are actually allocated?


In coherent DMA cases, we need to return some memory with contiguous
kernel virtual addresses. That is the reason why we need vmalloc()
here. If we allocate the memory page by page, the corresponding kernel
virtual addresses in a contiguous IOVA range might not be contiguous.


Yes, but we can do that as what has been done in the series
(alloc_pages_exact()). Or do you mean it would be a little bit hard to
recycle IOVA/pages here?


Yes, it might be hard to reuse the memory. For example, we firstly
allocate 1 IOVA/page during dma_map, then the IOVA is freed during
dma_unmap. Actually we can't reuse this single page if we need a
two-pages area in the next IOVA allocating. So the best way is using
IOTLB_UNMAP to free this single page during dma_unmap too.

Thanks,
Yongji



I get you now. Then I agree that let's go with IOTLB_UNMAP.

Thanks








Re: [PATCH linux-next v2 1/7] vdpa_sim_net: Make mac address array static

2021-01-03 Thread Jason Wang



On 2021/1/4 上午11:31, Parav Pandit wrote:

MAC address array is used only in vdpa_sim_net.c.
Hence, keep it static.

Signed-off-by: Parav Pandit 
---
Changelog:
v1->v2:
  - new patch
---
  drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c 
b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
index c10b6981fdab..f0482427186b 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
@@ -33,7 +33,7 @@ static char *macaddr;
  module_param(macaddr, charp, 0);
  MODULE_PARM_DESC(macaddr, "Ethernet MAC address");
  
-u8 macaddr_buf[ETH_ALEN];

+static u8 macaddr_buf[ETH_ALEN];
  
  static struct vdpasim *vdpasim_net_dev;
  



Acked-by: Jason Wang 




Re: [PATCH linux-next v2 4/7] vdpa: Define vdpa mgmt device, ops and a netlink interface

2021-01-03 Thread Jason Wang



On 2021/1/4 上午11:31, Parav Pandit wrote:

To add one or more VDPA devices, define a management device which
allows adding or removing vdpa device. A management device defines
set of callbacks to manage vdpa devices.

To begin with, it defines add and remove callbacks through which a user
defined vdpa device can be added or removed.

A unique management device is identified by its unique handle identified
by management device name and optionally the bus name.

Hence, introduce routine through which driver can register a
management device and its callback operations for adding and remove
a vdpa device.

Introduce vdpa netlink socket family so that user can query management
device and its attributes.

Example of show vdpa management device which allows creating vdpa device of
networking class (device id = 0x1) of virtio specification 1.1
section 5.1.1.

$ vdpa mgmtdev show
vdpasim_net:
   supported_classes:
 net

Example of showing vdpa management device in JSON format.

$ vdpa mgmtdev show -jp
{
 "show": {
 "vdpasim_net": {
 "supported_classes": [ "net" ]
 }
 }
}

Signed-off-by: Parav Pandit
Reviewed-by: Eli Cohen
Reviewed-by: Jason Wang
---
Changelog:
v1->v2:
  - rebased
  - updated commit log example for management device name from
"vdpasim" to "vdpasim_net"
  - removed device_id as net and block management devices are separated



So I wonder whether there could be a type of management devices that can 
deal with multiple types of virtio devices. If yes, we probably need to 
add device id back.


Thanks




Re: [PATCH linux-next v2 7/7] vdpa_sim_net: Add support for user supported devices

2021-01-03 Thread Jason Wang



On 2021/1/4 上午11:31, Parav Pandit wrote:

  static int __init vdpasim_net_init(void)
  {
int ret = 0;
@@ -176,6 +264,8 @@ static int __init vdpasim_net_init(void)
  
  	if (default_device)

ret = vdpasim_net_default_dev_register();
+   else
+   ret = vdpasim_net_mgmtdev_init();
return ret;
  }
  
@@ -183,6 +273,8 @@ static void __exit vdpasim_net_exit(void)

  {
if (default_device)
vdpasim_net_default_dev_unregister();
+   else
+   vdpasim_net_mgmtdev_cleanup();
  }
  
  module_init(vdpasim_net_init);

-- 2.26.2



I wonder what's the value of keeping the default device that is out of 
the control of management API.


Thanks



Re: [PATCH linux-next v2 7/7] vdpa_sim_net: Add support for user supported devices

2021-01-04 Thread Jason Wang



On 2021/1/4 下午3:21, Parav Pandit wrote:



From: Jason Wang 
Sent: Monday, January 4, 2021 12:35 PM

On 2021/1/4 上午11:31, Parav Pandit wrote:

   static int __init vdpasim_net_init(void)
   {
int ret = 0;
@@ -176,6 +264,8 @@ static int __init vdpasim_net_init(void)

if (default_device)
ret = vdpasim_net_default_dev_register();
+   else
+   ret = vdpasim_net_mgmtdev_init();
return ret;
   }

@@ -183,6 +273,8 @@ static void __exit vdpasim_net_exit(void)
   {
if (default_device)
vdpasim_net_default_dev_unregister();
+   else
+   vdpasim_net_mgmtdev_cleanup();
   }

   module_init(vdpasim_net_init);
-- 2.26.2


I wonder what's the value of keeping the default device that is out of the
control of management API.

I think we can remove it like how I did in the v1 version. And actual vendor 
drivers like mlx5_vdpa will likely should do only user created devices.
I added only for backward compatibility purpose, but we can remove the default 
simulated vdpa net device.
What do you recommend?



I think we'd better mandate this management API. This can avoid vendor 
specific configuration that may complex management layer.


Thanks




Re: [PATCH linux-next v2 4/7] vdpa: Define vdpa mgmt device, ops and a netlink interface

2021-01-04 Thread Jason Wang



On 2021/1/4 下午3:24, Parav Pandit wrote:



From: Jason Wang 
Sent: Monday, January 4, 2021 12:33 PM

On 2021/1/4 上午11:31, Parav Pandit wrote:

To add one or more VDPA devices, define a management device which
allows adding or removing vdpa device. A management device defines set
of callbacks to manage vdpa devices.

To begin with, it defines add and remove callbacks through which a
user defined vdpa device can be added or removed.

A unique management device is identified by its unique handle
identified by management device name and optionally the bus name.

Hence, introduce routine through which driver can register a
management device and its callback operations for adding and remove a
vdpa device.

Introduce vdpa netlink socket family so that user can query management
device and its attributes.

Example of show vdpa management device which allows creating vdpa
device of networking class (device id = 0x1) of virtio specification
1.1 section 5.1.1.

$ vdpa mgmtdev show
vdpasim_net:
supported_classes:
  net

Example of showing vdpa management device in JSON format.

$ vdpa mgmtdev show -jp
{
  "show": {
  "vdpasim_net": {
  "supported_classes": [ "net" ]
  }
  }
}

Signed-off-by: Parav Pandit
Reviewed-by: Eli Cohen
Reviewed-by: Jason Wang
---
Changelog:
v1->v2:
   - rebased
   - updated commit log example for management device name from
 "vdpasim" to "vdpasim_net"
   - removed device_id as net and block management devices are
separated


So I wonder whether there could be a type of management devices that can
deal with multiple types of virtio devices. If yes, we probably need to add
device id back.

At this point mlx5 plan to support only net.
It is useful to see what type of vdpa device is supported by a management 
device.

In future if a mgmt dev supports multiple types, user needs to choose desired 
type.
I guess we can differ this optional type to future, when such mgmt. device 
will/may be available.



I worry if we remove device_id, it may gives a hint that multiple mgmt 
devices needs to be registered if it supports multiple types.


So if possible I would like to keep the device_id here.

Thanks




Re: [PATCH 06/21] vdpa: introduce virtqueue groups

2021-01-04 Thread Jason Wang



On 2021/1/4 下午6:04, Stefan Hajnoczi wrote:

On Wed, Dec 16, 2020 at 02:48:03PM +0800, Jason Wang wrote:

This patch introduces virtqueue groups to vDPA device. The virtqueue
group is the minimal set of virtqueues that must share an address
space. And the adddress space identifier could only be attached to
a specific virtqueue group.

A new mandated bus operation is introduced to get the virtqueue group
ID for a specific virtqueue.

All the vDPA device drivers were converted to simply support a single
virtqueue group.

Signed-off-by: Jason Wang 
---
  drivers/vdpa/ifcvf/ifcvf_main.c   |  9 -
  drivers/vdpa/mlx5/net/mlx5_vnet.c |  8 +++-
  drivers/vdpa/vdpa.c   |  4 +++-
  drivers/vdpa/vdpa_sim/vdpa_sim.c  | 11 ++-
  include/linux/vdpa.h  | 12 +---
  5 files changed, 37 insertions(+), 7 deletions(-)

Maybe consider calling it iotlb_group or iommu_group so the purpose of
the group is clear?



I'm not sure. The reason that I choose virtqueue group is because:

1) Virtqueue is the only entity that tries to issues DMA
2) For IOMMU group, it may cause confusion to the existing IOMMU group 
who group devices
3) IOTLB is the concept in vhost, we don't have such definition in the 
virtio spec


Thanks




Re: [PATCH linux-next v2 4/7] vdpa: Define vdpa mgmt device, ops and a netlink interface

2021-01-05 Thread Jason Wang



On 2021/1/5 下午2:33, Parav Pandit wrote:



From: Jason Wang 
Sent: Tuesday, January 5, 2021 9:40 AM

On 2021/1/4 下午3:24, Parav Pandit wrote:

From: Jason Wang 
Sent: Monday, January 4, 2021 12:33 PM

On 2021/1/4 上午11:31, Parav Pandit wrote:

To add one or more VDPA devices, define a management device which
allows adding or removing vdpa device. A management device defines
set of callbacks to manage vdpa devices.

To begin with, it defines add and remove callbacks through which a
user defined vdpa device can be added or removed.

A unique management device is identified by its unique handle
identified by management device name and optionally the bus name.

Hence, introduce routine through which driver can register a
management device and its callback operations for adding and remove
a vdpa device.

Introduce vdpa netlink socket family so that user can query
management device and its attributes.

Example of show vdpa management device which allows creating vdpa
device of networking class (device id = 0x1) of virtio specification
1.1 section 5.1.1.

$ vdpa mgmtdev show
vdpasim_net:
 supported_classes:
   net

Example of showing vdpa management device in JSON format.

$ vdpa mgmtdev show -jp
{
   "show": {
   "vdpasim_net": {
   "supported_classes": [ "net" ]
   }
   }
}

Signed-off-by: Parav Pandit
Reviewed-by: Eli Cohen
Reviewed-by: Jason Wang
---
Changelog:
v1->v2:
- rebased
- updated commit log example for management device name from
  "vdpasim" to "vdpasim_net"
- removed device_id as net and block management devices are
separated

So I wonder whether there could be a type of management devices that
can deal with multiple types of virtio devices. If yes, we probably
need to add device id back.

At this point mlx5 plan to support only net.
It is useful to see what type of vdpa device is supported by a management

device.

In future if a mgmt dev supports multiple types, user needs to choose

desired type.

I guess we can differ this optional type to future, when such mgmt. device

will/may be available.


I worry if we remove device_id, it may gives a hint that multiple mgmt
devices needs to be registered if it supports multiple types.


No it shouldn't. because we do expose multiple supported types in mgmtdev 
attributes.



Right.





So if possible I would like to keep the device_id here.


Its possible to keep it. But with current drivers, mainly mlx5 and vdpa_sim, it 
is redundant.
Not sure of the ifc's plan.
We have been splitting modules to handle net and block differently in mlx5 as 
well as vdpa_sim.
So it looks to me that both may be separate management drivers (and management 
devices).
Such as vdpasim_net and vdpasim_block.
mlx5 doesn't have plan for block yet.



Ok. Then it's fine.

Thanks




Re: [PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit

2021-01-05 Thread Jason Wang



On 2021/1/5 下午5:11, Xuan Zhuo wrote:

The first patch made some adjustments to xsk.



Thanks a lot for the work. It's rather interesting.




The second patch itself can be used as an independent patch to solve the problem
that XDP may fail to load when the number of queues is insufficient.



It would be better to send this as a separated patch. Several people 
asked for this before.





The third to last patch implements support for xsk in virtio-net.

A practical problem with virtio is that tx interrupts are not very reliable.
There will always be some missing or delayed tx interrupts. So I specially added
a point timer to solve this problem. Of course, considering performance issues,
The timer only triggers when the ring of the network card is full.



This is sub-optimal. We need figure out the root cause. We don't meet 
such issue before.


Several questions:

- is tx interrupt enabled?
- can you still see the issue if you disable event index?
- what's backend did you use? qemu or vhost(user)?




Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also
developing it, but I found that the modification may be relatively large, so I
consider this patch set to be separated from the code related to xsk zero copy
rx.



That's fine, but a question here.

How is the multieuque being handled here. I'm asking since there's no 
programmable filters/directors support in virtio spec now.


Thanks




Xuan Zhuo (5):
   xsk: support get page for drv
   virtio-net: support XDP_TX when not more queues
   virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
   xsk, virtio-net: prepare for support xsk
   virtio-net, xsk: virtio-net support xsk zero copy tx

  drivers/net/virtio_net.c| 643 +++-
  include/linux/netdevice.h   |   1 +
  include/net/xdp_sock_drv.h  |  10 +
  include/net/xsk_buff_pool.h |   1 +
  net/xdp/xsk_buff_pool.c |  10 +-
  5 files changed, 597 insertions(+), 68 deletions(-)

--
1.8.3.1





Re: [PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit

2021-01-05 Thread Jason Wang



On 2021/1/5 下午8:42, Xuan Zhuo wrote:

On Tue, 5 Jan 2021 17:32:19 +0800, Jason Wang  wrote:

On 2021/1/5 下午5:11, Xuan Zhuo wrote:

The first patch made some adjustments to xsk.


Thanks a lot for the work. It's rather interesting.



The second patch itself can be used as an independent patch to solve the problem
that XDP may fail to load when the number of queues is insufficient.


It would be better to send this as a separated patch. Several people
asked for this before.



The third to last patch implements support for xsk in virtio-net.

A practical problem with virtio is that tx interrupts are not very reliable.
There will always be some missing or delayed tx interrupts. So I specially added
a point timer to solve this problem. Of course, considering performance issues,
The timer only triggers when the ring of the network card is full.


This is sub-optimal. We need figure out the root cause. We don't meet
such issue before.

Several questions:

- is tx interrupt enabled?
- can you still see the issue if you disable event index?
- what's backend did you use? qemu or vhost(user)?

Sorry, it may just be a problem with the backend I used here. I just tested the
latest qemu and it did not have this problem. I think I should delete the
timer-related code?



Yes, please.







Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also
developing it, but I found that the modification may be relatively large, so I
consider this patch set to be separated from the code related to xsk zero copy
rx.


That's fine, but a question here.

How is the multieuque being handled here. I'm asking since there's no
programmable filters/directors support in virtio spec now.

Thanks

I don't really understand what you mean. In the case of multiple queues,
there is no problem.



So consider we bind xsk to queue 4, how can you make sure the traffic to 
be directed queue 4? One possible solution is to use filters as what 
suggested in af_xdp.rst:


  ethtool -N p3p2 rx-flow-hash udp4 fn
  ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
  action 16
...

But virtio-net doesn't have any filters that could be programmed from 
the driver.


Anything I missed here?

Thanks







Xuan Zhuo (5):
xsk: support get page for drv
virtio-net: support XDP_TX when not more queues
virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
xsk, virtio-net: prepare for support xsk
virtio-net, xsk: virtio-net support xsk zero copy tx

   drivers/net/virtio_net.c| 643 
+++-
   include/linux/netdevice.h   |   1 +
   include/net/xdp_sock_drv.h  |  10 +
   include/net/xsk_buff_pool.h |   1 +
   net/xdp/xsk_buff_pool.c |  10 +-
   5 files changed, 597 insertions(+), 68 deletions(-)

--
1.8.3.1





Re: [PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit

2021-01-05 Thread Jason Wang



On 2021/1/6 上午10:55, Xuan Zhuo wrote:

On Wed, 6 Jan 2021 10:46:43 +0800, Jason Wang  wrote:

On 2021/1/5 下午8:42, Xuan Zhuo wrote:

On Tue, 5 Jan 2021 17:32:19 +0800, Jason Wang  wrote:

On 2021/1/5 下午5:11, Xuan Zhuo wrote:

The first patch made some adjustments to xsk.

Thanks a lot for the work. It's rather interesting.



The second patch itself can be used as an independent patch to solve the problem
that XDP may fail to load when the number of queues is insufficient.

It would be better to send this as a separated patch. Several people
asked for this before.



The third to last patch implements support for xsk in virtio-net.

A practical problem with virtio is that tx interrupts are not very reliable.
There will always be some missing or delayed tx interrupts. So I specially added
a point timer to solve this problem. Of course, considering performance issues,
The timer only triggers when the ring of the network card is full.

This is sub-optimal. We need figure out the root cause. We don't meet
such issue before.

Several questions:

- is tx interrupt enabled?
- can you still see the issue if you disable event index?
- what's backend did you use? qemu or vhost(user)?

Sorry, it may just be a problem with the backend I used here. I just tested the
latest qemu and it did not have this problem. I think I should delete the
timer-related code?


Yes, please.



Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also
developing it, but I found that the modification may be relatively large, so I
consider this patch set to be separated from the code related to xsk zero copy
rx.

That's fine, but a question here.

How is the multieuque being handled here. I'm asking since there's no
programmable filters/directors support in virtio spec now.

Thanks

I don't really understand what you mean. In the case of multiple queues,
there is no problem.


So consider we bind xsk to queue 4, how can you make sure the traffic to
be directed queue 4? One possible solution is to use filters as what
suggested in af_xdp.rst:

    ethtool -N p3p2 rx-flow-hash udp4 fn
    ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
    action 16
...

But virtio-net doesn't have any filters that could be programmed from
the driver.

Anything I missed here?

Thanks

I understand what you mean, this problem does exist, and I encountered it when I
tested qemu.

First of all, this is that the problem only affects recv. This patch is for
xmit. Of course, our normal business must also have recv scenarios.

My solution in developing the upper-level application is to bond all the queues
to ensure that we can receive the packets we want.



I'm not sure I get you here. Note that. one advantage of AF_XDP is that 
is allows XSK to be bound to a specific queue and the rest could still 
be used by kernel.




  And I think in the
implementation of the use, even if the network card supports filters, we should
also bond all the queues, because we don't know which queue the traffic we care
about will arrive from.



With the help of filters the card can select a specific queue based on 
hash or n-tuple so it should work?





Regarding the problem of virtio-net, I think our core question is whether we
need to deal with this problem in the driver of virtio-net, I personally think
that we should add the virtio specification to define this scenario.



Yes, so do you want to do that? It would make virtio-net more user 
friendly to AF_XDP. (Or if you wish I can post patch to extend the spec).





When I tested it, I found that some cloud vendors' implementations guarantee
this queue selection algorithm.



Right, though spec suggest a automatic steering algorithm but it's not 
mandatory. Vendor can implement their own.


But hash or ntuple filter should be still useful.

Thanks




Thanks!!




Xuan Zhuo (5):
 xsk: support get page for drv
 virtio-net: support XDP_TX when not more queues
 virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
 xsk, virtio-net: prepare for support xsk
 virtio-net, xsk: virtio-net support xsk zero copy tx

drivers/net/virtio_net.c| 643 
+++-
include/linux/netdevice.h   |   1 +
include/net/xdp_sock_drv.h  |  10 +
include/net/xsk_buff_pool.h |   1 +
net/xdp/xsk_buff_pool.c |  10 +-
5 files changed, 597 insertions(+), 68 deletions(-)

--
1.8.3.1





Re: [PATCH] vdpa/mlx5: Fix memory key MTT population

2021-01-06 Thread Jason Wang



On 2021/1/6 下午5:05, Eli Cohen wrote:

map_direct_mr() assumed that the number of scatter/gather entries
returned by dma_map_sg_attrs() was equal to the number of segments in
the sgl list. This led to wrong population of the mkey object. Fix this
by properly referring to the returned value.

In addition, get rid of fill_sg() whjich effect is overwritten bu
populate_mtts().



Typo.




Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Signed-off-by: Eli Cohen 
---
  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
  drivers/vdpa/mlx5/core/mr.c| 28 
  2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 5c92a576edae..08f742fd2409 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -15,6 +15,7 @@ struct mlx5_vdpa_direct_mr {
struct sg_table sg_head;
int log_size;
int nsg;
+   int nent;
struct list_head list;
u64 offset;
  };
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 4b6195666c58..d300f799efcd 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -25,17 +25,6 @@ static int get_octo_len(u64 len, int page_shift)
return (npages + 1) / 2;
  }
  
-static void fill_sg(struct mlx5_vdpa_direct_mr *mr, void *in)

-{
-   struct scatterlist *sg;
-   __be64 *pas;
-   int i;
-
-   pas = MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
-   for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i)
-   (*pas) = cpu_to_be64(sg_dma_address(sg));
-}
-
  static void mlx5_set_access_mode(void *mkc, int mode)
  {
MLX5_SET(mkc, mkc, access_mode_1_0, mode & 0x3);
@@ -45,10 +34,18 @@ static void mlx5_set_access_mode(void *mkc, int mode)
  static void populate_mtts(struct mlx5_vdpa_direct_mr *mr, __be64 *mtt)
  {
struct scatterlist *sg;
+   int nsg = mr->nsg;
+   u64 dma_addr;
+   u64 dma_len;
+   int j = 0;
int i;
  
-	for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i)

-   mtt[i] = cpu_to_be64(sg_dma_address(sg));
+   for_each_sg(mr->sg_head.sgl, sg, mr->nent, i) {
+   for (dma_addr = sg_dma_address(sg), dma_len = sg_dma_len(sg);
+nsg && dma_len;
+nsg--, dma_addr += BIT(mr->log_size), dma_len -= 
BIT(mr->log_size))
+   mtt[j++] = cpu_to_be64(dma_addr);



It looks to me the mtt entry is also limited by log_size. It's better to 
explain this a little bit in the commit log.


Thanks



+   }
  }
  
  static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr)

@@ -64,7 +61,6 @@ static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_direct
return -ENOMEM;
  
  	MLX5_SET(create_mkey_in, in, uid, mvdev->res.uid);

-   fill_sg(mr, in);
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, lw, !!(mr->perm & VHOST_MAP_WO));
MLX5_SET(mkc, mkc, lr, !!(mr->perm & VHOST_MAP_RO));
@@ -276,8 +272,8 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_direct_mr
  done:
mr->log_size = log_entity_size;
mr->nsg = nsg;
-   err = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, 
DMA_BIDIRECTIONAL, 0);
-   if (!err)
+   mr->nent = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, 
DMA_BIDIRECTIONAL, 0);
+   if (!mr->nent)
goto err_map;
  
  	err = create_direct_mr(mvdev, mr);




Re: [PATCH v1] vdpa/mlx5: Fix memory key MTT population

2021-01-08 Thread Jason Wang



On 2021/1/7 下午3:18, Eli Cohen wrote:

map_direct_mr() assumed that the number of scatter/gather entries
returned by dma_map_sg_attrs() was equal to the number of segments in
the sgl list. This led to wrong population of the mkey object. Fix this
by properly referring to the returned value.

The hardware expects each MTT entry to contain the DMA address of a
contiguous block of memory of size (1 << mr->log_size) bytes.
dma_map_sg_attrs() can coalesce several sg entries into a single
scatter/gather entry of contiguous DMA range so we need to scan the list
and refer to the size of each s/g entry.

In addition, get rid of fill_sg() which effect is overwritten by
populate_mtts().

Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Signed-off-by: Eli Cohen 
---
V0->V1:
1. Fix typos
2. Improve changelog



Acked-by: Jason Wang 




  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
  drivers/vdpa/mlx5/core/mr.c| 28 
  2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 5c92a576edae..08f742fd2409 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -15,6 +15,7 @@ struct mlx5_vdpa_direct_mr {
struct sg_table sg_head;
int log_size;
int nsg;
+   int nent;
struct list_head list;
u64 offset;
  };
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 4b6195666c58..d300f799efcd 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -25,17 +25,6 @@ static int get_octo_len(u64 len, int page_shift)
return (npages + 1) / 2;
  }
  
-static void fill_sg(struct mlx5_vdpa_direct_mr *mr, void *in)

-{
-   struct scatterlist *sg;
-   __be64 *pas;
-   int i;
-
-   pas = MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
-   for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i)
-   (*pas) = cpu_to_be64(sg_dma_address(sg));
-}
-
  static void mlx5_set_access_mode(void *mkc, int mode)
  {
MLX5_SET(mkc, mkc, access_mode_1_0, mode & 0x3);
@@ -45,10 +34,18 @@ static void mlx5_set_access_mode(void *mkc, int mode)
  static void populate_mtts(struct mlx5_vdpa_direct_mr *mr, __be64 *mtt)
  {
struct scatterlist *sg;
+   int nsg = mr->nsg;
+   u64 dma_addr;
+   u64 dma_len;
+   int j = 0;
int i;
  
-	for_each_sg(mr->sg_head.sgl, sg, mr->nsg, i)

-   mtt[i] = cpu_to_be64(sg_dma_address(sg));
+   for_each_sg(mr->sg_head.sgl, sg, mr->nent, i) {
+   for (dma_addr = sg_dma_address(sg), dma_len = sg_dma_len(sg);
+nsg && dma_len;
+nsg--, dma_addr += BIT(mr->log_size), dma_len -= 
BIT(mr->log_size))
+   mtt[j++] = cpu_to_be64(dma_addr);
+   }
  }
  
  static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr)

@@ -64,7 +61,6 @@ static int create_direct_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_direct
return -ENOMEM;
  
  	MLX5_SET(create_mkey_in, in, uid, mvdev->res.uid);

-   fill_sg(mr, in);
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, lw, !!(mr->perm & VHOST_MAP_WO));
MLX5_SET(mkc, mkc, lr, !!(mr->perm & VHOST_MAP_RO));
@@ -276,8 +272,8 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_direct_mr
  done:
mr->log_size = log_entity_size;
mr->nsg = nsg;
-   err = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, 
DMA_BIDIRECTIONAL, 0);
-   if (!err)
+   mr->nent = dma_map_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, 
DMA_BIDIRECTIONAL, 0);
+   if (!mr->nent)
goto err_map;
  
  	err = create_direct_mr(mvdev, mr);




Re: [PATCH net-next 0/2] Introduce XDP_FLAGS_NO_TX flag

2021-01-11 Thread Jason Wang



On 2021/1/9 上午10:49, Charlie Somerville wrote:

This patch series introduces a new flag XDP_FLAGS_NO_TX which prevents
the allocation of additional send queues for XDP programs.



This part I don't understand. Is such flag a must? I think the answer is 
probably not.


Why not simply do:

1) if we had sufficient TX queues, use dedicated TX queues for XDP_TX
2) if we don't, simple synchronize through spin_lock[1]

Thanks

[1] https://www.spinics.net/lists/bpf/msg32587.html




Included in this patch series is an implementation of XDP_FLAGS_NO_TX
for the virtio_net driver. This flag is intended to be advisory - not
all drivers must implement support for it.

Many virtualised environments only provide enough virtio_net send queues
for the number of processors allocated to the VM:

# nproc
8
# ethtool --show-channels ens3
Channel parameters for ens3:
Pre-set maximums:
RX: 0
TX: 0
Other:  0
Combined:   8

In this configuration XDP is unusable because the virtio_net driver
always tries to allocate an extra send queue for each processor - even
if the XDP the program never uses the XDP_TX functionality.

While XDP_TX is still unavailable in these environments, this new flag
expands the set of XDP programs that can be used.

This is my first contribution to the kernel, so apologies if I've sent
this to the wrong list. I have tried to cc relevant maintainers but
it's possible I may have missed some people. I'm looking forward to
receiving feedback on this change.

Charlie Somerville (2):
   xdp: Add XDP_FLAGS_NO_TX flag
   virtio_net: Implement XDP_FLAGS_NO_TX support

  drivers/net/virtio_net.c | 17 +
  include/uapi/linux/if_link.h |  5 -
  2 files changed, 17 insertions(+), 5 deletions(-)





Re: [PATCH 21/21] vdpasim: control virtqueue support

2021-01-11 Thread Jason Wang



On 2021/1/11 下午8:26, Eli Cohen wrote:

On Wed, Dec 16, 2020 at 02:48:18PM +0800, Jason Wang wrote:

This patch introduces the control virtqueue support for vDPA
simulator. This is a requirement for supporting advanced features like
multiqueue.

A requirement for control virtqueue is to isolate its memory access
from the rx/tx virtqueues. This is because when using vDPA device
for VM, the control virqueue is not directly assigned to VM. Userspace
(Qemu) will present a shadow control virtqueue to control for
recording the device states.

The isolation is done via the virtqueue groups and ASID support in
vDPA through vhost-vdpa. The simulator is extended to have:

1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue)
2) two virtqueue groups: group 0 contains RXVQ and TXVQ; group 1
contains CVQ
3) two address spaces and the simulator simply implements the address
spaces by mapping it 1:1 to IOTLB.

For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1
to group 1. So we have:

1) The IOTLB for virtqueue group 0 contains the mappings of guest, so
RX and TX can be assigned to guest directly.
2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which
is the buffers that allocated and managed by VMM only. So CVQ of
vhost-vdpa is visible to VMM only. And Guest can not access the CVQ
of vhost-vdpa.

For the other use cases, since AS 0 is associated to all virtqueue
groups by default. All virtqueues share the same mapping by default.

To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is
implemented in the simulator for the driver to set mac address.


Hi Jason,

is there any version of qemu/libvirt available that I can see the
control virtqueue working in action?



Not yet, the qemu part depends on the shadow virtqueue work of Eugenio. 
But it will work as:


1) qemu will use a separated address space for the control virtqueue 
(shadow) exposed through vhost-vDPA
2) the commands sent through control virtqueue by guest driver will 
intercept by qemu

3) Qemu will send those commands to the shadow control virtqueue

Eugenio, any ETA for the new version of shadow virtqueue support in Qemu?

Thanks








Re: [PATCH v1] vhost_vdpa: fix the problem in vhost_vdpa_set_config_call

2021-01-11 Thread Jason Wang



On 2021/1/12 上午10:46, Cindy Lu wrote:

in vhost_vdpa_set_config_call, the cb.private should be vhost_vdpa.



Should be "In"



this cb.private will finally use in vhost_vdpa_config_cb as
vhost_vdpa.Fix this issue



An whitespace is needed before Fix and a full stop after "issue"

Fixes: 776f395004d82 ("vhost_vdpa: Support config interrupt in vdpa")

Acked-by: Jason Wang 

Please post a V2 with the above fixed and cc stable.

Thanks




Signed-off-by: Cindy Lu 
---
  drivers/vhost/vdpa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ef688c8c0e0e..3fbb9c1f49da 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -319,7 +319,7 @@ static long vhost_vdpa_set_config_call(struct vhost_vdpa 
*v, u32 __user *argp)
struct eventfd_ctx *ctx;
  
  	cb.callback = vhost_vdpa_config_cb;

-   cb.private = v->vdpa;
+   cb.private = v;
if (copy_from_user(&fd, argp, sizeof(fd)))
return  -EFAULT;
  




Re: [PATCH net] vdpa/mlx5: Fix miss to set VIRTIO_NET_S_LINK_UP for virtio_net_config

2020-10-19 Thread Jason Wang



On 2020/10/19 下午5:07, we...@ucloud.cn wrote:

From: wenxu 

Qemu get virtio_net_config from the vdpa driver. So The vdpa driver
should set the VIRTIO_NET_S_LINK_UP flag to virtio_net_config like
vdpa_sim. Or the link of virtio net NIC in the virtual machine will
never up.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: wenxu 
---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 74264e59..af6c74c 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1537,6 +1537,8 @@ static int mlx5_vdpa_set_features(struct vdpa_device 
*vdev, u64 features)
ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features;
ndev->config.mtu = __cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev),
 ndev->mtu);
+   ndev->config.status = 
__cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev),
+  VIRTIO_NET_S_LINK_UP);
return err;
  }
  



Other than the small issue pointed out by Jakub.

Acked-by: Jason Wang 




Re: [PATCH net v2] Revert "virtio-net: ethtool configurable RXCSUM"

2020-10-19 Thread Jason Wang



On 2020/10/20 上午1:32, Michael S. Tsirkin wrote:

This reverts commit 3618ad2a7c0e78e4258386394d5d5f92a3dbccf8.

When the device does not have a control vq (e.g. when using a
version of QEMU based on upstream v0.10 or older, or when specifying
ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_rx_extra=off,ctrl_mac_addr=off
for the device on the QEMU command line), that commit causes a crash:

[   72.229171] kernel BUG at drivers/net/virtio_net.c:1667!
[   72.230266] invalid opcode:  [#1] PREEMPT SMP
[   72.231172] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.9.0-rc8-02934-g3618ad2a7c0e7 #1
[   72.231172] EIP: virtnet_send_command+0x120/0x140
[   72.231172] Code: 00 0f 94 c0 8b 7d f0 65 33 3d 14 00 00 00 75 1c 8d 65 f4 
5b 5e 5f 5d c3 66 90 be 01 00 00 00 e9 6e ff ff ff 8d b6 00
+00 00 00 <0f> 0b e8 d9 bb 82 00 eb 17 8d b4 26 00 00 00 00 8d b4 26 00 00 00
[   72.231172] EAX: 000d EBX: f72895c0 ECX: 0017 EDX: 0011
[   72.231172] ESI: f7197800 EDI: ed69bd00 EBP: ed69bcf4 ESP: ed69bc98
[   72.231172] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[   72.231172] CR0: 80050033 CR2:  CR3: 02c84000 CR4: 000406f0
[   72.231172] Call Trace:
[   72.231172]  ? __virt_addr_valid+0x45/0x60
[   72.231172]  ? ___cache_free+0x51f/0x760
[   72.231172]  ? kobject_uevent_env+0xf4/0x560
[   72.231172]  virtnet_set_guest_offloads+0x4d/0x80
[   72.231172]  virtnet_set_features+0x85/0x120
[   72.231172]  ? virtnet_set_guest_offloads+0x80/0x80
[   72.231172]  __netdev_update_features+0x27a/0x8e0
[   72.231172]  ? kobject_uevent+0xa/0x20
[   72.231172]  ? netdev_register_kobject+0x12c/0x160
[   72.231172]  register_netdevice+0x4fe/0x740
[   72.231172]  register_netdev+0x1c/0x40
[   72.231172]  virtnet_probe+0x728/0xb60
[   72.231172]  ? _raw_spin_unlock+0x1d/0x40
[   72.231172]  ? virtio_vdpa_get_status+0x1c/0x20
[   72.231172]  virtio_dev_probe+0x1c6/0x271
[   72.231172]  really_probe+0x195/0x2e0
[   72.231172]  driver_probe_device+0x26/0x60
[   72.231172]  device_driver_attach+0x49/0x60
[   72.231172]  __driver_attach+0x46/0xc0
[   72.231172]  ? device_driver_attach+0x60/0x60
[   72.231172]  bus_add_driver+0x197/0x1c0
[   72.231172]  driver_register+0x66/0xc0
[   72.231172]  register_virtio_driver+0x1b/0x40
[   72.231172]  virtio_net_driver_init+0x61/0x86
[   72.231172]  ? veth_init+0x14/0x14
[   72.231172]  do_one_initcall+0x76/0x2e4
[   72.231172]  ? rdinit_setup+0x2a/0x2a
[   72.231172]  do_initcalls+0xb2/0xd5
[   72.231172]  kernel_init_freeable+0x14f/0x179
[   72.231172]  ? rest_init+0x100/0x100
[   72.231172]  kernel_init+0xd/0xe0
[   72.231172]  ret_from_fork+0x1c/0x30
[   72.231172] Modules linked in:
[   72.269563] ---[ end trace a6ebc4afea0e6cb1 ]---

The reason is that virtnet_set_features now calls virtnet_set_guest_offloads
unconditionally, it used to only call it when there is something
to configure.

If device does not have a control vq, everything breaks.

Looking at this some more, I noticed that it's not really checking the
hardware too much. E.g.

 if ((dev->features ^ features) & NETIF_F_LRO) {
 if (features & NETIF_F_LRO)
 offloads |= GUEST_OFFLOAD_LRO_MASK &
 vi->guest_offloads_capable;
 else
 offloads &= ~GUEST_OFFLOAD_LRO_MASK;
 }

and

 (1ULL << VIRTIO_NET_F_GUEST_TSO6) | \
 (1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
 (1ULL << VIRTIO_NET_F_GUEST_UFO))

But there's no guarantee that e.g. VIRTIO_NET_F_GUEST_TSO6 is set.

If it isn't command should not send it.

Further

static int virtnet_set_features(struct net_device *dev,
 netdev_features_t features)
{
 struct virtnet_info *vi = netdev_priv(dev);
 u64 offloads = vi->guest_offloads;

seems wrong since guest_offloads is zero initialized,



I'm not sure I get here.

Did you mean vi->guest_offloads?

We initialize it during probe

    for (i = 0; i < ARRAY_SIZE(guest_offloads); i++)
        if (virtio_has_feature(vi->vdev, guest_offloads[i]))
            set_bit(guest_offloads[i], &vi->guest_offloads);



it does not reflect the state after reset which comes from
the features.

Revert the original commit for now.

Cc: Tonghao Zhang 
Cc: Willem de Bruijn 
Fixes: 3618ad2a7c0e7 ("virtio-net: ethtool configurable RXCSUM")
Reported-by: kernel test robot 
Signed-off-by: Michael S. Tsirkin 
---

changes from v1:
- clarify how to reproduce the bug in the log


  drivers/net/virtio_net.c | 50 +++-
  1 file changed, 13 insertions(+), 37 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d2d2c4a53cf2..21b71148c532 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -68,8 +68,6 @@ static const unsigned long guest_offloads[] = {
(1ULL << VIRTI

Re: [PATCH 1/2] KVM: not register a IRQ bypass producer if unsupported or disabled

2020-10-19 Thread Jason Wang



On 2020/10/19 下午5:06, Zhenzhong Duan wrote:

If Post interrupt is disabled due to hardware limit or forcely disabled
by "intremap=nopost" parameter, return -EINVAL so that the legacy mode IRQ
isn't registered as IRQ bypass producer.



Is there any side effect if it was still registered?




With this change, below message is printed:
"vfio-pci :db:00.0: irq bypass producer (token 60c8cda5) registration 
fails: -22"



I may miss something, but the patch only touches vhost-vDPA instead of VFIO?

Thanks




..which also hints us if a vfio or vdpa device works in PI mode or legacy
remapping mode.

Add a print to vdpa code just like what vfio_msi_set_vector_signal() does.

Signed-off-by: Zhenzhong Duan 
---
  arch/x86/kvm/svm/avic.c | 3 +--
  arch/x86/kvm/vmx/vmx.c  | 5 ++---
  drivers/vhost/vdpa.c| 5 +
  3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ac830cd..316142a 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -814,7 +814,7 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int 
host_irq,
  
  	if (!kvm_arch_has_assigned_device(kvm) ||

!irq_remapping_cap(IRQ_POSTING_CAP))
-   return 0;
+   return ret;
  
  	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",

 __func__, host_irq, guest_irq, set);
@@ -899,7 +899,6 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int 
host_irq,
}
}
  
-	ret = 0;

  out:
srcu_read_unlock(&kvm->irq_srcu, idx);
return ret;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0a9954..1fed6d6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7716,12 +7716,12 @@ static int vmx_update_pi_irte(struct kvm *kvm, unsigned 
int host_irq,
struct kvm_lapic_irq irq;
struct kvm_vcpu *vcpu;
struct vcpu_data vcpu_info;
-   int idx, ret = 0;
+   int idx, ret = -EINVAL;
  
  	if (!kvm_arch_has_assigned_device(kvm) ||

!irq_remapping_cap(IRQ_POSTING_CAP) ||
!kvm_vcpu_apicv_active(kvm->vcpus[0]))
-   return 0;
+   return ret;
  
  	idx = srcu_read_lock(&kvm->irq_srcu);

irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
@@ -7787,7 +7787,6 @@ static int vmx_update_pi_irte(struct kvm *kvm, unsigned 
int host_irq,
}
}
  
-	ret = 0;

  out:
srcu_read_unlock(&kvm->irq_srcu, idx);
return ret;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 62a9bb0..b20060a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -107,6 +107,11 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, 
u16 qid)
vq->call_ctx.producer.token = vq->call_ctx.ctx;
vq->call_ctx.producer.irq = irq;
ret = irq_bypass_register_producer(&vq->call_ctx.producer);
+   if (unlikely(ret))
+   dev_info(&vdpa->dev,
+   "irq bypass producer (token %p) registration fails: %d\n",
+   vq->call_ctx.producer.token, ret);
+
spin_unlock(&vq->call_ctx.ctx_lock);
  }
  




Re: [PATCH 2/2] KVM: not link irqfd with a fake IRQ bypass producer

2020-10-19 Thread Jason Wang



On 2020/10/19 下午5:06, Zhenzhong Duan wrote:

In case failure to setup Post interrupt for an IRQ, it make no sense
to assign irqfd->producer to the producer.

This change makes code more robust.



It's better to describe what issue we will get without this patch.

Thanks




Signed-off-by: Zhenzhong Duan 
---
  arch/x86/kvm/x86.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce856e0..277e961 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10683,13 +10683,14 @@ int kvm_arch_irq_bypass_add_producer(struct 
irq_bypass_consumer *cons,
container_of(cons, struct kvm_kernel_irqfd, consumer);
int ret;
  
-	irqfd->producer = prod;

kvm_arch_start_assignment(irqfd->kvm);
ret = kvm_x86_ops.update_pi_irte(irqfd->kvm,
 prod->irq, irqfd->gsi, 1);
  
  	if (ret)

kvm_arch_end_assignment(irqfd->kvm);
+   else
+   irqfd->producer = prod;
  
  	return ret;

  }




Re: [PATCH net] vdpa/mlx5: Fix miss to set VIRTIO_NET_S_LINK_UP for virtio_net_config

2020-10-20 Thread Jason Wang



On 2020/10/20 下午3:44, Eli Cohen wrote:

On Tue, Oct 20, 2020 at 10:03:00AM +0800, Jason Wang wrote:

On 2020/10/19 下午5:07, we...@ucloud.cn wrote:

From: wenxu 

Qemu get virtio_net_config from the vdpa driver. So The vdpa driver
should set the VIRTIO_NET_S_LINK_UP flag to virtio_net_config like
vdpa_sim. Or the link of virtio net NIC in the virtual machine will
never up.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: wenxu 
---
   drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 74264e59..af6c74c 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1537,6 +1537,8 @@ static int mlx5_vdpa_set_features(struct vdpa_device 
*vdev, u64 features)
ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features;
ndev->config.mtu = __cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev),
 ndev->mtu);
+   ndev->config.status = 
__cpu_to_virtio16(mlx5_vdpa_is_little_endian(mvdev),
+  VIRTIO_NET_S_LINK_UP);
return err;
   }


Other than the small issue pointed out by Jakub.

Acked-by: Jason Wang 



I already posted a fix for this a while ago and Michael should merge it.

https://lkml.org/lkml/2020/9/17/543



Aha, I just forget this.

It's queued here: 
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?h=linux-next


And with my ack actually.

Thanks




Re: [PATCH v4] Revert "virtio-net: ethtool configurable RXCSUM"

2020-10-21 Thread Jason Wang



On 2020/10/21 下午10:30, Michael S. Tsirkin wrote:

This reverts commit 3618ad2a7c0e78e4258386394d5d5f92a3dbccf8.

When control vq is not negotiated, that commit causes a crash:

[   72.229171] kernel BUG at drivers/net/virtio_net.c:1667!
[   72.230266] invalid opcode:  [#1] PREEMPT SMP
[   72.231172] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.9.0-rc8-02934-g3618ad2a7c0e7 #1
[   72.231172] EIP: virtnet_send_command+0x120/0x140
[   72.231172] Code: 00 0f 94 c0 8b 7d f0 65 33 3d 14 00 00 00 75 1c 8d 65 f4 
5b 5e 5f 5d c3 66 90 be 01 00 00 00 e9 6e ff ff ff 8d b6 00
+00 00 00 <0f> 0b e8 d9 bb 82 00 eb 17 8d b4 26 00 00 00 00 8d b4 26 00 00 00
[   72.231172] EAX: 000d EBX: f72895c0 ECX: 0017 EDX: 0011
[   72.231172] ESI: f7197800 EDI: ed69bd00 EBP: ed69bcf4 ESP: ed69bc98
[   72.231172] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[   72.231172] CR0: 80050033 CR2:  CR3: 02c84000 CR4: 000406f0
[   72.231172] Call Trace:
[   72.231172]  ? __virt_addr_valid+0x45/0x60
[   72.231172]  ? ___cache_free+0x51f/0x760
[   72.231172]  ? kobject_uevent_env+0xf4/0x560
[   72.231172]  virtnet_set_guest_offloads+0x4d/0x80
[   72.231172]  virtnet_set_features+0x85/0x120
[   72.231172]  ? virtnet_set_guest_offloads+0x80/0x80
[   72.231172]  __netdev_update_features+0x27a/0x8e0
[   72.231172]  ? kobject_uevent+0xa/0x20
[   72.231172]  ? netdev_register_kobject+0x12c/0x160
[   72.231172]  register_netdevice+0x4fe/0x740
[   72.231172]  register_netdev+0x1c/0x40
[   72.231172]  virtnet_probe+0x728/0xb60
[   72.231172]  ? _raw_spin_unlock+0x1d/0x40
[   72.231172]  ? virtio_vdpa_get_status+0x1c/0x20
[   72.231172]  virtio_dev_probe+0x1c6/0x271
[   72.231172]  really_probe+0x195/0x2e0
[   72.231172]  driver_probe_device+0x26/0x60
[   72.231172]  device_driver_attach+0x49/0x60
[   72.231172]  __driver_attach+0x46/0xc0
[   72.231172]  ? device_driver_attach+0x60/0x60
[   72.231172]  bus_add_driver+0x197/0x1c0
[   72.231172]  driver_register+0x66/0xc0
[   72.231172]  register_virtio_driver+0x1b/0x40
[   72.231172]  virtio_net_driver_init+0x61/0x86
[   72.231172]  ? veth_init+0x14/0x14
[   72.231172]  do_one_initcall+0x76/0x2e4
[   72.231172]  ? rdinit_setup+0x2a/0x2a
[   72.231172]  do_initcalls+0xb2/0xd5
[   72.231172]  kernel_init_freeable+0x14f/0x179
[   72.231172]  ? rest_init+0x100/0x100
[   72.231172]  kernel_init+0xd/0xe0
[   72.231172]  ret_from_fork+0x1c/0x30
[   72.231172] Modules linked in:
[   72.269563] ---[ end trace a6ebc4afea0e6cb1 ]---

The reason is that virtnet_set_features now calls virtnet_set_guest_offloads
unconditionally, it used to only call it when there is something
to configure.

If device does not have a control vq, everything breaks.

Revert the original commit for now.

Cc: Tonghao Zhang 
Cc: Willem de Bruijn 
Fixes: 3618ad2a7c0e7 ("virtio-net: ethtool configurable RXCSUM")
Reported-by: kernel test robot 
Signed-off-by: Michael S. Tsirkin 
---



Acked-by: Jason Wang 





Same patch as all of v1-v3, just tweaking the commit log.

  drivers/net/virtio_net.c | 50 +++-
  1 file changed, 13 insertions(+), 37 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d2d2c4a53cf2..21b71148c532 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -68,8 +68,6 @@ static const unsigned long guest_offloads[] = {
(1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
(1ULL << VIRTIO_NET_F_GUEST_UFO))
  
-#define GUEST_OFFLOAD_CSUM_MASK (1ULL << VIRTIO_NET_F_GUEST_CSUM)

-
  struct virtnet_stat_desc {
char desc[ETH_GSTRING_LEN];
size_t offset;
@@ -2524,48 +2522,29 @@ static int virtnet_get_phys_port_name(struct net_device 
*dev, char *buf,
return 0;
  }
  
-static netdev_features_t virtnet_fix_features(struct net_device *netdev,

- netdev_features_t features)
-{
-   /* If Rx checksum is disabled, LRO should also be disabled. */
-   if (!(features & NETIF_F_RXCSUM))
-   features &= ~NETIF_F_LRO;
-
-   return features;
-}
-
  static int virtnet_set_features(struct net_device *dev,
netdev_features_t features)
  {
struct virtnet_info *vi = netdev_priv(dev);
-   u64 offloads = vi->guest_offloads;
+   u64 offloads;
int err;
  
-	/* Don't allow configuration while XDP is active. */

-   if (vi->xdp_queue_pairs)
-   return -EBUSY;
-
if ((dev->features ^ features) & NETIF_F_LRO) {
+   if (vi->xdp_queue_pairs)
+   return -EBUSY;
+
if (features & NETIF_F_LRO)
-   offloads |= GUEST_OFFLOAD_LRO_MASK &
-   vi->guest_offloads_capable;
+   offloads = vi->guest_offloads_capable;
   

Re: [PATCH] vdpa: handle irq bypass register failure case

2020-10-25 Thread Jason Wang



On 2020/10/23 下午6:40, Zhu Lingshan wrote:

LKP considered variable 'ret' in vhost_vdpa_setup_vq_irq() as
a unused variable, so suggest we remove it. Actually it stores
return value of irq_bypass_register_producer(), but we did not
check it, we should handle the failure case.

This commit will print a message if irq bypass register producer
fail, in this case, vqs still remain functional.

Signed-off-by: Zhu Lingshan 
Reported-by: kernel test robot 
---
  drivers/vhost/vdpa.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 62a9bb0efc55..d6b2c3bd1b01 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -107,6 +107,9 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, 
u16 qid)
vq->call_ctx.producer.token = vq->call_ctx.ctx;
vq->call_ctx.producer.irq = irq;
ret = irq_bypass_register_producer(&vq->call_ctx.producer);
+   if (unlikely(ret))
+   dev_info(&v->dev, "vq %u, irq bypass producer (token %p) 
registration fails, ret =  %d\n",
+qid, vq->call_ctx.producer.token, ret);
spin_unlock(&vq->call_ctx.ctx_lock);
  }
  



Acked-by: Jason Wang 




Re: [PATCH net] vhost_vdpa: Return -EFUALT if copy_from_user() fails

2020-10-25 Thread Jason Wang



On 2020/10/23 下午11:34, Michael S. Tsirkin wrote:

On Fri, Oct 23, 2020 at 03:08:53PM +0300, Dan Carpenter wrote:

The copy_to/from_user() functions return the number of bytes which we
weren't able to copy but the ioctl should return -EFAULT if they fail.

Fixes: a127c5bbb6a8 ("vhost-vdpa: fix backend feature ioctls")
Signed-off-by: Dan Carpenter 

Acked-by: Michael S. Tsirkin 
Needed for stable I guess.



Agree.

Acked-by: Jason Wang 



---
  drivers/vhost/vdpa.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 62a9bb0efc55..c94a97b6bd6d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -428,12 +428,11 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
void __user *argp = (void __user *)arg;
u64 __user *featurep = argp;
u64 features;
-   long r;
+   long r = 0;
  
  	if (cmd == VHOST_SET_BACKEND_FEATURES) {

-   r = copy_from_user(&features, featurep, sizeof(features));
-   if (r)
-   return r;
+   if (copy_from_user(&features, featurep, sizeof(features)))
+   return -EFAULT;
if (features & ~VHOST_VDPA_BACKEND_FEATURES)
return -EOPNOTSUPP;
vhost_set_backend_features(&v->vdev, features);
@@ -476,7 +475,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
break;
case VHOST_GET_BACKEND_FEATURES:
features = VHOST_VDPA_BACKEND_FEATURES;
-   r = copy_to_user(featurep, &features, sizeof(features));
+   if (copy_to_user(featurep, &features, sizeof(features)))
+   r = -EFAULT;
break;
default:
r = vhost_dev_ioctl(&v->vdev, cmd, argp);




Re: [PATCH] vhost: Use mutex to protect vq_irq setup

2020-10-29 Thread Jason Wang



On 2020/10/28 下午10:20, Eli Cohen wrote:

Both irq_bypass_register_producer() and irq_bypass_unregister_producer()
require process context to run. Change the call context lock from
spinlock to mutex to protect the setup process to avoid deadlocks.

Fixes: 265a0ad8731d ("vhost: introduce vhost_vring_call")
Signed-off-by: Eli Cohen 



Hi Eli:

During review we spot that the spinlock is not necessary. And it was 
already protected by vq mutex. So it was removed in this commit:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=86e182fe12ee5869022614457037097c70fe2ed1

Thanks



---
  drivers/vhost/vdpa.c  | 10 +-
  drivers/vhost/vhost.c |  6 +++---
  drivers/vhost/vhost.h |  3 ++-
  3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index be783592fe58..0a744f2b6e76 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -98,26 +98,26 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, 
u16 qid)
return;
  
  	irq = ops->get_vq_irq(vdpa, qid);

-   spin_lock(&vq->call_ctx.ctx_lock);
+   mutex_lock(&vq->call_ctx.ctx_lock);
irq_bypass_unregister_producer(&vq->call_ctx.producer);
if (!vq->call_ctx.ctx || irq < 0) {
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
return;
}
  
  	vq->call_ctx.producer.token = vq->call_ctx.ctx;

vq->call_ctx.producer.irq = irq;
ret = irq_bypass_register_producer(&vq->call_ctx.producer);
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
  }
  
  static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)

  {
struct vhost_virtqueue *vq = &v->vqs[qid];
  
-	spin_lock(&vq->call_ctx.ctx_lock);

+   mutex_lock(&vq->call_ctx.ctx_lock);
irq_bypass_unregister_producer(&vq->call_ctx.producer);
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
  }
  
  static void vhost_vdpa_reset(struct vhost_vdpa *v)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9ad45e1d27f0..938239e11455 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -302,7 +302,7 @@ static void vhost_vring_call_reset(struct vhost_vring_call 
*call_ctx)
  {
call_ctx->ctx = NULL;
memset(&call_ctx->producer, 0x0, sizeof(struct irq_bypass_producer));
-   spin_lock_init(&call_ctx->ctx_lock);
+   mutex_init(&call_ctx->ctx_lock);
  }
  
  static void vhost_vq_reset(struct vhost_dev *dev,

@@ -1650,9 +1650,9 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
break;
}
  
-		spin_lock(&vq->call_ctx.ctx_lock);

+   mutex_lock(&vq->call_ctx.ctx_lock);
swap(ctx, vq->call_ctx.ctx);
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
break;
case VHOST_SET_VRING_ERR:
if (copy_from_user(&f, argp, sizeof f)) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 9032d3c2a9f4..e8855ea04205 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -64,7 +64,8 @@ enum vhost_uaddr_type {
  struct vhost_vring_call {
struct eventfd_ctx *ctx;
struct irq_bypass_producer producer;
-   spinlock_t ctx_lock;
+   /* protect vq irq setup */
+   struct mutex ctx_lock;
  };
  
  /* The virtqueue structure describes a queue attached to a device. */




Re: [PATCH] vhost: Use mutex to protect vq_irq setup

2020-10-29 Thread Jason Wang



On 2020/10/29 下午3:37, Eli Cohen wrote:

On Thu, Oct 29, 2020 at 03:03:24PM +0800, Jason Wang wrote:

On 2020/10/28 下午10:20, Eli Cohen wrote:

Both irq_bypass_register_producer() and irq_bypass_unregister_producer()
require process context to run. Change the call context lock from
spinlock to mutex to protect the setup process to avoid deadlocks.

Fixes: 265a0ad8731d ("vhost: introduce vhost_vring_call")
Signed-off-by: Eli Cohen 


Hi Eli:

During review we spot that the spinlock is not necessary. And it was already
protected by vq mutex. So it was removed in this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=86e182fe12ee5869022614457037097c70fe2ed1

Thanks


I see, thanks.

BTW, while testing irq bypassing, I noticed that qemu started crashing
and I fail to boot the VM? Is that a known issue. I checked using
updated master branch of qemu updated yesterday.



Not known yet.




Any ideas how to check this further?



I would be helpful if you can paste the calltrace here.



Did anyone actually check that irq bypassing works?



Yes, Ling Shan tested it via IFCVF driver.

Thanks





---
   drivers/vhost/vdpa.c  | 10 +-
   drivers/vhost/vhost.c |  6 +++---
   drivers/vhost/vhost.h |  3 ++-
   3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index be783592fe58..0a744f2b6e76 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -98,26 +98,26 @@ static void vhost_vdpa_setup_vq_irq(struct vhost_vdpa *v, 
u16 qid)
return;
irq = ops->get_vq_irq(vdpa, qid);
-   spin_lock(&vq->call_ctx.ctx_lock);
+   mutex_lock(&vq->call_ctx.ctx_lock);
irq_bypass_unregister_producer(&vq->call_ctx.producer);
if (!vq->call_ctx.ctx || irq < 0) {
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
return;
}
vq->call_ctx.producer.token = vq->call_ctx.ctx;
vq->call_ctx.producer.irq = irq;
ret = irq_bypass_register_producer(&vq->call_ctx.producer);
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
   }
   static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
   {
struct vhost_virtqueue *vq = &v->vqs[qid];
-   spin_lock(&vq->call_ctx.ctx_lock);
+   mutex_lock(&vq->call_ctx.ctx_lock);
irq_bypass_unregister_producer(&vq->call_ctx.producer);
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
   }
   static void vhost_vdpa_reset(struct vhost_vdpa *v)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9ad45e1d27f0..938239e11455 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -302,7 +302,7 @@ static void vhost_vring_call_reset(struct vhost_vring_call 
*call_ctx)
   {
call_ctx->ctx = NULL;
memset(&call_ctx->producer, 0x0, sizeof(struct irq_bypass_producer));
-   spin_lock_init(&call_ctx->ctx_lock);
+   mutex_init(&call_ctx->ctx_lock);
   }
   static void vhost_vq_reset(struct vhost_dev *dev,
@@ -1650,9 +1650,9 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
break;
}
-   spin_lock(&vq->call_ctx.ctx_lock);
+   mutex_lock(&vq->call_ctx.ctx_lock);
swap(ctx, vq->call_ctx.ctx);
-   spin_unlock(&vq->call_ctx.ctx_lock);
+   mutex_unlock(&vq->call_ctx.ctx_lock);
break;
case VHOST_SET_VRING_ERR:
if (copy_from_user(&f, argp, sizeof f)) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 9032d3c2a9f4..e8855ea04205 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -64,7 +64,8 @@ enum vhost_uaddr_type {
   struct vhost_vring_call {
struct eventfd_ctx *ctx;
struct irq_bypass_producer producer;
-   spinlock_t ctx_lock;
+   /* protect vq irq setup */
+   struct mutex ctx_lock;
   };
   /* The virtqueue structure describes a queue attached to a device. */




Re: [PATCH] vhost: Use mutex to protect vq_irq setup

2020-10-29 Thread Jason Wang



On 2020/10/29 下午3:50, Eli Cohen wrote:

On Thu, Oct 29, 2020 at 03:39:24PM +0800, Jason Wang wrote:

On 2020/10/29 下午3:37, Eli Cohen wrote:

On Thu, Oct 29, 2020 at 03:03:24PM +0800, Jason Wang wrote:

On 2020/10/28 下午10:20, Eli Cohen wrote:

Both irq_bypass_register_producer() and irq_bypass_unregister_producer()
require process context to run. Change the call context lock from
spinlock to mutex to protect the setup process to avoid deadlocks.

Fixes: 265a0ad8731d ("vhost: introduce vhost_vring_call")
Signed-off-by: Eli Cohen

Hi Eli:

During review we spot that the spinlock is not necessary. And it was already
protected by vq mutex. So it was removed in this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=86e182fe12ee5869022614457037097c70fe2ed1

Thanks


I see, thanks.

BTW, while testing irq bypassing, I noticed that qemu started crashing
and I fail to boot the VM? Is that a known issue. I checked using
updated master branch of qemu updated yesterday.

Not known yet.



Any ideas how to check this further?

I would be helpful if you can paste the calltrace here.


I am not too familiar with qemu. Assuming I am using virsh start to boot
the VM, how can I get the call trace?



You probably need to configure qemu with --enable-debug. Then after VM 
is launching, you can use gdb to attach to the qemu process, then gdb 
may report a calltrace if qemu crashes.


Thanks




Re: [PATCH] vhost/vsock: add IOTLB API support

2020-10-30 Thread Jason Wang



On 2020/10/30 上午1:43, Stefano Garzarella wrote:

This patch enables the IOTLB API support for vhost-vsock devices,
allowing the userspace to emulate an IOMMU for the guest.

These changes were made following vhost-net, in details this patch:
- exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
   device if the feature is acked
- implements VHOST_GET_BACKEND_FEATURES and
   VHOST_SET_BACKEND_FEATURES ioctls
- calls vq_meta_prefetch() before vq processing to prefetch vq
   metadata address in IOTLB
- provides .read_iter, .write_iter, and .poll callbacks for the
   chardev; they are used by the userspace to exchange IOTLB messages

This patch was tested with QEMU and a patch applied [1] to fix a
simple issue:
 $ qemu -M q35,accel=kvm,kernel-irqchip=split \
-drive file=fedora.qcow2,format=qcow2,if=virtio \
-device intel-iommu,intremap=on \
-device vhost-vsock-pci,guest-cid=3,iommu_platform=on



Patch looks good, but a question:

It looks to me you don't enable ATS which means vhost won't get any 
invalidation request or did I miss anything?


Thanks




[1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html

Signed-off-by: Stefano Garzarella 
---
  drivers/vhost/vsock.c | 68 +--
  1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a483cec31d5c..5e78fb719602 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -30,7 +30,12 @@
  #define VHOST_VSOCK_PKT_WEIGHT 256
  
  enum {

-   VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+   VHOST_VSOCK_FEATURES = VHOST_FEATURES |
+  (1ULL << VIRTIO_F_ACCESS_PLATFORM)
+};
+
+enum {
+   VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
  };
  
  /* Used to track all the vhost_vsock instances on the system. */

@@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
if (!vhost_vq_get_backend(vq))
goto out;
  
+	if (!vq_meta_prefetch(vq))

+   goto out;
+
/* Avoid further vmexits, we're already processing the virtqueue */
vhost_disable_notify(&vsock->dev, vq);
  
@@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)

if (!vhost_vq_get_backend(vq))
goto out;
  
+	if (!vq_meta_prefetch(vq))

+   goto out;
+
vhost_disable_notify(&vsock->dev, vq);
do {
u32 len;
@@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
mutex_lock(&vsock->dev.mutex);
if ((features & (1 << VHOST_F_LOG_ALL)) &&
!vhost_log_access_ok(&vsock->dev)) {
-   mutex_unlock(&vsock->dev.mutex);
-   return -EFAULT;
+   goto err;
+   }
+
+   if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
+   if (vhost_init_device_iotlb(&vsock->dev, true))
+   goto err;
}
  
  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {

@@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
}
mutex_unlock(&vsock->dev.mutex);
return 0;
+
+err:
+   mutex_unlock(&vsock->dev.mutex);
+   return -EFAULT;
  }
  
  static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,

@@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
if (copy_from_user(&features, argp, sizeof(features)))
return -EFAULT;
return vhost_vsock_set_features(vsock, features);
+   case VHOST_GET_BACKEND_FEATURES:
+   features = VHOST_VSOCK_BACKEND_FEATURES;
+   if (copy_to_user(argp, &features, sizeof(features)))
+   return -EFAULT;
+   return 0;
+   case VHOST_SET_BACKEND_FEATURES:
+   if (copy_from_user(&features, argp, sizeof(features)))
+   return -EFAULT;
+   if (features & ~VHOST_VSOCK_BACKEND_FEATURES)
+   return -EOPNOTSUPP;
+   vhost_set_backend_features(&vsock->dev, features);
+   return 0;
default:
mutex_lock(&vsock->dev.mutex);
r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
@@ -823,6 +854,34 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
}
  }
  
+static ssize_t vhost_vsock_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)

+{
+   struct file *file = iocb->ki_filp;
+   struct vhost_vsock *vsock = file->private_data;
+   struct vhost_dev *dev = &vsock->dev;
+   int noblock = file->f_flags & O_NONBLOCK;
+
+   return vhost_chr_read_iter(dev, to, noblock);
+}
+
+static ssize_t vhost_vsock_chr_write_iter(struct kiocb *iocb,
+   struct iov_iter *from)
+{
+

Re: [PATCH] vhost/vsock: add IOTLB API support

2020-10-30 Thread Jason Wang



On 2020/10/30 下午6:54, Stefano Garzarella wrote:

On Fri, Oct 30, 2020 at 06:02:18PM +0800, Jason Wang wrote:


On 2020/10/30 上午1:43, Stefano Garzarella wrote:

This patch enables the IOTLB API support for vhost-vsock devices,
allowing the userspace to emulate an IOMMU for the guest.

These changes were made following vhost-net, in details this patch:
- exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
  device if the feature is acked
- implements VHOST_GET_BACKEND_FEATURES and
  VHOST_SET_BACKEND_FEATURES ioctls
- calls vq_meta_prefetch() before vq processing to prefetch vq
  metadata address in IOTLB
- provides .read_iter, .write_iter, and .poll callbacks for the
  chardev; they are used by the userspace to exchange IOTLB messages

This patch was tested with QEMU and a patch applied [1] to fix a
simple issue:
    $ qemu -M q35,accel=kvm,kernel-irqchip=split \
   -drive file=fedora.qcow2,format=qcow2,if=virtio \
   -device intel-iommu,intremap=on \
   -device vhost-vsock-pci,guest-cid=3,iommu_platform=on



Patch looks good, but a question:

It looks to me you don't enable ATS which means vhost won't get any 
invalidation request or did I miss anything?




You're right, I didn't see invalidation requests, only miss and updates.
Now I have tried to enable 'ats' and 'device-iotlb' but I still don't 
see any invalidation.


How can I test it? (Sorry but I don't have much experience yet with 
vIOMMU)



I guess it's because the batched unmap. Maybe you can try to use 
"intel_iommu=strict" in guest kernel command line to see if it works.


Btw, make sure the qemu contains the patch [1]. Otherwise ATS won't be 
enabled for recent Linux Kernel in the guest.


Thanks

[1] https://patchew.org/QEMU/20200909081731.24688-1-jasow...@redhat.com/



Thanks,
Stefano





Re: [PATCH] vhost/vsock: add IOTLB API support

2020-11-03 Thread Jason Wang



On 2020/11/3 上午1:11, Stefano Garzarella wrote:

On Fri, Oct 30, 2020 at 07:44:43PM +0800, Jason Wang wrote:


On 2020/10/30 下午6:54, Stefano Garzarella wrote:

On Fri, Oct 30, 2020 at 06:02:18PM +0800, Jason Wang wrote:


On 2020/10/30 上午1:43, Stefano Garzarella wrote:

This patch enables the IOTLB API support for vhost-vsock devices,
allowing the userspace to emulate an IOMMU for the guest.

These changes were made following vhost-net, in details this patch:
- exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
  device if the feature is acked
- implements VHOST_GET_BACKEND_FEATURES and
  VHOST_SET_BACKEND_FEATURES ioctls
- calls vq_meta_prefetch() before vq processing to prefetch vq
  metadata address in IOTLB
- provides .read_iter, .write_iter, and .poll callbacks for the
  chardev; they are used by the userspace to exchange IOTLB messages

This patch was tested with QEMU and a patch applied [1] to fix a
simple issue:
    $ qemu -M q35,accel=kvm,kernel-irqchip=split \
   -drive file=fedora.qcow2,format=qcow2,if=virtio \
   -device intel-iommu,intremap=on \
   -device vhost-vsock-pci,guest-cid=3,iommu_platform=on



Patch looks good, but a question:

It looks to me you don't enable ATS which means vhost won't get any 
invalidation request or did I miss anything?




You're right, I didn't see invalidation requests, only miss and 
updates.
Now I have tried to enable 'ats' and 'device-iotlb' but I still 
don't see any invalidation.


How can I test it? (Sorry but I don't have much experience yet with 
vIOMMU)



I guess it's because the batched unmap. Maybe you can try to use 
"intel_iommu=strict" in guest kernel command line to see if it works.


Btw, make sure the qemu contains the patch [1]. Otherwise ATS won't 
be enabled for recent Linux Kernel in the guest.


The problem was my kernel, it was built with a tiny configuration.
Using fedora stock kernel I can see the 'invalidate' requests, but I 
also had the following issues.


Do they make you ring any bells?

$ ./qemu -m 4G -smp 4 -M q35,accel=kvm,kernel-irqchip=split \
    -drive file=fedora.qcow2,format=qcow2,if=virtio \
    -device intel-iommu,intremap=on,device-iotlb=on \
    -device vhost-vsock-pci,guest-cid=6,iommu_platform=on,ats=on,id=v1

    qemu-system-x86_64: vtd_iova_to_slpte: detected IOVA overflow     
(iova=0x1d4030c0)



It's a hint that IOVA exceeds the AW. It might be worth to check whether 
the missed IOVA reported from IOTLB is legal.


Thanks


qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
(dev=00:03:00, iova=0x1d4030c0)
    qemu-system-x86_64: New fault is not recorded due to compression 
of     faults


Guest kernel messages:
    [   44.940872] DMAR: DRHD: handling fault status reg 2
    [   44.941989] DMAR: [DMA Read] Request device [00:03.0] PASID     
 fault addr 88W

    [   49.785884] DMAR: DRHD: handling fault status reg 2
    [   49.788874] DMAR: [DMA Read] Request device [00:03.0] PASID     
 fault addr 88W



QEMU: b149dea55c Merge remote-tracking branch 
'remotes/cschoenebeck/tags/pull-9p-20201102' into staging


Linux guest: 5.8.16-200.fc32.x86_64


Thanks,
Stefano





Re: [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit

2021-04-18 Thread Jason Wang



在 2021/4/16 下午2:29, Xuan Zhuo 写道:

On Fri, 16 Apr 2021 13:35:33 +0800, Jason Wang  wrote:

在 2021/4/15 下午6:27, Xuan Zhuo 写道:

On Wed, 14 Apr 2021 13:46:45 +0800, Jason Wang  wrote:

在 2021/4/13 上午11:15, Xuan Zhuo 写道:

This patch implements the core part of xsk zerocopy xmit.

When the user calls sendto to consume the data in the xsk tx queue,
virtnet_xsk_wakeup() will be called.

In wakeup, it will try to send a part of the data directly. There are
two purposes for this realization:

1. Send part of the data quickly to reduce the transmission delay of the
  first packet.
2. Trigger tx interrupt, start napi to consume xsk tx data.

All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
needs to support csum and other functions later, consider assigning xsk
hdr separately for each sent packet.

There are now three situations in free_old_xmit(): skb, xdp frame, xsk
desc.  Based on the last two bit of ptr returned by virtqueue_get_buf():
   00 is skb by default.
   01 represents the packet sent by xdp
   10 is the packet sent by xsk

If the xmit work of xsk has not been completed, but the ring is full,
napi must first exit and wait for the ring to be available, so
need_wakeup() is set. If free_old_xmit() is called first by start_xmit(),
we can quickly wake up napi to execute xsk xmit task.

When recycling, we need to count the number of bytes sent, so put xsk
desc->len into the ptr pointer. Because ptr does not point to meaningful
objects in xsk.

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
---
drivers/net/virtio_net.c | 296 ++-
1 file changed, 292 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8242a9e9f17d..c441d6bf1510 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -46,6 +46,11 @@ module_param(napi_tx, bool, 0644);
#define VIRTIO_XDP_REDIRBIT(1)

#define VIRTIO_XDP_FLAG BIT(0)
+#define VIRTIO_XSK_FLAGBIT(1)
+
+#define VIRTIO_XSK_PTR_SHIFT   4
+
+static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;

/* RX packet size EWMA. The average packet size is used to determine the 
packet
 * buffer size when refilling RX rings. As the entire RX ring may be 
refilled
@@ -138,6 +143,12 @@ struct send_queue {
struct {
/* xsk pool */
struct xsk_buff_pool __rcu *pool;
+
+   /* save the desc for next xmit, when xmit fail. */
+   struct xdp_desc last_desc;

As replied in the pervious version this looks tricky. I think we need to
make sure to reserve some slots as skb path did.

This looks exactly like what stmmac did which alos shares XDP and skb
for the same ring.



+
+   /* xsk wait for tx inter or softirq */
+   bool need_wakeup;
} xsk;
};

@@ -255,6 +266,15 @@ struct padded_vnet_hdr {
char padding[4];
};

+static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
+  int budget, bool in_napi);
+static void virtnet_xsk_complete(struct send_queue *sq, u32 num);
+
+static bool is_skb_ptr(void *ptr)
+{
+   return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
+}
+
static bool is_xdp_frame(void *ptr)
{
return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -265,6 +285,19 @@ static void *xdp_to_ptr(struct xdp_frame *ptr)
return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
}

+static void *xsk_to_ptr(struct xdp_desc *desc)
+{
+   /* save the desc len to ptr */
+   u64 p = desc->len << VIRTIO_XSK_PTR_SHIFT;
+
+   return (void *)((unsigned long)p | VIRTIO_XSK_FLAG);
+}
+
+static void ptr_to_xsk(void *ptr, struct xdp_desc *desc)
+{
+   desc->len = ((unsigned long)ptr) >> VIRTIO_XSK_PTR_SHIFT;
+}
+
static struct xdp_frame *ptr_to_xdp(void *ptr)
{
return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
@@ -273,25 +306,35 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
static void __free_old_xmit(struct send_queue *sq, bool in_napi,
struct virtnet_sq_stats *stats)
{
+   unsigned int xsknum = 0;
unsigned int len;
void *ptr;

while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-   if (likely(!is_xdp_frame(ptr))) {
+   if (is_skb_ptr(ptr)) {
struct sk_buff *skb = ptr;

pr_debug("Sent skb %p\n", skb);

stats->bytes += skb->len;
napi_consume_skb(skb, in_napi);
-   } else {
+   } else if (is_xdp_frame(ptr)) {
struct xdp_frame *frame = ptr_to_xdp(ptr);

stats->bytes += frame->len;
xdp_return_frame(frame);
+   } else {
+   struct 

Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

2021-04-18 Thread Jason Wang



在 2021/4/16 下午5:16, Xuan Zhuo 写道:

In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
can use build_skb to create skb directly. No need to alloc for
additional space. And it can save a 'frags slot', which is very friendly
to GRO.

Here, if the payload of the received package is too small (less than
GOOD_COPY_LEN), we still choose to copy it directly to the space got by
napi_alloc_skb. So we can reuse these pages.

Testing Machine:
 The four queues of the network card are bound to the cpu1.

Test command:
 for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& 
done

The size of the udp package is 1000, so in the case of this patch, there
will always be enough tailroom to use build_skb. The sent udp packet
will be discarded because there is no port to receive it. The irqsoftd
of the machine is 100%, we observe the received quantity displayed by
sar -n DEV 1:

no build_skb:  956864.00 rxpck/s
build_skb:1158465.00 rxpck/s



I suggess to test the case of XDP_PASS in this case as well.




Signed-off-by: Xuan Zhuo 
Suggested-by: Jason Wang 
---

v3: fix the truesize when headroom > 0

v2: conflict resolution

  drivers/net/virtio_net.c | 69 
  1 file changed, 48 insertions(+), 21 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 101659cd4b87..8cd76037c724 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -379,21 +379,17 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
   struct receive_queue *rq,
   struct page *page, unsigned int offset,
   unsigned int len, unsigned int truesize,
-  bool hdr_valid, unsigned int metasize)
+  bool hdr_valid, unsigned int metasize,
+  unsigned int headroom)
  {
struct sk_buff *skb;
struct virtio_net_hdr_mrg_rxbuf *hdr;
unsigned int copy, hdr_len, hdr_padded_len;
-   char *p;
+   int tailroom, shinfo_size;
+   char *p, *hdr_p;

p = page_address(page) + offset;
-
-   /* copy small packet so we can reuse these pages for small data */
-   skb = napi_alloc_skb(&rq->napi, GOOD_COPY_LEN);
-   if (unlikely(!skb))
-   return NULL;
-
-   hdr = skb_vnet_hdr(skb);
+   hdr_p = p;

hdr_len = vi->hdr_len;
if (vi->mergeable_rx_bufs)
@@ -401,14 +397,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
else
hdr_padded_len = sizeof(struct padded_vnet_hdr);

-   /* hdr_valid means no XDP, so we can copy the vnet header */
-   if (hdr_valid)
-   memcpy(hdr, p, hdr_len);
+   /* If headroom is not 0, there is an offset between the beginning of the
+* data and the allocated space, otherwise the data and the allocated
+* space are aligned.
+*/
+   if (headroom) {
+   /* The actual allocated space size is PAGE_SIZE. */



I think the comment in receive_mergeable() is better:

    /* Buffers with headroom use PAGE_SIZE as alloc size,
 * see add_recvbuf_mergeable() + get_mergeable_buf_len()
 */



+   truesize = PAGE_SIZE;
+   tailroom = truesize - len - offset;
+   } else {
+   tailroom = truesize - len;
+   }

len -= hdr_len;
offset += hdr_padded_len;
p += hdr_padded_len;

+   shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+   if (len > GOOD_COPY_LEN && tailroom >= shinfo_size) {



Any reason that you don't use build_skb() for len < GOOD_COPY_LEN?

Other looks good.

Thanks



+   skb = build_skb(p, truesize);
+   if (unlikely(!skb))
+   return NULL;
+
+   skb_put(skb, len);
+   goto ok;
+   }
+
+   /* copy small packet so we can reuse these pages for small data */
+   skb = napi_alloc_skb(&rq->napi, GOOD_COPY_LEN);
+   if (unlikely(!skb))
+   return NULL;
+
/* Copy all frame if it fits skb->head, otherwise
 * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
 */
@@ -418,11 +438,6 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
copy = ETH_HLEN + metasize;
skb_put_data(skb, p, copy);

-   if (metasize) {
-   __skb_pull(skb, metasize);
-   skb_metadata_set(skb, metasize);
-   }
-
len -= copy;
offset += copy;

@@ -431,7 +446,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
skb_add_rx_frag(skb, 0, page, offset, len, truesize);
else
put_page(page);
-   ret

Re: [PATCH V3 1/3] vDPA/ifcvf: deduce VIRTIO device ID when probe

2021-04-18 Thread Jason Wang



在 2021/4/16 下午3:16, Zhu Lingshan 写道:

This commit deduces VIRTIO device ID as device type when probe,
then ifcvf_vdpa_get_device_id() can simply return the ID.
ifcvf_vdpa_get_features() and ifcvf_vdpa_get_config_size()
can work properly based on the device ID.

Signed-off-by: Zhu Lingshan 
---
  drivers/vdpa/ifcvf/ifcvf_base.h |  1 +
  drivers/vdpa/ifcvf/ifcvf_main.c | 30 ++
  2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_base.h b/drivers/vdpa/ifcvf/ifcvf_base.h
index b2eeb16b9c2c..1c04cd256fa7 100644
--- a/drivers/vdpa/ifcvf/ifcvf_base.h
+++ b/drivers/vdpa/ifcvf/ifcvf_base.h
@@ -84,6 +84,7 @@ struct ifcvf_hw {
u32 notify_off_multiplier;
u64 req_features;
u64 hw_features;
+   u32 dev_type;
struct virtio_pci_common_cfg __iomem *common_cfg;
void __iomem *net_cfg;
struct vring_info vring[IFCVF_MAX_QUEUE_PAIRS * 2];
diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 44d7586019da..469a9b5737b7 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -323,19 +323,9 @@ static u32 ifcvf_vdpa_get_generation(struct vdpa_device 
*vdpa_dev)
  
  static u32 ifcvf_vdpa_get_device_id(struct vdpa_device *vdpa_dev)

  {
-   struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev);
-   struct pci_dev *pdev = adapter->pdev;
-   u32 ret = -ENODEV;
-
-   if (pdev->device < 0x1000 || pdev->device > 0x107f)
-   return ret;
-
-   if (pdev->device < 0x1040)
-   ret =  pdev->subsystem_device;
-   else
-   ret =  pdev->device - 0x1040;
+   struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);
  
-	return ret;

+   return vf->dev_type;
  }
  
  static u32 ifcvf_vdpa_get_vendor_id(struct vdpa_device *vdpa_dev)

@@ -466,6 +456,22 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
pci_set_drvdata(pdev, adapter);
  
  	vf = &adapter->vf;

+
+   /* This drirver drives both modern virtio devices and transitional
+* devices in modern mode.
+* vDPA requires feature bit VIRTIO_F_ACCESS_PLATFORM,
+* so legacy devices and transitional devices in legacy
+* mode will not work for vDPA, this driver will not
+* drive devices with legacy interface.
+*/
+   if (pdev->device < 0x1000 || pdev->device > 0x107f)
+   return -EOPNOTSUPP;



So this seems useless, id_table has already did that for us:

The driver supports:

#define IFCVF_DEVICE_ID 0x1041

and

#define C5000X_PL_BLK_DEVICE_ID    0x1001

I think we can never reach the condition above.

Thanks



+
+   if (pdev->device < 0x1040)
+   vf->dev_type =  pdev->subsystem_device;
+   else
+   vf->dev_type =  pdev->device - 0x1040;
+
vf->base = pcim_iomap_table(pdev);
  
  	adapter->pdev = pdev;




Re: [PATCH V3 2/3] vDPA/ifcvf: enable Intel C5000X-PL virtio-block for vDPA

2021-04-18 Thread Jason Wang



在 2021/4/16 下午3:16, Zhu Lingshan 写道:

This commit enabled Intel FPGA SmartNIC C5000X-PL virtio-block
for vDPA.

Signed-off-by: Zhu Lingshan 



Acked-by: Jason Wang 



---
  drivers/vdpa/ifcvf/ifcvf_base.h |  8 +++-
  drivers/vdpa/ifcvf/ifcvf_main.c | 19 ++-
  2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_base.h b/drivers/vdpa/ifcvf/ifcvf_base.h
index 1c04cd256fa7..0111bfdeb342 100644
--- a/drivers/vdpa/ifcvf/ifcvf_base.h
+++ b/drivers/vdpa/ifcvf/ifcvf_base.h
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -28,7 +29,12 @@

  #define C5000X_PL_SUBSYS_VENDOR_ID0x8086
  #define C5000X_PL_SUBSYS_DEVICE_ID0x0001
  
-#define IFCVF_SUPPORTED_FEATURES \

+#define C5000X_PL_BLK_VENDOR_ID0x1AF4
+#define C5000X_PL_BLK_DEVICE_ID0x1001
+#define C5000X_PL_BLK_SUBSYS_VENDOR_ID 0x8086
+#define C5000X_PL_BLK_SUBSYS_DEVICE_ID 0x0002
+
+#define IFCVF_NET_SUPPORTED_FEATURES \
((1ULL << VIRTIO_NET_F_MAC)   | \
 (1ULL << VIRTIO_F_ANY_LAYOUT)| \
 (1ULL << VIRTIO_F_VERSION_1) | \
diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 469a9b5737b7..376b2014916a 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -168,10 +168,23 @@ static struct ifcvf_hw *vdpa_to_vf(struct vdpa_device 
*vdpa_dev)
  
  static u64 ifcvf_vdpa_get_features(struct vdpa_device *vdpa_dev)

  {
+   struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev);
struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);
+   struct pci_dev *pdev = adapter->pdev;
+
u64 features;
  
-	features = ifcvf_get_features(vf) & IFCVF_SUPPORTED_FEATURES;

+   switch (vf->dev_type) {
+   case VIRTIO_ID_NET:
+   features = ifcvf_get_features(vf) & 
IFCVF_NET_SUPPORTED_FEATURES;
+   break;
+   case VIRTIO_ID_BLOCK:
+   features = ifcvf_get_features(vf);
+   break;
+   default:
+   features = 0;
+   IFCVF_ERR(pdev, "VIRTIO ID %u not supported\n", vf->dev_type);
+   }
  
  	return features;

  }
@@ -517,6 +530,10 @@ static struct pci_device_id ifcvf_pci_ids[] = {
 C5000X_PL_DEVICE_ID,
 C5000X_PL_SUBSYS_VENDOR_ID,
 C5000X_PL_SUBSYS_DEVICE_ID) },
+   { PCI_DEVICE_SUB(C5000X_PL_BLK_VENDOR_ID,
+C5000X_PL_BLK_DEVICE_ID,
+C5000X_PL_BLK_SUBSYS_VENDOR_ID,
+C5000X_PL_BLK_SUBSYS_DEVICE_ID) },
  
  	{ 0 },

  };




Re: [PATCH V3 3/3] vDPA/ifcvf: get_config_size should return dev specific config size

2021-04-18 Thread Jason Wang



在 2021/4/16 下午3:16, Zhu Lingshan 写道:

get_config_size() should return the size based on the decected
device type.

Signed-off-by: Zhu Lingshan 



Acked-by: Jason Wang 



---
  drivers/vdpa/ifcvf/ifcvf_main.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 376b2014916a..3b6f7862dbb8 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -356,7 +356,24 @@ static u32 ifcvf_vdpa_get_vq_align(struct vdpa_device 
*vdpa_dev)
  
  static size_t ifcvf_vdpa_get_config_size(struct vdpa_device *vdpa_dev)

  {
-   return sizeof(struct virtio_net_config);
+   struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev);
+   struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);
+   struct pci_dev *pdev = adapter->pdev;
+   size_t size;
+
+   switch (vf->dev_type) {
+   case VIRTIO_ID_NET:
+   size = sizeof(struct virtio_net_config);
+   break;
+   case VIRTIO_ID_BLOCK:
+   size = sizeof(struct virtio_blk_config);
+   break;
+   default:
+   size = 0;
+   IFCVF_ERR(pdev, "VIRTIO ID %u not supported\n", vf->dev_type);
+   }
+
+   return size;
  }
  
  static void ifcvf_vdpa_get_config(struct vdpa_device *vdpa_dev,




Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

2021-04-19 Thread Jason Wang



在 2021/4/20 上午7:29, Mat Martineau 写道:


On Fri, 16 Apr 2021, Xuan Zhuo wrote:


In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
can use build_skb to create skb directly. No need to alloc for
additional space. And it can save a 'frags slot', which is very friendly
to GRO.

Here, if the payload of the received package is too small (less than
GOOD_COPY_LEN), we still choose to copy it directly to the space got by
napi_alloc_skb. So we can reuse these pages.

Testing Machine:
   The four queues of the network card are bound to the cpu1.

Test command:
   for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 
150& done


The size of the udp package is 1000, so in the case of this patch, there
will always be enough tailroom to use build_skb. The sent udp packet
will be discarded because there is no port to receive it. The irqsoftd
of the machine is 100%, we observe the received quantity displayed by
sar -n DEV 1:

no build_skb:  956864.00 rxpck/s
build_skb:    1158465.00 rxpck/s

Signed-off-by: Xuan Zhuo 
Suggested-by: Jason Wang 
---

v3: fix the truesize when headroom > 0

v2: conflict resolution

drivers/net/virtio_net.c | 69 
1 file changed, 48 insertions(+), 21 deletions(-)


Xuan,

I realize this has been merged to net-next already, but I'm getting a 
use-after-free with KASAN in page_to_skb() with this patch. Reverting 
this change fixes the UAF. I've included the KASAN dump below, and a 
couple of comments inline.





diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 101659cd4b87..8cd76037c724 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -379,21 +379,17 @@ static struct sk_buff *page_to_skb(struct 
virtnet_info *vi,

   struct receive_queue *rq,
   struct page *page, unsigned int offset,
   unsigned int len, unsigned int truesize,
-   bool hdr_valid, unsigned int metasize)
+   bool hdr_valid, unsigned int metasize,
+   unsigned int headroom)
{
struct sk_buff *skb;
struct virtio_net_hdr_mrg_rxbuf *hdr;
unsigned int copy, hdr_len, hdr_padded_len;
-    char *p;
+    int tailroom, shinfo_size;
+    char *p, *hdr_p;

p = page_address(page) + offset;
-
-    /* copy small packet so we can reuse these pages for small data */
-    skb = napi_alloc_skb(&rq->napi, GOOD_COPY_LEN);
-    if (unlikely(!skb))
-    return NULL;
-
-    hdr = skb_vnet_hdr(skb);
+    hdr_p = p;


hdr_p is assigned here, pointer to an address in the provided page...



hdr_len = vi->hdr_len;
if (vi->mergeable_rx_bufs)


(snip)

@@ -431,7 +446,7 @@ static struct sk_buff *page_to_skb(struct 
virtnet_info *vi,

    skb_add_rx_frag(skb, 0, page, offset, len, truesize);
    else
    put_page(page);


page is potentially released here...


-    return skb;
+    goto ok;
}

/*
@@ -458,6 +473,18 @@ static struct sk_buff *page_to_skb(struct 
virtnet_info *vi,

if (page)
    give_pages(rq, page);

+ok:
+    /* hdr_valid means no XDP, so we can copy the vnet header */
+    if (hdr_valid) {
+    hdr = skb_vnet_hdr(skb);
+    memcpy(hdr, hdr_p, hdr_len);


and hdr_p is dereferenced here.



Right, I tend to recover the way to copy hdr and set meta just after 
alloc/build_skb().


Thanks




I'm seeing this KASAN UAF at boot time in a kvm VM (Fedora 33 host and 
guest, if that helps):


[   61.202483] 
==

[   61.204005] BUG: KASAN: use-after-free in page_to_skb+0x32a/0x4b0
[   61.205387] Read of size 12 at addr 888105bdf800 by task 
NetworkManager/579
[   61.207035] [   61.207408] CPU: 0 PID: 579 Comm: NetworkManager Not 
tainted 5.12.0-rc7+ #2
[   61.208715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.14.0-1.fc33 04/01/2014

[   61.210257] Call Trace:
[   61.210730]  
[   61.211209]  dump_stack+0x93/0xc2
[   61.211996]  print_address_description.constprop.0+0x18/0x130
[   61.213310]  ? page_to_skb+0x32a/0x4b0
[   61.214318]  ? page_to_skb+0x32a/0x4b0
[   61.215085]  kasan_report.cold+0x7f/0x111
[   61.215966]  ? trace_hardirqs_off+0x10/0xe0
[   61.216823]  ? page_to_skb+0x32a/0x4b0
[   61.217809]  kasan_check_range+0xf9/0x1e0
[   61.217834]  memcpy+0x20/0x60
[   61.217848]  page_to_skb+0x32a/0x4b0
[   61.217888]  receive_buf+0x1434/0x2690
[   61.217926]  ? page_to_skb+0x4b0/0x4b0
[   61.217947]  ? find_held_lock+0x85/0xa0
[   61.217964]  ? lock_release+0x1d0/0x400
[   61.217974]  ? virtnet_poll+0x1d8/0x6b0
[   61.217983]  ? detach_buf_split+0x254/0x290
[   61.218008]  ? virtqueue_get_buf_ctx_split+0x145/0x1f0
[   61.218032]  virtnet_poll+0x2a8/0x6b0
[   61.218057]  ? receive_buf+0x2690/0x2690
[   61.218067]  ? lock_release+0x400/0x400
[   61.218119]  __napi_poll+0x57/0x2f0
[   61.229624]  net_rx_action+0x4dd/0x590

Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

2021-04-19 Thread Jason Wang



在 2021/4/20 上午10:38, Jason Wang 写道:

:
+    /* hdr_valid means no XDP, so we can copy the vnet header */
+    if (hdr_valid) {
+    hdr = skb_vnet_hdr(skb);
+    memcpy(hdr, hdr_p, hdr_len);


and hdr_p is dereferenced here.



Right, I tend to recover the way to copy hdr and set meta just after 
alloc/build_skb().


Thanks 



Btw, since the patch modifies a critical path of virtio-net I suggest to 
test the following cases:


1) netperf TCP stream
2) netperf UDP with packet size from 64 to PAGE_SIZE
3) XDP_PASS with 1)
4) XDP_PASS with 2)
5) XDP metadata with 1)
6) XDP metadata with 2)

Thanks



Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

2021-04-19 Thread Jason Wang



在 2021/4/20 上午12:48, David Ahern 写道:

On 4/16/21 2:16 AM, Xuan Zhuo wrote:

In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
can use build_skb to create skb directly. No need to alloc for
additional space. And it can save a 'frags slot', which is very friendly
to GRO.

Here, if the payload of the received package is too small (less than
GOOD_COPY_LEN), we still choose to copy it directly to the space got by
napi_alloc_skb. So we can reuse these pages.

Testing Machine:
 The four queues of the network card are bound to the cpu1.

Test command:
 for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& 
done

The size of the udp package is 1000, so in the case of this patch, there
will always be enough tailroom to use build_skb. The sent udp packet
will be discarded because there is no port to receive it. The irqsoftd
of the machine is 100%, we observe the received quantity displayed by
sar -n DEV 1:

no build_skb:  956864.00 rxpck/s
build_skb:1158465.00 rxpck/s


virtio_net is using napi_consume_skb, so napi_build_skb should show a
small increase from build_skb.



Yes and we probably need to do this in receive_small().

Thanks



Re: [PATCH V4 1/3] vDPA/ifcvf: deduce VIRTIO device ID when probe

2021-04-19 Thread Jason Wang



在 2021/4/19 下午2:33, Zhu Lingshan 写道:

This commit deduces VIRTIO device ID as device type when probe,
then ifcvf_vdpa_get_device_id() can simply return the ID.
ifcvf_vdpa_get_features() and ifcvf_vdpa_get_config_size()
can work properly based on the device ID.

Signed-off-by: Zhu Lingshan 



Acked-by: Jason Wang 



---
  drivers/vdpa/ifcvf/ifcvf_base.h |  1 +
  drivers/vdpa/ifcvf/ifcvf_main.c | 27 +++
  2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_base.h b/drivers/vdpa/ifcvf/ifcvf_base.h
index b2eeb16b9c2c..1c04cd256fa7 100644
--- a/drivers/vdpa/ifcvf/ifcvf_base.h
+++ b/drivers/vdpa/ifcvf/ifcvf_base.h
@@ -84,6 +84,7 @@ struct ifcvf_hw {
u32 notify_off_multiplier;
u64 req_features;
u64 hw_features;
+   u32 dev_type;
struct virtio_pci_common_cfg __iomem *common_cfg;
void __iomem *net_cfg;
struct vring_info vring[IFCVF_MAX_QUEUE_PAIRS * 2];
diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 44d7586019da..66927ec81fa5 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -323,19 +323,9 @@ static u32 ifcvf_vdpa_get_generation(struct vdpa_device 
*vdpa_dev)
  
  static u32 ifcvf_vdpa_get_device_id(struct vdpa_device *vdpa_dev)

  {
-   struct ifcvf_adapter *adapter = vdpa_to_adapter(vdpa_dev);
-   struct pci_dev *pdev = adapter->pdev;
-   u32 ret = -ENODEV;
-
-   if (pdev->device < 0x1000 || pdev->device > 0x107f)
-   return ret;
-
-   if (pdev->device < 0x1040)
-   ret =  pdev->subsystem_device;
-   else
-   ret =  pdev->device - 0x1040;
+   struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);
  
-	return ret;

+   return vf->dev_type;
  }
  
  static u32 ifcvf_vdpa_get_vendor_id(struct vdpa_device *vdpa_dev)

@@ -466,6 +456,19 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
pci_set_drvdata(pdev, adapter);
  
  	vf = &adapter->vf;

+
+   /* This drirver drives both modern virtio devices and transitional
+* devices in modern mode.
+* vDPA requires feature bit VIRTIO_F_ACCESS_PLATFORM,
+* so legacy devices and transitional devices in legacy
+* mode will not work for vDPA, this driver will not
+* drive devices with legacy interface.
+*/
+   if (pdev->device < 0x1040)
+   vf->dev_type =  pdev->subsystem_device;
+   else
+   vf->dev_type =  pdev->device - 0x1040;
+
vf->base = pcim_iomap_table(pdev);
  
  	adapter->pdev = pdev;




Re: [PATCH net-next v2 2/5] virtio-net: separate rx/tx coalescing moderation cmds

2023-11-08 Thread Jason Wang
On Thu, Nov 2, 2023 at 9:10 PM Heng Qi  wrote:
>
> This patch separates the rx and tx global coalescing moderation
> commands to support netdim switches in subsequent patches.
>
> Signed-off-by: Heng Qi 

Acked-by: Jason Wang 

Thanks

> ---
> v1->v2:
> - Add 'i' definition.
>
>  drivers/net/virtio_net.c | 31 ---
>  1 file changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0ad2894e6a5e..0285301caf78 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3266,10 +3266,10 @@ static int virtnet_get_link_ksettings(struct 
> net_device *dev,
> return 0;
>  }
>
> -static int virtnet_send_notf_coal_cmds(struct virtnet_info *vi,
> -  struct ethtool_coalesce *ec)
> +static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi,
> + struct ethtool_coalesce *ec)
>  {
> -   struct scatterlist sgs_tx, sgs_rx;
> +   struct scatterlist sgs_tx;
> int i;
>
> vi->ctrl->coal_tx.tx_usecs = cpu_to_le32(ec->tx_coalesce_usecs);
> @@ -3289,6 +3289,15 @@ static int virtnet_send_notf_coal_cmds(struct 
> virtnet_info *vi,
> vi->sq[i].intr_coal.max_packets = ec->tx_max_coalesced_frames;
> }
>
> +   return 0;
> +}
> +
> +static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi,
> + struct ethtool_coalesce *ec)
> +{
> +   struct scatterlist sgs_rx;
> +   int i;
> +
> vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
> vi->ctrl->coal_rx.rx_max_packets = 
> cpu_to_le32(ec->rx_max_coalesced_frames);
> sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
> @@ -3309,6 +3318,22 @@ static int virtnet_send_notf_coal_cmds(struct 
> virtnet_info *vi,
> return 0;
>  }
>
> +static int virtnet_send_notf_coal_cmds(struct virtnet_info *vi,
> +  struct ethtool_coalesce *ec)
> +{
> +   int err;
> +
> +   err = virtnet_send_tx_notf_coal_cmds(vi, ec);
> +   if (err)
> +   return err;
> +
> +   err = virtnet_send_rx_notf_coal_cmds(vi, ec);
> +   if (err)
> +   return err;
> +
> +   return 0;
> +}
> +
>  static int virtnet_send_ctrl_coal_vq_cmd(struct virtnet_info *vi,
>  u16 vqn, u32 max_usecs, u32 
> max_packets)
>  {
> --
> 2.19.1.6.gb485710b
>
>




Re: [PATCH net-next v2 4/5] virtio-net: support rx netdim

2023-11-08 Thread Jason Wang
On Thu, Nov 2, 2023 at 9:10 PM Heng Qi  wrote:
>
> By comparing the traffic information in the complete napi processes,
> let the virtio-net driver automatically adjust the coalescing
> moderation parameters of each receive queue.
>
> Signed-off-by: Heng Qi 
> ---
> v1->v2:
> - Improved the judgment of dim switch conditions.
> - Fix the safe problem of work thread.

Nit: it's better to be more concrete here, e.g what kind of "safe
problem" it is.


>
>  drivers/net/virtio_net.c | 187 ++-
>  1 file changed, 165 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 69fe09e99b3c..5473aa1ee5cd 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -172,6 +173,17 @@ struct receive_queue {
>
> struct virtnet_rq_stats stats;
>
> +   /* The number of rx notifications */
> +   u16 calls;
> +
> +   /* Is dynamic interrupt moderation enabled? */
> +   bool dim_enabled;
> +
> +   /* Dynamic Interrupt Moderation */
> +   struct dim dim;
> +
> +   u32 packets_in_napi;
> +
> struct virtnet_interrupt_coalesce intr_coal;
>
> /* Chain pages by the private ptr. */
> @@ -305,6 +317,9 @@ struct virtnet_info {
> u8 duplex;
> u32 speed;
>
> +   /* Is rx dynamic interrupt moderation enabled? */
> +   bool rx_dim_enabled;
> +
> /* Interrupt coalescing settings */
> struct virtnet_interrupt_coalesce intr_coal_tx;
> struct virtnet_interrupt_coalesce intr_coal_rx;
> @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
> struct virtnet_info *vi = rvq->vdev->priv;
> struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
>
> +   rq->calls++;
> virtqueue_napi_schedule(&rq->napi, rvq);
>  }
>
> @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct receive_queue 
> *rq)
> }
>  }
>
> +static void virtnet_rx_dim_work(struct work_struct *work);
> +
> +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> receive_queue *rq)
> +{
> +   struct dim_sample cur_sample = {};
> +
> +   if (!rq->packets_in_napi)
> +   return;
> +
> +   u64_stats_update_begin(&rq->stats.syncp);
> +   dim_update_sample(rq->calls,
> + u64_stats_read(&rq->stats.packets),
> + u64_stats_read(&rq->stats.bytes),
> + &cur_sample);
> +   u64_stats_update_end(&rq->stats.syncp);
> +
> +   net_dim(&rq->dim, cur_sample);
> +   rq->packets_in_napi = 0;
> +}
> +
>  static int virtnet_poll(struct napi_struct *napi, int budget)
>  {
> struct receive_queue *rq =
> @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, int 
> budget)
> struct send_queue *sq;
> unsigned int received;
> unsigned int xdp_xmit = 0;
> +   bool napi_complete;
>
> virtnet_poll_cleantx(rq);
>
> received = virtnet_receive(rq, budget, &xdp_xmit);
> +   rq->packets_in_napi += received;
>
> if (xdp_xmit & VIRTIO_XDP_REDIR)
> xdp_do_flush();
>
> /* Out of packets? */
> -   if (received < budget)
> -   virtqueue_napi_complete(napi, rq->vq, received);
> +   if (received < budget) {
> +   napi_complete = virtqueue_napi_complete(napi, rq->vq, 
> received);
> +   if (napi_complete && rq->dim_enabled)
> +   virtnet_rx_dim_update(vi, rq);
> +   }
>
> if (xdp_xmit & VIRTIO_XDP_TX) {
> sq = virtnet_xdp_get_sq(vi);
> @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
> virtnet_napi_tx_disable(&vi->sq[qp_index].napi);
> napi_disable(&vi->rq[qp_index].napi);
> xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq);
> +   cancel_work_sync(&vi->rq[qp_index].dim.work);
>  }
>
>  static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> @@ -2196,6 +2238,9 @@ static int virtnet_enable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
> if (err < 0)
> goto err_xdp_reg_mem_model;
>
> +   INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work);

Any reason we need to do the INIT_WORK here but not probe?

> +   vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +
> virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, 
> &vi->sq[qp_index].napi);
>
> @@ -2393,8 +2438,10 @@ static int virtnet_rx_resize(struct virtnet_info *vi,
>
> qindex = rq - vi->rq;
>
> -   if (running)
> +   if (running) {
> napi_disable(&rq->napi);
> +   cancel_work_sync(&rq->dim.work);
> +   }
>
> err 

Re: [PATCH net-next v2 5/5] virtio-net: return -EOPNOTSUPP for adaptive-tx

2023-11-08 Thread Jason Wang
On Thu, Nov 2, 2023 at 9:10 PM Heng Qi  wrote:
>
> We do not currently support tx dim, so respond to -EOPNOTSUPP.
>
> Signed-off-by: Heng Qi 
> ---
> v1->v2:
> - Use -EOPNOTSUPP instead of specific implementation.
>
>  drivers/net/virtio_net.c | 29 ++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5473aa1ee5cd..03edeadd0725 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3364,9 +3364,15 @@ static int virtnet_get_link_ksettings(struct 
> net_device *dev,
>  static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi,
>   struct ethtool_coalesce *ec)
>  {
> +   bool tx_ctrl_dim_on = !!ec->use_adaptive_tx_coalesce;
> struct scatterlist sgs_tx;
> int i;
>
> +   if (tx_ctrl_dim_on) {
> +   pr_debug("Failed to enable adaptive-tx, which is not 
> supported\n");
> +   return -EOPNOTSUPP;
> +   }

When can we hit this?

Thanks

> +
> vi->ctrl->coal_tx.tx_usecs = cpu_to_le32(ec->tx_coalesce_usecs);
> vi->ctrl->coal_tx.tx_max_packets = 
> cpu_to_le32(ec->tx_max_coalesced_frames);
> sg_init_one(&sgs_tx, &vi->ctrl->coal_tx, sizeof(vi->ctrl->coal_tx));
> @@ -3497,6 +3503,25 @@ static int virtnet_send_rx_notf_coal_vq_cmds(struct 
> virtnet_info *vi,
> return 0;
>  }
>
> +static int virtnet_send_tx_notf_coal_vq_cmds(struct virtnet_info *vi,
> +struct ethtool_coalesce *ec,
> +u16 queue)
> +{
> +   bool tx_ctrl_dim_on = !!ec->use_adaptive_tx_coalesce;
> +   int err;
> +
> +   if (tx_ctrl_dim_on) {
> +   pr_debug("Enabling adaptive-tx for txq%d is not supported\n", 
> queue);
> +   return -EOPNOTSUPP;
> +   }
> +
> +   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue,
> +  ec->tx_coalesce_usecs,
> +  ec->tx_max_coalesced_frames);
> +
> +   return err;
> +}
> +
>  static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>   struct ethtool_coalesce *ec,
>   u16 queue)
> @@ -3507,9 +3532,7 @@ static int virtnet_send_notf_coal_vq_cmds(struct 
> virtnet_info *vi,
> if (err)
> return err;
>
> -   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue,
> -  ec->tx_coalesce_usecs,
> -  ec->tx_max_coalesced_frames);
> +   err = virtnet_send_tx_notf_coal_vq_cmds(vi, ec, queue);
> if (err)
> return err;
>
> --
> 2.19.1.6.gb485710b
>




Re: [PATCH net-next v2 04/21] virtio_net: move core structures to virtio_net.h

2023-11-08 Thread Jason Wang
On Tue, Nov 7, 2023 at 11:13 AM Xuan Zhuo  wrote:
>
> Move some core structures (send_queue, receive_queue, virtnet_info)
> definitions and the relative structures definitions into the
> virtio_net.h file.
>
> That will be used by the other c code files.
>
> Signed-off-by: Xuan Zhuo 
> ---

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v2 05/21] virtio_net: add prefix virtnet to all struct inside virtio_net.h

2023-11-08 Thread Jason Wang
On Tue, Nov 7, 2023 at 11:13 AM Xuan Zhuo  wrote:
>
> We move some structures to the header file, but these structures do not
> prefixed with virtnet. This patch adds virtnet for these.
>
> Signed-off-by: Xuan Zhuo 
> ---

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode

2023-11-08 Thread Jason Wang
On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo  wrote:
>
> If the xsk is enabling, the xsk tx will share the send queue.
> But the xsk requires that the send queue use the premapped mode.
> So the send queue must support premapped mode.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio/main.c   | 163 
>  drivers/net/virtio/virtio_net.h |  16 
>  2 files changed, 163 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> index 16e75c08639e..f052db459156 100644
> --- a/drivers/net/virtio/main.c
> +++ b/drivers/net/virtio/main.c
> @@ -46,6 +46,7 @@ module_param(napi_tx, bool, 0644);
>  #define VIRTIO_XDP_REDIR   BIT(1)
>
>  #define VIRTIO_XDP_FLAGBIT(0)
> +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
>
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>
> @@ -167,6 +168,29 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
> return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>  }
>
> +static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> +{
> +   struct virtnet_sq_dma *next, *head;
> +
> +   head = (void *)((unsigned long)data & ~VIRTIO_XMIT_DATA_MASK);
> +
> +   data = head->data;
> +
> +   while (head) {
> +   virtqueue_dma_unmap_single_attrs(sq->vq, head->addr, 
> head->len,
> +DMA_TO_DEVICE, 0);
> +
> +   next = head->next;
> +
> +   head->next = sq->dmainfo.free;
> +   sq->dmainfo.free = head;
> +
> +   head = next;
> +   }
> +
> +   return data;
> +}
> +
>  static void __free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> u64 *bytes, u64 *packets)
>  {
> @@ -175,14 +199,24 @@ static void __free_old_xmit(struct virtnet_sq *sq, bool 
> in_napi,
>
> while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> if (!is_xdp_frame(ptr)) {
> -   struct sk_buff *skb = ptr;
> +   struct sk_buff *skb;
> +
> +   if (sq->do_dma)
> +   ptr = virtnet_sq_unmap(sq, ptr);
> +
> +   skb = ptr;
>
> pr_debug("Sent skb %p\n", skb);
>
> *bytes += skb->len;
> napi_consume_skb(skb, in_napi);
> } else {
> -   struct xdp_frame *frame = ptr_to_xdp(ptr);
> +   struct xdp_frame *frame;
> +
> +   if (sq->do_dma)
> +   ptr = virtnet_sq_unmap(sq, ptr);
> +
> +   frame = ptr_to_xdp(ptr);
>
> *bytes += xdp_get_frame_len(frame);
> xdp_return_frame(frame);
> @@ -567,22 +601,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq *rq, 
> u32 size, gfp_t gfp)
> return buf;
>  }
>
> -static void virtnet_rq_set_premapped(struct virtnet_info *vi)
> +static int virtnet_sq_set_premapped(struct virtnet_sq *sq)
>  {
> -   int i;
> +   struct virtnet_sq_dma *d;
> +   int err, size, i;
>
> -   /* disable for big mode */
> -   if (!vi->mergeable_rx_bufs && vi->big_packets)
> -   return;
> +   size = virtqueue_get_vring_size(sq->vq);
> +
> +   size += MAX_SKB_FRAGS + 2;

Btw, the dmainfo seems per sg? If I'm correct, how can vq_size +
MAX_SKB_FRAGS + 2 work?

> +
> +   sq->dmainfo.head = kcalloc(size, sizeof(*sq->dmainfo.head), 
> GFP_KERNEL);
> +   if (!sq->dmainfo.head)
> +   return -ENOMEM;
> +
> +   err = virtqueue_set_dma_premapped(sq->vq);
> +   if (err) {
> +   kfree(sq->dmainfo.head);
> +   return err;
> +   }

Allocating after set_dma_premapped() seems easier.

Btw, is there a benchmark of TX PPS just for this patch to demonstrate
the impact of the performance?

> +
> +   sq->dmainfo.free = NULL;
> +
> +   sq->do_dma = true;
> +
> +   for (i = 0; i < size; ++i) {
> +   d = &sq->dmainfo.head[i];
> +
> +   d->next = sq->dmainfo.free;
> +   sq->dmainfo.free = d;
> +   }
> +
> +   return 0;
> +}
> +
> +static void virtnet_set_premapped(struct virtnet_info *vi)
> +{
> +   int i;
>
> for (i = 0; i < vi->max_queue_pairs; i++) {
> -   if (virtqueue_set_dma_premapped(vi->rq[i].vq))
> -   continue;
> +   virtnet_sq_set_premapped(&vi->sq[i]);
>
> -   vi->rq[i].do_dma = true;
> +   /* disable for big mode */
> +   if (vi->mergeable_rx_bufs || !vi->big_packets) {
> +   if (!virtqueue_set_dma_premapped(vi->rq[i].vq))
> +   vi->rq[i].do_dma = true;

How about sticking a virtnet_rq_set_premapped() and calling it here?

It seems more clean.

Btw, the big mode support for pre mapping is still worthwhil

Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode

2023-11-13 Thread Jason Wang
On Thu, Nov 9, 2023 at 7:06 PM Xuan Zhuo  wrote:
>
> On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang  wrote:
> > On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo  
> > wrote:
> > >
> > > If the xsk is enabling, the xsk tx will share the send queue.
> > > But the xsk requires that the send queue use the premapped mode.
> > > So the send queue must support premapped mode.
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  drivers/net/virtio/main.c   | 163 
> > >  drivers/net/virtio/virtio_net.h |  16 
> > >  2 files changed, 163 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > > index 16e75c08639e..f052db459156 100644
> > > --- a/drivers/net/virtio/main.c
> > > +++ b/drivers/net/virtio/main.c
> > > @@ -46,6 +46,7 @@ module_param(napi_tx, bool, 0644);
> > >  #define VIRTIO_XDP_REDIR   BIT(1)
> > >
> > >  #define VIRTIO_XDP_FLAGBIT(0)
> > > +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
> > >
> > >  #define VIRTNET_DRIVER_VERSION "1.0.0"
> > >
> > > @@ -167,6 +168,29 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
> > > return (struct xdp_frame *)((unsigned long)ptr & 
> > > ~VIRTIO_XDP_FLAG);
> > >  }
> > >
> > > +static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> > > +{
> > > +   struct virtnet_sq_dma *next, *head;
> > > +
> > > +   head = (void *)((unsigned long)data & ~VIRTIO_XMIT_DATA_MASK);
> > > +
> > > +   data = head->data;
> > > +
> > > +   while (head) {
> > > +   virtqueue_dma_unmap_single_attrs(sq->vq, head->addr, 
> > > head->len,
> > > +DMA_TO_DEVICE, 0);
> > > +
> > > +   next = head->next;
> > > +
> > > +   head->next = sq->dmainfo.free;
> > > +   sq->dmainfo.free = head;
> > > +
> > > +   head = next;
> > > +   }
> > > +
> > > +   return data;
> > > +}
> > > +
> > >  static void __free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> > > u64 *bytes, u64 *packets)
> > >  {
> > > @@ -175,14 +199,24 @@ static void __free_old_xmit(struct virtnet_sq *sq, 
> > > bool in_napi,
> > >
> > > while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > > if (!is_xdp_frame(ptr)) {
> > > -   struct sk_buff *skb = ptr;
> > > +   struct sk_buff *skb;
> > > +
> > > +   if (sq->do_dma)
> > > +   ptr = virtnet_sq_unmap(sq, ptr);
> > > +
> > > +   skb = ptr;
> > >
> > > pr_debug("Sent skb %p\n", skb);
> > >
> > > *bytes += skb->len;
> > > napi_consume_skb(skb, in_napi);
> > > } else {
> > > -   struct xdp_frame *frame = ptr_to_xdp(ptr);
> > > +   struct xdp_frame *frame;
> > > +
> > > +   if (sq->do_dma)
> > > +   ptr = virtnet_sq_unmap(sq, ptr);
> > > +
> > > +   frame = ptr_to_xdp(ptr);
> > >
> > > *bytes += xdp_get_frame_len(frame);
> > > xdp_return_frame(frame);
> > > @@ -567,22 +601,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq 
> > > *rq, u32 size, gfp_t gfp)
> > > return buf;
> > >  }
> > >
> > > -static void virtnet_rq_set_premapped(struct virtnet_info *vi)
> > > +static int virtnet_sq_set_premapped(struct virtnet_sq *sq)
> > >  {
> > > -   int i;
> > > +   struct virtnet_sq_dma *d;
> > > +   int err, size, i;
> > >
> > > -   /* disable for big mode */
> > > -   if (!vi->mergeable_rx_bufs && vi->big_packets)
> > > -   return;
> > > +   size = virtqueue_get_vring_size(sq->vq);
> > > +
> > > +   size += MAX_SKB_FRAGS + 2;
> >
> > Btw, the dma

Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode

2023-11-13 Thread Jason Wang
On Tue, Nov 14, 2023 at 11:42 AM Xuan Zhuo  wrote:
>
> On Tue, 14 Nov 2023 11:26:42 +0800, Jason Wang  wrote:
> > On Thu, Nov 9, 2023 at 7:06 PM Xuan Zhuo  wrote:
> > >
> > > On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang  wrote:
> > > > On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > If the xsk is enabling, the xsk tx will share the send queue.
> > > > > But the xsk requires that the send queue use the premapped mode.
> > > > > So the send queue must support premapped mode.
> > > > >
> > > > > Signed-off-by: Xuan Zhuo 
> > > > > ---
> > > > >  drivers/net/virtio/main.c   | 163 
> > > > > 
> > > > >  drivers/net/virtio/virtio_net.h |  16 
> > > > >  2 files changed, 163 insertions(+), 16 deletions(-)
> > > > >

[...]

> > > >
> > > > I think we need to seek a way to reuse what has been stored by virtio
> > > > core. It should be much more efficient.
> > >
> > >
> > > Yes.
> > >
> > > But that is for net-next branch.
> > >
> > > Can we do that as a fix after that is merged to 6.8?
> >
> > We still have time. I would like to do it from the start.
>
>
> I want to finish the job including new AF_XDP ZC feature.
> Because that this must wait the merge window.
> Base on that, the optimizing work can be done everytime.
>
> If we work from the new virtio prepare, that can be merged to 6.8.
> And the AF_XDP zc must wait 6.9. right?

It can be part of this series. Or anything I missed?

My understanding is that, since the information is handy, it just
requires new helpers. So I don't expect it needs a large series.

Thanks

>
> Thanks
>




Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode

2023-11-13 Thread Jason Wang
On Tue, Nov 14, 2023 at 11:59 AM Xuan Zhuo  wrote:
>
> On Tue, 14 Nov 2023 11:55:52 +0800, Jason Wang  wrote:
> > On Tue, Nov 14, 2023 at 11:42 AM Xuan Zhuo  
> > wrote:
> > >
> > > On Tue, 14 Nov 2023 11:26:42 +0800, Jason Wang  
> > > wrote:
> > > > On Thu, Nov 9, 2023 at 7:06 PM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang  
> > > > > wrote:
> > > > > > On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo 
> > > > > >  wrote:
> > > > > > >
> > > > > > > If the xsk is enabling, the xsk tx will share the send queue.
> > > > > > > But the xsk requires that the send queue use the premapped mode.
> > > > > > > So the send queue must support premapped mode.
> > > > > > >
> > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > ---
> > > > > > >  drivers/net/virtio/main.c   | 163 
> > > > > > > 
> > > > > > >  drivers/net/virtio/virtio_net.h |  16 
> > > > > > >  2 files changed, 163 insertions(+), 16 deletions(-)
> > > > > > >
> >
> > [...]
> >
> > > > > >
> > > > > > I think we need to seek a way to reuse what has been stored by 
> > > > > > virtio
> > > > > > core. It should be much more efficient.
> > > > >
> > > > >
> > > > > Yes.
> > > > >
> > > > > But that is for net-next branch.
> > > > >
> > > > > Can we do that as a fix after that is merged to 6.8?
> > > >
> > > > We still have time. I would like to do it from the start.
> > >
> > >
> > > I want to finish the job including new AF_XDP ZC feature.
> > > Because that this must wait the merge window.
> > > Base on that, the optimizing work can be done everytime.
> > >
> > > If we work from the new virtio prepare, that can be merged to 6.8.
> > > And the AF_XDP zc must wait 6.9. right?
> >
> > It can be part of this series. Or anything I missed?
> >
> > My understanding is that, since the information is handy, it just
> > requires new helpers. So I don't expect it needs a large series.
>
> Now, this is pushing to net-next.
>
> If we add an new virtio-core helper. That must be pushed to virtio branch.
> And this patch set must wait that.

I don't think so if it's just a matter of new helpers. The
acknowledgement from the virtio maintainer should be sufficient.

Let's just try and see?

THanks

>
> Thanks.
>
>
> >
> > Thanks
> >
> > >
> > > Thanks
> > >
> >
>




Re: [PATCH net-next v4 4/4] virtio-net: support rx netdim

2023-11-21 Thread Jason Wang
On Mon, Nov 20, 2023 at 8:38 PM Heng Qi  wrote:
>
> By comparing the traffic information in the complete napi processes,
> let the virtio-net driver automatically adjust the coalescing
> moderation parameters of each receive queue.
>
> Signed-off-by: Heng Qi 
> ---
> v2->v3:
> - Some minor modifications.
>
> v1->v2:
> - Improved the judgment of dim switch conditions.
> - Cancel the work when vq reset.
>
>  drivers/net/virtio_net.c | 191 ++-
>  1 file changed, 169 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 69fe09e99b3c..bc32d5aae005 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -172,6 +173,17 @@ struct receive_queue {
>
> struct virtnet_rq_stats stats;
>
> +   /* The number of rx notifications */
> +   u16 calls;
> +
> +   /* Is dynamic interrupt moderation enabled? */
> +   bool dim_enabled;
> +
> +   /* Dynamic Interrupt Moderation */
> +   struct dim dim;
> +
> +   u32 packets_in_napi;
> +
> struct virtnet_interrupt_coalesce intr_coal;
>
> /* Chain pages by the private ptr. */
> @@ -305,6 +317,9 @@ struct virtnet_info {
> u8 duplex;
> u32 speed;
>
> +   /* Is rx dynamic interrupt moderation enabled? */
> +   bool rx_dim_enabled;
> +
> /* Interrupt coalescing settings */
> struct virtnet_interrupt_coalesce intr_coal_tx;
> struct virtnet_interrupt_coalesce intr_coal_rx;
> @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
> struct virtnet_info *vi = rvq->vdev->priv;
> struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
>
> +   rq->calls++;
> virtqueue_napi_schedule(&rq->napi, rvq);
>  }
>
> @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct receive_queue 
> *rq)
> }
>  }
>
> +static void virtnet_rx_dim_work(struct work_struct *work);
> +
> +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> receive_queue *rq)
> +{
> +   struct dim_sample cur_sample = {};
> +
> +   if (!rq->packets_in_napi)
> +   return;
> +
> +   u64_stats_update_begin(&rq->stats.syncp);
> +   dim_update_sample(rq->calls,
> + u64_stats_read(&rq->stats.packets),
> + u64_stats_read(&rq->stats.bytes),
> + &cur_sample);
> +   u64_stats_update_end(&rq->stats.syncp);
> +
> +   net_dim(&rq->dim, cur_sample);
> +   rq->packets_in_napi = 0;
> +}
> +
>  static int virtnet_poll(struct napi_struct *napi, int budget)
>  {
> struct receive_queue *rq =
> @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, int 
> budget)
> struct send_queue *sq;
> unsigned int received;
> unsigned int xdp_xmit = 0;
> +   bool napi_complete;
>
> virtnet_poll_cleantx(rq);
>
> received = virtnet_receive(rq, budget, &xdp_xmit);
> +   rq->packets_in_napi += received;
>
> if (xdp_xmit & VIRTIO_XDP_REDIR)
> xdp_do_flush();
>
> /* Out of packets? */
> -   if (received < budget)
> -   virtqueue_napi_complete(napi, rq->vq, received);
> +   if (received < budget) {
> +   napi_complete = virtqueue_napi_complete(napi, rq->vq, 
> received);
> +   if (napi_complete && rq->dim_enabled)
> +   virtnet_rx_dim_update(vi, rq);
> +   }
>
> if (xdp_xmit & VIRTIO_XDP_TX) {
> sq = virtnet_xdp_get_sq(vi);
> @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
> virtnet_napi_tx_disable(&vi->sq[qp_index].napi);
> napi_disable(&vi->rq[qp_index].napi);
> xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq);
> +   cancel_work_sync(&vi->rq[qp_index].dim.work);
>  }
>
>  static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> @@ -2196,6 +2238,9 @@ static int virtnet_enable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
> if (err < 0)
> goto err_xdp_reg_mem_model;
>
> +   INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work);
> +   vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;

So in V2, you explained it can be done here but I want to know why it
must be done here.

For example, the refill_work is initialized in alloc_queues().

> +
> virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, 
> &vi->sq[qp_index].napi);
>
> @@ -2393,8 +2438,10 @@ static int virtnet_rx_resize(struct virtnet_info *vi,
>
> qindex = rq - vi->rq;
>
> -   if (running)
> +   if (running) {
> napi_disable(&rq->napi);
> +   cancel_work_syn

Re: [PATCH net-next v4 4/4] virtio-net: support rx netdim

2023-11-21 Thread Jason Wang
On Wed, Nov 22, 2023 at 1:52 PM Jason Wang  wrote:
>
> On Mon, Nov 20, 2023 at 8:38 PM Heng Qi  wrote:
> >
> > By comparing the traffic information in the complete napi processes,
> > let the virtio-net driver automatically adjust the coalescing
> > moderation parameters of each receive queue.
> >
> > Signed-off-by: Heng Qi 
> > ---
> > v2->v3:
> > - Some minor modifications.
> >
> > v1->v2:
> > - Improved the judgment of dim switch conditions.
> > - Cancel the work when vq reset.
> >
> >  drivers/net/virtio_net.c | 191 ++-
> >  1 file changed, 169 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 69fe09e99b3c..bc32d5aae005 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -19,6 +19,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -172,6 +173,17 @@ struct receive_queue {
> >
> > struct virtnet_rq_stats stats;
> >
> > +   /* The number of rx notifications */
> > +   u16 calls;
> > +
> > +   /* Is dynamic interrupt moderation enabled? */
> > +   bool dim_enabled;
> > +
> > +   /* Dynamic Interrupt Moderation */
> > +   struct dim dim;
> > +
> > +   u32 packets_in_napi;
> > +
> > struct virtnet_interrupt_coalesce intr_coal;
> >
> > /* Chain pages by the private ptr. */
> > @@ -305,6 +317,9 @@ struct virtnet_info {
> > u8 duplex;
> > u32 speed;
> >
> > +   /* Is rx dynamic interrupt moderation enabled? */
> > +   bool rx_dim_enabled;
> > +
> > /* Interrupt coalescing settings */
> > struct virtnet_interrupt_coalesce intr_coal_tx;
> > struct virtnet_interrupt_coalesce intr_coal_rx;
> > @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
> > struct virtnet_info *vi = rvq->vdev->priv;
> > struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
> >
> > +   rq->calls++;
> > virtqueue_napi_schedule(&rq->napi, rvq);
> >  }
> >
> > @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct 
> > receive_queue *rq)
> > }
> >  }
> >
> > +static void virtnet_rx_dim_work(struct work_struct *work);
> > +
> > +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> > receive_queue *rq)
> > +{
> > +   struct dim_sample cur_sample = {};
> > +
> > +   if (!rq->packets_in_napi)
> > +   return;
> > +
> > +   u64_stats_update_begin(&rq->stats.syncp);
> > +   dim_update_sample(rq->calls,
> > + u64_stats_read(&rq->stats.packets),
> > + u64_stats_read(&rq->stats.bytes),
> > + &cur_sample);
> > +   u64_stats_update_end(&rq->stats.syncp);
> > +
> > +   net_dim(&rq->dim, cur_sample);
> > +   rq->packets_in_napi = 0;
> > +}
> > +
> >  static int virtnet_poll(struct napi_struct *napi, int budget)
> >  {
> > struct receive_queue *rq =
> > @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, 
> > int budget)
> > struct send_queue *sq;
> > unsigned int received;
> > unsigned int xdp_xmit = 0;
> > +   bool napi_complete;
> >
> > virtnet_poll_cleantx(rq);
> >
> > received = virtnet_receive(rq, budget, &xdp_xmit);
> > +   rq->packets_in_napi += received;
> >
> > if (xdp_xmit & VIRTIO_XDP_REDIR)
> > xdp_do_flush();
> >
> > /* Out of packets? */
> > -   if (received < budget)
> > -   virtqueue_napi_complete(napi, rq->vq, received);
> > +   if (received < budget) {
> > +   napi_complete = virtqueue_napi_complete(napi, rq->vq, 
> > received);
> > +   if (napi_complete && rq->dim_enabled)
> > +   virtnet_rx_dim_update(vi, rq);
> > +   }
> >
> > if (xdp_xmit & VIRTIO_XDP_TX) {
> > sq = virtnet_xdp_get_sq(vi);
> > @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct 
> > virtnet_info *vi, int qp_index)
> > virt

Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access

2023-12-05 Thread Jason Wang
On Tue, Dec 5, 2023 at 4:02 PM Heng Qi  wrote:
>
> Currently access to ctrl cmd is globally protected via rtnl_lock and works
> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, deadlock
> may occur due to cancel_work_sync for dim work.

Can you explain why?

> Therefore, treating
> ctrl cmd as a separate protection object of the lock is the solution and
> the basis for the next patch.

Let's don't do that. Reasons are:

1) virtnet_send_command() may wait for cvq commands for an indefinite time
2) hold locks may complicate the future hardening works around cvq

Thanks




Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access

2023-12-06 Thread Jason Wang
On Tue, Dec 5, 2023 at 7:06 PM Heng Qi  wrote:
>
>
>
> 在 2023/12/5 下午4:35, Jason Wang 写道:
> > On Tue, Dec 5, 2023 at 4:02 PM Heng Qi  wrote:
> >> Currently access to ctrl cmd is globally protected via rtnl_lock and works
> >> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, deadlock
> >> may occur due to cancel_work_sync for dim work.
> > Can you explain why?
>
> For example, during the bus unbind operation, the following call stack
> occurs:
> virtnet_remove -> unregister_netdev -> rtnl_lock[1] -> virtnet_close ->
> cancel_work_sync -> virtnet_rx_dim_work -> rtnl_lock[2] (deadlock occurs).

Can we use rtnl_trylock() and reschedule the work?

>
> >> Therefore, treating
> >> ctrl cmd as a separate protection object of the lock is the solution and
> >> the basis for the next patch.
> > Let's don't do that. Reasons are:
> >
> > 1) virtnet_send_command() may wait for cvq commands for an indefinite time
>
> Yes, I took that into consideration. But ndo_set_rx_mode's need for an
> atomic
> environment rules out the mutex lock.

It is a "bug" that we need to fix.

>
> > 2) hold locks may complicate the future hardening works around cvq
>
> Agree, but I don't seem to have thought of a better way besides passing
> the lock.
> Do you have any other better ideas or suggestions?

So far no, you can refer to the past discussions, it might require the
collaboration from the uAPI and stack.

Thanks

>
> Thanks!
>
> >
> > Thanks
>




Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access

2023-12-06 Thread Jason Wang
On Wed, Dec 6, 2023 at 9:03 PM Heng Qi  wrote:
>
>
>
> 在 2023/12/6 下午8:27, Paolo Abeni 写道:
> > On Tue, 2023-12-05 at 19:05 +0800, Heng Qi wrote:
> >> 在 2023/12/5 下午4:35, Jason Wang 写道:
> >>> On Tue, Dec 5, 2023 at 4:02 PM Heng Qi  wrote:
> >>>> Currently access to ctrl cmd is globally protected via rtnl_lock and 
> >>>> works
> >>>> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, deadlock
> >>>> may occur due to cancel_work_sync for dim work.
> >>> Can you explain why?
> >> For example, during the bus unbind operation, the following call stack
> >> occurs:
> >> virtnet_remove -> unregister_netdev -> rtnl_lock[1] -> virtnet_close ->
> >> cancel_work_sync -> virtnet_rx_dim_work -> rtnl_lock[2] (deadlock occurs).
> >>
> >>>> Therefore, treating
> >>>> ctrl cmd as a separate protection object of the lock is the solution and
> >>>> the basis for the next patch.
> >>> Let's don't do that. Reasons are:
> >>>
> >>> 1) virtnet_send_command() may wait for cvq commands for an indefinite time
> >> Yes, I took that into consideration. But ndo_set_rx_mode's need for an
> >> atomic
> >> environment rules out the mutex lock.
> >>
> >>> 2) hold locks may complicate the future hardening works around cvq
> >> Agree, but I don't seem to have thought of a better way besides passing
> >> the lock.
> >> Do you have any other better ideas or suggestions?
> > What about:
> >
> > - using the rtnl lock only
> > - virtionet_close() invokes cancel_work(), without flushing the work
> > - virtnet_remove() calls flush_work() after unregister_netdev(),
> > outside the rtnl lock
> >
> > Should prevent both the deadlock and the UaF.
>
>
> Hi, Paolo and Jason!
>
> Thank you very much for your effective suggestions, but I found another
> solution[1],
> based on the ideas of rtnl_trylock and refill_work, which works very well:
>
> [1]
> +static void virtnet_rx_dim_work(struct work_struct *work)
> +{
> +struct dim *dim = container_of(work, struct dim, work);
> +struct receive_queue *rq = container_of(dim,
> +struct receive_queue, dim);
> +struct virtnet_info *vi = rq->vq->vdev->priv;
> +struct net_device *dev = vi->dev;
> +struct dim_cq_moder update_moder;
> +int i, qnum, err;
> +
> +if (!rtnl_trylock())
> +return;

Don't we need to reschedule here?

like

if (rq->dim_enabled)
   sechedule_work()

?

Thanks

> +
> +for (i = 0; i < vi->curr_queue_pairs; i++) {
> +rq = &vi->rq[i];
> +dim = &rq->dim;
> +qnum = rq - vi->rq;
> +
> +if (!rq->dim_enabled)
> +continue;
> +
> +update_moder = net_dim_get_rx_moderation(dim->mode,
> dim->profile_ix);
> +if (update_moder.usec != rq->intr_coal.max_usecs ||
> +update_moder.pkts != rq->intr_coal.max_packets) {
> +err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
> +   update_moder.usec,
> +   update_moder.pkts);
> +if (err)
> +pr_debug("%s: Failed to send dim parameters on rxq%d\n",
> + dev->name, qnum);
> +dim->state = DIM_START_MEASURE;
> +}
> +}
> +
> +rtnl_unlock();
> +}
>
>
> In addition, other optimizations[2] have been tried, but it may be due
> to the sparsely
> scheduled work that the retry condition is always satisfied, affecting
> performance,
> so [1] is the final solution:
>
> [2]
>
> +static void virtnet_rx_dim_work(struct work_struct *work)
> +{
> +struct dim *dim = container_of(work, struct dim, work);
> +struct receive_queue *rq = container_of(dim,
> +struct receive_queue, dim);
> +struct virtnet_info *vi = rq->vq->vdev->priv;
> +struct net_device *dev = vi->dev;
> +struct dim_cq_moder update_moder;
> +int i, qnum, err, count;
> +
> +if (!rtnl_trylock())
> +return;
> +retry:
> +count = vi->curr_queue_pairs;
> +for (i = 0; i < vi->curr_queue_pairs; i++) {
> +rq = &vi->rq[i];
> +dim = &rq->dim;
> +qnum = rq - vi->rq;
> +update_moder = net_dim_get_rx_moderation(dim->mode,
> dim->profile_ix);
> +if (update_moder.usec != rq->intr_coal.max_usecs ||
> +u

Re: [PATCH net-next v6 4/5] virtio-net: add spin lock for ctrl cmd access

2023-12-06 Thread Jason Wang
On Thu, Dec 7, 2023 at 12:47 PM Heng Qi  wrote:
>
>
>
> 在 2023/12/7 下午12:19, Jason Wang 写道:
> > On Wed, Dec 6, 2023 at 9:03 PM Heng Qi  wrote:
> >>
> >>
> >> 在 2023/12/6 下午8:27, Paolo Abeni 写道:
> >>> On Tue, 2023-12-05 at 19:05 +0800, Heng Qi wrote:
> >>>> 在 2023/12/5 下午4:35, Jason Wang 写道:
> >>>>> On Tue, Dec 5, 2023 at 4:02 PM Heng Qi  wrote:
> >>>>>> Currently access to ctrl cmd is globally protected via rtnl_lock and 
> >>>>>> works
> >>>>>> fine. But if dim work's access to ctrl cmd also holds rtnl_lock, 
> >>>>>> deadlock
> >>>>>> may occur due to cancel_work_sync for dim work.
> >>>>> Can you explain why?
> >>>> For example, during the bus unbind operation, the following call stack
> >>>> occurs:
> >>>> virtnet_remove -> unregister_netdev -> rtnl_lock[1] -> virtnet_close ->
> >>>> cancel_work_sync -> virtnet_rx_dim_work -> rtnl_lock[2] (deadlock 
> >>>> occurs).
> >>>>
> >>>>>> Therefore, treating
> >>>>>> ctrl cmd as a separate protection object of the lock is the solution 
> >>>>>> and
> >>>>>> the basis for the next patch.
> >>>>> Let's don't do that. Reasons are:
> >>>>>
> >>>>> 1) virtnet_send_command() may wait for cvq commands for an indefinite 
> >>>>> time
> >>>> Yes, I took that into consideration. But ndo_set_rx_mode's need for an
> >>>> atomic
> >>>> environment rules out the mutex lock.
> >>>>
> >>>>> 2) hold locks may complicate the future hardening works around cvq
> >>>> Agree, but I don't seem to have thought of a better way besides passing
> >>>> the lock.
> >>>> Do you have any other better ideas or suggestions?
> >>> What about:
> >>>
> >>> - using the rtnl lock only
> >>> - virtionet_close() invokes cancel_work(), without flushing the work
> >>> - virtnet_remove() calls flush_work() after unregister_netdev(),
> >>> outside the rtnl lock
> >>>
> >>> Should prevent both the deadlock and the UaF.
> >>
> >> Hi, Paolo and Jason!
> >>
> >> Thank you very much for your effective suggestions, but I found another
> >> solution[1],
> >> based on the ideas of rtnl_trylock and refill_work, which works very well:
> >>
> >> [1]
> >> +static void virtnet_rx_dim_work(struct work_struct *work)
> >> +{
> >> +struct dim *dim = container_of(work, struct dim, work);
> >> +struct receive_queue *rq = container_of(dim,
> >> +struct receive_queue, dim);
> >> +struct virtnet_info *vi = rq->vq->vdev->priv;
> >> +struct net_device *dev = vi->dev;
> >> +struct dim_cq_moder update_moder;
> >> +int i, qnum, err;
> >> +
> >> +if (!rtnl_trylock())
> >> +return;
> > Don't we need to reschedule here?
> >
> > like
> >
> > if (rq->dim_enabled)
> > sechedule_work()
> >
> > ?
>
> I think no, we don't need this.
>
> The work of each queue will be called by "net_dim()->schedule_work()"
> when napi traffic changes (before schedule_work(), the dim->profile_ix
> of the corresponding rxq has been updated).
> So we only need to traverse and update the profiles of all rxqs in the
> work which is obtaining the rtnl_lock.

Ok, let's have a comment here then.

Thanks

>
> Thanks!
>
> >
> > Thanks
> >
> >> +
> >> +for (i = 0; i < vi->curr_queue_pairs; i++) {
> >> +rq = &vi->rq[i];
> >> +dim = &rq->dim;
> >> +qnum = rq - vi->rq;
> >> +
> >> +if (!rq->dim_enabled)
> >> +continue;
> >> +
> >> +update_moder = net_dim_get_rx_moderation(dim->mode,
> >> dim->profile_ix);
> >> +if (update_moder.usec != rq->intr_coal.max_usecs ||
> >> +update_moder.pkts != rq->intr_coal.max_packets) {
> >> +err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
> >> +   update_moder.usec,
> >> +   update_moder.pkts);
> >> +if (

Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-21 Thread Jason Wang
On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn
 wrote:
>
> Heng Qi wrote:
> >
> >
> > 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > > Heng Qi wrote:
> > >> virtio-net has two ways to switch napi_tx: one is through the
> > >> module parameter, and the other is through coalescing parameter
> > >> settings (provided that the nic status is down).
> > >>
> > >> Sometimes we face performance regression caused by napi_tx,
> > >> then we need to switch napi_tx when debugging. However, the
> > >> existing methods are a bit troublesome, such as needing to
> > >> reload the driver or turn off the network card.

Why is this troublesome? We don't need to turn off the card, it's just
a toggling of the interface.

This ends up with pretty simple code.

> So try to make
> > >> this update.
> > >>
> > >> Signed-off-by: Heng Qi 
> > >> Reviewed-by: Xuan Zhuo 
> > > The commit does not explain why it is safe to do so.
> >
> > virtnet_napi_tx_disable ensures that already scheduled tx napi ends and
> > no new tx napi will be scheduled.
> >
> > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot
> > send the packet.
> >
> > Then we can safely toggle the weight to indicate where to clear the buffers.
> >
> > >
> > > The tx-napi weights are not really weights: it is a boolean whether
> > > napi is used for transmit cleaning, or whether packets are cleaned
> > > in ndo_start_xmit.
> >
> > Right.
> >
> > >
> > > There certainly are some subtle issues with regard to pausing/waking
> > > queues when switching between modes.
> >
> > What are "subtle issues" and if there are any, we find them.
>
> A single runtime test is not sufficient to exercise all edge cases.
>
> Please don't leave it to reviewers to establish the correctness of a
> patch.

+1

And instead of trying to do this, it would be much better to optimize
the NAPI performance. Then we can drop the orphan mode.

>
> The napi_tx and non-napi code paths differ in how they handle at least
> the following structures:
>
> 1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is
> needed as delay until the next ndo_start_xmit and thus completion is
> unbounded.
>
> When switching to napi mode, orphaned skbs may now be cleaned by the
> napi handler. This is indeed safe.
>
> When switching from napi to non-napi, the unbound latency resurfaces.
> It is a small edge case, and I think a potentially acceptable risk, if
> the user of this knob is aware of the risk.
>
> 2. virtqueue callback ("interrupt" masking). The non-napi path enables
> the interrupt (disables the mask) when available descriptors falls
> beneath a low watermark, and reenables when it recovers above a high
> watermark. Napi disables when napi is scheduled, and reenables on
> napi complete.
>
> 3. dev_queue->state (QUEUE_STATE_DRV_XOFF). if the ring falls below
> a low watermark, the driver stops the stack for queuing more packets.
> In napi mode, it schedules napi to clean packets. See the calls to
> netif_xmit_stopped, netif_stop_subqueue, netif_start_subqueue and
> netif_tx_wake_queue.
>
> Some if this can be assumed safe by looking at existing analogous
> code, such as the queue stop/start in virtnet_tx_resize.
>
> But that all virtqueue callback and dev_queue->state transitions are
> correct when switching between modes at runtime is not trivial to
> establish, deserves some thought and explanation in the commit
> message.

Thanks




Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-24 Thread Jason Wang
On Fri, Dec 22, 2023 at 4:14 PM Michael S. Tsirkin  wrote:
>
> On Fri, Dec 22, 2023 at 10:35:07AM +0800, Jason Wang wrote:
> > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn
> >  wrote:
> > >
> > > Heng Qi wrote:
> > > >
> > > >
> > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > > > > Heng Qi wrote:
> > > > >> virtio-net has two ways to switch napi_tx: one is through the
> > > > >> module parameter, and the other is through coalescing parameter
> > > > >> settings (provided that the nic status is down).
> > > > >>
> > > > >> Sometimes we face performance regression caused by napi_tx,
> > > > >> then we need to switch napi_tx when debugging. However, the
> > > > >> existing methods are a bit troublesome, such as needing to
> > > > >> reload the driver or turn off the network card.
> >
> > Why is this troublesome? We don't need to turn off the card, it's just
> > a toggling of the interface.
> >
> > This ends up with pretty simple code.
> >
> > > So try to make
> > > > >> this update.
> > > > >>
> > > > >> Signed-off-by: Heng Qi 
> > > > >> Reviewed-by: Xuan Zhuo 
> > > > > The commit does not explain why it is safe to do so.
> > > >
> > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends and
> > > > no new tx napi will be scheduled.
> > > >
> > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot
> > > > send the packet.
> > > >
> > > > Then we can safely toggle the weight to indicate where to clear the 
> > > > buffers.
> > > >
> > > > >
> > > > > The tx-napi weights are not really weights: it is a boolean whether
> > > > > napi is used for transmit cleaning, or whether packets are cleaned
> > > > > in ndo_start_xmit.
> > > >
> > > > Right.
> > > >
> > > > >
> > > > > There certainly are some subtle issues with regard to pausing/waking
> > > > > queues when switching between modes.
> > > >
> > > > What are "subtle issues" and if there are any, we find them.
> > >
> > > A single runtime test is not sufficient to exercise all edge cases.
> > >
> > > Please don't leave it to reviewers to establish the correctness of a
> > > patch.
> >
> > +1
> >
> > And instead of trying to do this, it would be much better to optimize
> > the NAPI performance. Then we can drop the orphan mode.
>
> "To address your problem, optimize our code to the level which we
> couldn't achieve in more than 10 years".

Last time QE didn't report any issue for TCP. For others, the code
might just need some optimization if it really matters, it's just
because nobody has worked on this part in the past years.

The ethtool trick is just for debugging purposes, I can hardly believe
it is used by any management layer software.

> That's not a reasonable
> requirement. Not getting an interrupt will always be better for some
> workloads.

So NAPI has been enabled by default for many years. And most of the
NIC doesn't do orphans. Orphan has known issues like pktgen and
others. Keeping two modes may result in tricky code and complicate the
features like BQL [1]. We would have interrupt coalescing and dim
soon. Admin can enable the heuristic or tweak it according to the
workload which is much more user friendly than orphaning.

I don't see a strong point to keep the orphan mode any more
considering the advantages of NAPI.

Thanks

[1] https://lore.kernel.org/lkml/20181205225323.12555-2-...@redhat.com/



>
>
> > >
> > > The napi_tx and non-napi code paths differ in how they handle at least
> > > the following structures:
> > >
> > > 1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is
> > > needed as delay until the next ndo_start_xmit and thus completion is
> > > unbounded.
> > >
> > > When switching to napi mode, orphaned skbs may now be cleaned by the
> > > napi handler. This is indeed safe.
> > >
> > > When switching from napi to non-napi, the unbound latency resurfaces.
> > > It is a small edge case, and I think a potentially acceptable risk, if
> > > the user of this knob is aware of the risk.
> > >
> > > 2. virtqueue callback ("interr

Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-24 Thread Jason Wang
On Mon, Dec 25, 2023 at 10:25 AM Jason Xing  wrote:
>
> Hello Jason,
> On Fri, Dec 22, 2023 at 10:36 AM Jason Wang  wrote:
> >
> > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn
> >  wrote:
> > >
> > > Heng Qi wrote:
> > > >
> > > >
> > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > > > > Heng Qi wrote:
> > > > >> virtio-net has two ways to switch napi_tx: one is through the
> > > > >> module parameter, and the other is through coalescing parameter
> > > > >> settings (provided that the nic status is down).
> > > > >>
> > > > >> Sometimes we face performance regression caused by napi_tx,
> > > > >> then we need to switch napi_tx when debugging. However, the
> > > > >> existing methods are a bit troublesome, such as needing to
> > > > >> reload the driver or turn off the network card.
> >
> > Why is this troublesome? We don't need to turn off the card, it's just
> > a toggling of the interface.
> >
> > This ends up with pretty simple code.
> >
> > > So try to make
> > > > >> this update.
> > > > >>
> > > > >> Signed-off-by: Heng Qi 
> > > > >> Reviewed-by: Xuan Zhuo 
> > > > > The commit does not explain why it is safe to do so.
> > > >
> > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends and
> > > > no new tx napi will be scheduled.
> > > >
> > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot
> > > > send the packet.
> > > >
> > > > Then we can safely toggle the weight to indicate where to clear the 
> > > > buffers.
> > > >
> > > > >
> > > > > The tx-napi weights are not really weights: it is a boolean whether
> > > > > napi is used for transmit cleaning, or whether packets are cleaned
> > > > > in ndo_start_xmit.
> > > >
> > > > Right.
> > > >
> > > > >
> > > > > There certainly are some subtle issues with regard to pausing/waking
> > > > > queues when switching between modes.
> > > >
> > > > What are "subtle issues" and if there are any, we find them.
> > >
> > > A single runtime test is not sufficient to exercise all edge cases.
> > >
> > > Please don't leave it to reviewers to establish the correctness of a
> > > patch.
> >
> > +1
> >
> [...]
> > And instead of trying to do this, it would be much better to optimize
> > the NAPI performance. Then we can drop the orphan mode.
>
> Do you mean when to call skb_orphan()? If yes, I just want to provide
> more information that we also have some performance issues where the
> flow control takes a bad effect, especially under some small
> throughput in our production environment.

I think you need to describe it in detail.

> What strikes me as odd is if I restart the network, then the issue
> will go with the wind. I cannot reproduce it in my testing machine.
> One more thing: if I force skb_orphan() the current skb in every
> start_xmit(), it could also solve the issue but not in a proper way.
> After all, it drops the flow control... :S

Yes, that's the known issue.

Thanks

>
> Thanks,
> Jason
> >
> > >
> > > The napi_tx and non-napi code paths differ in how they handle at least
> > > the following structures:
> > >
> > > 1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is
> > > needed as delay until the next ndo_start_xmit and thus completion is
> > > unbounded.
> > >
> > > When switching to napi mode, orphaned skbs may now be cleaned by the
> > > napi handler. This is indeed safe.
> > >
> > > When switching from napi to non-napi, the unbound latency resurfaces.
> > > It is a small edge case, and I think a potentially acceptable risk, if
> > > the user of this knob is aware of the risk.
> > >
> > > 2. virtqueue callback ("interrupt" masking). The non-napi path enables
> > > the interrupt (disables the mask) when available descriptors falls
> > > beneath a low watermark, and reenables when it recovers above a high
> > > watermark. Napi disables when napi is scheduled, and reenables on
> > > napi complete.
> > >
> > > 3. dev_queue->state (QUEUE_STATE_DRV_XOFF). if the ring falls below
> > > a low watermark, the driver stops the stack for queuing more packets.
> > > In napi mode, it schedules napi to clean packets. See the calls to
> > > netif_xmit_stopped, netif_stop_subqueue, netif_start_subqueue and
> > > netif_tx_wake_queue.
> > >
> > > Some if this can be assumed safe by looking at existing analogous
> > > code, such as the queue stop/start in virtnet_tx_resize.
> > >
> > > But that all virtqueue callback and dev_queue->state transitions are
> > > correct when switching between modes at runtime is not trivial to
> > > establish, deserves some thought and explanation in the commit
> > > message.
> >
> > Thanks
> >
> >
>




Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-24 Thread Jason Wang
On Mon, Dec 25, 2023 at 2:34 PM Jason Xing  wrote:
>
> On Mon, Dec 25, 2023 at 12:14 PM Jason Wang  wrote:
> >
> > On Mon, Dec 25, 2023 at 10:25 AM Jason Xing  
> > wrote:
> > >
> > > Hello Jason,
> > > On Fri, Dec 22, 2023 at 10:36 AM Jason Wang  wrote:
> > > >
> > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn
> > > >  wrote:
> > > > >
> > > > > Heng Qi wrote:
> > > > > >
> > > > > >
> > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > > > > > > Heng Qi wrote:
> > > > > > >> virtio-net has two ways to switch napi_tx: one is through the
> > > > > > >> module parameter, and the other is through coalescing parameter
> > > > > > >> settings (provided that the nic status is down).
> > > > > > >>
> > > > > > >> Sometimes we face performance regression caused by napi_tx,
> > > > > > >> then we need to switch napi_tx when debugging. However, the
> > > > > > >> existing methods are a bit troublesome, such as needing to
> > > > > > >> reload the driver or turn off the network card.
> > > >
> > > > Why is this troublesome? We don't need to turn off the card, it's just
> > > > a toggling of the interface.
> > > >
> > > > This ends up with pretty simple code.
> > > >
> > > > > So try to make
> > > > > > >> this update.
> > > > > > >>
> > > > > > >> Signed-off-by: Heng Qi 
> > > > > > >> Reviewed-by: Xuan Zhuo 
> > > > > > > The commit does not explain why it is safe to do so.
> > > > > >
> > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends 
> > > > > > and
> > > > > > no new tx napi will be scheduled.
> > > > > >
> > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot
> > > > > > send the packet.
> > > > > >
> > > > > > Then we can safely toggle the weight to indicate where to clear the 
> > > > > > buffers.
> > > > > >
> > > > > > >
> > > > > > > The tx-napi weights are not really weights: it is a boolean 
> > > > > > > whether
> > > > > > > napi is used for transmit cleaning, or whether packets are cleaned
> > > > > > > in ndo_start_xmit.
> > > > > >
> > > > > > Right.
> > > > > >
> > > > > > >
> > > > > > > There certainly are some subtle issues with regard to 
> > > > > > > pausing/waking
> > > > > > > queues when switching between modes.
> > > > > >
> > > > > > What are "subtle issues" and if there are any, we find them.
> > > > >
> > > > > A single runtime test is not sufficient to exercise all edge cases.
> > > > >
> > > > > Please don't leave it to reviewers to establish the correctness of a
> > > > > patch.
> > > >
> > > > +1
> > > >
> > > [...]
> > > > And instead of trying to do this, it would be much better to optimize
> > > > the NAPI performance. Then we can drop the orphan mode.
> > >
> [...]
> > > Do you mean when to call skb_orphan()? If yes, I just want to provide
> > > more information that we also have some performance issues where the
> > > flow control takes a bad effect, especially under some small
> > > throughput in our production environment.
> >
> > I think you need to describe it in detail.
>
> Some of the details were described below in the last email. The
> decreased performance happened because of flow control: the delay of
> skb free means the delay

What do you mean by delay here? Is it an interrupt delay? If yes, Does
it work better if you simply remove

virtqueue_enable_cb_delayed() with virtqueue_enable_cb()? As the
former may delay the interrupt more or less depend on the traffic.

> of decreasing of sk_wmem_alloc, then it will
> hit the limit of TSQ mechanism, finally causing transmitting slowly in
> the TCP layer.

TSQ might work better with BQL which virtio-net doesn't have right now.

>
> >
> > >

Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-24 Thread Jason Wang
On Mon, Dec 25, 2023 at 2:44 PM Jason Wang  wrote:
>
> On Mon, Dec 25, 2023 at 2:34 PM Jason Xing  wrote:
> >
> > On Mon, Dec 25, 2023 at 12:14 PM Jason Wang  wrote:
> > >
> > > On Mon, Dec 25, 2023 at 10:25 AM Jason Xing  
> > > wrote:
> > > >
> > > > Hello Jason,
> > > > On Fri, Dec 22, 2023 at 10:36 AM Jason Wang  wrote:
> > > > >
> > > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn
> > > > >  wrote:
> > > > > >
> > > > > > Heng Qi wrote:
> > > > > > >
> > > > > > >
> > > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > > > > > > > Heng Qi wrote:
> > > > > > > >> virtio-net has two ways to switch napi_tx: one is through the
> > > > > > > >> module parameter, and the other is through coalescing parameter
> > > > > > > >> settings (provided that the nic status is down).
> > > > > > > >>
> > > > > > > >> Sometimes we face performance regression caused by napi_tx,
> > > > > > > >> then we need to switch napi_tx when debugging. However, the
> > > > > > > >> existing methods are a bit troublesome, such as needing to
> > > > > > > >> reload the driver or turn off the network card.
> > > > >
> > > > > Why is this troublesome? We don't need to turn off the card, it's just
> > > > > a toggling of the interface.
> > > > >
> > > > > This ends up with pretty simple code.
> > > > >
> > > > > > So try to make
> > > > > > > >> this update.
> > > > > > > >>
> > > > > > > >> Signed-off-by: Heng Qi 
> > > > > > > >> Reviewed-by: Xuan Zhuo 
> > > > > > > > The commit does not explain why it is safe to do so.
> > > > > > >
> > > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi 
> > > > > > > ends and
> > > > > > > no new tx napi will be scheduled.
> > > > > > >
> > > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack 
> > > > > > > cannot
> > > > > > > send the packet.
> > > > > > >
> > > > > > > Then we can safely toggle the weight to indicate where to clear 
> > > > > > > the buffers.
> > > > > > >
> > > > > > > >
> > > > > > > > The tx-napi weights are not really weights: it is a boolean 
> > > > > > > > whether
> > > > > > > > napi is used for transmit cleaning, or whether packets are 
> > > > > > > > cleaned
> > > > > > > > in ndo_start_xmit.
> > > > > > >
> > > > > > > Right.
> > > > > > >
> > > > > > > >
> > > > > > > > There certainly are some subtle issues with regard to 
> > > > > > > > pausing/waking
> > > > > > > > queues when switching between modes.
> > > > > > >
> > > > > > > What are "subtle issues" and if there are any, we find them.
> > > > > >
> > > > > > A single runtime test is not sufficient to exercise all edge cases.
> > > > > >
> > > > > > Please don't leave it to reviewers to establish the correctness of a
> > > > > > patch.
> > > > >
> > > > > +1
> > > > >
> > > > [...]
> > > > > And instead of trying to do this, it would be much better to optimize
> > > > > the NAPI performance. Then we can drop the orphan mode.
> > > >
> > [...]
> > > > Do you mean when to call skb_orphan()? If yes, I just want to provide
> > > > more information that we also have some performance issues where the
> > > > flow control takes a bad effect, especially under some small
> > > > throughput in our production environment.
> > >
> > > I think you need to describe it in detail.
> >
> > Some of the details were described below in the last email. The
> > decreased performance happened 

Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-25 Thread Jason Wang
On Mon, Dec 25, 2023 at 4:03 PM Michael S. Tsirkin  wrote:
>
> On Mon, Dec 25, 2023 at 12:12:48PM +0800, Jason Wang wrote:
> > On Fri, Dec 22, 2023 at 4:14 PM Michael S. Tsirkin  wrote:
> > >
> > > On Fri, Dec 22, 2023 at 10:35:07AM +0800, Jason Wang wrote:
> > > > On Thu, Dec 21, 2023 at 11:06 PM Willem de Bruijn
> > > >  wrote:
> > > > >
> > > > > Heng Qi wrote:
> > > > > >
> > > > > >
> > > > > > 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > > > > > > Heng Qi wrote:
> > > > > > >> virtio-net has two ways to switch napi_tx: one is through the
> > > > > > >> module parameter, and the other is through coalescing parameter
> > > > > > >> settings (provided that the nic status is down).
> > > > > > >>
> > > > > > >> Sometimes we face performance regression caused by napi_tx,
> > > > > > >> then we need to switch napi_tx when debugging. However, the
> > > > > > >> existing methods are a bit troublesome, such as needing to
> > > > > > >> reload the driver or turn off the network card.
> > > >
> > > > Why is this troublesome? We don't need to turn off the card, it's just
> > > > a toggling of the interface.
> > > >
> > > > This ends up with pretty simple code.
> > > >
> > > > > So try to make
> > > > > > >> this update.
> > > > > > >>
> > > > > > >> Signed-off-by: Heng Qi 
> > > > > > >> Reviewed-by: Xuan Zhuo 
> > > > > > > The commit does not explain why it is safe to do so.
> > > > > >
> > > > > > virtnet_napi_tx_disable ensures that already scheduled tx napi ends 
> > > > > > and
> > > > > > no new tx napi will be scheduled.
> > > > > >
> > > > > > Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot
> > > > > > send the packet.
> > > > > >
> > > > > > Then we can safely toggle the weight to indicate where to clear the 
> > > > > > buffers.
> > > > > >
> > > > > > >
> > > > > > > The tx-napi weights are not really weights: it is a boolean 
> > > > > > > whether
> > > > > > > napi is used for transmit cleaning, or whether packets are cleaned
> > > > > > > in ndo_start_xmit.
> > > > > >
> > > > > > Right.
> > > > > >
> > > > > > >
> > > > > > > There certainly are some subtle issues with regard to 
> > > > > > > pausing/waking
> > > > > > > queues when switching between modes.
> > > > > >
> > > > > > What are "subtle issues" and if there are any, we find them.
> > > > >
> > > > > A single runtime test is not sufficient to exercise all edge cases.
> > > > >
> > > > > Please don't leave it to reviewers to establish the correctness of a
> > > > > patch.
> > > >
> > > > +1
> > > >
> > > > And instead of trying to do this, it would be much better to optimize
> > > > the NAPI performance. Then we can drop the orphan mode.
> > >
> > > "To address your problem, optimize our code to the level which we
> > > couldn't achieve in more than 10 years".
> >
> > Last time QE didn't report any issue for TCP. For others, the code
> > might just need some optimization if it really matters, it's just
> > because nobody has worked on this part in the past years.
>
> You think nobody worked on performance of virtio net because nobody
> could bother?

No, I just describe what I've seen from the list. No patches were
posted for performance optimization in recent years.

> I think it's just micro optimized to a level where
> progress is difficult.
>
>

Maybe.

Let me clarify what I meant.

If people can help to optimize the NAPI to be as fast as orphans or
something like 80%-90% of orphans. It should be sufficient to remove
the orphan path completely. This allows a lot of good features to be
built easily on top.

To me, it seems fine to remove it even now as it breaks socket
accounting which is actually a bug and NAPI has been enabled for years
but if you don't wish to do that, that's fine.  We can wait for the
coalescing and dim to be used for a while and revisit this.

Thanks




Re: [PATCH v2] virtio_net: fix missing dma unmap for resize

2024-01-03 Thread Jason Wang
On Thu, Jan 4, 2024 at 6:18 AM Michael S. Tsirkin  wrote:
>
> On Wed, Jan 03, 2024 at 01:58:03PM -0800, Jakub Kicinski wrote:
> > On Tue, 26 Dec 2023 17:43:33 +0800 Xuan Zhuo wrote:
> > > For rq, we have three cases getting buffers from virtio core:
> > >
> > > 1. virtqueue_get_buf{,_ctx}
> > > 2. virtqueue_detach_unused_buf
> > > 3. callback for virtqueue_resize
> > >
> > > But in commit 295525e29a5b("virtio_net: merge dma operations when
> > > filling mergeable buffers"), I missed the dma unmap for the #3 case.
> > >
> > > That will leak some memory, because I did not release the pages referred
> > > by the unused buffers.
> > >
> > > If we do such script, we will make the system OOM.
> > >
> > > while true
> > > do
> > > ethtool -G ens4 rx 128
> > > ethtool -G ens4 rx 256
> > > free -m
> > > done
> > >
> > > Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling 
> > > mergeable buffers")
> > > Signed-off-by: Xuan Zhuo 
> >
> > Michael, Jason, looks good? Worth pushing it to v6.7?
>
> I'd say yes.
>
> Acked-by: Michael S. Tsirkin 

Acked-by: Jason Wang 

Thanks

>




Re: [PATCH net-next v3 00/27] virtio-net: support AF_XDP zero copy

2024-01-10 Thread Jason Wang
On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo  wrote:
>
> ## AF_XDP
>
> XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
> copy feature of xsk (XDP socket) needs to be supported by the driver. The
> performance of zero copy is very good. mlx5 and intel ixgbe already support
> this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
> feature.
>
> At present, we have completed some preparation:
>
> 1. vq-reset (virtio spec and kernel code)
> 2. virtio-core premapped dma
> 3. virtio-net xdp refactor
>
> So it is time for Virtio-Net to complete the support for the XDP Socket
> Zerocopy.
>
> Virtio-net can not increase the queue num at will, so xsk shares the queue 
> with
> kernel.
>
> On the other hand, Virtio-Net does not support generate interrupt from driver
> manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
> NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it
> is also the local CPU, then we wake up napi directly.
>
> This patch set includes some refactor to the virtio-net to let that to support
> AF_XDP.
>
> ## performance
>
> ENV: Qemu with vhost-user(polling mode).
> Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
>
> ### virtio PMD in guest with testpmd
>
> testpmd> show port stats all
>
>   NIC statistics for port 0 
>  RX-packets: 19531092064 RX-missed: 0 RX-bytes: 1093741155584
>  RX-errors: 0
>  RX-nombuf: 0
>  TX-packets: 595992 TX-errors: 0 TX-bytes: 371030645664
>
>
>  Throughput (since last show)
>  Rx-pps:   8861574 Rx-bps:  3969985208
>  Tx-pps:   8861493 Tx-bps:  3969962736
>  
>
> ### AF_XDP PMD in guest with testpmd
>
> testpmd> show port stats all
>
>    NIC statistics for port 0  
>   RX-packets: 68152727   RX-missed: 0  RX-bytes:  3816552712
>   RX-errors: 0
>   RX-nombuf:  0
>   TX-packets: 68114967   TX-errors: 33216  TX-bytes:  3814438152
>
>   Throughput (since last show)
>   Rx-pps:  6333196  Rx-bps:   2837272088
>   Tx-pps:  6333227  Tx-bps:   2837285936
>   
>
> But AF_XDP consumes more CPU for tx and rx napi(100% and 86%).
>
> ## maintain
>
> I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support 
> in
> virtio-net.
>
> Please review.
>
> Thanks.
>
> v3
> 1. virtio introduces helpers for virtio-net sq using premapped dma
> 2. xsk has more complete support for merge mode
> 3. fix some problems
>
> v2
> 1. wakeup uses the way of GVE. No send ipi to wakeup napi on remote cpu.
> 2. remove rcu. Because we synchronize all operat, so the rcu is not 
> needed.
> 3. split the commit "move to virtio_net.h" in last patch set. Just move 
> the
>struct/api to header when we use them.
> 4. add comments for some code
>
> v1:
> 1. remove two virtio commits. Push this patchset to net-next
> 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: 
> support tx
> 3. fix some warnings
>
>
>
> Xuan Zhuo (27):
>   virtio_net: rename free_old_xmit_skbs to free_old_xmit
>   virtio_net: unify the code for recycling the xmit ptr
>   virtio_net: independent directory
>   virtio_net: move core structures to virtio_net.h
>   virtio_net: add prefix virtnet to all struct inside virtio_net.h
>   virtio_ring: introduce virtqueue_get_buf_ctx_dma()
>   virtio_ring: virtqueue_disable_and_recycle let the callback detach
> bufs
>   virtio_ring: introduce virtqueue_detach_unused_buf_dma()
>   virtio_ring: introduce virtqueue_get_dma_premapped()
>   virtio_net: sq support premapped mode
>   virtio_net: separate virtnet_rx_resize()
>   virtio_net: separate virtnet_tx_resize()
>   virtio_net: xsk: bind/unbind xsk
>   virtio_net: xsk: prevent disable tx napi
>   virtio_net: move some api to header
>   virtio_net: xsk: tx: support xmit xsk buffer
>   virtio_net: xsk: tx: support wakeup
>   virtio_net: xsk: tx: handle the transmitted xsk buffer
>   virtio_net: xsk: tx: free the unused xsk buffer
>   virtio_net: separate receive_mergeable
>   virtio_net: separate receive_buf
>   virtio_net: xsk: rx: support fill with xsk buffer
>   virtio_net: xsk: rx: support recv merge mode
>   virtio_net: xsk: rx: support recv small mode
>   virtio_net: xsk: rx: free the unused xsk buffer
>   virtio_net: update tx timeout record
>   virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY

Hi Xuan:

This series seems too huge to be reviewed easily.

I'd suggest to split it to be multiple series (as suggested by
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#tl-dr)

Thanks

>
>  MAINTAINERS |   2 +-
>  drivers/net/Kconfig |   8 +-
>  drivers/net/Makefile  

Re: [PATCH net-next v3 06/27] virtio_ring: introduce virtqueue_get_buf_ctx_dma()

2024-01-11 Thread Jason Wang
On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo  wrote:
>
> introduce virtqueue_get_buf_ctx_dma() to collect the dma info when
> get buf from virtio core for premapped mode.
>
> If the virtio queue is premapped mode, the virtio-net send buf may
> have many desc. Every desc dma address need to be unmap. So here we
> introduce a new helper to collect the dma address of the buffer from
> the virtio core.
>
> Because the BAD_RING is called (that may set vq->broken), so
> the relative "const" of vq is removed.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 174 +--
>  include/linux/virtio.h   |  16 
>  2 files changed, 142 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 51d8f3299c10..1374b3fd447c 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct 
> vring_virtqueue *vq)
> return vq->dma_dev;
>  }
>
> +/*
> + * use_dma_api premapped -> do_unmap
> + *  1. false   falsefalse
> + *  2. truefalsetrue
> + *  3. truetrue false
> + *
> + * Only #3, we should return the DMA info to the driver.

Btw, I guess you meant "#3 is false" here?

And could we reduce the size of these 3 * 3 matrices? It's usually a
hint that the code is not optmized.

Thanks




Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-14 Thread Jason Wang
On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun  wrote:
>
> From: Zhu Yanjun 
>
> Some devices emulate the virtio_net hardwares. When virtio_net
> driver sends commands to the emulated hardware, normally the
> hardware needs time to response. Sometimes the time is very
> long. Thus, the following will appear. Then the whole system
> will hang.
> The similar problems also occur in Intel NICs and Mellanox NICs.
> As such, the similar solution is borrowed from them. A timeout
> value is added and the timeout value as large as possible is set
> to ensure that the driver gets the maximum possible response from
> the hardware.
>
> "
> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! 
> [(udev-worker):3157]
> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr 
> rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency 
> intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm 
> x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt 
> dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry 
> pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me 
> isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec 
> idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf 
> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop 
> zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni 
> polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 
> sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg 
> scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> [  213.796194] irq event stamp: 67740
> [  213.796195] hardirqs last  enabled at (67739): [] 
> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796203] hardirqs last disabled at (67740): [] 
> sysvec_apic_timer_interrupt+0xe/0x90
> [  213.796208] softirqs last  enabled at (67686): [] 
> __irq_exit_rcu+0xbe/0xe0
> [  213.796214] softirqs last disabled at (67681): [] 
> __irq_exit_rcu+0xbe/0xe0
> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not 
> tainted 6.7.0+ #9
> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, 
> BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 
> 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 
> e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246
> [  213.796233] RAX:  RBX: ff2f15095896f000 RCX: 
> 0001
> [  213.796235] RDX:  RSI: ff4bbb362306f9cc RDI: 
> ff2f15095896f000
> [  213.796236] RBP:  R08:  R09: 
> 
> [  213.796237] R10: 0003 R11: ff2f15095893cc40 R12: 
> 0002
> [  213.796239] R13: 0004 R14:  R15: 
> ff2f1509534f3000
> [  213.796240] FS:  7f775847d0c0() GS:ff2f1528bac0() 
> knlGS:
> [  213.796242] CS:  0010 DS:  ES:  CR0: 80050033
> [  213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: 
> 00f71ef0
> [  213.796245] DR0:  DR1:  DR2: 
> 
> [  213.796246] DR3:  DR6: fffe07f0 DR7: 
> 0400
> [  213.796247] PKRU: 5554
> [  213.796249] Call Trace:
> [  213.796250]  
> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [  213.796282]  
> [  213.796284]  
> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> [  213.796328]  virtio_dev_probe+0x195/0x230
> [  213.796333]  really_probe+0x19f/0x400
> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> [  213.796340]  __driver_probe_device+0x78/0x160
> [  213.796343]  driver_probe_device+0x1f/0x90
> [  213.796346]  __driver_attach+0xd6/0x1d0
> [  213.796349]  bus_for_each_dev+0x8c/0xe0
> [  213.796355]  bus_add_driver+0x119/0x220
> [  213.796359]  driver_register+0x59/0x100
> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> [  213.796375]  do_one_initcall+0x6f/0x380
> [  213.796384]  do_init_module+0x60/0x240
> [  213.796388]  init_module_from_file+0x86/0xc0
> [  213.796396]  idempotent_init_module+0x129/0x2c0
> [  213.79640

Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-21 Thread Jason Wang
On Mon, Jan 15, 2024 at 6:22 PM Zhu Yanjun  wrote:
>
>
> 在 2024/1/15 10:20, Jason Wang 写道:
>
> On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun  wrote:
>
> From: Zhu Yanjun 
>
> Some devices emulate the virtio_net hardwares. When virtio_net
> driver sends commands to the emulated hardware, normally the
> hardware needs time to response. Sometimes the time is very
> long. Thus, the following will appear. Then the whole system
> will hang.
> The similar problems also occur in Intel NICs and Mellanox NICs.
> As such, the similar solution is borrowed from them. A timeout
> value is added and the timeout value as large as possible is set
> to ensure that the driver gets the maximum possible response from
> the hardware.
>
> "
> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! 
> [(udev-worker):3157]
> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr 
> rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency 
> intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm 
> x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt 
> dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry 
> pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me 
> isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec 
> idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf 
> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop 
> zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni 
> polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 
> sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg 
> scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> [  213.796194] irq event stamp: 67740
> [  213.796195] hardirqs last  enabled at (67739): [] 
> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796203] hardirqs last disabled at (67740): [] 
> sysvec_apic_timer_interrupt+0xe/0x90
> [  213.796208] softirqs last  enabled at (67686): [] 
> __irq_exit_rcu+0xbe/0xe0
> [  213.796214] softirqs last disabled at (67681): [] 
> __irq_exit_rcu+0xbe/0xe0
> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not 
> tainted 6.7.0+ #9
> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, 
> BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 
> 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 
> e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246
> [  213.796233] RAX:  RBX: ff2f15095896f000 RCX: 
> 0001
> [  213.796235] RDX:  RSI: ff4bbb362306f9cc RDI: 
> ff2f15095896f000
> [  213.796236] RBP:  R08:  R09: 
> 
> [  213.796237] R10: 0003 R11: ff2f15095893cc40 R12: 
> 0002
> [  213.796239] R13: 0004 R14:  R15: 
> ff2f1509534f3000
> [  213.796240] FS:  7f775847d0c0() GS:ff2f1528bac0() 
> knlGS:
> [  213.796242] CS:  0010 DS:  ES:  CR0: 80050033
> [  213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: 
> 00f71ef0
> [  213.796245] DR0:  DR1:  DR2: 
> 
> [  213.796246] DR3:  DR6: fffe07f0 DR7: 
> 0400
> [  213.796247] PKRU: 5554
> [  213.796249] Call Trace:
> [  213.796250]  
> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [  213.796282]  
> [  213.796284]  
> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> [  213.796328]  virtio_dev_probe+0x195/0x230
> [  213.796333]  really_probe+0x19f/0x400
> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> [  213.796340]  __driver_probe_device+0x78/0x160
> [  213.796343]  driver_probe_device+0x1f/0x90
> [  213.796346]  __driver_attach+0xd6/0x1d0
> [  213.796349]  bus_for_each_dev+0x8c/0xe0
> [  213.796355]  bus_add_driver

Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-21 Thread Jason Wang
On Fri, Jan 19, 2024 at 10:27 PM Heng Qi  wrote:
>
>
>
> 在 2024/1/18 下午8:01, Zhu Yanjun 写道:
> >
> > 在 2024/1/16 20:04, Paolo Abeni 写道:
> >> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
> >>> From: Zhu Yanjun 
> >>>
> >>> Some devices emulate the virtio_net hardwares. When virtio_net
> >>> driver sends commands to the emulated hardware, normally the
> >>> hardware needs time to response. Sometimes the time is very
> >>> long. Thus, the following will appear. Then the whole system
> >>> will hang.
> >>> The similar problems also occur in Intel NICs and Mellanox NICs.
> >>> As such, the similar solution is borrowed from them. A timeout
> >>> value is added and the timeout value as large as possible is set
> >>> to ensure that the driver gets the maximum possible response from
> >>> the hardware.
> >>>
> >>> "
> >>> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s!
> >>> [(udev-worker):3157]
> >>> [  213.796114] Modules linked in: virtio_net(+) net_failover
> >>> failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common
> >>> intel_uncore_frequency intel_uncore_frequency_common intel_ifs
> >>> i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp
> >>> coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support
> >>> vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev
> >>> intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci
> >>> isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus
> >>> i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf
> >>> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update
> >>> fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel
> >>> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> >>> bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi
> >>> pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> >>> [  213.796194] irq event stamp: 67740
> >>> [  213.796195] hardirqs last  enabled at (67739):
> >>> [] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >>> [  213.796203] hardirqs last disabled at (67740):
> >>> [] sysvec_apic_timer_interrupt+0xe/0x90
> >>> [  213.796208] softirqs last  enabled at (67686):
> >>> [] __irq_exit_rcu+0xbe/0xe0
> >>> [  213.796214] softirqs last disabled at (67681):
> >>> [] __irq_exit_rcu+0xbe/0xe0
> >>> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded
> >>> Not tainted 6.7.0+ #9
> >>> [  213.796220] Hardware name: Intel Corporation
> >>> M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926
> >>> 11/14/2022
> >>> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> >>> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89
> >>> 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87
> >>> 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48
> >>> 89 e8 5b
> >>> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246
> >>> [  213.796233] RAX:  RBX: ff2f15095896f000 RCX:
> >>> 0001
> >>> [  213.796235] RDX:  RSI: ff4bbb362306f9cc RDI:
> >>> ff2f15095896f000
> >>> [  213.796236] RBP:  R08:  R09:
> >>> 
> >>> [  213.796237] R10: 0003 R11: ff2f15095893cc40 R12:
> >>> 0002
> >>> [  213.796239] R13: 0004 R14:  R15:
> >>> ff2f1509534f3000
> >>> [  213.796240] FS:  7f775847d0c0() GS:ff2f1528bac0()
> >>> knlGS:
> >>> [  213.796242] CS:  0010 DS:  ES:  CR0: 80050033
> >>> [  213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4:
> >>> 00f71ef0
> >>> [  213.796245] DR0:  DR1:  DR2:
> >>> 
> >>> [  213.796246] DR3:  DR6: fffe07f0 DR7:
> >>> 0400
> >>> [  213.796247] PKRU: 5554
> >>> [  213.796249] Call Trace:
> >>> [  213.796250]  
> >>> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> >>> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> >>> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> >>> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> >>> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> >>> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> >>> [  213.796282]  
> >>> [  213.796284]  
> >>> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >>> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> >>> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> >>> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> >>> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> >>> [  213.796328]  virtio_dev_probe+0x195/0x230
> >>> [  213.796333]  really_probe+0x19f/0x400
> >>> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> >>> [  213.796340]  __driver_probe_device+0x78/0x160
> >>> [  213.796343]  driver_probe_device+0x1f/0x90
> >>> [  213.796346]  __driver_attach+0xd6/0x1d

Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-21 Thread Jason Wang
On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun  wrote:
>
>
> 在 2024/1/20 1:29, Andrew Lunn 写道:
> >while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > -   !virtqueue_is_broken(vi->cvq))
> > +   !virtqueue_is_broken(vi->cvq)) {
> > +if (timeout)
> > +timeout--;
>  This is not really a timeout, just a loop counter. 200 iterations could
>  be a very short time on reasonable H/W. I guess this avoid the soft
>  lockup, but possibly (likely?) breaks the functionality when we need to
>  loop for some non negligible time.
> 
>  I fear we need a more complex solution, as mentioned by Micheal in the
>  thread you quoted.
> >>> Got it. I also look forward to the more complex solution to this problem.
> >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> >> to get a reasonable timeout?
> > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > read_poll_timeout() otherwise read_poll_timeout_atomic().
>
> I read carefully the functions read_poll_timeout() and
> read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> functions.

FYI, in order to avoid a swtich of atomic or not, we need convert rx
mode setting to workqueue first:

https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html

>
> As such, can we add a module parameter to customize this timeout value
> by the user?

Who is the "user" here, or how can the "user" know the value?

>
> Or this timeout value is stored in device register, virtio_net driver
> will read this timeout value at initialization?

See another thread. The design needs to be general, or you can post a RFC.

In another thought, we've already had a tx watchdog, maybe we can have
something similar to cvq and use timeout + reset in that case.

Thans

>
> Zhu Yanjun
>
> >
> >   Andrew
>




Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-21 Thread Jason Wang
On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo  wrote:
>
> On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang  wrote:
> > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun  wrote:
> > >
> > >
> > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > >>>>>while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > >>>>> -   !virtqueue_is_broken(vi->cvq))
> > > >>>>> +   !virtqueue_is_broken(vi->cvq)) {
> > > >>>>> +if (timeout)
> > > >>>>> +timeout--;
> > > >>>> This is not really a timeout, just a loop counter. 200 iterations 
> > > >>>> could
> > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > >>>> lockup, but possibly (likely?) breaks the functionality when we need 
> > > >>>> to
> > > >>>> loop for some non negligible time.
> > > >>>>
> > > >>>> I fear we need a more complex solution, as mentioned by Micheal in 
> > > >>>> the
> > > >>>> thread you quoted.
> > > >>> Got it. I also look forward to the more complex solution to this 
> > > >>> problem.
> > > >> Can we add a device capability (new feature bit) such as 
> > > >> ctrq_wait_timeout
> > > >> to get a reasonable timeout?
> > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > >
> > > I read carefully the functions read_poll_timeout() and
> > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > functions.
> >
> > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > mode setting to workqueue first:
> >
> > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> >
> > >
> > > As such, can we add a module parameter to customize this timeout value
> > > by the user?
> >
> > Who is the "user" here, or how can the "user" know the value?
> >
> > >
> > > Or this timeout value is stored in device register, virtio_net driver
> > > will read this timeout value at initialization?
> >
> > See another thread. The design needs to be general, or you can post a RFC.
> >
> > In another thought, we've already had a tx watchdog, maybe we can have
> > something similar to cvq and use timeout + reset in that case.
>
> But we may block by the reset ^_^ if the device is broken?

I mean vq reset here.

It looks like we have multiple goals here

1) avoid lockups, using workqueue + cond_resched() seems to be
sufficient, it has issue but nothing new
2) recover from the unresponsive device, the issue for timeout is that
it needs to deal with false positives

Thanks

>
> Thanks.
>
>
> >
> > Thans
> >
> > >
> > > Zhu Yanjun
> > >
> > > >
> > > >   Andrew
> > >
> >
>




Re: [PATCH net-next v3 06/27] virtio_ring: introduce virtqueue_get_buf_ctx_dma()

2024-01-21 Thread Jason Wang
On Tue, Jan 16, 2024 at 3:47 PM Xuan Zhuo  wrote:
>
> On Thu, 11 Jan 2024 16:34:09 +0800, Jason Wang  wrote:
> > On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo  
> > wrote:
> > >
> > > introduce virtqueue_get_buf_ctx_dma() to collect the dma info when
> > > get buf from virtio core for premapped mode.
> > >
> > > If the virtio queue is premapped mode, the virtio-net send buf may
> > > have many desc. Every desc dma address need to be unmap. So here we
> > > introduce a new helper to collect the dma address of the buffer from
> > > the virtio core.
> > >
> > > Because the BAD_RING is called (that may set vq->broken), so
> > > the relative "const" of vq is removed.
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  drivers/virtio/virtio_ring.c | 174 +--
> > >  include/linux/virtio.h   |  16 
> > >  2 files changed, 142 insertions(+), 48 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 51d8f3299c10..1374b3fd447c 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct 
> > > vring_virtqueue *vq)
> > > return vq->dma_dev;
> > >  }
> > >
> > > +/*
> > > + * use_dma_api premapped -> do_unmap
> > > + *  1. false   falsefalse
> > > + *  2. truefalsetrue
> > > + *  3. truetrue false
> > > + *
> > > + * Only #3, we should return the DMA info to the driver.
> >
> > Btw, I guess you meant "#3 is false" here?
> >
> > And could we reduce the size of these 3 * 3 matrices? It's usually a
> > hint that the code is not optmized.
>
> On the process of doing dma map, we force the (use_dma_api, premapped).
>
> if premapped:
>  virtio core skip dma map
> else:
> if use_dma_api:
> do dma map
> else:
> work with the physical address.
>
> Here we force the (premapped, do_unmap).
>
> do_unmap is an optimization. We just check this to know should we do dma unmap
> or not.
>
> Now, we introduced an new case, when the virtio core skip dma unmap,
> we may need to return the dma info to the driver. That just occur when
> the (premapped, do_unmap) is (true, false). Because that the (premmaped,
> do_unmap) may be (false, false).
>
> For the matrices, I just want to show where the do_unmap comes from.
> That is a optimization, we use this many places, not to check (use_dma_api,
> premapped) on the process of doing unmap. And only for the case #3, we should
> return the dma info to drivers.

Ok, it tries to ease the life of the readers.

I wonder if something like

bool virtqueue_needs_unmap() can help, it can judge based on the value
of use_dma_api and premapped.

Thanks

>
> Thanks.
>
> >
> > Thanks
> >
> >
>




Re: [PATCH net-next 00/17] virtio-net: support AF_XDP zero copy (3/3)

2024-01-21 Thread Jason Wang
On Wed, Jan 17, 2024 at 1:58 PM Xuan Zhuo  wrote:
>
> On Tue, 16 Jan 2024 07:07:05 -0800, Jakub Kicinski  wrote:
> > On Tue, 16 Jan 2024 13:37:30 +0100 Paolo Abeni wrote:
> > > For future submission it would be better if you split this series in
> > > smaller chunks: the maximum size allowed is 15 patches.
> >
> > Which does not mean you can split it up and post them all at the same
> > time, FWIW.
>
>
> I hope some ones have time to reivew the other parts.

Will review those this week.

Thanks

> In the future, I will post one after the last one is merged.
>
> Thanks.
>




Re: [PATCH net-next v3 06/27] virtio_ring: introduce virtqueue_get_buf_ctx_dma()

2024-01-21 Thread Jason Wang
On Mon, Jan 22, 2024 at 2:12 PM Xuan Zhuo  wrote:
>
> On Mon, 22 Jan 2024 12:18:51 +0800, Jason Wang  wrote:
> > On Tue, Jan 16, 2024 at 3:47 PM Xuan Zhuo  
> > wrote:
> > >
> > > On Thu, 11 Jan 2024 16:34:09 +0800, Jason Wang  
> > > wrote:
> > > > On Fri, Dec 29, 2023 at 3:31 PM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > introduce virtqueue_get_buf_ctx_dma() to collect the dma info when
> > > > > get buf from virtio core for premapped mode.
> > > > >
> > > > > If the virtio queue is premapped mode, the virtio-net send buf may
> > > > > have many desc. Every desc dma address need to be unmap. So here we
> > > > > introduce a new helper to collect the dma address of the buffer from
> > > > > the virtio core.
> > > > >
> > > > > Because the BAD_RING is called (that may set vq->broken), so
> > > > > the relative "const" of vq is removed.
> > > > >
> > > > > Signed-off-by: Xuan Zhuo 
> > > > > ---
> > > > >  drivers/virtio/virtio_ring.c | 174 
> > > > > +--
> > > > >  include/linux/virtio.h   |  16 
> > > > >  2 files changed, 142 insertions(+), 48 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 51d8f3299c10..1374b3fd447c 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct 
> > > > > vring_virtqueue *vq)
> > > > > return vq->dma_dev;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * use_dma_api premapped -> do_unmap
> > > > > + *  1. false   falsefalse
> > > > > + *  2. truefalsetrue
> > > > > + *  3. truetrue false
> > > > > + *
> > > > > + * Only #3, we should return the DMA info to the driver.
> > > >
> > > > Btw, I guess you meant "#3 is false" here?
> > > >
> > > > And could we reduce the size of these 3 * 3 matrices? It's usually a
> > > > hint that the code is not optmized.
> > >
> > > On the process of doing dma map, we force the (use_dma_api, premapped).
> > >
> > > if premapped:
> > >  virtio core skip dma map
> > > else:
> > > if use_dma_api:
> > > do dma map
> > > else:
> > > work with the physical address.
> > >
> > > Here we force the (premapped, do_unmap).
> > >
> > > do_unmap is an optimization. We just check this to know should we do dma 
> > > unmap
> > > or not.
> > >
> > > Now, we introduced an new case, when the virtio core skip dma unmap,
> > > we may need to return the dma info to the driver. That just occur when
> > > the (premapped, do_unmap) is (true, false). Because that the (premmaped,
> > > do_unmap) may be (false, false).
> > >
> > > For the matrices, I just want to show where the do_unmap comes from.
> > > That is a optimization, we use this many places, not to check 
> > > (use_dma_api,
> > > premapped) on the process of doing unmap. And only for the case #3, we 
> > > should
> > > return the dma info to drivers.
> >
> > Ok, it tries to ease the life of the readers.
> >
> > I wonder if something like
> >
> > bool virtqueue_needs_unmap() can help, it can judge based on the value
> > of use_dma_api and premapped.
>
>
> I think not too much.
>
> Because do_unmap is for this.
>
>
>
> +static bool vring_need_unmap(struct vring_virtqueue *vq,
> +struct virtio_dma_head *dma,
> +dma_addr_t addr, unsigned int length)
> +{
> +   if (vq->do_unmap)
> +   return true;
>
> Before this, we is to judge whether we should do unmap or not.
> After this, we is to judge whehter we should return dma info to driver or not.
>
> If you want to simplify this function, I will say no.
>
> If you want to replace "do_unmap" with virtqueue_needs_unmap(), I will say ok.

That's my point.

> But I think we donot need to do that.

Just a suggestion, and you can move the comment above there.

Thanks

>
> +
> +   if (!vq->premapped)
> +   return false;
> +
> +   if (!dma)
> +   return false;
> +
> +   if (unlikely(dma->next >= dma->num)) {
> +   BAD_RING(vq, "premapped vq: collect dma overflow: %pad %u\n",
> +&addr, length);
> +   return false;
> +   }
> +
> +   dma->items[dma->next].addr = addr;
> +   dma->items[dma->next].length = length;
> +
> +   ++dma->next;
> +
> +   return false;
> +}
>
>
> Thanks.
>
>
> >
> > Thanks
> >
> > >
> > > Thanks.
> > >
> > > >
> > > > Thanks
> > > >
> > > >
> > >
> >
>




Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-21 Thread Jason Wang
On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo  wrote:
>
> On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang  wrote:
> > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo  
> > wrote:
> > >
> > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang  
> > > wrote:
> > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun  
> > > > wrote:
> > > > >
> > > > >
> > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > >>>>>while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > >>>>> -   !virtqueue_is_broken(vi->cvq))
> > > > > >>>>> +   !virtqueue_is_broken(vi->cvq)) {
> > > > > >>>>> +if (timeout)
> > > > > >>>>> +timeout--;
> > > > > >>>> This is not really a timeout, just a loop counter. 200 
> > > > > >>>> iterations could
> > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the 
> > > > > >>>> soft
> > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we 
> > > > > >>>> need to
> > > > > >>>> loop for some non negligible time.
> > > > > >>>>
> > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal 
> > > > > >>>> in the
> > > > > >>>> thread you quoted.
> > > > > >>> Got it. I also look forward to the more complex solution to this 
> > > > > >>> problem.
> > > > > >> Can we add a device capability (new feature bit) such as 
> > > > > >> ctrq_wait_timeout
> > > > > >> to get a reasonable timeout?
> > > > > > The usual solution to this is include/linux/iopoll.h. If you can 
> > > > > > sleep
> > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > >
> > > > > I read carefully the functions read_poll_timeout() and
> > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > functions.
> > > >
> > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > mode setting to workqueue first:
> > > >
> > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > >
> > > > >
> > > > > As such, can we add a module parameter to customize this timeout value
> > > > > by the user?
> > > >
> > > > Who is the "user" here, or how can the "user" know the value?
> > > >
> > > > >
> > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > will read this timeout value at initialization?
> > > >
> > > > See another thread. The design needs to be general, or you can post a 
> > > > RFC.
> > > >
> > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > something similar to cvq and use timeout + reset in that case.
> > >
> > > But we may block by the reset ^_^ if the device is broken?
> >
> > I mean vq reset here.
>
> I see.
>
> I mean when the deivce is broken, the vq reset also many be blocked.
>
> void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, 
> u16 index)
> {
> struct virtio_pci_modern_common_cfg __iomem *cfg;
>
> cfg = (struct virtio_pci_modern_common_cfg __iomem 
> *)mdev->common;
>
> vp_iowrite16(index, &cfg->cfg.queue_select);
> vp_iowrite16(1, &cfg->queue_reset);
>
> while (vp_ioread16(&cfg->queue_reset))
> msleep(1);
>
> while (vp_ioread16(&cfg->cfg.queue_enable))
> msleep(1);
> }
> EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
>
> In this function, for the broken device, we can not expect something.

Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.

Thanks

>
>
> >
> > It looks like we have multiple goals here
> >
> > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > sufficient, it has issue but nothing new
> > 2) recover from the unresponsive device, the issue for timeout is that
> > it needs to deal with false positives
>
>
> I agree.
>
> But I want to add a new goal, cvq async. In the netdim, we will
> send many requests via the cvq, so the cvq async will be nice.
>
> Thanks.
>
>
> >
> > Thanks
> >
> > >
> > > Thanks.
> > >
> > >
> > > >
> > > > Thans
> > > >
> > > > >
> > > > > Zhu Yanjun
> > > > >
> > > > > >
> > > > > >   Andrew
> > > > >
> > > >
> > >
> >
>




<    1   2   3   4   5   6   7   8   9   10   >