Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-23 Thread Jason Wang


On 2020/12/24 下午12:37, wangyunjian wrote:

-Original Message-
From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Thursday, December 24, 2020 11:10 AM
To: wangyunjian ; net...@vger.kernel.org;
m...@redhat.com; willemdebruijn.ker...@gmail.com
Cc: virtualization@lists.linux-foundation.org; Lilijun (Jerry)
; chenchanghu ;
xudingke ; huangbin (J)

Subject: Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg
fails


On 2020/12/24 上午10:25, wangyunjian wrote:

From: Yunjian Wang 

Currently the driver doesn't drop a packet which can't be sent by tun
(e.g bad packet). In this case, the driver will always process the
same packet lead to the tx queue stuck.

To fix this issue:
1. in the case of persistent failure (e.g bad packet), the driver can
skip this descriptor by ignoring the error.
2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
driver schedules the worker to try again.


I might be wrong but looking at alloc_skb_with_frags(), it returns -ENOBUFS
actually:

      *errcode = -ENOBUFS;
      skb = alloc_skb(header_len, gfp_mask);
      if (!skb)
          return NULL;

Yes, but the sock_alloc_send_pskb() returns - EAGAIN when no sndbuf space.
So there is need to check return value which is -EAGAIN or -ENOMEM or - EAGAIN?

struct sk_buff *sock_alloc_send_pskb()
{
...
for (;;) {
...
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
err = -EAGAIN;
if (!timeo)
goto failure;
...
}
skb = alloc_skb_with_frags(header_len, data_len, max_page_order,
   errcode, sk->sk_allocation);
if (skb)
skb_set_owner_w(skb, sk);
return skb;
...
*errcode = err;
return NULL;
}



-EAGAIN and -ENOBFS are fine. But I don't see how -ENOMEM can be returned.

Thanks



Thanks



Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Yunjian Wang 
---
   drivers/vhost/net.c | 16 
   1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
c8784dfafdd7..e76245daa7f6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net,

struct socket *sock)

msg.msg_flags &= ~MSG_MORE;
}

-   /* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
+   if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
vhost_discard_vq_desc(vq, 1);
vhost_net_enable_vq(net, vq);
break;
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: len %d != %zd\n",
 err, len);
   done:
@@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net

*net, struct socket *sock)

msg.msg_flags &= ~MSG_MORE;
}

-   /* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
@@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net

*net, struct socket *sock)

nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   if (err == -EAGAIN || err == -ENOMEM) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
if (!zcopy_used)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-23 Thread Jason Wang


On 2020/12/24 上午10:25, wangyunjian wrote:

From: Yunjian Wang 

Currently the driver doesn't drop a packet which can't be sent by tun
(e.g bad packet). In this case, the driver will always process the
same packet lead to the tx queue stuck.

To fix this issue:
1. in the case of persistent failure (e.g bad packet), the driver
can skip this descriptor by ignoring the error.
2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
driver schedules the worker to try again.



I might be wrong but looking at alloc_skb_with_frags(), it returns 
-ENOBUFS actually:


    *errcode = -ENOBUFS;
    skb = alloc_skb(header_len, gfp_mask);
    if (!skb)
        return NULL;

Thanks




Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Yunjian Wang 
---
  drivers/vhost/net.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c8784dfafdd7..e76245daa7f6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -827,14 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct 
socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
+   if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
vhost_discard_vq_desc(vq, 1);
vhost_net_enable_vq(net, vq);
break;
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: len %d != %zd\n",
 err, len);
  done:
@@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
@@ -931,11 +929,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   if (err == -EAGAIN || err == -ENOMEM) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
}
-   if (err != len)
+   if (err >= 0 && err != len)
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
if (!zcopy_used)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang


On 2020/12/23 下午10:17, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 4:08 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji 
---
   Documentation/driver-api/vduse.rst |   74 ++
   Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
   drivers/vdpa/Kconfig   |8 +
   drivers/vdpa/Makefile  |1 +
   drivers/vdpa/vdpa_user/Makefile|5 +
   drivers/vdpa/vdpa_user/eventfd.c   |  221 
   drivers/vdpa/vdpa_user/eventfd.h   |   48 +
   drivers/vdpa/vdpa_user/iova_domain.c   |  442 
   drivers/vdpa/vdpa_user/iova_domain.h   |   93 ++
   drivers/vdpa/vdpa_user/vduse.h |   59 ++
   drivers/vdpa/vdpa_user/vduse_dev.c | 1121 

   include/uapi/linux/vdpa.h  |1 +
   include/uapi/linux/vduse.h |   99 ++
   13 files changed, 2173 insertions(+)
   create mode 100644 Documentation/driver-api/vduse.rst
   create mode 100644 drivers/vdpa/vdpa_user/Makefile
   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
   create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
new file mode 100644
index ..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is u

Re: [External] Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-23 Thread Jason Wang


On 2020/12/23 下午8:14, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 5:05 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
   Documentation/driver-api/vduse.rst |  15 +++-
   drivers/vdpa/vdpa_user/vduse_dev.c | 147 
-
   include/uapi/linux/vduse.h |  11 +++
   3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:

   - VDUSE_GET_CONFIG: Read from device specific configuration space

+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
   Please see include/linux/vdpa.h for details.

-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
   driver which supports mapping the kernel dma buffer to a userspace iova
   region dynamically. The userspace iova region can be created by passing
   the userspace vDPA device fd to mmap(2).

+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
   Besides, the eventfd mechanism is used to trigger interrupt callbacks and
   receive virtqueue kicks in userspace. The following ioctls on the userspace
   vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@

   struct vduse_dev_msg {
   struct vduse_dev_request req;
+ struct file *iotlb_file;
   struct vduse_dev_response resp;
   struct list_head list;
   wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
   return ret;
   }

+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+ u64 offset, u64 iova, u64 size, u8 perm)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.offset = offset;
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+ msg->req.iotlb.perm = perm;
+ msg->req.iotlb.fd = -1;
+ msg->iotlb_file = get_file(file);
+
+ ret = vduse_dev_msg_sync(dev, msg);


My feeling is that we should provide consistent API for the userspace
device to use.

E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message
by using VDUSE file as msg->iotlb_file here.


It's OK for me. One problem is when to transfer the UPDATE_IOTLB
message in virtio cases.



Instead of generating IOTLB messages for userspace.

How about record the mappings (which is a common case for device have 
on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl 
for userspace to query?


Thanks





+ vduse_dev_msg_put(msg);
+ fput(file);
+
+ return ret;
+}
+
+static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
+ u64 iova, u64 size)
+{
+ struct vduse_dev_msg *msg;
+ int ret;
+
+ if (!size)
+ return -EINVAL;
+
+ msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
+ msg->req.size = sizeof(struct vduse_iotlb);
+ msg->req.iotlb.iova = iova;
+ msg->req.iotlb.size = size;
+
+ ret = vduse_dev_msg_sync(dev, msg);
+ vduse_dev_msg_put(msg);
+
+ return ret;
+}
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+ unsigned int flags = 0;
+
+ switch (perm) {
+ case VHOST_ACCESS_WO:
+ flags |= O_WRONLY;
+ break;
+ case VHOST_ACCESS_RO:
+ flags |= O_RDONLY;
+ break;
+ case VHOST_ACCESS_RW:
+ flags |= O_RDWR;
+ break;
+ default:
+ WARN(1, "invalidate vhost IOTLB permission\n");
+

Re: [PATCH net v4 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-23 Thread Willem de Bruijn
On Wed, Dec 23, 2020 at 9:25 PM wangyunjian  wrote:
>
> From: Yunjian Wang 
>
> Currently the driver doesn't drop a packet which can't be sent by tun
> (e.g bad packet). In this case, the driver will always process the
> same packet lead to the tx queue stuck.
>
> To fix this issue:
> 1. in the case of persistent failure (e.g bad packet), the driver
> can skip this descriptor by ignoring the error.
> 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
> driver schedules the worker to try again.
>
> Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
> Signed-off-by: Yunjian Wang 

Acked-by: Willem de Bruijn 

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops

2020-12-23 Thread Jason Wang


On 2020/12/23 下午7:06, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 4:37 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

This patch introduces a new method in the vdpa_config_ops to
support processing the raw vhost memory mapping message in the
vDPA device driver.

Signed-off-by: Xie Yongji 
---
   drivers/vhost/vdpa.c | 5 -
   include/linux/vdpa.h | 7 +++
   2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..ccbb391e38be 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
   if (r)
   return r;

+ if (ops->process_iotlb_msg)
+ return ops->process_iotlb_msg(vdpa, msg);
+
   switch (msg->type) {
   case VHOST_IOTLB_UPDATE:
   r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
   int ret;

   /* Device want to do DMA by itself */
- if (ops->set_map || ops->dma_map)
+ if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
   return 0;

   bus = dma_dev->bus;
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 656fe264234e..7bccedf22f4b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -5,6 +5,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

@@ -172,6 +173,10 @@ struct vdpa_iova_range {
*  @vdev: vdpa device
*  Returns the iova range supported by
*  the device.
+ * @process_iotlb_msg:   Process vhost memory mapping message 
(optional)
+ *   Only used for VDUSE device now
+ *   @vdev: vdpa device
+ *   @msg: vhost memory mapping message
* @set_map:Set device memory mapping (optional)
*  Needed for device that using device
*  specific DMA translation (on-chip IOMMU)
@@ -240,6 +245,8 @@ struct vdpa_config_ops {
   struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);

   /* DMA ops */
+ int (*process_iotlb_msg)(struct vdpa_device *vdev,
+  struct vhost_iotlb_msg *msg);
   int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
   int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
  u64 pa, u32 perm);


Is there any reason that it can't be done via dma_map/dma_unmap or set_map?


To get the shmfd, we need the vma rather than physical address. And
it's not necessary to pin the user pages in VDUSE case.



Right, actually, vhost-vDPA is planning to support shared virtual 
address space.


So let's try to reuse the existing config ops. How about just introduce 
an attribute to vdpa device that tells the bus tells the bus it can do 
shared virtual memory. Then when the device is probed by vhost-vDPA, use 
pages won't be pinned and we will do VA->VA mapping as IOVA->PA mapping 
in the vhost IOTLB and the config ops. vhost IOTLB needs to be extended 
to accept opaque pointer to store the file. And the file was pass via 
the config ops as well.


Thanks





Thanks,
Yongji



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang


On 2020/12/23 下午6:59, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 2:38 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that.


Rethink about the bounce buffer stuffs. I wonder instead of using kernel
pages with mmap(), how about just use userspace pages like what vhost did?

It means we need a worker to do bouncing but we don't need to care about
annoying stuffs like page reclaiming?


Now the I/O bouncing is done in the streaming DMA mapping routines
which can be called from interrupt context. If we put this into a
kworker, that means we need to synchronize with a kworker in an
interrupt context. I think it can't work.



We just need to make sure the buffer is ready before the user is trying 
to access them.


But I admit it would be tricky (require shadow virtqueue etc) which is 
probably not a good idea.


Thanks




Thanks,
Yongji



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] vhost/vsock: add IOTLB API support

2020-12-23 Thread Jason Wang


On 2020/12/23 下午10:36, Stefano Garzarella wrote:

This patch enables the IOTLB API support for vhost-vsock devices,
allowing the userspace to emulate an IOMMU for the guest.

These changes were made following vhost-net, in details this patch:
- exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
   device if the feature is acked
- implements VHOST_GET_BACKEND_FEATURES and
   VHOST_SET_BACKEND_FEATURES ioctls
- calls vq_meta_prefetch() before vq processing to prefetch vq
   metadata address in IOTLB
- provides .read_iter, .write_iter, and .poll callbacks for the
   chardev; they are used by the userspace to exchange IOTLB messages

This patch was tested specifying "intel_iommu=strict" in the guest
kernel command line. I used QEMU with a patch applied [1] to fix a
simple issue (that patch was merged in QEMU v5.2.0):
 $ qemu -M q35,accel=kvm,kernel-irqchip=split \
-drive file=fedora.qcow2,format=qcow2,if=virtio \
-device intel-iommu,intremap=on,device-iotlb=on \
-device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 



Acked-by: Jason Wang 



---

The patch is the same of v1, but I re-tested it with:
- QEMU v5.2.0-551-ga05f8ecd88
- Linux 5.9.15 (host)
- Linux 5.9.15 and 5.10.0 (guest)
Now, enabling 'ats' it works well, there are just a few simple changes.

v1: https://www.spinics.net/lists/kernel/msg3716022.html
v2:
- updated commit message about QEMU version and string used to test
- rebased on mst/vhost branch

Thanks,
Stefano
---
  drivers/vhost/vsock.c | 68 +--
  1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a483cec31d5c..5e78fb719602 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -30,7 +30,12 @@
  #define VHOST_VSOCK_PKT_WEIGHT 256
  
  enum {

-   VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+   VHOST_VSOCK_FEATURES = VHOST_FEATURES |
+  (1ULL << VIRTIO_F_ACCESS_PLATFORM)
+};
+
+enum {
+   VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
  };
  
  /* Used to track all the vhost_vsock instances on the system. */

@@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
if (!vhost_vq_get_backend(vq))
goto out;
  
+	if (!vq_meta_prefetch(vq))

+   goto out;
+
/* Avoid further vmexits, we're already processing the virtqueue */
vhost_disable_notify(&vsock->dev, vq);
  
@@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)

if (!vhost_vq_get_backend(vq))
goto out;
  
+	if (!vq_meta_prefetch(vq))

+   goto out;
+
vhost_disable_notify(&vsock->dev, vq);
do {
u32 len;
@@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
mutex_lock(&vsock->dev.mutex);
if ((features & (1 << VHOST_F_LOG_ALL)) &&
!vhost_log_access_ok(&vsock->dev)) {
-   mutex_unlock(&vsock->dev.mutex);
-   return -EFAULT;
+   goto err;
+   }
+
+   if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
+   if (vhost_init_device_iotlb(&vsock->dev, true))
+   goto err;
}
  
  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {

@@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
}
mutex_unlock(&vsock->dev.mutex);
return 0;
+
+err:
+   mutex_unlock(&vsock->dev.mutex);
+   return -EFAULT;
  }
  
  static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,

@@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
if (copy_from_user(&features, argp, sizeof(features)))
return -EFAULT;
return vhost_vsock_set_features(vsock, features);
+   case VHOST_GET_BACKEND_FEATURES:
+   features = VHOST_VSOCK_BACKEND_FEATURES;
+   if (copy_to_user(argp, &features, sizeof(features)))
+   return -EFAULT;
+   return 0;
+   case VHOST_SET_BACKEND_FEATURES:
+   if (copy_from_user(&features, argp, sizeof(features)))
+   return -EFAULT;
+   if (features & ~VHOST_VSOCK_BACKEND_FEATURES)
+   return -EOPNOTSUPP;
+   vhost_set_backend_features(&vsock->dev, features);
+   return 0;
default:
mutex_lock(&vsock->dev.mutex);
r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
@@ -823,6 +854,34 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
}
  }
  
+static ssize_t vhost_vsock_chr_read_iter(struct kiocb *i

Re: [PATCH net v3 2/2] vhost_net: fix tx queue stuck when sendmsg fails

2020-12-23 Thread Willem de Bruijn
On Wed, Dec 23, 2020 at 9:47 AM wangyunjian  wrote:
>
> From: Yunjian Wang 
>
> Currently the driver don't drop a packet which can't be send by tun
>
> (e.g bad packet). In this case, the driver will always process the
> same packet lead to the tx queue stuck.
>
> To fix this issue:
> 1. in the case of persistent failure (e.g bad packet), the driver
> can skip this descriptior by ignoring the error.
> 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
> driver schedules the worker to try again.
>

Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")

Since I have a few other comments, a few minor typo corrections too:
don't -> doesn't, send -> sent, descriptior -> descriptor.

> Signed-off-by: Yunjian Wang 
>
>  drivers/vhost/net.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c8784dfafdd7..e49dd64d086a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -827,9 +827,8 @@ static void handle_tx_copy(struct vhost_net *net, struct 
> socket *sock)
> msg.msg_flags &= ~MSG_MORE;
> }
>
> -   /* TODO: Check specific error and bomb out unless ENOBUFS? */
> err = sock->ops->sendmsg(sock, &msg, len);
> -   if (unlikely(err < 0)) {
> +   if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
> vhost_discard_vq_desc(vq, 1);
> vhost_net_enable_vq(net, vq);
> break;
> @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
> struct socket *sock)
> msg.msg_flags &= ~MSG_MORE;
> }
>
> -   /* TODO: Check specific error and bomb out unless ENOBUFS? */
> err = sock->ops->sendmsg(sock, &msg, len);
> if (unlikely(err < 0)) {
> if (zcopy_used) {
> @@ -931,9 +929,11 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
> struct socket *sock)
> nvq->upend_idx = ((unsigned)nvq->upend_idx - 
> 1)
> % UIO_MAXIOV;
> }
> -   vhost_discard_vq_desc(vq, 1);
> -   vhost_net_enable_vq(net, vq);
> -   break;
> +   if (err == -EAGAIN || err == -ENOMEM) {
> +   vhost_discard_vq_desc(vq, 1);
> +   vhost_net_enable_vq(net, vq);
> +   break;
> +   }
> }
> if (err != len)
> pr_debug("Truncated TX packet: "

Probably my bad for feedback in patch 2/2, but now vhost will
incorrectly log bad packets as truncated packets.

This will need to be if (err >= 0 && err != len).

It would be nice if we could notify the guest in the transmit
descriptor when a packet was dropped due to failing integrity checks
(bad packet). But I don't think we easily can, so out of scope for
this fix.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-23 Thread Alexander Duyck
On Tue, Dec 22, 2020 at 7:39 PM Liang Li  wrote:
>
> > > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> > > +struct hstate *h, unsigned int nid,
> > > +struct scatterlist *sgl, unsigned int *offset)
> > > +{
> > > +   struct list_head *list = &h->hugepage_freelists[nid];
> > > +   unsigned int page_len = PAGE_SIZE << h->order;
> > > +   struct page *page, *next;
> > > +   long budget;
> > > +   int ret = 0, scan_cnt = 0;
> > > +
> > > +   /*
> > > +* Perform early check, if free area is empty there is
> > > +* nothing to process so we can skip this free_list.
> > > +*/
> > > +   if (list_empty(list))
> > > +   return ret;
> > > +
> > > +   spin_lock_irq(&hugetlb_lock);
> > > +
> > > +   if (huge_page_order(h) > MAX_ORDER)
> > > +   budget = HUGEPAGE_REPORTING_CAPACITY;
> > > +   else
> > > +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> >
> > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> > don't even really need budget since this should probably be pulling
> > out no more than one hugepage at a time.
>
> I want to disting a 2M page and 1GB page here. The order of 1GB page is 
> greater
> than MAX_ORDER while 2M page's order is less than MAX_ORDER.

The budget here is broken. When I put the budget in page reporting it
was so that we wouldn't try to report all of the memory in a given
region. It is meant to hold us to no more than one pass through 1/16
of the free memory. So essentially we will be slowly processing all of
memory and it will take 16 calls (32 seconds) for us to process a
system that is sitting completely idle. It is meant to pace us so we
don't spend a ton of time doing work that will be undone, not to
prevent us from burying a CPU which is what seems to be implied here.

Using HUGEPAGE_REPORTING_CAPACITY makes no sense here. I was using it
in the original definition because it was how many pages we could
scoop out at a time and then I was aiming for a 16th of that. Here you
are arbitrarily squaring HUGEPAGE_REPORTING_CAPACITY in terms of the
amount of work you will doo since you are using it as a multiple
instead of a divisor.

> >
> > > +   /* loop through free list adding unreported pages to sg list */
> > > +   list_for_each_entry_safe(page, next, list, lru) {
> > > +   /* We are going to skip over the reported pages. */
> > > +   if (PageReported(page)) {
> > > +   if (++scan_cnt >= MAX_SCAN_NUM) {
> > > +   ret = scan_cnt;
> > > +   break;
> > > +   }
> > > +   continue;
> > > +   }
> > > +
> >
> > It would probably have been better to place this set before your new
> > set. I don't see your new set necessarily being the best use for page
> > reporting.
>
> I haven't really latched on to what you mean, could you explain it again?

It would be better for you to spend time understanding how this patch
set works before you go about expanding it to do other things.
Mistakes like the budget one above kind of point out the fact that you
don't understand how this code was supposed to work and just kind of
shoehorned you page zeroing code onto it.

It would be better to look at trying to understand this code first
before you extend it to support your zeroing use case. So adding huge
pages first might make more sense than trying to zero and push the
order down. The fact is the page reporting extension should be minimal
for huge pages since they are just passed as a scatterlist so you
should only need to add a small bit to page_reporting.c to extend it
to support this use case.

> >
> > > +   /*
> > > +* If we fully consumed our budget then update our
> > > +* state to indicate that we are requesting additional
> > > +* processing and exit this list.
> > > +*/
> > > +   if (budget < 0) {
> > > +   atomic_set(&prdev->state, 
> > > PAGE_REPORTING_REQUESTED);
> > > +   next = page;
> > > +   break;
> > > +   }
> > > +
> >
> > If budget is only ever going to be 1 then we probably could just look
> > at making this the default case for any time we find a non-reported
> > page.
>
> and here again.

It comes down to the fact that the changes you made have a significant
impact on how this is supposed to function. Reducing the scatterlist
to a size of one makes the whole point of doing batching kind of
pointless. Basically the code should be rewritten with the assumption
that if you find a page you report it.

The old code would batch things up because there is significant
overhead to be addressed when going to the hypervisor to report said
memory. Your code doesn't seem to really take anything like 

[PATCH v2] vhost/vsock: add IOTLB API support

2020-12-23 Thread Stefano Garzarella
This patch enables the IOTLB API support for vhost-vsock devices,
allowing the userspace to emulate an IOMMU for the guest.

These changes were made following vhost-net, in details this patch:
- exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
  device if the feature is acked
- implements VHOST_GET_BACKEND_FEATURES and
  VHOST_SET_BACKEND_FEATURES ioctls
- calls vq_meta_prefetch() before vq processing to prefetch vq
  metadata address in IOTLB
- provides .read_iter, .write_iter, and .poll callbacks for the
  chardev; they are used by the userspace to exchange IOTLB messages

This patch was tested specifying "intel_iommu=strict" in the guest
kernel command line. I used QEMU with a patch applied [1] to fix a
simple issue (that patch was merged in QEMU v5.2.0):
$ qemu -M q35,accel=kvm,kernel-irqchip=split \
   -drive file=fedora.qcow2,format=qcow2,if=virtio \
   -device intel-iommu,intremap=on,device-iotlb=on \
   -device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---

The patch is the same of v1, but I re-tested it with:
- QEMU v5.2.0-551-ga05f8ecd88
- Linux 5.9.15 (host)
- Linux 5.9.15 and 5.10.0 (guest)
Now, enabling 'ats' it works well, there are just a few simple changes.

v1: https://www.spinics.net/lists/kernel/msg3716022.html
v2:
- updated commit message about QEMU version and string used to test
- rebased on mst/vhost branch

Thanks,
Stefano
---
 drivers/vhost/vsock.c | 68 +--
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a483cec31d5c..5e78fb719602 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -30,7 +30,12 @@
 #define VHOST_VSOCK_PKT_WEIGHT 256
 
 enum {
-   VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+   VHOST_VSOCK_FEATURES = VHOST_FEATURES |
+  (1ULL << VIRTIO_F_ACCESS_PLATFORM)
+};
+
+enum {
+   VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
 };
 
 /* Used to track all the vhost_vsock instances on the system. */
@@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
if (!vhost_vq_get_backend(vq))
goto out;
 
+   if (!vq_meta_prefetch(vq))
+   goto out;
+
/* Avoid further vmexits, we're already processing the virtqueue */
vhost_disable_notify(&vsock->dev, vq);
 
@@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
if (!vhost_vq_get_backend(vq))
goto out;
 
+   if (!vq_meta_prefetch(vq))
+   goto out;
+
vhost_disable_notify(&vsock->dev, vq);
do {
u32 len;
@@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
mutex_lock(&vsock->dev.mutex);
if ((features & (1 << VHOST_F_LOG_ALL)) &&
!vhost_log_access_ok(&vsock->dev)) {
-   mutex_unlock(&vsock->dev.mutex);
-   return -EFAULT;
+   goto err;
+   }
+
+   if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
+   if (vhost_init_device_iotlb(&vsock->dev, true))
+   goto err;
}
 
for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
@@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
}
mutex_unlock(&vsock->dev.mutex);
return 0;
+
+err:
+   mutex_unlock(&vsock->dev.mutex);
+   return -EFAULT;
 }
 
 static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
@@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
if (copy_from_user(&features, argp, sizeof(features)))
return -EFAULT;
return vhost_vsock_set_features(vsock, features);
+   case VHOST_GET_BACKEND_FEATURES:
+   features = VHOST_VSOCK_BACKEND_FEATURES;
+   if (copy_to_user(argp, &features, sizeof(features)))
+   return -EFAULT;
+   return 0;
+   case VHOST_SET_BACKEND_FEATURES:
+   if (copy_from_user(&features, argp, sizeof(features)))
+   return -EFAULT;
+   if (features & ~VHOST_VSOCK_BACKEND_FEATURES)
+   return -EOPNOTSUPP;
+   vhost_set_backend_features(&vsock->dev, features);
+   return 0;
default:
mutex_lock(&vsock->dev.mutex);
r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
@@ -823,6 +854,34 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
}
 }
 
+static ssize_t vhost_vsock_chr_read_iter(struct kiocb *iocb, struct iov_iter 
*to)
+{
+   struct file *file = iocb->ki_filp;
+   struct vhost_

Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-23 Thread Willem de Bruijn
On Wed, Dec 23, 2020 at 8:21 AM wangyunjian  wrote:
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Wednesday, December 23, 2020 10:54 AM
> > To: Willem de Bruijn 
> > Cc: wangyunjian ; Network Development
> > ; Michael S. Tsirkin ;
> > virtualization@lists.linux-foundation.org; Lilijun (Jerry)
> > ; chenchanghu ;
> > xudingke ; huangbin (J)
> > 
> > Subject: Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg 
> > fails
> >
> >
> > On 2020/12/22 下午10:24, Willem de Bruijn wrote:
> > > On Mon, Dec 21, 2020 at 11:41 PM Jason Wang 
> > wrote:
> > >>
> > >> On 2020/12/22 上午7:07, Willem de Bruijn wrote:
> > >>> On Wed, Dec 16, 2020 at 3:20 AM wangyunjian
> > wrote:
> >  From: Yunjian Wang
> > 
> >  Currently we break the loop and wake up the vhost_worker when
> >  sendmsg fails. When the worker wakes up again, we'll meet the same
> >  error.
> > >>> The patch is based on the assumption that such error cases always
> > >>> return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb?
> > >>>
> >  This will cause high CPU load. To fix this issue, we can skip this
> >  description by ignoring the error. When we exceeds sndbuf, the
> >  return value of sendmsg is -EAGAIN. In the case we don't skip the
> >  description and don't drop packet.
> > >>> the -> that
> > >>>
> > >>> here and above: description -> descriptor
> > >>>
> > >>> Perhaps slightly revise to more explicitly state that
> > >>>
> > >>> 1. in the case of persistent failure (i.e., bad packet), the driver
> > >>> drops the packet 2. in the case of transient failure (e.g,. memory
> > >>> pressure) the driver schedules the worker to try again later
> > >>
> > >> If we want to go with this way, we need a better time to wakeup the
> > >> worker. Otherwise it just produces more stress on the cpu that is
> > >> what this patch tries to avoid.
> > > Perhaps I misunderstood the purpose of the patch: is it to drop
> > > everything, regardless of transient or persistent failure, until the
> > > ring runs out of descriptors?
> >
> >
> > My understanding is that the main motivation is to avoid high cpu 
> > utilization
> > when sendmsg() fail due to guest reason (e.g bad packet).
> >
>
> My main motivation is to avoid the tx queue stuck.
>
> Should I describe it like this:
> Currently the driver don't drop a packet which can't be send by tun
> (e.g bad packet). In this case, the driver will always process the
> same packet lead to the tx queue stuck.
>
> To fix this issue:
> 1. in the case of persistent failure (e.g bad packet), the driver can skip
> this descriptior by ignoring the error.
> 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the driver
> schedules the worker to try again.

That sounds good to me, thanks.

> Thanks
>
> >
> > >
> > > I can understand both a blocking and drop strategy during memory
> > > pressure. But partial drop strategy until exceeding ring capacity
> > > seems like a peculiar hybrid?
> >
> >
> > Yes. So I wonder if we want to be do better when we are in the memory
> > pressure. E.g can we let socket wake up us instead of rescheduling the
> > workers here? At least in this case we know some memory might be freed?

I don't know whether a blocking or drop strategy is the better choice.
Either way, it probably deserves to be handled separately.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[GIT PULL] virtio,vdpa: features, cleanups, fixes

2020-12-23 Thread Michael S. Tsirkin
The following changes since commit 2c85ebc57b3e1817b6ce1a6b703928e113a90442:

  Linux 5.10 (2020-12-13 14:41:30 -0800)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 418eddef050d5f6393c303a94e3173847ab85466:

  vdpa: Use simpler version of ida allocation (2020-12-18 16:14:31 -0500)


virtio,vdpa: features, cleanups, fixes

vdpa sim refactoring
virtio mem  Big Block Mode support
misc cleanus, fixes

Signed-off-by: Michael S. Tsirkin 


Christophe JAILLET (1):
  vdpa: ifcvf: Use dma_set_mask_and_coherent to simplify code

Dan Carpenter (3):
  virtio_ring: Cut and paste bugs in vring_create_virtqueue_packed()
  virtio_net: Fix error code in probe()
  virtio_ring: Fix two use after free bugs

David Hildenbrand (29):
  virtio-mem: determine nid only once using memory_add_physaddr_to_nid()
  virtio-mem: more precise calculation in 
virtio_mem_mb_state_prepare_next_mb()
  virtio-mem: simplify MAX_ORDER - 1 / pageblock_order handling
  virtio-mem: drop rc2 in virtio_mem_mb_plug_and_add()
  virtio-mem: use "unsigned long" for nr_pages when fake onlining/offlining
  virtio-mem: factor out calculation of the bit number within the subblock 
bitmap
  virtio-mem: print debug messages from virtio_mem_send_*_request()
  virtio-mem: factor out fake-offlining into virtio_mem_fake_offline()
  virtio-mem: factor out handling of fake-offline pages in memory notifier
  virtio-mem: retry fake-offlining via alloc_contig_range() on ZONE_MOVABLE
  virtio-mem: generalize check for added memory
  virtio-mem: generalize virtio_mem_owned_mb()
  virtio-mem: generalize virtio_mem_overlaps_range()
  virtio-mem: drop last_mb_id
  virtio-mem: don't always trigger the workqueue when offlining memory
  virtio-mem: generalize handling when memory is getting onlined deferred
  virito-mem: document Sub Block Mode (SBM)
  virtio-mem: memory block states are specific to Sub Block Mode (SBM)
  virito-mem: subblock states are specific to Sub Block Mode (SBM)
  virtio-mem: nb_sb_per_mb and subblock_size are specific to Sub Block Mode 
(SBM)
  virtio-mem: memory block ids are specific to Sub Block Mode (SBM)
  virito-mem: existing (un)plug functions are specific to Sub Block Mode 
(SBM)
  virtio-mem: memory notifier callbacks are specific to Sub Block Mode (SBM)
  virtio-mem: factor out adding/removing memory from Linux
  virtio-mem: Big Block Mode (BBM) memory hotplug
  virtio-mem: allow to force Big Block Mode (BBM) and set the big block size
  mm/memory_hotplug: extend offline_and_remove_memory() to handle more than 
one memory block
  virtio-mem: Big Block Mode (BBM) - basic memory hotunplug
  virtio-mem: Big Block Mode (BBM) - safe memory hotunplug

Eli Cohen (1):
  vdpa/mlx5: Use write memory barrier after updating CQ index

Enrico Weigelt, metux IT consult (2):
  uapi: virtio_ids.h: consistent indentions
  uapi: virtio_ids: add missing device type IDs from OASIS spec

Max Gurtovoy (2):
  vdpa_sim: remove hard-coded virtq count
  vdpa: split vdpasim to core and net modules

Parav Pandit (2):
  vdpa: Add missing comment for virtqueue count
  vdpa: Use simpler version of ida allocation

Peng Fan (3):
  tools/virtio: include asm/bug.h
  tools/virtio: add krealloc_array
  tools/virtio: add barrier for aarch64

Stefano Garzarella (16):
  vdpa: remove unnecessary 'default n' in Kconfig entries
  vdpa_sim: remove unnecessary headers inclusion
  vdpa_sim: make IOTLB entries limit configurable
  vdpa_sim: rename vdpasim_config_ops variables
  vdpa_sim: add struct vdpasim_dev_attr for device attributes
  vdpa_sim: add device id field in vdpasim_dev_attr
  vdpa_sim: add supported_features field in vdpasim_dev_attr
  vdpa_sim: add work_fn in vdpasim_dev_attr
  vdpa_sim: store parsed MAC address in a buffer
  vdpa_sim: make 'config' generic and usable for any device type
  vdpa_sim: add get_config callback in vdpasim_dev_attr
  vdpa_sim: add set_config callback in vdpasim_dev_attr
  vdpa_sim: set vringh notify callback
  vdpa_sim: use kvmalloc to allocate vdpasim->buffer
  vdpa_sim: make vdpasim->buffer size configurable
  vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov

Tian Tao (1):
  vhost_vdpa: switch to vmemdup_user()

Zhang Changzhong (1):
  vhost scsi: fix error return code in vhost_scsi_set_endpoint()

 drivers/net/virtio_net.c |1 +
 drivers/vdpa/Kconfig |   18 +-
 drivers/vdpa/ifcvf/ifcvf_main.c  |   11 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c|5 +
 drivers/vdpa/vdpa.c  |2 +-
 drivers/vdpa/vdpa_sim/Mak

[PATCH v2] vdpa_sim: use iova module to allocate IOVA addresses

2020-12-23 Thread Stefano Garzarella
The identical mapping used until now created issues when mapping
different virtual pages with the same physical address.
To solve this issue, we can use the iova module, to handle the IOVA
allocation.
For simplicity we use an IOVA allocator with byte granularity.

We add two new functions, vdpasim_map_range() and vdpasim_unmap_range(),
to handle the IOVA allocation and the registration into the IOMMU/IOTLB.
These functions are used by dma_map_ops callbacks.

Acked-by: Jason Wang 
Signed-off-by: Stefano Garzarella 
---
v2:
- used ULONG_MAX instead of ~0UL [Jason]
- fixed typos in comment and patch description [Jason]
---
 drivers/vdpa/vdpa_sim/vdpa_sim.h |   2 +
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 108 +++
 drivers/vdpa/Kconfig |   1 +
 3 files changed, 69 insertions(+), 42 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.h b/drivers/vdpa/vdpa_sim/vdpa_sim.h
index b02142293d5b..6efe205e583e 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.h
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.h
@@ -6,6 +6,7 @@
 #ifndef _VDPA_SIM_H
 #define _VDPA_SIM_H
 
+#include 
 #include 
 #include 
 #include 
@@ -55,6 +56,7 @@ struct vdpasim {
/* virtio config according to device type */
void *config;
struct vhost_iotlb *iommu;
+   struct iova_domain iova;
void *buffer;
u32 status;
u32 generation;
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index b3fcc67bfdf0..edc930719fb8 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vdpa_sim.h"
 
@@ -128,30 +129,57 @@ static int dir_to_perm(enum dma_data_direction dir)
return perm;
 }
 
+static dma_addr_t vdpasim_map_range(struct vdpasim *vdpasim, phys_addr_t paddr,
+   size_t size, unsigned int perm)
+{
+   struct iova *iova;
+   dma_addr_t dma_addr;
+   int ret;
+
+   /* We set the limit_pfn to the maximum (ULONG_MAX - 1) */
+   iova = alloc_iova(&vdpasim->iova, size, ULONG_MAX - 1, true);
+   if (!iova)
+   return DMA_MAPPING_ERROR;
+
+   dma_addr = iova_dma_addr(&vdpasim->iova, iova);
+
+   spin_lock(&vdpasim->iommu_lock);
+   ret = vhost_iotlb_add_range(vdpasim->iommu, (u64)dma_addr,
+   (u64)dma_addr + size - 1, (u64)paddr, perm);
+   spin_unlock(&vdpasim->iommu_lock);
+
+   if (ret) {
+   __free_iova(&vdpasim->iova, iova);
+   return DMA_MAPPING_ERROR;
+   }
+
+   return dma_addr;
+}
+
+static void vdpasim_unmap_range(struct vdpasim *vdpasim, dma_addr_t dma_addr,
+   size_t size)
+{
+   spin_lock(&vdpasim->iommu_lock);
+   vhost_iotlb_del_range(vdpasim->iommu, (u64)dma_addr,
+ (u64)dma_addr + size - 1);
+   spin_unlock(&vdpasim->iommu_lock);
+
+   free_iova(&vdpasim->iova, iova_pfn(&vdpasim->iova, dma_addr));
+}
+
 static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
   unsigned long offset, size_t size,
   enum dma_data_direction dir,
   unsigned long attrs)
 {
struct vdpasim *vdpasim = dev_to_sim(dev);
-   struct vhost_iotlb *iommu = vdpasim->iommu;
-   u64 pa = (page_to_pfn(page) << PAGE_SHIFT) + offset;
-   int ret, perm = dir_to_perm(dir);
+   phys_addr_t paddr = page_to_phys(page) + offset;
+   int perm = dir_to_perm(dir);
 
if (perm < 0)
return DMA_MAPPING_ERROR;
 
-   /* For simplicity, use identical mapping to avoid e.g iova
-* allocator.
-*/
-   spin_lock(&vdpasim->iommu_lock);
-   ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
-   pa, dir_to_perm(dir));
-   spin_unlock(&vdpasim->iommu_lock);
-   if (ret)
-   return DMA_MAPPING_ERROR;
-
-   return (dma_addr_t)(pa);
+   return vdpasim_map_range(vdpasim, paddr, size, perm);
 }
 
 static void vdpasim_unmap_page(struct device *dev, dma_addr_t dma_addr,
@@ -159,12 +187,8 @@ static void vdpasim_unmap_page(struct device *dev, 
dma_addr_t dma_addr,
   unsigned long attrs)
 {
struct vdpasim *vdpasim = dev_to_sim(dev);
-   struct vhost_iotlb *iommu = vdpasim->iommu;
 
-   spin_lock(&vdpasim->iommu_lock);
-   vhost_iotlb_del_range(iommu, (u64)dma_addr,
- (u64)dma_addr + size - 1);
-   spin_unlock(&vdpasim->iommu_lock);
+   vdpasim_unmap_range(vdpasim, dma_addr, size);
 }
 
 static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
@@ -172,27 +196,22 @@ static void *vdpasim_alloc_coherent(struct device *dev, 
size_t size,
unsigned long attrs)
 {
struct vdpasim *vdpasim = dev_to_sim

Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-23 Thread Jason Wang


On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
  Documentation/driver-api/vduse.rst |  15 +++-
  drivers/vdpa/vdpa_user/vduse_dev.c | 147 -
  include/uapi/linux/vduse.h |  11 +++
  3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:
  
  - VDUSE_GET_CONFIG: Read from device specific configuration space
  
+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB

+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
  Please see include/linux/vdpa.h for details.
  
-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU

+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
  driver which supports mapping the kernel dma buffer to a userspace iova
  region dynamically. The userspace iova region can be created by passing
  the userspace vDPA device fd to mmap(2).
  
+In vhost-vdpa case, the dma buffer is reside in a userspace memory region

+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
  Besides, the eventfd mechanism is used to trigger interrupt callbacks and
  receive virtqueue kicks in userspace. The following ioctls on the userspace
  vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@
  
  struct vduse_dev_msg {

struct vduse_dev_request req;
+   struct file *iotlb_file;
struct vduse_dev_response resp;
struct list_head list;
wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
return ret;
  }
  
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,

+   u64 offset, u64 iova, u64 size, u8 perm)
+{
+   struct vduse_dev_msg *msg;
+   int ret;
+
+   if (!size)
+   return -EINVAL;
+
+   msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+   msg->req.size = sizeof(struct vduse_iotlb);
+   msg->req.iotlb.offset = offset;
+   msg->req.iotlb.iova = iova;
+   msg->req.iotlb.size = size;
+   msg->req.iotlb.perm = perm;
+   msg->req.iotlb.fd = -1;
+   msg->iotlb_file = get_file(file);
+
+   ret = vduse_dev_msg_sync(dev, msg);



My feeling is that we should provide consistent API for the userspace 
device to use.


E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message 
by using VDUSE file as msg->iotlb_file here.




+   vduse_dev_msg_put(msg);
+   fput(file);
+
+   return ret;
+}
+
+static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
+   u64 iova, u64 size)
+{
+   struct vduse_dev_msg *msg;
+   int ret;
+
+   if (!size)
+   return -EINVAL;
+
+   msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
+   msg->req.size = sizeof(struct vduse_iotlb);
+   msg->req.iotlb.iova = iova;
+   msg->req.iotlb.size = size;
+
+   ret = vduse_dev_msg_sync(dev, msg);
+   vduse_dev_msg_put(msg);
+
+   return ret;
+}
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+   unsigned int flags = 0;
+
+   switch (perm) {
+   case VHOST_ACCESS_WO:
+   flags |= O_WRONLY;
+   break;
+   case VHOST_ACCESS_RO:
+   flags |= O_RDONLY;
+   break;
+   case VHOST_ACCESS_RW:
+   flags |= O_RDWR;
+   break;
+   default:
+   WARN(1, "invalidate vhost IOTLB permission\n");
+   break;
+   }
+
+   return flags;
+}
+
  static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
  {
struct file *file = iocb->ki_filp;
struct vduse_dev *dev = file->private_data;
struct vduse_dev_msg *msg;
-   int size = sizeof(struct vduse_dev_request);
+

Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-23 Thread David Hildenbrand
[...]

>> I was rather saying that for security it's of little use IMHO.
>> Application/VM start up time might be improved by using huge pages (and
>> pre-zeroing these). Free page reporting might be improved by using
>> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>>
>>> this feature, above all of them, which one is likely to become the
>>> most strong one?  From the implementation, you will find it is
>>> configurable, users don't want to use it can turn it off.  This is not
>>> an option?
>>
>> Well, we have to maintain the feature and sacrifice a page flag. For
>> example, do we expect someone explicitly enabling the feature just to
>> speed up startup time of an app that consumes a lot of memory? I highly
>> doubt it.
> 
> In our production environment, there are three main applications have such
> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> for best performance, they populate memory when starting up. For SPDK vhost,
> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> vhost 'live' upgrade, which is done by killing the old process and
> starting a new
> one with the new binary. In this case, we want the new process started as 
> quick
> as possible to shorten the service downtime. We really enable this feature
> to speed up startup time for them  :)

Thanks for info on the use case!

All of these use cases either already use, or could use, huge pages
IMHO. It's not your ordinary proprietary gaming app :) This is where
pre-zeroing of huge pages could already help.

Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
creating a file and pre-zeroing it from another process, or am I missing
something important? At least for QEMU this should work AFAIK, where you
can just pass the file to be use using memory-backend-file.

> 
>> I'd love to hear opinions of other people. (a lot of people are offline
>> until beginning of January, including, well, actually me :) )
> 
> OK. I will wait some time for others' feedback. Happy holidays!

To you too, cheers!


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops

2020-12-23 Thread Jason Wang


On 2020/12/22 下午10:52, Xie Yongji wrote:

This patch introduces a new method in the vdpa_config_ops to
support processing the raw vhost memory mapping message in the
vDPA device driver.

Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 5 -
  include/linux/vdpa.h | 7 +++
  2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..ccbb391e38be 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
if (r)
return r;
  
+	if (ops->process_iotlb_msg)

+   return ops->process_iotlb_msg(vdpa, msg);
+
switch (msg->type) {
case VHOST_IOTLB_UPDATE:
r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
int ret;
  
  	/* Device want to do DMA by itself */

-   if (ops->set_map || ops->dma_map)
+   if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
return 0;
  
  	bus = dma_dev->bus;

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 656fe264234e..7bccedf22f4b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -5,6 +5,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -172,6 +173,10 @@ struct vdpa_iova_range {

   *@vdev: vdpa device
   *Returns the iova range supported by
   *the device.
+ * @process_iotlb_msg: Process vhost memory mapping message (optional)
+ * Only used for VDUSE device now
+ * @vdev: vdpa device
+ * @msg: vhost memory mapping message
   * @set_map:  Set device memory mapping (optional)
   *Needed for device that using device
   *specific DMA translation (on-chip IOMMU)
@@ -240,6 +245,8 @@ struct vdpa_config_ops {
struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
  
  	/* DMA ops */

+   int (*process_iotlb_msg)(struct vdpa_device *vdev,
+struct vhost_iotlb_msg *msg);
int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
   u64 pa, u32 perm);



Is there any reason that it can't be done via dma_map/dma_unmap or set_map?

Thanks


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang


On 2020/12/23 下午2:38, Jason Wang wrote:


V1 to V2:
- Add vhost-vdpa support



I may miss something but I don't see any code to support that. E.g 
neither set_map nor dma_map/unmap is implemented in the config ops.


Thanks 



Speak too fast :(

I saw a new config ops was introduced.

Let me dive into that.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 01/13] mm: export zap_page_range() for driver use

2020-12-23 Thread Christoph Hellwig
On Wed, Dec 23, 2020 at 02:32:07PM +0800, Yongji Xie wrote:
> Now I want to map/unmap some pages in an userland vma dynamically. The
> vm_insert_page() is being used for mapping. In the unmapping case, it
> looks like the zap_page_range() does what I want. So I export it.
> Otherwise, we need some ways to notify userspace to trigger it with
> madvise(MADV_DONTNEED), which might not be able to meet all our needs.
> For example, unmapping some pages in a memory shrinker function.
> 
> So I'd like to know what's the limitation to use zap_page_range() in a
> module. And if we can't use it in a module, is there any acceptable
> way to achieve that?

I think the anser is: don't play funny games with unmapped outside of
munmap.  Especially as synchronization is very hard to get right.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang


On 2020/12/22 下午10:52, Xie Yongji wrote:

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji 
---
  Documentation/driver-api/vduse.rst |   74 ++
  Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
  drivers/vdpa/Kconfig   |8 +
  drivers/vdpa/Makefile  |1 +
  drivers/vdpa/vdpa_user/Makefile|5 +
  drivers/vdpa/vdpa_user/eventfd.c   |  221 
  drivers/vdpa/vdpa_user/eventfd.h   |   48 +
  drivers/vdpa/vdpa_user/iova_domain.c   |  442 
  drivers/vdpa/vdpa_user/iova_domain.h   |   93 ++
  drivers/vdpa/vdpa_user/vduse.h |   59 ++
  drivers/vdpa/vdpa_user/vduse_dev.c | 1121 
  include/uapi/linux/vdpa.h  |1 +
  include/uapi/linux/vduse.h |   99 ++
  13 files changed, 2173 insertions(+)
  create mode 100644 Documentation/driver-api/vduse.rst
  create mode 100644 drivers/vdpa/vdpa_user/Makefile
  create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
  create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
  create mode 100644 drivers/vdpa/vdpa_user/vduse.h
  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
  create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
new file mode 100644
index ..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
+  by userspace to notify VDUSE driver to trigger interrupt callbacks.
+
+MMU-based IOMMU Driver
+-