Re: [PATCH virtio] pds_vdpa: protect Makefile from unconfigured debugfs

2023-07-06 Thread Randy Dunlap



On 7/6/23 16:17, Shannon Nelson wrote:
> debugfs.h protects itself from an undefined DEBUG_FS, so it is
> not necessary to check it in the driver code or the Makefile.
> The driver code had been updated for this, but the Makefile had
> missed the update.
> 
> Link: 
> https://lore.kernel.org/linux-next/fec68c3c-8249-7af4-5390-0495386a7...@infradead.org/
> Fixes: a16291b5bcbb ("pds_vdpa: Add new vDPA driver for AMD/Pensando DSC")
> Signed-off-by: Shannon Nelson 

Reviewed-by: Randy Dunlap 
Tested-by: Randy Dunlap  # build-tested

Thanks.

> ---
>  drivers/vdpa/pds/Makefile | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/vdpa/pds/Makefile b/drivers/vdpa/pds/Makefile
> index 2e22418e3ab3..c2d314d4614d 100644
> --- a/drivers/vdpa/pds/Makefile
> +++ b/drivers/vdpa/pds/Makefile
> @@ -5,6 +5,5 @@ obj-$(CONFIG_PDS_VDPA) := pds_vdpa.o
>  
>  pds_vdpa-y := aux_drv.o \
> cmds.o \
> +   debugfs.o \
> vdpa_dev.o
> -
> -pds_vdpa-$(CONFIG_DEBUG_FS) += debugfs.o

-- 
~Randy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH virtio] pds_vdpa: protect Makefile from unconfigured debugfs

2023-07-06 Thread Shannon Nelson via Virtualization
debugfs.h protects itself from an undefined DEBUG_FS, so it is
not necessary to check it in the driver code or the Makefile.
The driver code had been updated for this, but the Makefile had
missed the update.

Link: 
https://lore.kernel.org/linux-next/fec68c3c-8249-7af4-5390-0495386a7...@infradead.org/
Fixes: a16291b5bcbb ("pds_vdpa: Add new vDPA driver for AMD/Pensando DSC")
Signed-off-by: Shannon Nelson 
---
 drivers/vdpa/pds/Makefile | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vdpa/pds/Makefile b/drivers/vdpa/pds/Makefile
index 2e22418e3ab3..c2d314d4614d 100644
--- a/drivers/vdpa/pds/Makefile
+++ b/drivers/vdpa/pds/Makefile
@@ -5,6 +5,5 @@ obj-$(CONFIG_PDS_VDPA) := pds_vdpa.o
 
 pds_vdpa-y := aux_drv.o \
  cmds.o \
+ debugfs.o \
  vdpa_dev.o
-
-pds_vdpa-$(CONFIG_DEBUG_FS) += debugfs.o
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: linux-next: Tree for Jul 6 [drivers/vdpa/pds/pds_vdpa.ko]

2023-07-06 Thread Shannon Nelson via Virtualization

On 7/6/23 2:07 PM, Randy Dunlap wrote:


On 7/5/23 18:57, Stephen Rothwell wrote:

Hi all,

Please do *not* add any v6.6 related stuff to your linux-next included
branches until after v6.5-rc1 has been released.

Changes since 20230705:



On loongarch, when DEBUG_FS is not set:

ERROR: modpost: "pds_vdpa_debugfs_del_vdpadev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_add_vdpadev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_reset_vdpadev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_create" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_add_ident" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_destroy" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_add_pcidev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!


Yikes - sorry, will follow up shortly.
sln






Full randconfig file is attached.

--
~Randy

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: linux-next: Tree for Jul 6 [drivers/vdpa/pds/pds_vdpa.ko]

2023-07-06 Thread Randy Dunlap


On 7/5/23 18:57, Stephen Rothwell wrote:
> Hi all,
> 
> Please do *not* add any v6.6 related stuff to your linux-next included
> branches until after v6.5-rc1 has been released.
> 
> Changes since 20230705:
> 

On loongarch, when DEBUG_FS is not set:

ERROR: modpost: "pds_vdpa_debugfs_del_vdpadev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_add_vdpadev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_reset_vdpadev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_create" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_add_ident" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_destroy" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!
ERROR: modpost: "pds_vdpa_debugfs_add_pcidev" [drivers/vdpa/pds/pds_vdpa.ko] 
undefined!



Full randconfig file is attached.

-- 
~Randy

config-r2589.gz
Description: application/gzip
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v5 00/17] vsock: MSG_ZEROCOPY flag support

2023-07-06 Thread Stefano Garzarella
On Sat, Jul 01, 2023 at 09:39:30AM +0300, Arseniy Krasnov wrote:
>Hello,
>
>   DESCRIPTION
>
>this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>current implementation for TCP as much as possible:
>
>1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>   flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>   flag will be ignored (e.g. without completion).
>
>2) Kernel uses completions from socket's error queue. Single completion
>   for single tx syscall (or it can merge several completions to single
>   one). I used already implemented logic for MSG_ZEROCOPY support:
>   'msg_zerocopy_realloc()' etc.
>
>Difference with copy way is not significant. During packet allocation,
>non-linear skb is created and filled with pinned user pages.
>There are also some updates for vhost and guest parts of transport - in
>both cases i've added handling of non-linear skb for virtio part. vhost
>copies data from such skb to the guest's rx virtio buffers. In the guest,
>virtio transport fills tx virtio queue with pages from skb.
>
>Head of this patchset is:
>https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>
>
>This version has several limits/problems (all resolved at v5):
>
>1) As this feature totally depends on transport, there is no way (or it
>   is difficult) to check whether transport is able to handle it or not
>   during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>   setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>   leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>   are not considered to be called from each other. So in current version
>   SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>   AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>   tx routine will fail with EOPNOTSUPP.
>
>   ^^^ fixed in v5. Thanks to Bobby Eshleman.
>
>2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>   one completion. In each completion there is flag which shows how tx
>   was performed: zerocopy or copy. This leads that whole message must
>   be send in zerocopy or copy way - we can't send part of message with
>   copying and rest of message with zerocopy mode (or vice versa). Now,
>   we need to account vsock credit logic, e.g. we can't send whole data
>   once - only allowed number of bytes could sent at any moment. In case
>   of copying way there is no problem as in worst case we can send single
>   bytes, but zerocopy is more complex because smallest transmission
>   unit is single page. So if there is not enough space at peer's side
>   to send integer number of pages (at least one) - we will wait, thus
>   stalling tx side. To overcome this problem i've added simple rule -
>   zerocopy is possible only when there is enough space at another side
>   for whole message (to check, that current 'msghdr' was already used
>   in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
>   ^^^
>   Discussed as ok during v2. Link:
>   
> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>
>3) loopback transport is not supported, because it requires to implement
>   non-linear skb handling in dequeue logic (as we "send" fragged skb
>   and "receive" it from the same queue). I'm going to implement it in
>   next versions.
>
>   ^^^ fixed in v2
>
>4) Current implementation sets max length of packet to 64KB. IIUC this
>   is due to 'kmalloc()' allocated data buffers. I think, in case of
>   MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>   not touched for data - user space pages are used as buffers. Also
>   this limit trims every message which is > 64KB, thus such messages
>   will be send in copy mode due to 'iov_offset' check in 2).
>
>   ^^^ fixed in v2
>
> PATCHSET STRUCTURE
>
>Patchset has the following structure:
>1) Handle non-linear skbuff on receive in virtio/vhost.
>2) Handle non-linear skbuff on send in virtio/vhost.
>3) Updates for AF_VSOCK.
>4) Enable MSG_ZEROCOPY support on transports.
>5) Tests/tools/docs updates.
>
>PERFORMANCE
>
>Performance: it is a little bit tricky to compare performance between
>copy and zerocopy transmissions. In zerocopy way we need to wait when
>user buffers will be released by kernel, so it is like synchronous
>path (wait until device driver will process it), while in copy way we
>can feed data to kernel as many as we want, don't care about device
>driver. So I compared only time which we spend in the 'send()' syscall.
>Then if this value will be combined with total number of transmitted
>bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>enough credit, receiver allocates same amount of space as sender needs.
>
>Sender:
>./vsock_perf --sender  --buf-size  --bytes 256M [--zc]
>

Re: [RFC PATCH v5 17/17] test/vsock: io_uring rx/tx tests

2023-07-06 Thread Stefano Garzarella
On Sat, Jul 01, 2023 at 09:39:47AM +0300, Arseniy Krasnov wrote:
>This adds set of tests which use io_uring for rx/tx. This test suite is
>implemented as separated util like 'vsock_test' and has the same set of
>input arguments as 'vsock_test'. These tests only cover cases of data
>transmission (no connect/bind/accept etc).

Cool, thanks for adding this!

>
>Signed-off-by: Arseniy Krasnov 
>---
> tools/testing/vsock/Makefile   |   7 +-
> tools/testing/vsock/vsock_uring_test.c | 321 +
> 2 files changed, 327 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/vsock/vsock_uring_test.c
>
>diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
>index 0a78787d1d92..8621ae73051d 100644
>--- a/tools/testing/vsock/Makefile
>+++ b/tools/testing/vsock/Makefile
>@@ -1,12 +1,17 @@
> # SPDX-License-Identifier: GPL-2.0-only
>+ifeq ($(MAKECMDGOALS),vsock_uring_test)
>+LDFLAGS = -luring
>+endif
>+
> all: test vsock_perf
> test: vsock_test vsock_diag_test
> vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
> vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
> vsock_perf: vsock_perf.o
>+vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o $(LDFLAGS)

Why we need `$(LDFLAGS)` in the dependencies?


>
> CFLAGS += -g -O2 -Werror -Wall -I. -I../../include
> -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow
> -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
> .PHONY: all test clean
> clean:
>-  ${RM} *.o *.d vsock_test vsock_diag_test
>+  ${RM} *.o *.d vsock_test vsock_diag_test vsock_uring_test
> -include *.d
>diff --git a/tools/testing/vsock/vsock_uring_test.c 
>b/tools/testing/vsock/vsock_uring_test.c
>new file mode 100644
>index ..7637ff510490
>--- /dev/null
>+++ b/tools/testing/vsock/vsock_uring_test.c
>@@ -0,0 +1,321 @@
>+// SPDX-License-Identifier: GPL-2.0-only
>+/* io_uring tests for vsock
>+ *
>+ * Copyright (C) 2023 SberDevices.
>+ *
>+ * Author: Arseniy Krasnov 
>+ */
>+
>+#include 
>+#include 
>+#include 
>+#include 
>+#include 
>+#include 
>+#include 
>+#include 
>+#include 
>+
>+#include "util.h"
>+#include "control.h"
>+
>+#define PAGE_SIZE 4096
>+#define RING_ENTRIES_NUM  4
>+
>+static struct vsock_test_data test_data_array[] = {
>+  /* All elements have page aligned base and size. */
>+  {
>+  .vecs_cnt = 3,
>+  {
>+  { NULL, PAGE_SIZE },
>+  { NULL, 2 * PAGE_SIZE },
>+  { NULL, 3 * PAGE_SIZE },
>+  }
>+  },
>+  /* Middle element has both non-page aligned base and size. */
>+  {
>+  .vecs_cnt = 3,
>+  {
>+  { NULL, PAGE_SIZE },
>+  { (void *)1, 200  },
>+  { NULL, 3 * PAGE_SIZE },
>+  }
>+  }
>+};
>+
>+static void vsock_io_uring_client(const struct test_opts *opts,
>+const struct vsock_test_data *test_data,
>+bool msg_zerocopy)
>+{
>+  struct io_uring_sqe *sqe;
>+  struct io_uring_cqe *cqe;
>+  struct io_uring ring;
>+  struct iovec *iovec;
>+  struct msghdr msg;
>+  int fd;
>+
>+  fd = vsock_stream_connect(opts->peer_cid, 1234);
>+  if (fd < 0) {
>+  perror("connect");
>+  exit(EXIT_FAILURE);
>+  }
>+
>+  if (msg_zerocopy)
>+  enable_so_zerocopy(fd);
>+
>+  iovec = iovec_from_test_data(test_data);
>+
>+  if (io_uring_queue_init(RING_ENTRIES_NUM, , 0))
>+  error(1, errno, "io_uring_queue_init");
>+
>+  if (io_uring_register_buffers(, iovec, test_data->vecs_cnt))
>+  error(1, errno, "io_uring_register_buffers");
>+
>+  memset(, 0, sizeof(msg));
>+  msg.msg_iov = iovec;
>+  msg.msg_iovlen = test_data->vecs_cnt;
>+  sqe = io_uring_get_sqe();
>+
>+  if (msg_zerocopy)
>+  io_uring_prep_sendmsg_zc(sqe, fd, , 0);
>+  else
>+  io_uring_prep_sendmsg(sqe, fd, , 0);
>+
>+  if (io_uring_submit() != 1)
>+  error(1, errno, "io_uring_submit");
>+
>+  if (io_uring_wait_cqe(, ))
>+  error(1, errno, "io_uring_wait_cqe");
>+
>+  io_uring_cqe_seen(, cqe);
>+
>+  control_writeulong(iovec_hash_djb2(iovec, test_data->vecs_cnt));
>+
>+  control_writeln("DONE");
>+  io_uring_queue_exit();
>+  free_iovec_test_data(test_data, iovec);
>+  close(fd);
>+}
>+
>+static void vsock_io_uring_server(const struct test_opts *opts,
>+const struct vsock_test_data *test_data)
>+{
>+  unsigned long remote_hash;
>+  unsigned long local_hash;
>+  struct io_uring_sqe *sqe;
>+  struct io_uring_cqe *cqe;
>+  struct io_uring ring;
>+  struct iovec iovec;
>+  size_t data_len;
>+  void *data;
>+  int fd;
>+
>+  fd = 

Re: [RFC PATCH v5 14/17] docs: net: description of MSG_ZEROCOPY for AF_VSOCK

2023-07-06 Thread Stefano Garzarella
On Sat, Jul 01, 2023 at 09:39:44AM +0300, Arseniy Krasnov wrote:
>This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
>socket.
>
>Signed-off-by: Arseniy Krasnov 
>---
> Documentation/networking/msg_zerocopy.rst | 12 ++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
>diff --git a/Documentation/networking/msg_zerocopy.rst 
>b/Documentation/networking/msg_zerocopy.rst
>index b3ea96af9b49..34bc7ff411ce 100644
>--- a/Documentation/networking/msg_zerocopy.rst
>+++ b/Documentation/networking/msg_zerocopy.rst
>@@ -7,7 +7,8 @@ Intro
> =
>
> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>-The feature is currently implemented for TCP and UDP sockets.
>+The feature is currently implemented for TCP, UDP and VSOCK (with
>+virtio transport) sockets.
>
>
> Opportunity and Caveats
>@@ -174,7 +175,7 @@ read_notification() call in the previous snippet. A 
>notification
> is encoded in the standard error format, sock_extended_err.
>
> The level and type fields in the control data are protocol family
>-specific, IP_RECVERR or IPV6_RECVERR.
>+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
>
> Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
> as explained before, to avoid blocking read and write system calls on
>@@ -201,6 +202,7 @@ undefined, bar for ee_code, as discussed below.
>
>   printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
>
>+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 0.

Maybe better to move up, just under the previous change.

By the way, should we define a valid type value for vsock
(e.g. VSOCK_RECVERR)?

>
> Deferred copies
> ~~~
>@@ -235,12 +237,15 @@ Implementation
> Loopback
> 
>
>+For TCP and UDP:
> Data sent to local sockets can be queued indefinitely if the receive
> process does not read its socket. Unbound notification latency is not
> acceptable. For this reason all packets generated with MSG_ZEROCOPY
> that are looped to a local socket will incur a deferred copy. This
> includes looping onto packet sockets (e.g., tcpdump) and tun devices.
>
>+For VSOCK:
>+Data path sent to local sockets is the same as for non-local sockets.
>
> Testing
> ===
>@@ -254,3 +259,6 @@ instance when run with msg_zerocopy.sh between a veth pair 
>across
> namespaces, the test will not show any improvement. For testing, the
> loopback restriction can be temporarily relaxed by making
> skb_orphan_frags_rx identical to skb_orphan_frags.
>+
>+For VSOCK type of socket example can be found in  tools/testing/vsock/
>+vsock_test_zerocopy.c.

For VSOCK socket, example can be found in
tools/testing/vsock/vsock_test_zerocopy.c

(we should leave the entire path on the same line)

>--
>2.25.1
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 13/17] vsock: enable setting SO_ZEROCOPY

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:43AM +0300, Arseniy Krasnov wrote:

For AF_VSOCK, zerocopy tx mode depends on transport, so this option must
be set in AF_VSOCK implementation where transport is accessible (if
transport is not set during setting SO_ZEROCOPY: for example socket is
not connected, then SO_ZEROCOPY will be enabled, but once transport will
be assigned, support of this type of transmission will be checked).

To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT
bit, thus handling SOL_SOCKET option operations, but all of them except
SO_ZEROCOPY will be forwarded to the generic handler by calling
'sock_setsockopt()'.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * This patch is totally reworked. Previous version added check for
   PF_VSOCK directly to 'net/core/sock.c', thus allowing to set
   SO_ZEROCOPY for AF_VSOCK type of socket. This new version catches
   attempt to set SO_ZEROCOPY in 'af_vsock.c'. All other options
   except SO_ZEROCOPY are forwarded to generic handler. Only this
   option is processed in 'af_vsock.c'. Handling this option includes
   access to transport to check that MSG_ZEROCOPY transmission is
   supported by the current transport (if it is set, if not - transport
   will be checked during 'connect()').


Yeah, great, this is much better!



net/vmw_vsock/af_vsock.c | 44 ++--
1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index da22ae0ef477..8acc77981d01 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1406,8 +1406,18 @@ static int vsock_connect(struct socket *sock, struct 
sockaddr *addr,
goto out;
}

-   if (vsock_msgzerocopy_allow(transport))
+   if (!vsock_msgzerocopy_allow(transport)) {


Can you leave `if (vsock_msgzerocopy_allow(transport))` and just add
the else branch with this new check?

if (vsock_msgzerocopy_allow(transport)) {
...
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
...
}


+   /* If this option was set before 'connect()',
+* when transport was unknown, check that this
+* feature is supported here.
+*/
+   if (sock_flag(sk, SOCK_ZEROCOPY)) {
+   err = -EOPNOTSUPP;
+   goto out;
+   }
+   } else {
set_bit(SOCK_SUPPORT_ZC, >sk_socket->flags);
+   }

err = vsock_auto_bind(vsk);
if (err)
@@ -1643,7 +1653,7 @@ static int vsock_connectible_setsockopt(struct socket 
*sock,
const struct vsock_transport *transport;
u64 val;

-   if (level != AF_VSOCK)
+   if (level != AF_VSOCK && level != SOL_SOCKET)
return -ENOPROTOOPT;

#define COPY_IN(_v)   \
@@ -1666,6 +1676,34 @@ static int vsock_connectible_setsockopt(struct socket 
*sock,

transport = vsk->transport;

+   if (level == SOL_SOCKET) {


We could reduce the indentation here:
if (optname != SO_ZEROCOPY) {
release_sock(sk);
return sock_setsockopt(sock, level, optname, optval, 
optlen);
}

Then remove the next indentation.


+   if (optname == SO_ZEROCOPY) {
+   int zc_val;


`zerocopy` is more readable.

+
+   /* Use 'int' type here, because variable to
+* set this option usually has this type.
+*/
+   COPY_IN(zc_val);
+
+   if (zc_val < 0 || zc_val > 1) {
+   err = -EINVAL;
+   goto exit;
+   }
+
+   if (transport && !vsock_msgzerocopy_allow(transport)) {
+   err = -EOPNOTSUPP;
+   goto exit;
+   }
+
+   sock_valbool_flag(sk, SOCK_ZEROCOPY,
+ zc_val ? true : false);


Why not using directly `zc_val`?
The 3rd param of sock_valbool_flag() is an int.


+   goto exit;
+   }
+
+   release_sock(sk);
+   return sock_setsockopt(sock, level, optname, optval, optlen);
+   }
+
switch (optname) {
case SO_VM_SOCKETS_BUFFER_SIZE:
COPY_IN(val);
@@ -2321,6 +2359,8 @@ static int vsock_create(struct net *net, struct socket 
*sock,
}
}

+   set_bit(SOCK_CUSTOM_SOCKOPT, >sk_socket->flags);
+
vsock_insert_unbound(vsk);

return 0;
--
2.25.1



___

Re: [RFC PATCH v5 12/17] vsock/loopback: support MSG_ZEROCOPY for transport

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:42AM +0300, Arseniy Krasnov wrote:

Add 'msgzerocopy_allow()' callback for loopback transport.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Move 'msgzerocopy_allow' right after seqpacket callbacks.
 * Don't use prototype for 'vsock_loopback_msgzerocopy_allow()'.

net/vmw_vsock/vsock_loopback.c | 6 ++
1 file changed, 6 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 5c6360df1f31..048640167411 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -47,6 +47,10 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
}

static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
+static bool vsock_loopback_msgzerocopy_allow(void)
+{
+   return true;
+}

static struct virtio_transport loopback_transport = {
.transport = {
@@ -79,6 +83,8 @@ static struct virtio_transport loopback_transport = {
.seqpacket_allow  = vsock_loopback_seqpacket_allow,
.seqpacket_has_data   = virtio_transport_seqpacket_has_data,

+   .msgzerocopy_allow= vsock_loopback_msgzerocopy_allow,
+
.notify_poll_in   = virtio_transport_notify_poll_in,
.notify_poll_out  = virtio_transport_notify_poll_out,
.notify_recv_init = virtio_transport_notify_recv_init,
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 11/17] vsock/virtio: support MSG_ZEROCOPY for transport

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:41AM +0300, Arseniy Krasnov wrote:

Add 'msgzerocopy_allow()' callback for virtio transport.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Move 'msgzerocopy_allow' right after seqpacket callbacks.

net/vmw_vsock/virtio_transport.c | 7 +++
1 file changed, 7 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 6cbb45bb12d2..8d3e9f441fa1 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -441,6 +441,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, >rx_work);
}

+static bool virtio_transport_msgzerocopy_allow(void)
+{
+   return true;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);

static struct virtio_transport virtio_transport = {
@@ -474,6 +479,8 @@ static struct virtio_transport virtio_transport = {
.seqpacket_allow  = virtio_transport_seqpacket_allow,
.seqpacket_has_data   = virtio_transport_seqpacket_has_data,

+   .msgzerocopy_allow= virtio_transport_msgzerocopy_allow,
+
.notify_poll_in   = virtio_transport_notify_poll_in,
.notify_poll_out  = virtio_transport_notify_poll_out,
.notify_recv_init = virtio_transport_notify_recv_init,
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 09/17] vsock: enable SOCK_SUPPORT_ZC bit

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:39AM +0300, Arseniy Krasnov wrote:

This bit is used by io_uring in case of zerocopy tx mode. io_uring code
checks, that socket has this feature. This patch sets it in two places:
1) For socket in 'connect()' call.
2) For new socket which is returned by 'accept()' call.

Signed-off-by: Arseniy Krasnov 
---
net/vmw_vsock/af_vsock.c | 6 ++
1 file changed, 6 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 033006e1b5ad..da22ae0ef477 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1406,6 +1406,9 @@ static int vsock_connect(struct socket *sock, struct 
sockaddr *addr,
goto out;
}

+   if (vsock_msgzerocopy_allow(transport))
+   set_bit(SOCK_SUPPORT_ZC, >sk_socket->flags);
+
err = vsock_auto_bind(vsk);
if (err)
goto out;
@@ -1560,6 +1563,9 @@ static int vsock_accept(struct socket *sock, struct 
socket *newsock, int flags,
} else {
newsock->state = SS_CONNECTED;
sock_graft(connected, newsock);
+   if (vsock_msgzerocopy_allow(vconnected->transport))
+   set_bit(SOCK_SUPPORT_ZC,
+   >sk_socket->flags);
}

release_sock(connected);
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 10/17] vhost/vsock: support MSG_ZEROCOPY for transport

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:40AM +0300, Arseniy Krasnov wrote:

Add 'msgzerocopy_allow()' callback for vhost transport.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Move 'msgzerocopy_allow' right after seqpacket callbacks.

drivers/vhost/vsock.c | 7 +++
1 file changed, 7 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index cb00e0e059e4..3fd0ab0c0edc 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -398,6 +398,11 @@ static bool vhost_vsock_more_replies(struct vhost_vsock 
*vsock)
return val < vq->num;
}

+static bool vhost_transport_msgzerocopy_allow(void)
+{
+   return true;
+}
+
static bool vhost_transport_seqpacket_allow(u32 remote_cid);

static struct virtio_transport vhost_transport = {
@@ -431,6 +436,8 @@ static struct virtio_transport vhost_transport = {
.seqpacket_allow  = vhost_transport_seqpacket_allow,
.seqpacket_has_data   = virtio_transport_seqpacket_has_data,

+   .msgzerocopy_allow= vhost_transport_msgzerocopy_allow,
+
.notify_poll_in   = virtio_transport_notify_poll_in,
.notify_poll_out  = virtio_transport_notify_poll_out,
.notify_recv_init = virtio_transport_notify_recv_init,
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 08/17] vsock: check for MSG_ZEROCOPY support on send

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:38AM +0300, Arseniy Krasnov wrote:

This feature totally depends on transport, so if transport doesn't
support it, return error.

Signed-off-by: Arseniy Krasnov 
---
include/net/af_vsock.h   | 7 +++
net/vmw_vsock/af_vsock.c | 6 ++
2 files changed, 13 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0e7504a42925..ec09edc5f3a0 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -177,6 +177,9 @@ struct vsock_transport {

/* Read a single skb */
int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
+
+   /* Zero-copy. */
+   bool (*msgzerocopy_allow)(void);
};

/ CORE /
@@ -243,4 +246,8 @@ static inline void __init vsock_bpf_build_proto(void)
{}
#endif

+static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
+{
+   return t->msgzerocopy_allow && t->msgzerocopy_allow();
+}
#endif /* __AF_VSOCK_H__ */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 07803d9fbf6d..033006e1b5ad 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1824,6 +1824,12 @@ static int vsock_connectible_sendmsg(struct socket 
*sock, struct msghdr *msg,
goto out;
}

+   if (msg->msg_flags & MSG_ZEROCOPY &&
+   !vsock_msgzerocopy_allow(transport)) {
+   err = -EOPNOTSUPP;
+   goto out;
+   }
+
/* Wait for room in the produce queue to enqueue our user's data. */
timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);

--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 07/17] vsock: read from socket's error queue

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:37AM +0300, Arseniy Krasnov wrote:

This adds handling of MSG_ERRQUEUE input flag in receive call. This flag
is used to read socket's error queue instead of data queue. Possible
scenario of error queue usage is receiving completions for transmission
with MSG_ZEROCOPY flag. This patch also adds 'SOL_VSOCK' define.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Update commit message by adding sentence that 'SOL_VSOCK' is also
   added.


Reviewed-by: Stefano Garzarella 



include/linux/socket.h   | 1 +
net/vmw_vsock/af_vsock.c | 5 +
2 files changed, 6 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index bd1cc3238851..d79efd026880 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -382,6 +382,7 @@ struct ucred {
#define SOL_MPTCP   284
#define SOL_MCTP285
#define SOL_SMC 286
+#define SOL_VSOCK  287

/* IPX options */
#define IPX_TYPE1
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 45fd20c4ed50..07803d9fbf6d 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -110,6 +110,7 @@
#include 
#include 
#include 
+#include 

static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
static void vsock_sk_destruct(struct sock *sk);
@@ -2135,6 +2136,10 @@ vsock_connectible_recvmsg(struct socket *sock, struct 
msghdr *msg, size_t len,
int err;

sk = sock->sk;
+
+   if (unlikely(flags & MSG_ERRQUEUE))
+   return sock_recv_errqueue(sk, msg, len, SOL_VSOCK, 0);
+
vsk = vsock_sk(sk);
err = 0;

--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 06/17] vsock: fix EPOLLERR set on non-empty error queue

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:36AM +0300, Arseniy Krasnov wrote:

If socket's error queue is not empty, EPOLLERR must be set. Otherwise,
reader of error queue won't detect data in it using EPOLLERR bit.
Currently for AF_VSOCK this is reproducible only with MSG_ZEROCOPY, as
this feature is the only user of an error queue of the socket.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")


Sorry if I confused you, but if without MSG_ZEROCOPY this is not an
issue, then we can remove the Fixes tag.


Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Change commit message as Fix patch. Also add details that this
   problem could be reproduced only with MSG_ZEROCOPY transmission
   mode.

net/vmw_vsock/af_vsock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index efb8a0937a13..45fd20c4ed50 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1030,7 +1030,7 @@ static __poll_t vsock_poll(struct file *file, struct 
socket *sock,
poll_wait(file, sk_sleep(sk), wait);
mask = 0;

-   if (sk->sk_err)
+   if (sk->sk_err || !skb_queue_empty_lockless(>sk_error_queue))
/* Signify that there has been an error on this socket. */
mask |= EPOLLERR;

--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 05/17] vsock/virtio: MSG_ZEROCOPY flag support

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:35AM +0300, Arseniy Krasnov wrote:

This adds handling of MSG_ZEROCOPY flag on transmission path: if this
flag is set and zerocopy transmission is possible, then non-linear skb
will be created and filled with the pages of user's buffer. Pages of
user's buffer are locked in memory by 'get_user_pages()'. Second thing
that this patch does is replace type of skb owning: instead of calling
'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
of socket, so to decrease this field correctly proper skb destructor is
needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Remove unused 'vsk' arg from 'virtio_transport_fill_linear_skb()'.
 * Remove old comment 'Returns a new packet on success...'.
 * Commit message update by adding details why this patch uses
   'skb_set_owner_w()' instead of previous 'skb_set_owner_sk_safe()'.
 * 'hdr' variable declaration and assignment in a single line.
 * Rename 'max_skb_cap' to 'max_skb_len'.
 * Use 'info->msg->msg_flags' to check MSG_ZEROCOPY flag  instead of
   field 'flags' of struct 'virtio_vsock_pkt_info'. This was a bug.

net/vmw_vsock/virtio_transport_common.c | 262 ++--
1 file changed, 200 insertions(+), 62 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index dfc48b56d0a2..2269530ea737 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,27 +37,99 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}

-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */
-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
-  size_t len,
-  u32 src_cid,
-  u32 src_port,
-  u32 dst_cid,
-  u32 dst_port)
-{
-   const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
-   struct virtio_vsock_hdr *hdr;
-   struct sk_buff *skb;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+  size_t max_to_send)
+{
+   struct iov_iter *iov_iter;
+
+   if (!info->msg)
+   return false;
+
+   iov_iter = >msg->msg_iter;
+
+   /* Data is simple buffer. */
+   if (iter_is_ubuf(iov_iter))
+   return true;
+
+   if (!iter_is_iovec(iov_iter))
+   return false;
+
+   if (iov_iter->iov_offset)
+   return false;
+
+   /* We can't send whole iov. */
+   if (iov_iter->count > max_to_send)
+   return false;
+
+   return true;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+  struct sk_buff *skb,
+  struct msghdr *msg,
+  bool zerocopy)
+{
+   struct ubuf_info *uarg;
+
+   if (msg->msg_ubuf) {
+   uarg = msg->msg_ubuf;
+   net_zcopy_get(uarg);
+   } else {
+   struct iov_iter *iter = >msg_iter;
+   struct ubuf_info_msgzc *uarg_zc;
+   int len;
+
+   /* Only ITER_IOVEC or ITER_UBUF are allowed and
+* checked before.
+*/
+   if (iter_is_iovec(iter))
+   len = iov_length(iter->__iov, iter->nr_segs);
+   else
+   len = iter->count;
+
+   uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+   len,
+   NULL);
+


We can remove this extra blank line.


+   if (!uarg)
+   return -1;
+
+   uarg_zc = uarg_to_msgzc(uarg);
+   uarg_zc->zerocopy = zerocopy ? 1 : 0;
+   }
+
+   skb_zcopy_init(skb, uarg);
+
+   return 0;
+}
+
+static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
+   struct virtio_vsock_pkt_info *info,
+   size_t len)
+{
void *payload;
int err;

-   skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
-   if (!skb)
-   return NULL;
+   payload = skb_put(skb, len);
+   err = memcpy_from_msg(payload, info->msg, len);
+   if (err)
+   return -1;
+
+   if (msg_data_left(info->msg))
+   return 0;
+
+   return 0;
+}
+
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ u32 src_cid,
+  

Re: [RFC PATCH v5 04/17] vsock/virtio: non-linear skb handling for tap

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:34AM +0300, Arseniy Krasnov wrote:

For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Make 'skb' pointer constant because it is source.

net/vmw_vsock/virtio_transport_common.c | 31 ++---
1 file changed, 28 insertions(+), 3 deletions(-)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index e5683af23e60..dfc48b56d0a2 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info 
*info,
return NULL;
}

+static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
+   void *dst,
+   size_t len)
+{
+   struct iov_iter iov_iter = { 0 };
+   struct kvec kvec;
+   size_t to_copy;
+
+   kvec.iov_base = dst;
+   kvec.iov_len = len;
+
+   iov_iter.iter_type = ITER_KVEC;
+   iov_iter.kvec = 
+   iov_iter.nr_segs = 1;
+
+   to_copy = min_t(size_t, len, skb->len);
+
+   skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+  _iter, to_copy);
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void 
*opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
-   void *payload_buf;

/* A packet could be split to fit the RX buffer, so we can retrieve
 * the payload length from the header and the buffer pointer taking
@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void 
*opaque)
 */
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
-   payload_buf = pkt->data;

skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void 
*opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));

if (payload_len) {
-   skb_put_data(skb, payload_buf, payload_len);
+   if (skb_is_nonlinear(pkt)) {
+   void *data = skb_put(skb, payload_len);
+
+   virtio_transport_copy_nonlinear_skb(pkt, data, 
payload_len);
+   } else {
+   skb_put_data(skb, pkt->data, payload_len);
+   }
}

return skb;
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 03/17] vsock/virtio: support to send non-linear skb

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:33AM +0300, Arseniy Krasnov wrote:

For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Use 'out_sgs' variable to index 'bufs', not only 'sgs'.
 * Move smaller branch above, see 'if (!skb_is_nonlinear(skb)').
 * Remove blank line.
 * R-b from Bobby Eshleman removed due to patch update.

net/vmw_vsock/virtio_transport.c | 40 +++-
1 file changed, 34 insertions(+), 6 deletions(-)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..6cbb45bb12d2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];

for (;;) {
-   struct scatterlist hdr, buf, *sgs[2];
+   /* +1 is for packet header. */
+   struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+   struct scatterlist bufs[MAX_SKB_FRAGS + 1];
int ret, in_sg = 0, out_sg = 0;
struct sk_buff *skb;
bool reply;
@@ -111,12 +113,38 @@ virtio_transport_send_pkt_work(struct work_struct *work)

virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
+   sg_init_one([out_sg], virtio_vsock_hdr(skb),
+   sizeof(*virtio_vsock_hdr(skb)));
+   sgs[out_sg] = [out_sg];
+   out_sg++;
+
+   if (!skb_is_nonlinear(skb)) {
+   if (skb->len > 0) {
+   sg_init_one([out_sg], skb->data, skb->len);
+   sgs[out_sg] = [out_sg];
+   out_sg++;
+   }
+   } else {
+   struct skb_shared_info *si;
+   int i;
+
+   si = skb_shinfo(skb);
+
+   for (i = 0; i < si->nr_frags; i++) {
+   skb_frag_t *skb_frag = >frags[i];
+   void *va = page_to_virt(skb_frag->bv_page);

-   sg_init_one(, virtio_vsock_hdr(skb), 
sizeof(*virtio_vsock_hdr(skb)));
-   sgs[out_sg++] = 
-   if (skb->len > 0) {
-   sg_init_one(, skb->data, skb->len);
-   sgs[out_sg++] = 
+   /* We will use 'page_to_virt()' for userspace 
page here,
+* because virtio layer will call 
'virt_to_phys()' later
+* to fill buffer descriptor. We don't touch 
memory at
+* "virtual" address of this page.
+*/
+   sg_init_one([out_sg],
+   va + skb_frag->bv_offset,
+   skb_frag->bv_len);
+   sgs[out_sg] = [out_sg];
+   out_sg++;
+   }
}

ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, 
GFP_KERNEL);
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 02/17] vhost/vsock: read data from non-linear skb

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:32AM +0300, Arseniy Krasnov wrote:

This adds copying to guest's virtio buffers from non-linear skbs. Such
skbs are created by protocol layer when MSG_ZEROCOPY flags is used. It
replaces call of 'copy_to_iter()' to 'skb_copy_datagram_iter()'- second
function can read data from non-linear skb. Also this patch uses field
'frag_off' from skb control block. This field shows current offset to
read data from skb which could be both linear or not.

Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Use local variable for 'frag_off'.
 * Update commit message by adding some details about 'frag_off' field.
 * R-b from Bobby Eshleman removed due to patch update.


I think we should merge this patch with the previous one, since
vhost-vsock for example uses virtio_transport_stream_do_dequeue()
that we change in the previous commit, so we will break the bisection.

The patch LGTM!

Stefano



drivers/vhost/vsock.c | 14 +-
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..cb00e0e059e4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct sk_buff *skb;
unsigned out, in;
size_t nbytes;
+   u32 frag_off;
int head;

skb = virtio_vsock_skb_dequeue(>send_pkt_queue);
@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}

iov_iter_init(_iter, ITER_DEST, >iov[out], in, iov_len);
-   payload_len = skb->len;
+   frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+   payload_len = skb->len - frag_off;
hdr = virtio_vsock_hdr(skb);

/* If the packet is greater than the space available in the
@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}

-   nbytes = copy_to_iter(skb->data, payload_len, _iter);
-   if (nbytes != payload_len) {
+   if (skb_copy_datagram_iter(skb,
+  frag_off,
+  _iter,
+  payload_len)) {
kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;

-   skb_pull(skb, payload_len);
+   VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
total_len += payload_len;

/* If we didn't send all the payload we can requeue the packet
 * to send it with the next available buffer.
 */
-   if (skb->len > 0) {
+   if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
hdr->flags |= cpu_to_le32(flags_to_restore);

/* We are queueing the same skb to handle
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 01/17] vsock/virtio: read data from non-linear skb

2023-07-06 Thread Stefano Garzarella

On Sat, Jul 01, 2023 at 09:39:31AM +0300, Arseniy Krasnov wrote:

This is preparation patch for non-linear skbuff handling. It replaces
direct calls of 'memcpy_to_msg()' with 'skb_copy_datagram_iter()'. Main
advantage of the second one is that is can handle paged part of the skb


s/is that is/is that it/


by using 'kmap()' on each page, but if there are no pages in the skb,
it behaves like simple copying to iov iterator. This patch also adds
new field to the control block of skb - this value shows current offset
in the skb to read next portion of data (it doesn't matter linear it or
not). Idea is that 'skb_copy_datagram_iter()' handles both types of
skb internally - it just needs an offset from which to copy data from
the given skb. This offset is incremented on each read from skb. This
approach allows to avoid special handling of non-linear skbs:
1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
2) We need to update 'data_len' also on each read from this skb.


I would mention that this change is in preparation of zero-copy support.



Signed-off-by: Arseniy Krasnov 
---
Changelog:
v4 -> v5:
 * Use local variable for 'frag_off' in stream dequeue calback.
 * R-b from Bobby Eshleman removed due to patch update.

include/linux/virtio_vsock.h|  1 +
net/vmw_vsock/virtio_transport_common.c | 30 ++---
2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..17dbb7176e37 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,7 @@
struct virtio_vsock_skb_cb {
bool reply;
bool tap_delivered;
+   u32 frag_off;
};

#define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index b769fc258931..e5683af23e60 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
spin_lock_bh(>rx_lock);

skb_queue_walk_safe(>rx_queue, skb,  tmp) {
-   off = 0;
+   off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;

if (total == len)
break;
@@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
 */
spin_unlock_bh(>rx_lock);

-   err = memcpy_to_msg(msg, skb->data + off, bytes);
+   err = skb_copy_datagram_iter(skb, off,
+>msg_iter,
+bytes);
+
if (err)
goto out;

@@ -411,27 +414,35 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}

while (total < len && !skb_queue_empty(>rx_queue)) {
+   u32 skb_rest_len;
+
skb = skb_peek(>rx_queue);

bytes = len - total;
-   if (bytes > skb->len)
-   bytes = skb->len;
+   skb_rest_len = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+
+   if (bytes > skb_rest_len)
+   bytes = skb_rest_len;


What about just:
bytes = min_t(size_t, len - total,
  skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off);

The rest LGTM!

Stefano



/* sk_lock is held by caller so no one else can dequeue.
 * Unlock rx_lock since memcpy_to_msg() may sleep.
 */
spin_unlock_bh(>rx_lock);

-   err = memcpy_to_msg(msg, skb->data, bytes);
+   err = skb_copy_datagram_iter(skb,
+VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+>msg_iter, bytes);
+
if (err)
goto out;

spin_lock_bh(>rx_lock);

total += bytes;
-   skb_pull(skb, bytes);

-   if (skb->len == 0) {
+   VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
+
+   if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);

virtio_transport_dec_rx_pkt(vvs, pkt_len);
@@ -503,7 +514,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct 
vsock_sock *vsk,
 */
spin_unlock_bh(>rx_lock);

-   err = memcpy_to_msg(msg, skb->data, 
bytes_to_copy);
+   err = skb_copy_datagram_iter(skb, 0,
+>msg_iter,
+bytes_to_copy);
+
 

Re: [PATCH] Fix max/min warnings in virtio_net, amd/display, and io_uring

2023-07-06 Thread Jens Axboe
On 7/6/23 7:58?AM, Alex Deucher wrote:
> On Thu, Jul 6, 2023 at 3:37?AM Yang Rong  wrote:
>>
>> The files drivers/net/virtio_net.c, 
>> drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c, and io_uring/io_uring.c were 
>> modified to fix warnings.
>> Specifically, the opportunities for max() and min() were utilized to address 
>> the warnings.
> 
> Please split this into 3 patches, one for each component.

Don't bother with the io_uring one, code is far more readable as-is.

-- 
Jens Axboe

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Fix max/min warnings in virtio_net, amd/display, and io_uring

2023-07-06 Thread Michael S. Tsirkin
On Thu, Jul 06, 2023 at 10:06:16AM +0800, Yang Rong wrote:
> The files drivers/net/virtio_net.c, 
> drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c, and io_uring/io_uring.c were 
> modified to fix warnings.

what warnings? the point of the warning is to analyze it not "fix" it
blindly.

> Specifically, the opportunities for max() and min() were utilized to address 
> the warnings.
> 
> Signed-off-by: Yang Rong 
> ---
>  drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c | 6 +++---
>  drivers/net/virtio_net.c | 3 ++-
>  io_uring/io_uring.c  | 3 ++-
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c 
> b/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c
> index c753c6f30dd7..df79aea49a3c 100644
> --- a/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c
> +++ b/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c
> @@ -22,7 +22,7 @@
>   * Authors: AMD
>   *
>   */
> -
> +#include 
>  #include "dc.h"
>  #include "dc_dmub_srv.h"
>  #include "../dmub/dmub_srv.h"
> @@ -481,7 +481,7 @@ static void populate_subvp_cmd_drr_info(struct dc *dc,
> max_drr_vblank_us = div64_u64((subvp_active_us - prefetch_us -
> dc->caps.subvp_fw_processing_delay_us - 
> drr_active_us), 2) + drr_active_us;
> max_drr_mallregion_us = subvp_active_us - prefetch_us - 
> mall_region_us - dc->caps.subvp_fw_processing_delay_us;
> -   max_drr_supported_us = max_drr_vblank_us > max_drr_mallregion_us ? 
> max_drr_vblank_us : max_drr_mallregion_us;
> +   max_drr_supported_us = max(max_drr_vblank_us, max_drr_mallregion_us);
> max_vtotal_supported = div64_u64(((uint64_t)drr_timing->pix_clk_100hz 
> * 100 * max_drr_supported_us),
> (((uint64_t)drr_timing->h_total * 100)));
> 
> @@ -771,7 +771,7 @@ void dc_dmub_setup_subvp_dmub_command(struct dc *dc,
> wm_val_refclk = 
> context->bw_ctx.bw.dcn.watermarks.a.cstate_pstate.pstate_change_ns *
> 
> (dc->res_pool->ref_clocks.dchub_ref_clock_inKhz / 1000) / 1000;
> 
> -   cmd.fw_assisted_mclk_switch_v2.config_data.watermark_a_cache 
> = wm_val_refclk < 0x ? wm_val_refclk : 0x;
> +   cmd.fw_assisted_mclk_switch_v2.config_data.watermark_a_cache 
> = min(wm_val_refclk, 0x);
> }
> 
> dm_execute_dmub_cmd(dc->ctx, , DM_DMUB_WAIT_TYPE_WAIT);
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 9b3721424e71..5bb7da885f00 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -1291,7 +1292,7 @@ static struct sk_buff *build_skb_from_xdp_buff(struct 
> net_device *dev,
> __skb_put(skb, data_len);
> 
> metasize = xdp->data - xdp->data_meta;
> -   metasize = metasize > 0 ? metasize : 0;
> +   metasize = max(metasize, 0);
> if (metasize)
> skb_metadata_set(skb, metasize);
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index e8096d502a7c..875ca657227d 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -47,6 +47,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -2660,7 +2661,7 @@ static void *__io_uaddr_map(struct page ***pages, 
> unsigned short *npages,
> page_array);
> if (ret != nr_pages) {
>  err:
> -   io_pages_free(_array, ret > 0 ? ret : 0);
> +   io_pages_free(_array, max(ret, 0));
> return ret < 0 ? ERR_PTR(ret) : ERR_PTR(-EFAULT);
> }
> /*
> --
> 2.35.3
> 
> 
> 
> 本邮件及其附件内容可能含有机密和/或隐私信息,仅供指定个人或机构使用。若您非发件人指定收件人或其代理人,请勿使用、传播、复制或存储此邮件之任何内容或其附件。如您误收本邮件,请即以回复或电话方式通知发件人,并将原始邮件、附件及其所有复本删除。谢谢。
> The contents of this message and any attachments may contain confidential 
> and/or privileged information and are intended exclusively for the 
> addressee(s). If you are not the intended recipient of this message or their 
> agent, please note that any use, dissemination, copying, or storage of this 
> message or its attachments is not allowed. If you receive this message in 
> error, please notify the sender by reply the message or phone and delete this 
> message, any attachments and any copies immediately.
> Thank you

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] Fix max/min warnings in virtio_net, amd/display, and io_uring

2023-07-06 Thread Alex Deucher
On Thu, Jul 6, 2023 at 3:37 AM Yang Rong  wrote:
>
> The files drivers/net/virtio_net.c, 
> drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c, and io_uring/io_uring.c were 
> modified to fix warnings.
> Specifically, the opportunities for max() and min() were utilized to address 
> the warnings.

Please split this into 3 patches, one for each component.

Alex

>
> Signed-off-by: Yang Rong 
> ---
>  drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c | 6 +++---
>  drivers/net/virtio_net.c | 3 ++-
>  io_uring/io_uring.c  | 3 ++-
>  3 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c 
> b/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c
> index c753c6f30dd7..df79aea49a3c 100644
> --- a/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c
> +++ b/drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c
> @@ -22,7 +22,7 @@
>   * Authors: AMD
>   *
>   */
> -
> +#include 
>  #include "dc.h"
>  #include "dc_dmub_srv.h"
>  #include "../dmub/dmub_srv.h"
> @@ -481,7 +481,7 @@ static void populate_subvp_cmd_drr_info(struct dc *dc,
> max_drr_vblank_us = div64_u64((subvp_active_us - prefetch_us -
> dc->caps.subvp_fw_processing_delay_us - 
> drr_active_us), 2) + drr_active_us;
> max_drr_mallregion_us = subvp_active_us - prefetch_us - 
> mall_region_us - dc->caps.subvp_fw_processing_delay_us;
> -   max_drr_supported_us = max_drr_vblank_us > max_drr_mallregion_us ? 
> max_drr_vblank_us : max_drr_mallregion_us;
> +   max_drr_supported_us = max(max_drr_vblank_us, max_drr_mallregion_us);
> max_vtotal_supported = div64_u64(((uint64_t)drr_timing->pix_clk_100hz 
> * 100 * max_drr_supported_us),
> (((uint64_t)drr_timing->h_total * 100)));
>
> @@ -771,7 +771,7 @@ void dc_dmub_setup_subvp_dmub_command(struct dc *dc,
> wm_val_refclk = 
> context->bw_ctx.bw.dcn.watermarks.a.cstate_pstate.pstate_change_ns *
> 
> (dc->res_pool->ref_clocks.dchub_ref_clock_inKhz / 1000) / 1000;
>
> -   cmd.fw_assisted_mclk_switch_v2.config_data.watermark_a_cache 
> = wm_val_refclk < 0x ? wm_val_refclk : 0x;
> +   cmd.fw_assisted_mclk_switch_v2.config_data.watermark_a_cache 
> = min(wm_val_refclk, 0x);
> }
>
> dm_execute_dmub_cmd(dc->ctx, , DM_DMUB_WAIT_TYPE_WAIT);
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 9b3721424e71..5bb7da885f00 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -1291,7 +1292,7 @@ static struct sk_buff *build_skb_from_xdp_buff(struct 
> net_device *dev,
> __skb_put(skb, data_len);
>
> metasize = xdp->data - xdp->data_meta;
> -   metasize = metasize > 0 ? metasize : 0;
> +   metasize = max(metasize, 0);
> if (metasize)
> skb_metadata_set(skb, metasize);
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index e8096d502a7c..875ca657227d 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -47,6 +47,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -2660,7 +2661,7 @@ static void *__io_uaddr_map(struct page ***pages, 
> unsigned short *npages,
> page_array);
> if (ret != nr_pages) {
>  err:
> -   io_pages_free(_array, ret > 0 ? ret : 0);
> +   io_pages_free(_array, max(ret, 0));
> return ret < 0 ? ERR_PTR(ret) : ERR_PTR(-EFAULT);
> }
> /*
> --
> 2.35.3
>
>
> 
> 本邮件及其附件内容可能含有机密和/或隐私信息,仅供指定个人或机构使用。若您非发件人指定收件人或其代理人,请勿使用、传播、复制或存储此邮件之任何内容或其附件。如您误收本邮件,请即以回复或电话方式通知发件人,并将原始邮件、附件及其所有复本删除。谢谢。
> The contents of this message and any attachments may contain confidential 
> and/or privileged information and are intended exclusively for the 
> addressee(s). If you are not the intended recipient of this message or their 
> agent, please note that any use, dissemination, copying, or storage of this 
> message or its attachments is not allowed. If you receive this message in 
> error, please notify the sender by reply the message or phone and delete this 
> message, any attachments and any copies immediately.
> Thank you
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Hyper-V vsock streams do not fill the supplied buffer in full

2023-07-06 Thread Stefano Garzarella
Hi Gary,

On Wed, Jul 5, 2023 at 12:45 AM Gary Guo  wrote:
>
> When a vsock stream is called with recvmsg with a buffer, it only fills
> the buffer with data from the first single VM packet. Even if there are
> more VM packets at the time and the buffer is still not completely
> filled, it will just leave the buffer partially filled.
>
> This causes some issues when in WSLD which uses the vsock in
> non-blocking mode and uses epoll.
>
> For stream-oriented sockets, the epoll man page [1] says that
>
> > For stream-oriented files (e.g., pipe, FIFO, stream socket),
> > the condition that the read/write I/O space is exhausted can
> > also be detected by checking the amount of data read from /
> > written to the target file descriptor.  For example, if you
> > call read(2) by asking to read a certain amount of data and
> > read(2) returns a lower number of bytes, you can be sure of
> > having exhausted the read I/O space for the file descriptor.
>
> This has been used as an optimisation in the wild for reducing number
> of syscalls required for stream sockets (by asserting that the socket
> will not have to polled to EAGAIN in edge-trigger mode, if the buffer
> given to recvmsg is not filled completely). An example is Tokio, which
> starting in v1.21.0 [2].
>
> When this optimisation combines with the behaviour of Hyper-V vsock, it
> causes issue in this scenario:
> * the VM host send data to the guest, and it's splitted into multiple
>   VM packets
> * sk_data_ready is called and epoll returns, notifying the userspace
>   that the socket is ready
> * userspace call recvmsg with a buffer, and it's partially filled
> * userspace assumes that the stream socket is depleted, and if new data
>   arrives epoll will notify it again.
> * kernel always considers the socket to be ready, and since it's in
>   edge-trigger mode, the epoll instance will never be notified again.
>
> This different realisation of the readiness causes the userspace to
> block forever.

Thanks for the detailed description of the problem.

I think we should fix the hvs_stream_dequeue() in
net/vmw_vsock/hyperv_transport.c.
We can do something similar to what we do in
virtio_transport_stream_do_dequeue() in
net/vmw_vsock/virtio_transport_common.c

@Dexuan WDYT?

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it

2023-07-06 Thread Jason Wang
On Thu, Jul 6, 2023 at 3:55 PM Jason Wang  wrote:
>
> On Thu, Jul 6, 2023 at 3:06 PM Eugenio Perez Martin  
> wrote:
> >
> > On Thu, Jul 6, 2023 at 3:55 AM Jason Wang  wrote:
> > >
> > > On Wed, Jul 5, 2023 at 4:41 PM Michael S. Tsirkin  wrote:
> > > >
> > > > On Wed, Jul 05, 2023 at 03:49:58PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 4, 2023 at 11:45 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Jul 04, 2023 at 01:36:11PM +0200, Eugenio Perez Martin 
> > > > > > wrote:
> > > > > > > On Tue, Jul 4, 2023 at 12:38 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, Jul 04, 2023 at 12:25:32PM +0200, Eugenio Perez Martin 
> > > > > > > > wrote:
> > > > > > > > > On Mon, Jul 3, 2023 at 4:52 PM Michael S. Tsirkin 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 03, 2023 at 04:22:18PM +0200, Eugenio Pérez 
> > > > > > > > > > wrote:
> > > > > > > > > > > With the current code it is accepted as long as userland 
> > > > > > > > > > > send it.
> > > > > > > > > > >
> > > > > > > > > > > Although userland should not set a feature flag that has 
> > > > > > > > > > > not been
> > > > > > > > > > > offered to it with VHOST_GET_BACKEND_FEATURES, the 
> > > > > > > > > > > current code will not
> > > > > > > > > > > complain for it.
> > > > > > > > > > >
> > > > > > > > > > > Since there is no specific reason for any parent to 
> > > > > > > > > > > reject that backend
> > > > > > > > > > > feature bit when it has been proposed, let's control it 
> > > > > > > > > > > at vdpa frontend
> > > > > > > > > > > level. Future patches may move this control to the parent 
> > > > > > > > > > > driver.
> > > > > > > > > > >
> > > > > > > > > > > Fixes: 967800d2d52e ("vdpa: accept 
> > > > > > > > > > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature")
> > > > > > > > > > > Signed-off-by: Eugenio Pérez 
> > > > > > > > > >
> > > > > > > > > > Please do send v3. And again, I don't want to send "after 
> > > > > > > > > > driver ok" hack
> > > > > > > > > > upstream at all, I merged it in next just to give it some 
> > > > > > > > > > testing.
> > > > > > > > > > We want RING_ACCESS_AFTER_KICK or some such.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Current devices do not support that semantic.
> > > > > > > >
> > > > > > > > Which devices specifically access the ring after DRIVER_OK but 
> > > > > > > > before
> > > > > > > > a kick?
> > > > > > > >
> > > > > > >
> > > > > > > Previous versions of the QEMU LM series did a spurious kick to 
> > > > > > > start
> > > > > > > traffic at the LM destination [1]. When it was proposed, that 
> > > > > > > spurious
> > > > > > > kick was removed from the series because to check for descriptors
> > > > > > > after driver_ok, even without a kick, was considered work of the
> > > > > > > parent driver.
> > > > > > >
> > > > > > > I'm ok to go back to this spurious kick, but I'm not sure if the 
> > > > > > > hw
> > > > > > > will read the ring before the kick actually. I can ask.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1] 
> > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2023-01/msg02775.html
> > > > > >
> > > > > > Let's find out. We need to check for ENABLE_AFTER_DRIVER_OK too, no?
> > > > >
> > > > > My understanding is [1] assuming ACCESS_AFTER_KICK. This seems
> > > > > sub-optimal than assuming ENABLE_AFTER_DRIVER_OK.
> > > > >
> > > > > But this reminds me one thing, as the thread is going too long, I
> > > > > wonder if we simply assume ENABLE_AFTER_DRIVER_OK if RING_RESET is
> > > > > supported?
> > > > >
> > > > > Thanks
> > > >
> > > > I don't see what does one have to do with another ...
> > > >
> > > > I think with RING_RESET we had another solution, enable rings
> > > > mapping them to a zero page, then reset and re-enable later.
> > >
> > > As discussed before, this seems to have some problems:
> > >
> > > 1) The behaviour is not clarified in the document
> > > 2) zero is a valid IOVA
> > >
> >
> > I think we're not on the same page here.
> >
> > As I understood, rings mapped to a zero page means essentially an
> > avail ring whose avail_idx is always 0, offered to the device instead
> > of the guest's ring. Once all CVQ commands are processed, we use
> > RING_RESET to switch to the right ring, being guest's or SVQ vring.
>
> I get this. This seems more complicated in the destination: shadow vq + ASID?

So it's something like:

1) set all vq ASID to shadow virtqueue
2) do not add any bufs to data qp (stick 0 as avail index)
3) start to restore states via cvq
4) ring_rest for dataqp
5) set_vq_state for dataqp
6) re-initialize dataqp address etc
7) set data QP ASID to guest
8) set queue_enable

?

Thanks

>
> Thanks
>
> >
> >
> >
> > > Thanks
> > >
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > > > My plan was to convert
> > > > > > > > > it in vp_vdpa if needed, and reuse the current vdpa ops. 
> > > > > > > > > 

Re: [PATCH] vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it

2023-07-06 Thread Jason Wang
On Thu, Jul 6, 2023 at 3:06 PM Eugenio Perez Martin  wrote:
>
> On Thu, Jul 6, 2023 at 3:55 AM Jason Wang  wrote:
> >
> > On Wed, Jul 5, 2023 at 4:41 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Jul 05, 2023 at 03:49:58PM +0800, Jason Wang wrote:
> > > > On Tue, Jul 4, 2023 at 11:45 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Tue, Jul 04, 2023 at 01:36:11PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Tue, Jul 4, 2023 at 12:38 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Jul 04, 2023 at 12:25:32PM +0200, Eugenio Perez Martin 
> > > > > > > wrote:
> > > > > > > > On Mon, Jul 3, 2023 at 4:52 PM Michael S. Tsirkin 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Jul 03, 2023 at 04:22:18PM +0200, Eugenio Pérez wrote:
> > > > > > > > > > With the current code it is accepted as long as userland 
> > > > > > > > > > send it.
> > > > > > > > > >
> > > > > > > > > > Although userland should not set a feature flag that has 
> > > > > > > > > > not been
> > > > > > > > > > offered to it with VHOST_GET_BACKEND_FEATURES, the current 
> > > > > > > > > > code will not
> > > > > > > > > > complain for it.
> > > > > > > > > >
> > > > > > > > > > Since there is no specific reason for any parent to reject 
> > > > > > > > > > that backend
> > > > > > > > > > feature bit when it has been proposed, let's control it at 
> > > > > > > > > > vdpa frontend
> > > > > > > > > > level. Future patches may move this control to the parent 
> > > > > > > > > > driver.
> > > > > > > > > >
> > > > > > > > > > Fixes: 967800d2d52e ("vdpa: accept 
> > > > > > > > > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature")
> > > > > > > > > > Signed-off-by: Eugenio Pérez 
> > > > > > > > >
> > > > > > > > > Please do send v3. And again, I don't want to send "after 
> > > > > > > > > driver ok" hack
> > > > > > > > > upstream at all, I merged it in next just to give it some 
> > > > > > > > > testing.
> > > > > > > > > We want RING_ACCESS_AFTER_KICK or some such.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Current devices do not support that semantic.
> > > > > > >
> > > > > > > Which devices specifically access the ring after DRIVER_OK but 
> > > > > > > before
> > > > > > > a kick?
> > > > > > >
> > > > > >
> > > > > > Previous versions of the QEMU LM series did a spurious kick to start
> > > > > > traffic at the LM destination [1]. When it was proposed, that 
> > > > > > spurious
> > > > > > kick was removed from the series because to check for descriptors
> > > > > > after driver_ok, even without a kick, was considered work of the
> > > > > > parent driver.
> > > > > >
> > > > > > I'm ok to go back to this spurious kick, but I'm not sure if the hw
> > > > > > will read the ring before the kick actually. I can ask.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > [1] 
> > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2023-01/msg02775.html
> > > > >
> > > > > Let's find out. We need to check for ENABLE_AFTER_DRIVER_OK too, no?
> > > >
> > > > My understanding is [1] assuming ACCESS_AFTER_KICK. This seems
> > > > sub-optimal than assuming ENABLE_AFTER_DRIVER_OK.
> > > >
> > > > But this reminds me one thing, as the thread is going too long, I
> > > > wonder if we simply assume ENABLE_AFTER_DRIVER_OK if RING_RESET is
> > > > supported?
> > > >
> > > > Thanks
> > >
> > > I don't see what does one have to do with another ...
> > >
> > > I think with RING_RESET we had another solution, enable rings
> > > mapping them to a zero page, then reset and re-enable later.
> >
> > As discussed before, this seems to have some problems:
> >
> > 1) The behaviour is not clarified in the document
> > 2) zero is a valid IOVA
> >
>
> I think we're not on the same page here.
>
> As I understood, rings mapped to a zero page means essentially an
> avail ring whose avail_idx is always 0, offered to the device instead
> of the guest's ring. Once all CVQ commands are processed, we use
> RING_RESET to switch to the right ring, being guest's or SVQ vring.

I get this. This seems more complicated in the destination: shadow vq + ASID?

Thanks

>
>
>
> > Thanks
> >
> > >
> > > > >
> > > > >
> > > > >
> > > > > > > > My plan was to convert
> > > > > > > > it in vp_vdpa if needed, and reuse the current vdpa ops. Sorry 
> > > > > > > > if I
> > > > > > > > was not explicit enough.
> > > > > > > >
> > > > > > > > The only solution I can see to that is to trap & emulate in the 
> > > > > > > > vdpa
> > > > > > > > (parent?) driver, as talked in virtio-comment. But that 
> > > > > > > > complicates
> > > > > > > > the architecture:
> > > > > > > > * Offer VHOST_BACKEND_F_RING_ACCESS_AFTER_KICK
> > > > > > > > * Store vq enable state separately, at
> > > > > > > > vdpa->config->set_vq_ready(true), but not transmit that enable 
> > > > > > > > to hw
> > > > > > > > * Store the doorbell state separately, but do not configure it 
> > > > > > > > to the
> > > > > > > > device 

Re: [PATCH] vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it

2023-07-06 Thread Michael S. Tsirkin
On Thu, Jul 06, 2023 at 08:35:30AM +0200, Eugenio Perez Martin wrote:
> On Thu, Jul 6, 2023 at 8:07 AM Michael S. Tsirkin  wrote:
> >
> > On Wed, Jul 05, 2023 at 05:07:11PM -0700, Shannon Nelson wrote:
> > > On 7/5/23 11:27 AM, Eugenio Perez Martin wrote:
> > > >
> > > > On Wed, Jul 5, 2023 at 9:50 AM Jason Wang  wrote:
> > > > >
> > > > > On Tue, Jul 4, 2023 at 11:45 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Jul 04, 2023 at 01:36:11PM +0200, Eugenio Perez Martin 
> > > > > > wrote:
> > > > > > > On Tue, Jul 4, 2023 at 12:38 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, Jul 04, 2023 at 12:25:32PM +0200, Eugenio Perez Martin 
> > > > > > > > wrote:
> > > > > > > > > On Mon, Jul 3, 2023 at 4:52 PM Michael S. Tsirkin 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 03, 2023 at 04:22:18PM +0200, Eugenio Pérez 
> > > > > > > > > > wrote:
> > > > > > > > > > > With the current code it is accepted as long as userland 
> > > > > > > > > > > send it.
> > > > > > > > > > >
> > > > > > > > > > > Although userland should not set a feature flag that has 
> > > > > > > > > > > not been
> > > > > > > > > > > offered to it with VHOST_GET_BACKEND_FEATURES, the 
> > > > > > > > > > > current code will not
> > > > > > > > > > > complain for it.
> > > > > > > > > > >
> > > > > > > > > > > Since there is no specific reason for any parent to 
> > > > > > > > > > > reject that backend
> > > > > > > > > > > feature bit when it has been proposed, let's control it 
> > > > > > > > > > > at vdpa frontend
> > > > > > > > > > > level. Future patches may move this control to the parent 
> > > > > > > > > > > driver.
> > > > > > > > > > >
> > > > > > > > > > > Fixes: 967800d2d52e ("vdpa: accept 
> > > > > > > > > > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature")
> > > > > > > > > > > Signed-off-by: Eugenio Pérez 
> > > > > > > > > >
> > > > > > > > > > Please do send v3. And again, I don't want to send "after 
> > > > > > > > > > driver ok" hack
> > > > > > > > > > upstream at all, I merged it in next just to give it some 
> > > > > > > > > > testing.
> > > > > > > > > > We want RING_ACCESS_AFTER_KICK or some such.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Current devices do not support that semantic.
> > > > > > > >
> > > > > > > > Which devices specifically access the ring after DRIVER_OK but 
> > > > > > > > before
> > > > > > > > a kick?
> > >
> > > The PDS vdpa device can deal with a call to .set_vq_ready after DRIVER_OK 
> > > is
> > > set.  And I'm told that our VQ activity should start without a kick.
> > >
> > > Our vdpa device FW doesn't currently have support for VIRTIO_F_RING_RESET,
> > > but I believe it could be added without too much trouble.
> > >
> > > sln
> > >
> >
> > OK it seems clear at least in the current version pds needs
> > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK.
> > However can we also code up the RING_RESET path as the default?
> > Then down the road vendors can choose what to do.
> >
> 
> Yes, the RING_RESET path can be coded & tested for vp_vdpa, for
> example. I'm ok with making it the default, and let
> _F_ENABLE_AFTER_DRIVER_OK as a fallback.

Sounds good.

> >
> >
> >
> >
> > > > > > > >
> > > > > > >
> > > > > > > Previous versions of the QEMU LM series did a spurious kick to 
> > > > > > > start
> > > > > > > traffic at the LM destination [1]. When it was proposed, that 
> > > > > > > spurious
> > > > > > > kick was removed from the series because to check for descriptors
> > > > > > > after driver_ok, even without a kick, was considered work of the
> > > > > > > parent driver.
> > > > > > >
> > > > > > > I'm ok to go back to this spurious kick, but I'm not sure if the 
> > > > > > > hw
> > > > > > > will read the ring before the kick actually. I can ask.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1] 
> > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2023-01/msg02775.html
> > > > > >
> > > > > > Let's find out. We need to check for ENABLE_AFTER_DRIVER_OK too, no?
> > > > >
> > > > > My understanding is [1] assuming ACCESS_AFTER_KICK. This seems
> > > > > sub-optimal than assuming ENABLE_AFTER_DRIVER_OK.
> > > > >
> > > > > But this reminds me one thing, as the thread is going too long, I
> > > > > wonder if we simply assume ENABLE_AFTER_DRIVER_OK if RING_RESET is
> > > > > supported?
> > > > >
> > > >
> > > > The problem with that is that the device needs to support all
> > > > RING_RESET, like to be able to change vq address etc after DRIVER_OK.
> > > > Not all HW support it.
> > > >
> > > > We just need the subset of having the dataplane freezed until all CVQ
> > > > commands have been consumed. I'm sure current vDPA code already
> > > > supports it in some devices, like MLX and PSD.
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > > > My plan was to convert
> > > > > > > > > it in 

Re: [PATCH] vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it

2023-07-06 Thread Michael S. Tsirkin
On Wed, Jul 05, 2023 at 05:07:11PM -0700, Shannon Nelson wrote:
> On 7/5/23 11:27 AM, Eugenio Perez Martin wrote:
> >
> > On Wed, Jul 5, 2023 at 9:50 AM Jason Wang  wrote:
> > > 
> > > On Tue, Jul 4, 2023 at 11:45 PM Michael S. Tsirkin  
> > > wrote:
> > > > 
> > > > On Tue, Jul 04, 2023 at 01:36:11PM +0200, Eugenio Perez Martin wrote:
> > > > > On Tue, Jul 4, 2023 at 12:38 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > > 
> > > > > > On Tue, Jul 04, 2023 at 12:25:32PM +0200, Eugenio Perez Martin 
> > > > > > wrote:
> > > > > > > On Mon, Jul 3, 2023 at 4:52 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > On Mon, Jul 03, 2023 at 04:22:18PM +0200, Eugenio Pérez wrote:
> > > > > > > > > With the current code it is accepted as long as userland send 
> > > > > > > > > it.
> > > > > > > > > 
> > > > > > > > > Although userland should not set a feature flag that has not 
> > > > > > > > > been
> > > > > > > > > offered to it with VHOST_GET_BACKEND_FEATURES, the current 
> > > > > > > > > code will not
> > > > > > > > > complain for it.
> > > > > > > > > 
> > > > > > > > > Since there is no specific reason for any parent to reject 
> > > > > > > > > that backend
> > > > > > > > > feature bit when it has been proposed, let's control it at 
> > > > > > > > > vdpa frontend
> > > > > > > > > level. Future patches may move this control to the parent 
> > > > > > > > > driver.
> > > > > > > > > 
> > > > > > > > > Fixes: 967800d2d52e ("vdpa: accept 
> > > > > > > > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature")
> > > > > > > > > Signed-off-by: Eugenio Pérez 
> > > > > > > > 
> > > > > > > > Please do send v3. And again, I don't want to send "after 
> > > > > > > > driver ok" hack
> > > > > > > > upstream at all, I merged it in next just to give it some 
> > > > > > > > testing.
> > > > > > > > We want RING_ACCESS_AFTER_KICK or some such.
> > > > > > > > 
> > > > > > > 
> > > > > > > Current devices do not support that semantic.
> > > > > > 
> > > > > > Which devices specifically access the ring after DRIVER_OK but 
> > > > > > before
> > > > > > a kick?
> 
> The PDS vdpa device can deal with a call to .set_vq_ready after DRIVER_OK is
> set.  And I'm told that our VQ activity should start without a kick.
> 
> Our vdpa device FW doesn't currently have support for VIRTIO_F_RING_RESET,
> but I believe it could be added without too much trouble.
> 
> sln
> 

OK it seems clear at least in the current version pds needs
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK.
However can we also code up the RING_RESET path as the default?
Then down the road vendors can choose what to do.





> > > > > > 
> > > > > 
> > > > > Previous versions of the QEMU LM series did a spurious kick to start
> > > > > traffic at the LM destination [1]. When it was proposed, that spurious
> > > > > kick was removed from the series because to check for descriptors
> > > > > after driver_ok, even without a kick, was considered work of the
> > > > > parent driver.
> > > > > 
> > > > > I'm ok to go back to this spurious kick, but I'm not sure if the hw
> > > > > will read the ring before the kick actually. I can ask.
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > [1] 
> > > > > https://lists.nongnu.org/archive/html/qemu-devel/2023-01/msg02775.html
> > > > 
> > > > Let's find out. We need to check for ENABLE_AFTER_DRIVER_OK too, no?
> > > 
> > > My understanding is [1] assuming ACCESS_AFTER_KICK. This seems
> > > sub-optimal than assuming ENABLE_AFTER_DRIVER_OK.
> > > 
> > > But this reminds me one thing, as the thread is going too long, I
> > > wonder if we simply assume ENABLE_AFTER_DRIVER_OK if RING_RESET is
> > > supported?
> > > 
> > 
> > The problem with that is that the device needs to support all
> > RING_RESET, like to be able to change vq address etc after DRIVER_OK.
> > Not all HW support it.
> > 
> > We just need the subset of having the dataplane freezed until all CVQ
> > commands have been consumed. I'm sure current vDPA code already
> > supports it in some devices, like MLX and PSD.
> > 
> > Thanks!
> > 
> > > Thanks
> > > 
> > > > 
> > > > 
> > > > 
> > > > > > > My plan was to convert
> > > > > > > it in vp_vdpa if needed, and reuse the current vdpa ops. Sorry if 
> > > > > > > I
> > > > > > > was not explicit enough.
> > > > > > > 
> > > > > > > The only solution I can see to that is to trap & emulate in the 
> > > > > > > vdpa
> > > > > > > (parent?) driver, as talked in virtio-comment. But that 
> > > > > > > complicates
> > > > > > > the architecture:
> > > > > > > * Offer VHOST_BACKEND_F_RING_ACCESS_AFTER_KICK
> > > > > > > * Store vq enable state separately, at
> > > > > > > vdpa->config->set_vq_ready(true), but not transmit that enable to 
> > > > > > > hw
> > > > > > > * Store the doorbell state separately, but do not configure it to 
> > > > > > > the
> > > > > > > device directly.
> > > > > > > 
> > > > > > > But how to recover if the device cannot configure them at kick 
> > > > > > > 

Re: [PATCH] vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it

2023-07-06 Thread Michael S. Tsirkin
On Thu, Jul 06, 2023 at 09:56:29AM +0800, Jason Wang wrote:
> On Wed, Jul 5, 2023 at 4:42 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Jul 05, 2023 at 03:55:23PM +0800, Jason Wang wrote:
> > > On Tue, Jul 4, 2023 at 6:38 PM Michael S. Tsirkin  wrote:
> > > >
> > > > On Tue, Jul 04, 2023 at 12:25:32PM +0200, Eugenio Perez Martin wrote:
> > > > > On Mon, Jul 3, 2023 at 4:52 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Jul 03, 2023 at 04:22:18PM +0200, Eugenio Pérez wrote:
> > > > > > > With the current code it is accepted as long as userland send it.
> > > > > > >
> > > > > > > Although userland should not set a feature flag that has not been
> > > > > > > offered to it with VHOST_GET_BACKEND_FEATURES, the current code 
> > > > > > > will not
> > > > > > > complain for it.
> > > > > > >
> > > > > > > Since there is no specific reason for any parent to reject that 
> > > > > > > backend
> > > > > > > feature bit when it has been proposed, let's control it at vdpa 
> > > > > > > frontend
> > > > > > > level. Future patches may move this control to the parent driver.
> > > > > > >
> > > > > > > Fixes: 967800d2d52e ("vdpa: accept 
> > > > > > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature")
> > > > > > > Signed-off-by: Eugenio Pérez 
> > > > > >
> > > > > > Please do send v3. And again, I don't want to send "after driver 
> > > > > > ok" hack
> > > > > > upstream at all, I merged it in next just to give it some testing.
> > > > > > We want RING_ACCESS_AFTER_KICK or some such.
> > > > > >
> > > > >
> > > > > Current devices do not support that semantic.
> > > >
> > > > Which devices specifically access the ring after DRIVER_OK but before
> > > > a kick?
> > >
> > > Vhost-net is one example at last. It polls a socket as well, so it
> > > starts to access the ring immediately after DRIVER_OK.
> > >
> > > Thanks
> >
> >
> > For sure but that is not vdpa.
> 
> Somehow via vp_vdpa that I'm usually testing vDPA patches.
> 
> The problem is that it's very hard to audit all vDPA devices now if it
> is not defined in the spec before they are invented.
> 
> Thanks

vp_vdpa is exactly the part that bothers me. If on the host it is backed
by e.g. vhost-user then it does not work. And you don't know what is
backing it.

OTOH it supports RING_RESET ...

So, proposal: include both this solution and for drivers
vp_vdpa the RING_RESET trick.


Hmm?



> >
> > > >
> > > > > My plan was to convert
> > > > > it in vp_vdpa if needed, and reuse the current vdpa ops. Sorry if I
> > > > > was not explicit enough.
> > > > >
> > > > > The only solution I can see to that is to trap & emulate in the vdpa
> > > > > (parent?) driver, as talked in virtio-comment. But that complicates
> > > > > the architecture:
> > > > > * Offer VHOST_BACKEND_F_RING_ACCESS_AFTER_KICK
> > > > > * Store vq enable state separately, at
> > > > > vdpa->config->set_vq_ready(true), but not transmit that enable to hw
> > > > > * Store the doorbell state separately, but do not configure it to the
> > > > > device directly.
> > > > >
> > > > > But how to recover if the device cannot configure them at kick time,
> > > > > for example?
> > > > >
> > > > > Maybe we can just fail if the parent driver does not support enabling
> > > > > the vq after DRIVER_OK? That way no new feature flag is needed.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > >
> > > > > > > ---
> > > > > > > Sent with Fixes: tag pointing to 
> > > > > > > git.kernel.org/pub/scm/linux/kernel/git/mst
> > > > > > > commit. Please let me know if I should send a v3 of [1] instead.
> > > > > > >
> > > > > > > [1] 
> > > > > > > https://lore.kernel.org/lkml/20230609121244-mutt-send-email-...@kernel.org/T/
> > > > > > > ---
> > > > > > >  drivers/vhost/vdpa.c | 7 +--
> > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > > > > > index e1abf29fed5b..a7e554352351 100644
> > > > > > > --- a/drivers/vhost/vdpa.c
> > > > > > > +++ b/drivers/vhost/vdpa.c
> > > > > > > @@ -681,18 +681,21 @@ static long 
> > > > > > > vhost_vdpa_unlocked_ioctl(struct file *filep,
> > > > > > >  {
> > > > > > >   struct vhost_vdpa *v = filep->private_data;
> > > > > > >   struct vhost_dev *d = >vdev;
> > > > > > > + const struct vdpa_config_ops *ops = v->vdpa->config;
> > > > > > >   void __user *argp = (void __user *)arg;
> > > > > > >   u64 __user *featurep = argp;
> > > > > > > - u64 features;
> > > > > > > + u64 features, parent_features = 0;
> > > > > > >   long r = 0;
> > > > > > >
> > > > > > >   if (cmd == VHOST_SET_BACKEND_FEATURES) {
> > > > > > >   if (copy_from_user(, featurep, 
> > > > > > > sizeof(features)))
> > > > > > >   return -EFAULT;
> > > > > > > + if (ops->get_backend_features)
> > > > > > > + parent_features = 
> > > > > > > ops->get_backend_features(v->vdpa);
> > > > > > >