Re: [PATCH v1 0/3] QIOChannel flags + multifd zerocopy

2021-08-31 Thread Peter Xu
On Tue, Aug 31, 2021 at 08:02:36AM -0300, Leonardo Bras wrote:
> Results:
> So far, the resource usage of __sys_sendmsg() reduced 15 times, and the
> overall migration took 13-18% less time, based in synthetic workload.

Leo,

Could you share some of the details of your tests?  E.g., what's the
configuration of your VM for testing?  What's the migration time before/after
the patchset applied?  What is the network you're using?

Thanks,

-- 
Peter Xu




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-08-31 Thread Peter Xu
On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > 
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> > 
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd-zlib.c | 7 ---
> >  migration/multifd-zstd.c | 7 ---
> >  migration/multifd.c  | 9 ++---
> >  migration/multifd.h  | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
> 
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >  }
> >  
> >  if (used) {
> > -ret = multifd_send_state->ops->send_write(p, used, 
> > &local_err);
> > +ret = multifd_send_state->ops->send_write(p, used, 
> > MSG_ZEROCOPY,
> > +  &local_err);
> 
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
>if the socket option was not set, the socket exceeds its optmem 
>limit or the user exceeds its ulimit on locked pages."
> 
> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Yes it would be great to be a migration capability in parallel to multifd. At
initial phase if it's easy to be implemented on multi-fd only, we can add a
dependency between the caps.  In the future we can remove that dependency when
the code is ready to go without multifd.  Thanks,

-- 
Peter Xu




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-08-31 Thread Peter Xu
On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> > 
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
> 
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> 
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.
> 
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "Page pinning also changes system call semantics. It temporarily 
>shares the buffer between process and network stack. Unlike with
>copying, the process cannot immediately overwrite the buffer 
>after system call return without possibly modifying the data in 
>flight. Kernel integrity is not affected, but a buggy program
>can possibly corrupt its own data stream."
> 
> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
> 
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.

Regarding buffer reuse - it indeed has a very deep implication on the buffer
being available and it's not obvious at all.  Just to mention that the initial
user of this work will make sure all zero copy buffers will be guest pages only
(as it's only used in multi-fd), so they should always be there during the
process.

I think asking for a complete design still makes sense.  E.g., for precopy
before we flush device states and completes the migration, we may want to at
least have a final ack on all the zero-copies of guest pages to guarantee they
are flushed.

IOW, we need to make sure the last piece of migration stream lands after the
guest pages so that the dest VM will always contain the latest page data when
dest VM starts.  So far I don't see how current code guaranteed that.

In short, we may just want to at least having a way to make sure all zero
copied buffers are finished using and they're sent after some function returns
(e.g., qio_channel_flush()).  That may require us to do some accounting on when
we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
error queue and keep those information somewhere too.

Some other side notes that reached my mind..

The qio_channel_writev_full() may not be suitable for async operations, as the
name "full" implies synchronous to me.  So maybe we can add a new helper for
zero copy on the channel?

We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
bit is not set in qio channel features.

Thanks,

-- 
Peter Xu




Re: [PATCH] block/vvfat: Fix ro shared folder

2021-08-31 Thread Philippe Mathieu-Daudé
Hi Guillaume,

On 8/31/21 4:17 PM, Guillaume Roche wrote:
> QEMU exits in error when passing a vfat shared folder in read-only mode.
> 
> To fix this issue, this patch removes any potential write permission
> from cumulative_perms, when a read-only block device is in use.
> 
> Buglink: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918950
> 
> Signed-off-by: Guillaume Roche 
> ---
> This is an attempt to fix this behavior, but it feels a bit hacky to me
> since this patch checks for the vvfat format in a generic function.

What about implementing bdrv_vvfat::bdrv_check_perm()?

> However, I'd be glad to have some advice to make it better. Anyway, I
> ran the block tests to ensure this does not introduce any regression.
> 
> To add some context: I know that this can be worked around by setting 
> the shared folder in rw mode. But our use-case requires using both
> shared and VM snapshots, and QEMU prevents using snapshot with a rw 
> shared folder.
> 
>  block.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block.c b/block.c
> index e97ce0b1c8..3f59e3843f 100644
> --- a/block.c
> +++ b/block.c
> @@ -2383,6 +2383,12 @@ void bdrv_get_cumulative_perm(BlockDriverState *bs, 
> uint64_t *perm,
>  cumulative_shared_perms &= c->shared_perm;
>  }
>  
> +/* Discard write permission if vvfat block device is read-only */
> +const char *format = bdrv_get_format_name(bs);
> +if (format != NULL && strncmp(format, "vvfat", 5) == 0 && 
> bdrv_is_read_only(bs)) {
> +cumulative_perms &= ~BLK_PERM_WRITE;
> +}
> +
>  *perm = cumulative_perms;
>  *shared_perm = cumulative_shared_perms;
>  }
> 




[PATCH] block/vvfat: Fix ro shared folder

2021-08-31 Thread Guillaume Roche
QEMU exits in error when passing a vfat shared folder in read-only mode.

To fix this issue, this patch removes any potential write permission
from cumulative_perms, when a read-only block device is in use.

Buglink: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918950

Signed-off-by: Guillaume Roche 
---
This is an attempt to fix this behavior, but it feels a bit hacky to me
since this patch checks for the vvfat format in a generic function.

However, I'd be glad to have some advice to make it better. Anyway, I
ran the block tests to ensure this does not introduce any regression.

To add some context: I know that this can be worked around by setting 
the shared folder in rw mode. But our use-case requires using both
shared and VM snapshots, and QEMU prevents using snapshot with a rw 
shared folder.

 block.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block.c b/block.c
index e97ce0b1c8..3f59e3843f 100644
--- a/block.c
+++ b/block.c
@@ -2383,6 +2383,12 @@ void bdrv_get_cumulative_perm(BlockDriverState *bs, 
uint64_t *perm,
 cumulative_shared_perms &= c->shared_perm;
 }
 
+/* Discard write permission if vvfat block device is read-only */
+const char *format = bdrv_get_format_name(bs);
+if (format != NULL && strncmp(format, "vvfat", 5) == 0 && 
bdrv_is_read_only(bs)) {
+cumulative_perms &= ~BLK_PERM_WRITE;
+}
+
 *perm = cumulative_perms;
 *shared_perm = cumulative_shared_perms;
 }
-- 
2.31.1




[PATCH v1 1/3] io: Enable write flags for QIOChannel

2021-08-31 Thread Leonardo Bras
Some syscalls used for writting, such as sendmsg(), accept flags that
can modify their behavior, even allowing the usage of features such as
MSG_ZEROCOPY.

Change qio_channel_write*() interface to allow passing down flags,
allowing a more flexible use of IOChannel.

At first, it's use is enabled for QIOChannelSocket, but can be easily
extended to any other QIOChannel implementation.

Signed-off-by: Leonardo Bras 
---
 chardev/char-io.c   |  2 +-
 hw/remote/mpqemu-link.c |  2 +-
 include/io/channel.h| 56 -
 io/channel-buffer.c |  1 +
 io/channel-command.c|  1 +
 io/channel-file.c   |  1 +
 io/channel-socket.c |  4 ++-
 io/channel-tls.c|  1 +
 io/channel-websock.c|  1 +
 io/channel.c| 53 ++-
 migration/rdma.c|  1 +
 scsi/pr-manager-helper.c|  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 13 files changed, 81 insertions(+), 45 deletions(-)

diff --git a/chardev/char-io.c b/chardev/char-io.c
index 8ced184160..4ea7b1ee2a 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
 
 ret = qio_channel_writev_full(
 ioc, &iov, 1,
-fds, nfds, NULL);
+fds, 0, nfds, NULL);
 if (ret == QIO_CHANNEL_ERR_BLOCK) {
 if (offset) {
 return offset;
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index 7e841820e5..0d13321ef0 100644
--- a/hw/remote/mpqemu-link.c
+++ b/hw/remote/mpqemu-link.c
@@ -69,7 +69,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error 
**errp)
 }
 
 if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
-fds, nfds, errp)) {
+ fds, nfds, 0, errp)) {
 ret = true;
 } else {
 trace_mpqemu_send_io_error(msg->cmd, msg->size, nfds);
diff --git a/include/io/channel.h b/include/io/channel.h
index 88988979f8..dada9ebaaf 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -104,6 +104,7 @@ struct QIOChannelClass {
  size_t niov,
  int *fds,
  size_t nfds,
+ int flags,
  Error **errp);
 ssize_t (*io_readv)(QIOChannel *ioc,
 const struct iovec *iov,
@@ -260,6 +261,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
 size_t niov,
 int *fds,
 size_t nfds,
+int flags,
 Error **errp);
 
 /**
@@ -325,6 +327,7 @@ int qio_channel_readv_all(QIOChannel *ioc,
  * @ioc: the channel object
  * @iov: the array of memory regions to write data from
  * @niov: the length of the @iov array
+ * @flags: optional sending flags
  * @errp: pointer to a NULL-initialized error object
  *
  * Write data to the IO channel, reading it from the
@@ -339,10 +342,14 @@ int qio_channel_readv_all(QIOChannel *ioc,
  *
  * Returns: 0 if all bytes were written, or -1 on error
  */
-int qio_channel_writev_all(QIOChannel *ioc,
-   const struct iovec *iov,
-   size_t niov,
-   Error **erp);
+int qio_channel_writev_all_flags(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ int flags,
+ Error **errp);
+
+#define qio_channel_writev_all(ioc, iov, niov, errp) \
+qio_channel_writev_all_flags(ioc, iov, niov, 0, errp)
 
 /**
  * qio_channel_readv:
@@ -364,15 +371,21 @@ ssize_t qio_channel_readv(QIOChannel *ioc,
  * @ioc: the channel object
  * @iov: the array of memory regions to write data from
  * @niov: the length of the @iov array
+ * @flags: optional sending flags
  * @errp: pointer to a NULL-initialized error object
  *
  * Behaves as qio_channel_writev_full() but does not support
  * sending of file handles.
  */
-ssize_t qio_channel_writev(QIOChannel *ioc,
-   const struct iovec *iov,
-   size_t niov,
-   Error **errp);
+ssize_t qio_channel_writev_flags(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ int flags,
+ Error **errp);
+
+#define qio_channel_writev(ioc, iov, niov, errp) \
+qio_channel_writev_flags(ioc, iov, niov, 0, errp)
+
 
 /**
  * qio_channel_read:
@@ -395,16 +408,21 @@ ssize_t qio_channel_read(QIOChannel *ioc,
  * @ioc: the channel object
  * @buf: the memory regions to send data from

[PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-08-31 Thread Leonardo Bras
Call qio_channel_set_zerocopy(true) in the start of every multifd thread.

Change the send_write() interface of multifd, allowing it to pass down
flags for qio_channel_write*().

Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
other data being sent at the default copying approach.

Signed-off-by: Leonardo Bras 
---
 migration/multifd-zlib.c | 7 ---
 migration/multifd-zstd.c | 7 ---
 migration/multifd.c  | 9 ++---
 migration/multifd.h  | 3 ++-
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index ab4ba75d75..d8cce1810a 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -160,12 +160,13 @@ static int zlib_send_prepare(MultiFDSendParams *p, 
uint32_t used, Error **errp)
  * @used: number of pages used
  * @errp: pointer to an error
  */
-static int zlib_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
+static int zlib_send_write(MultiFDSendParams *p, uint32_t used, int flags,
+   Error **errp)
 {
 struct zlib_data *z = p->data;
 
-return qio_channel_write_all(p->c, (void *)z->zbuff, p->next_packet_size,
- errp);
+return qio_channel_write_all_flags(p->c, (void *)z->zbuff,
+   p->next_packet_size, flags, errp);
 }
 
 /**
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 693bddf8c9..fa063fd33e 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -171,12 +171,13 @@ static int zstd_send_prepare(MultiFDSendParams *p, 
uint32_t used, Error **errp)
  * @used: number of pages used
  * @errp: pointer to an error
  */
-static int zstd_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
+static int zstd_send_write(MultiFDSendParams *p, uint32_t used, int flags,
+   Error **errp)
 {
 struct zstd_data *z = p->data;
 
-return qio_channel_write_all(p->c, (void *)z->zbuff, p->next_packet_size,
- errp);
+return qio_channel_write_all_flags(p->c, (void *)z->zbuff,
+   p->next_packet_size, flags, errp);
 }
 
 /**
diff --git a/migration/multifd.c b/migration/multifd.c
index 377da78f5b..097621c12c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -103,9 +103,10 @@ static int nocomp_send_prepare(MultiFDSendParams *p, 
uint32_t used,
  * @used: number of pages used
  * @errp: pointer to an error
  */
-static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
+static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, int flags,
+ Error **errp)
 {
-return qio_channel_writev_all(p->c, p->pages->iov, used, errp);
+return qio_channel_writev_all_flags(p->c, p->pages->iov, used, flags, 
errp);
 }
 
 /**
@@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
 }
 
 if (used) {
-ret = multifd_send_state->ops->send_write(p, used, &local_err);
+ret = multifd_send_state->ops->send_write(p, used, 
MSG_ZEROCOPY,
+  &local_err);
 if (ret != 0) {
 break;
 }
@@ -815,6 +817,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 } else {
 /* update for tls qio channel */
 p->c = ioc;
+qio_channel_set_zerocopy(ioc, true);
 qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
QEMU_THREAD_JOINABLE);
}
diff --git a/migration/multifd.h b/migration/multifd.h
index 8d6751f5ed..7243ba4185 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -157,7 +157,8 @@ typedef struct {
 /* Prepare the send packet */
 int (*send_prepare)(MultiFDSendParams *p, uint32_t used, Error **errp);
 /* Write the send packet */
-int (*send_write)(MultiFDSendParams *p, uint32_t used, Error **errp);
+int (*send_write)(MultiFDSendParams *p, uint32_t used, int flags,
+  Error **errp);
 /* Setup for receiving side */
 int (*recv_setup)(MultiFDRecvParams *p, Error **errp);
 /* Cleanup for receiving side */
-- 
2.33.0




[PATCH v1 2/3] io: Add zerocopy and errqueue

2021-08-31 Thread Leonardo Bras
MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
send calls. It does so by avoiding copying user data into kernel buffers.

To make it work, three steps are needed:
1 - A setsockopt() system call, enabling SO_ZEROCOPY
2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
3 - Process the socket's error queue, dealing with any error

Zerocopy has it's costs, so it will only get improved performance if
the sending buffer is big (10KB, according to Linux docs).

The step 2 makes it possible to use the same socket to send data
using both zerocopy and the default copying approach, so the application
cat choose what is best for each packet.

To implement step 1, an optional set_zerocopy() interface was created
in QIOChannel, allowing each using code to enable or disable it.

Step 2 will be enabled by the using code at each qio_channel_write*()
that would benefit of zerocopy;

Step 3 is done with qio_channel_socket_errq_proc(), that runs after
SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found.

Signed-off-by: Leonardo Bras 
---
 include/io/channel-socket.h |  2 +
 include/io/channel.h| 29 ++
 io/channel-socket.c | 76 +
 io/channel-tls.c| 11 ++
 io/channel-websock.c|  9 +
 io/channel.c| 11 ++
 6 files changed, 138 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..09dffe059f 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@ struct QIOChannelSocket {
 socklen_t localAddrLen;
 struct sockaddr_storage remoteAddr;
 socklen_t remoteAddrLen;
+size_t errq_pending;
+bool zerocopy_enabled;
 };
 
 
diff --git a/include/io/channel.h b/include/io/channel.h
index dada9ebaaf..de10a78b10 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -137,6 +137,8 @@ struct QIOChannelClass {
   IOHandler *io_read,
   IOHandler *io_write,
   void *opaque);
+void (*io_set_zerocopy)(QIOChannel *ioc,
+bool enabled);
 };
 
 /* General I/O handling functions */
@@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc,
 void qio_channel_set_delay(QIOChannel *ioc,
bool enabled);
 
+/**
+ * qio_channel_set_zerocopy:
+ * @ioc: the channel object
+ * @enabled: the new flag state
+ *
+ * Controls whether the underlying transport is
+ * permitted to use zerocopy to avoid copying the
+ * sending buffer in kernel. If @enabled is true, then the
+ * writes may avoid buffer copy in kernel. If @enabled
+ * is false, writes will cause the kernel to always
+ * copy the buffer contents before sending.
+ *
+ * In order to use make a write with zerocopy feature,
+ * it's also necessary to sent each packet with
+ * MSG_ZEROCOPY flag. With this, it's possible to
+ * to select only writes that would benefit from the
+ * use of zerocopy feature, i.e. the ones with larger
+ * buffers.
+ *
+ * This feature was added in Linux 4.14, so older
+ * versions will fail on enabling. This is not an
+ * issue, since it will fall-back to default copying
+ * approach.
+ */
+void qio_channel_set_zerocopy(QIOChannel *ioc,
+  bool enabled);
+
 /**
  * qio_channel_set_cork:
  * @ioc: the channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e377e7303d..a69fec7315 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -26,8 +26,10 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#include 
 
 #define SOCKET_MAX_FDS 16
+#define SOCKET_ERRQ_THRESH 16384
 
 SocketAddress *
 qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
@@ -55,6 +57,8 @@ qio_channel_socket_new(void)
 
 sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
 sioc->fd = -1;
+sioc->zerocopy_enabled = false;
+sioc->errq_pending = 0;
 
 ioc = QIO_CHANNEL(sioc);
 qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
 return ret;
 }
 
+static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc,
+ Error **errp)
+{
+int fd = sioc->fd;
+int ret;
+struct msghdr msg = {};
+struct sock_extended_err *serr;
+struct cmsghdr *cm;
+
+do {
+ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+if (ret <= 0) {
+if (ret == 0 || errno == EAGAIN) {
+/* Nothing on errqueue */
+ sioc->errq_pending = 0;
+ break;
+}
+if (errno == EINTR) {
+continue;
+}
+
+error_setg_errno(errp, errno,
+ "Unable to read errqueue");
+break;
+}
+
+cm = CMSG_F

[PATCH v1 0/3] QIOChannel flags + multifd zerocopy

2021-08-31 Thread Leonardo Bras
This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
use of it for multifd migration performance improvement.

Patch #1 enables the use of flags on qio_channel_write*(), allowing
more flexibility in using the channel. 
It was designed for MSG_ZEROCOPY usage, in which it's a good idea
having a eassy way to choose what packets are sent with the flag, but
also makes it more flexible for future usage.

Patch #2 just adds the MSG_ZEROCOPY feature, and defines the enablement
mechanics, while not enabling it in any code.

Patch #3 enables MSG_ZEROCOPY for migration / multifd.


Results:
So far, the resource usage of __sys_sendmsg() reduced 15 times, and the
overall migration took 13-18% less time, based in synthetic workload.

The objective is to reduce migration time in hosts with heavy cpu usage.

Leonardo Bras (3):
  io: Enable write flags for QIOChannel
  io: Add zerocopy and errqueue
  migration: multifd: Enable zerocopy

 chardev/char-io.c   |  2 +-
 hw/remote/mpqemu-link.c |  2 +-
 include/io/channel-socket.h |  2 +
 include/io/channel.h| 85 +++--
 io/channel-buffer.c |  1 +
 io/channel-command.c|  1 +
 io/channel-file.c   |  1 +
 io/channel-socket.c | 80 ++-
 io/channel-tls.c| 12 
 io/channel-websock.c| 10 
 io/channel.c| 64 +-
 migration/multifd-zlib.c|  7 ++-
 migration/multifd-zstd.c|  7 ++-
 migration/multifd.c |  9 ++-
 migration/multifd.h |  3 +-
 migration/rdma.c|  1 +
 scsi/pr-manager-helper.c|  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 18 files changed, 235 insertions(+), 55 deletions(-)

-- 
2.33.0




Re: Virtual FAT disk images

2021-08-31 Thread Pascal
up :-)
nobody uses this feature of qemu?

Le ven. 27 août 2021 à 11:11, Pascal  a écrit :

> hello everybody,
>
> virtual FAT disk image - *which is a convenient way to transfer files to
> the guest without having to activate its network* - seems to work very
> poorly with Windows : do you have the same difficulties?
>
> context : up to date archlinux, qemu 6.0.0, fresh installed windows 10
> 21H1.
>
> /usr/bin/qemu-system-x86_64 -accel kvm -machine q35 -m 1024 -device
> nec-usb-xhci -device usb-tablet -cpu qemu64,kvm=off -parallel null -serial
> mon:stdio -hda windows.disk -hdb fat:rw:/tmp/test/
>
> access to the E: drive is extremely slow and the system events report many
> storahci (129: reinit \device\RaidPort0 sent) and disk (153: I/O @0x3f on
> disk 1 \device\0001f replayed) warnings.
>
> regards.
>


Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-08-31 Thread Daniel P . Berrangé
On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> 
> Change the send_write() interface of multifd, allowing it to pass down
> flags for qio_channel_write*().
> 
> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> other data being sent at the default copying approach.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  migration/multifd-zlib.c | 7 ---
>  migration/multifd-zstd.c | 7 ---
>  migration/multifd.c  | 9 ++---
>  migration/multifd.h  | 3 ++-
>  4 files changed, 16 insertions(+), 10 deletions(-)

> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
>  }
>  
>  if (used) {
> -ret = multifd_send_state->ops->send_write(p, used, 
> &local_err);
> +ret = multifd_send_state->ops->send_write(p, used, 
> MSG_ZEROCOPY,
> +  &local_err);

I don't think it is valid to unconditionally enable this feature due to the
resource usage implications

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

  "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
   if the socket option was not set, the socket exceeds its optmem 
   limit or the user exceeds its ulimit on locked pages."

The limit on locked pages is something that looks very likely to be
exceeded unless you happen to be running a QEMU config that already
implies locked memory (eg PCI assignment)


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-08-31 Thread Daniel P . Berrangé
On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> send calls. It does so by avoiding copying user data into kernel buffers.
> 
> To make it work, three steps are needed:
> 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> 3 - Process the socket's error queue, dealing with any error

AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.

It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
from a synchronous call to an asynchronous call.

It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
until an asynchronous completion notification has been received from
the socket error queue. These notifications are not required to
arrive in-order, even for a TCP stream, because the kernel hangs on
to the buffer if a re-transmit is needed.

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

  "Page pinning also changes system call semantics. It temporarily 
   shares the buffer between process and network stack. Unlike with
   copying, the process cannot immediately overwrite the buffer 
   after system call return without possibly modifying the data in 
   flight. Kernel integrity is not affected, but a buggy program
   can possibly corrupt its own data stream."

AFAICT, the design added in this patch does not provide any way
to honour these requirements around buffer lifetime.

I can't see how we can introduce MSG_ZEROCOPY in any seemless
way. The buffer lifetime requirements imply need for an API
design that is fundamentally different for asynchronous usage,
with a callback to notify when the write has finished/failed.

> Zerocopy has it's costs, so it will only get improved performance if
> the sending buffer is big (10KB, according to Linux docs).
> 
> The step 2 makes it possible to use the same socket to send data
> using both zerocopy and the default copying approach, so the application
> cat choose what is best for each packet.
> 
> To implement step 1, an optional set_zerocopy() interface was created
> in QIOChannel, allowing each using code to enable or disable it.
> 
> Step 2 will be enabled by the using code at each qio_channel_write*()
> that would benefit of zerocopy;
> 
> Step 3 is done with qio_channel_socket_errq_proc(), that runs after
> SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  include/io/channel-socket.h |  2 +
>  include/io/channel.h| 29 ++
>  io/channel-socket.c | 76 +
>  io/channel-tls.c| 11 ++
>  io/channel-websock.c|  9 +
>  io/channel.c| 11 ++
>  6 files changed, 138 insertions(+)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..09dffe059f 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -47,6 +47,8 @@ struct QIOChannelSocket {
>  socklen_t localAddrLen;
>  struct sockaddr_storage remoteAddr;
>  socklen_t remoteAddrLen;
> +size_t errq_pending;
> +bool zerocopy_enabled;
>  };
>  
>  
> diff --git a/include/io/channel.h b/include/io/channel.h
> index dada9ebaaf..de10a78b10 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -137,6 +137,8 @@ struct QIOChannelClass {
>IOHandler *io_read,
>IOHandler *io_write,
>void *opaque);
> +void (*io_set_zerocopy)(QIOChannel *ioc,
> +bool enabled);
>  };
>  
>  /* General I/O handling functions */
> @@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc,
>  void qio_channel_set_delay(QIOChannel *ioc,
> bool enabled);
>  
> +/**
> + * qio_channel_set_zerocopy:
> + * @ioc: the channel object
> + * @enabled: the new flag state
> + *
> + * Controls whether the underlying transport is
> + * permitted to use zerocopy to avoid copying the
> + * sending buffer in kernel. If @enabled is true, then the
> + * writes may avoid buffer copy in kernel. If @enabled
> + * is false, writes will cause the kernel to always
> + * copy the buffer contents before sending.
> + *
> + * In order to use make a write with zerocopy feature,
> + * it's also necessary to sent each packet with
> + * MSG_ZEROCOPY flag. With this, it's possible to
> + * to select only writes that would benefit from the
> + * use of zerocopy feature, i.e. the ones with larger
> + * buffers.
> + *
> + * This feature was added in Linux 4.14, so older
> + * versions will fail on enabling. This is not an
> + * issue, since it will fall-back to default copying
> + * approach.
> + */
> +void qio_channel_set_zerocopy(QIOChannel *ioc,
> +  bool enabled);
> +
>  /**
>   * qio_channel_set_cork:
>   * @i

Re: block/curl: should we be checking curl_easy_setopt() for errors?

2021-08-31 Thread Peter Maydell
On Tue, 31 Aug 2021 at 09:44, Daniel P. Berrangé  wrote:
>
> On Mon, Aug 30, 2021 at 04:34:56PM +0100, Peter Maydell wrote:
> > Coverity complains (CID 1460331, 1459482, 1459336, 1458895)
> > that we call curl_easy_setopt(), which can return an error value,
> > but we never check the return value.
> >
> > Is it correct? Looking at the libcurl documentation, the function
> > does return an error status, and there's nothing that says it's
> > ok to ignore (e.g. that it's guaranteed that the library will
> > safely accumulate any errors and return them when you make the
> > subsequent curl_easy_perform() call). On the other hand, neither
> > the libcurl manpage example nor the handful of example programs
> > at https://curl.se/libcurl/c/example.html ever seem to check the
> > return value from curl_easy_setopt()...
>
> Options that accept a string can return CURLE_OUT_OF_MEMORY from
> curl_easy_setopt.Most other failures seem to be reporting caller
> errors such as forgotten arguments, or too long strings.
>
> Does look like we ought to check return status though for at
> least some of the options, and if you check it for some then
> coverity will also complain if you don't check it for all.

Coverity complains about them all regardless -- I think that the scan
servers have been updated with models for some of these common
software libraries, and it now "knows" that the function has
an error-return that must be checked.

-- PMM



Re: block/curl: should we be checking curl_easy_setopt() for errors?

2021-08-31 Thread Daniel P . Berrangé
On Mon, Aug 30, 2021 at 04:34:56PM +0100, Peter Maydell wrote:
> Coverity complains (CID 1460331, 1459482, 1459336, 1458895)
> that we call curl_easy_setopt(), which can return an error value,
> but we never check the return value.
> 
> Is it correct? Looking at the libcurl documentation, the function
> does return an error status, and there's nothing that says it's
> ok to ignore (e.g. that it's guaranteed that the library will
> safely accumulate any errors and return them when you make the
> subsequent curl_easy_perform() call). On the other hand, neither
> the libcurl manpage example nor the handful of example programs
> at https://curl.se/libcurl/c/example.html ever seem to check the
> return value from curl_easy_setopt()...

Options that accept a string can return CURLE_OUT_OF_MEMORY from
curl_easy_setopt.Most other failures seem to be reporting caller
errors such as forgotten arguments, or too long strings.

Does look like we ought to check return status though for at
least some of the options, and if you check it for some then
coverity will also complain if you don't check it for all.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|