On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > send calls. It does so by avoiding copying user data into kernel buffers. > > To make it work, three steps are needed: > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > 3 - Process the socket's error queue, dealing with any error
AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() from a synchronous call to an asynchronous call. It is forbidden to overwrite/reuse/free the buffer passed to sendmsg until an asynchronous completion notification has been received from the socket error queue. These notifications are not required to arrive in-order, even for a TCP stream, because the kernel hangs on to the buffer if a re-transmit is needed. https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html "Page pinning also changes system call semantics. It temporarily shares the buffer between process and network stack. Unlike with copying, the process cannot immediately overwrite the buffer after system call return without possibly modifying the data in flight. Kernel integrity is not affected, but a buggy program can possibly corrupt its own data stream." AFAICT, the design added in this patch does not provide any way to honour these requirements around buffer lifetime. I can't see how we can introduce MSG_ZEROCOPY in any seemless way. The buffer lifetime requirements imply need for an API design that is fundamentally different for asynchronous usage, with a callback to notify when the write has finished/failed. > Zerocopy has it's costs, so it will only get improved performance if > the sending buffer is big (10KB, according to Linux docs). > > The step 2 makes it possible to use the same socket to send data > using both zerocopy and the default copying approach, so the application > cat choose what is best for each packet. > > To implement step 1, an optional set_zerocopy() interface was created > in QIOChannel, allowing each using code to enable or disable it. > > Step 2 will be enabled by the using code at each qio_channel_write*() > that would benefit of zerocopy; > > Step 3 is done with qio_channel_socket_errq_proc(), that runs after > SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found. > > Signed-off-by: Leonardo Bras <leob...@redhat.com> > --- > include/io/channel-socket.h | 2 + > include/io/channel.h | 29 ++++++++++++++ > io/channel-socket.c | 76 +++++++++++++++++++++++++++++++++++++ > io/channel-tls.c | 11 ++++++ > io/channel-websock.c | 9 +++++ > io/channel.c | 11 ++++++ > 6 files changed, 138 insertions(+) > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > index e747e63514..09dffe059f 100644 > --- a/include/io/channel-socket.h > +++ b/include/io/channel-socket.h > @@ -47,6 +47,8 @@ struct QIOChannelSocket { > socklen_t localAddrLen; > struct sockaddr_storage remoteAddr; > socklen_t remoteAddrLen; > + size_t errq_pending; > + bool zerocopy_enabled; > }; > > > diff --git a/include/io/channel.h b/include/io/channel.h > index dada9ebaaf..de10a78b10 100644 > --- a/include/io/channel.h > +++ b/include/io/channel.h > @@ -137,6 +137,8 @@ struct QIOChannelClass { > IOHandler *io_read, > IOHandler *io_write, > void *opaque); > + void (*io_set_zerocopy)(QIOChannel *ioc, > + bool enabled); > }; > > /* General I/O handling functions */ > @@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc, > void qio_channel_set_delay(QIOChannel *ioc, > bool enabled); > > +/** > + * qio_channel_set_zerocopy: > + * @ioc: the channel object > + * @enabled: the new flag state > + * > + * Controls whether the underlying transport is > + * permitted to use zerocopy to avoid copying the > + * sending buffer in kernel. If @enabled is true, then the > + * writes may avoid buffer copy in kernel. If @enabled > + * is false, writes will cause the kernel to always > + * copy the buffer contents before sending. > + * > + * In order to use make a write with zerocopy feature, > + * it's also necessary to sent each packet with > + * MSG_ZEROCOPY flag. With this, it's possible to > + * to select only writes that would benefit from the > + * use of zerocopy feature, i.e. the ones with larger > + * buffers. > + * > + * This feature was added in Linux 4.14, so older > + * versions will fail on enabling. This is not an > + * issue, since it will fall-back to default copying > + * approach. > + */ > +void qio_channel_set_zerocopy(QIOChannel *ioc, > + bool enabled); > + > /** > * qio_channel_set_cork: > * @ioc: the channel object > diff --git a/io/channel-socket.c b/io/channel-socket.c > index e377e7303d..a69fec7315 100644 > --- a/io/channel-socket.c > +++ b/io/channel-socket.c > @@ -26,8 +26,10 @@ > #include "io/channel-watch.h" > #include "trace.h" > #include "qapi/clone-visitor.h" > +#include <linux/errqueue.h> That's going to fail to biuld on non-Linux > > #define SOCKET_MAX_FDS 16 > +#define SOCKET_ERRQ_THRESH 16384 > > SocketAddress * > qio_channel_socket_get_local_address(QIOChannelSocket *ioc, > @@ -55,6 +57,8 @@ qio_channel_socket_new(void) > > sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); > sioc->fd = -1; > + sioc->zerocopy_enabled = false; > + sioc->errq_pending = 0; > > ioc = QIO_CHANNEL(sioc); > qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); > @@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc, > return ret; > } > > +static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc, > + Error **errp) > +{ > + int fd = sioc->fd; > + int ret; > + struct msghdr msg = {}; > + struct sock_extended_err *serr; > + struct cmsghdr *cm; > + > + do { > + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); > + if (ret <= 0) { > + if (ret == 0 || errno == EAGAIN) { > + /* Nothing on errqueue */ > + sioc->errq_pending = 0; > + break; > + } > + if (errno == EINTR) { > + continue; > + } > + > + error_setg_errno(errp, errno, > + "Unable to read errqueue"); > + break; > + } > + > + cm = CMSG_FIRSTHDR(&msg); > + if (cm->cmsg_level != SOL_IP && > + cm->cmsg_type != IP_RECVERR) { > + error_setg_errno(errp, EPROTOTYPE, > + "Wrong cmsg in errqueue"); > + break; > + } > + > + serr = (void *) CMSG_DATA(cm); > + if (serr->ee_errno != 0) { > + error_setg_errno(errp, serr->ee_errno, > + "Error on socket"); > + break; > + } > + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) { > + error_setg_errno(errp, serr->ee_origin, > + "Error not from zerocopy"); > + break; > + } > + } while (true); > +} > + > static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > const struct iovec *iov, > size_t niov, > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > "Unable to write to socket"); > return -1; > } > + > + if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) { > + sioc->errq_pending += niov; > + if (sioc->errq_pending > SOCKET_ERRQ_THRESH) { > + qio_channel_socket_errq_proc(sioc, errp); > + } > + } This silently looses any errors set from upto the final SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set. Also means if you have a series of writes without MSG_ZEROCOPY, it'll delay checking any pending errors. I would suggest checkig in close(), but as mentioned earlier, I think the design is flawed because the caller fundamentally needs to know about completion for every single write they make in order to know when the buffer can be released / reused. > +static void > +qio_channel_socket_set_zerocopy(QIOChannel *ioc, > + bool enabled) > +{ > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > + int v = enabled ? 1 : 0; > + int ret; > + > + ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > + if (ret >= 0) { > + sioc->zerocopy_enabled = true; > + } > +} Surely we need to tell the caller wether this succeeed, otherwise the later sendmsg() is going to fail with EINVAL on older kernels where MSG_ZEROCOPY is not supported. > diff --git a/io/channel-tls.c b/io/channel-tls.c > index 4ce890a538..bf44b0f7b0 100644 > --- a/io/channel-tls.c > +++ b/io/channel-tls.c > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc, > qio_channel_set_delay(tioc->master, enabled); > } > > + > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc, > + bool enabled) > +{ > + QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc); > + > + qio_channel_set_zerocopy(tioc->master, enabled); > +} This is going to be unsafe because gnutls will discard/reuse the buffer for the ciphertext after every write(). I can't see a way to safely enable MSG_ZEROCOPY when TLS is used. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|