On Wed, Jan 29, 2014 at 01:04:28PM +0100, Antonios Motakis wrote: > Hello, > > On Mon, Jan 27, 2014 at 5:49 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > > > > On Mon, Jan 27, 2014 at 05:37:02PM +0100, Antonios Motakis wrote: > > > Hello again, > > > > > > > > > On Wed, Jan 15, 2014 at 3:49 PM, Michael S. Tsirkin <m...@redhat.com> > > > wrote: > > > > > > > > On Wed, Jan 15, 2014 at 01:50:47PM +0100, Antonios Motakis wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Jan 15, 2014 at 10:07 AM, Michael S. Tsirkin > > > > > <m...@redhat.com> wrote: > > > > > > > > > > On Tue, Jan 14, 2014 at 07:13:43PM +0100, Antonios Motakis wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jan 14, 2014 at 12:33 PM, Michael S. Tsirkin > > > > > <m...@redhat.com> > > > > > wrote: > > > > > > > > > > > > On Mon, Jan 13, 2014 at 03:25:11PM +0100, Antonios Motakis > > > > > wrote: > > > > > > > In this patch series we would like to introduce our > > > > > approach for > > > > > putting > > > > > > a > > > > > > > virtio-net backend in an external userspace process. Our > > > > > eventual > > > > > target > > > > > > is to > > > > > > > run the network backend in the Snabbswitch ethernet > > > > > switch, while > > > > > > receiving > > > > > > > traffic from a guest inside QEMU/KVM which runs an > > > > > unmodified > > > > > virtio-net > > > > > > > implementation. > > > > > > > > > > > > > > For this, we are working into extending vhost to allow > > > > > equivalent > > > > > > functionality > > > > > > > for userspace. Vhost already passes control of the data > > > > > plane of > > > > > > virtio-net to > > > > > > > the host kernel; we want to realize a similar model, but > > > > > for > > > > > userspace. > > > > > > > > > > > > > > In this patch series the concept of a vhost-backend is > > > > > introduced. > > > > > > > > > > > > > > We define two vhost backend types - vhost-kernel and > > > > > vhost-user. > > > > > The > > > > > > former is > > > > > > > the interface to the current kernel module > > > > > implementation. Its > > > > > control > > > > > > plane is > > > > > > > ioctl based. The data plane is the kernel directly > > > > > accessing the > > > > > QEMU > > > > > > allocated, > > > > > > > guest memory. > > > > > > > > > > > > > > In the new vhost-user backend, the control plane is based > > > > > on > > > > > > communication > > > > > > > between QEMU and another userspace process using a unix > > > > > domain > > > > > socket. > > > > > > This > > > > > > > allows to implement a virtio backend for a guest running > > > > > in QEMU, > > > > > inside > > > > > > the > > > > > > > other userspace process. > > > > > > > > > > > > > > We change -mem-path to QemuOpts and add prealloc, share > > > > > and unlink > > > > > as > > > > > > properties > > > > > > > to it. HugeTLBFS requirements of -mem-path are relaxed, > > > > > so any > > > > > valid path > > > > > > can > > > > > > > be used now. The new properties allow more fine grained > > > > > control > > > > > over the > > > > > > guest > > > > > > > RAM backing store. > > > > > > > > > > > > > > The data path is realized by directly accessing the > > > > > vrings and the > > > > > buffer > > > > > > data > > > > > > > off the guest's memory. > > > > > > > > > > > > > > The current user of vhost-user is only vhost-net. We add > > > > > new netdev > > > > > > backend > > > > > > > that is intended to initialize vhost-net with vhost-user > > > > > backend. > > > > > > > > > > > > Some meta comments. > > > > > > > > > > > > Something that makes this patch harder to review is how it's > > > > > > split up. Generally IMHO it's not a good idea to repeatedly > > > > > > edit same part of file adding stuff in patch after patch, > > > > > > it's only making things harder to read if you add stubs, > > > > > then fill > > > > > them up. > > > > > > (we do this sometimes when we are changing existing code, > > > > > but > > > > > > it is generally not needed when adding new code) > > > > > > > > > > > > Instead, split it like this: > > > > > > > > > > > > 1. general refactoring, split out linux specific and > > > > > generic parts > > > > > > and add the ops indirection > > > > > > 2. add new files for vhost-user with complete > > > > > implementation. > > > > > > without command line to support it, there will be no way > > > > > to use > > > > > it, > > > > > > but should build fine. > > > > > > 3. tie it all up with option parsing > > > > > > > > > > > > > > > > > > Generic vhost and vhost net files should be kept separate. > > > > > > Don't let vhost net stuff seep back into generic files, > > > > > > we have vhost-scsi too. > > > > > > I would also prefer that userspace vhost has its own files. > > > > > > > > > > > > > > > > > > Ok, we'll keep this into account. > > > > > > > > > > > > > > > > > > > > > > > > We need a small test server qemu can talk to, to verify > > > > > things > > > > > > actually work. > > > > > > > > > > > > > > > > > > We have implemented such a test app: https://github.com/ > > > > > virtualopensystems/vapp > > > > > > > > > > > > We use it for testing, and also as a reference implementation. > > > > > A client > > > > > is also > > > > > > included. > > > > > > > > > > > > > > > > Sounds good. Can we include this in qemu and tie > > > > > it into the qtest framework? > > > > > >From a brief look, it merely needs to be tweaked for portability, > > > > > unless > > > > > > > > > > > > > > > > > Already commented on: reuse the chardev syntax and > > > > > preferably code. > > > > > > We already support a bunch of options there for > > > > > > domain sockets that will be useful here, they should > > > > > > work here as well. > > > > > > > > > > > > > > > > > > We adapted the syntax for this to be consistent with chardev. > > > > > What we > > > > > didn't > > > > > > use, it is not obvious at all to us on how they should be used; > > > > > a lot of > > > > > the > > > > > > chardev options just don't apply to us. > > > > > > > > > > > > > > > > Well server option should work at least. > > > > > nowait can work too? > > > > > > > > > > Also, if reconnect is useful it should be for chardevs too, so if > > > > > we don't > > > > > share code, need to code it in two places to stay consistent. > > > > > > > > > > Overall sharing some code might be better ... > > > > > > > > > > > > > > > > > > > > What you have in mind is to use the functions chardev uses from > > > > > qemu-sockets.c > > > > > right? Chardev itself doesn't look to have anything else that can be > > > > > shared. > > > > > > > > Yes. > > > > > > > > > The problem with reconnect is that it is implemented at the protocol > > > > > level; we > > > > > are not just transparently reconnecting the socket. So the same > > > > > approach would > > > > > most likely not apply for chardev. > > > > > > > > Chardev mostly just could use transparent reconnect. > > > > vhost-user could use that and get a callback to reconfigure > > > > everything after reconnect. > > > > > > > > Once you write up the protocol in some text file we can > > > > discuss this in more detail. > > > > For example I wonder how would feature negotiation work > > > > with reconnect: new connection could be from another > > > > application that does not support same features, but > > > > virtio assumes that device features never change. > > > > > > > > > > I attach the text document that we will include in the next version of > > > the series, which describes the vhost-user protocol. > > > > > > The protocol is based on and very close to the vhost kernel protocol. > > > Of note is the VHOST_USER_ECHO message, which is the only one that > > > doesn't have an equivalent ioctl in the kernel version of vhost; this > > > is the message that is being used to detect that the remote party is > > > not on the socket anymore. At that point QEMU will close the session > > > and try to initiate a new one on the same socket. > > > > What if e.g. features change in between? > > Everything just goes south, doesn't it? > > > > Is this detection and reconnect a must for your project? > > > > I think it would be simpler to > > - generalize char unix socket handling code and reuse for vhost-user > > > In our next version we will completely reuse the chardev > infrastructure. In the process of doing that we are adding features we > need to chardev (most specifically, support for ancillary data on the > socket). So the end user will use something along those lines: > -chardev socket,path=/path,id=chr0 -netdev vhost-user,chadev=chr0 > > Of course, this will only be usable with a socket chardev, otherwise > we will fail gracefully. > > > > > - as a separate step, add live detection and reconnect abilities > > to the generic code > > So far we did live detection with a special ECHO message. Is it > possible to detect if there is another listener on a unix socket in a > generic way? > > Best regards, > Antonios
I don't get why is ECHO necessary. Checking that remote is alive seems enough. IIRC testing that the fd is readable using poll or read is enough for that. > > > > > > > > > > > > > > > > > > > > > > > In particular you shouldn't require filesystem access by > > > > > qemu, > > > > > > passing fd for domain socket should work. > > > > > > > > > > > > > > > > > > We can add an option to pass an fd for the domain socket if > > > > > needed. > > > > > However as > > > > > > far as we understand, chardev doesn't do that either (at least > > > > > form > > > > > looking at > > > > > > the man page). Maybe we misunderstand what you mean. > > > > > > > > > > Sorry. I got confused with e.g. tap which has this. This might be > > > > > useful but does not have to block this patch. > > > > > > > > > > > > > > > > > > > > > > > > Example usage: > > > > > > > > > > > > > > qemu -m 1024 -mem-path /hugetlbfs,prealloc=on,share=on \ > > > > > > > -netdev > > > > > type=vhost-user,id=net0,path=/path/to/sock,poll_time= > > > > > 2500 \ > > > > > > > -device virtio-net-pci,netdev=net0 > > > > > > > > > > > > It's not clear which parts of -mem-path are required for > > > > > vhost-user. > > > > > > It should be documented somewhere, made clear in -help > > > > > > and should fail gracefully if misconfigured. > > > > > > > > > > > > > > > > > > > > > > > > Ok. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Changes from v5: > > > > > > > - Split -mem-path unlink option to a separate patch > > > > > > > - Fds are passed only in the ancillary data > > > > > > > - Stricter message size checks on receive/send > > > > > > > - Netdev vhost-user now includes path and poll_time > > > > > options > > > > > > > - The connection probing interval is configurable > > > > > > > > > > > > > > Changes from v4: > > > > > > > - Use error_report for errors > > > > > > > - VhostUserMsg has new field `size` indicating the > > > > > following > > > > > payload > > > > > > length. > > > > > > > Field `flags` now has version and reply bits. The > > > > > structure is > > > > > packed. > > > > > > > - Send data is of variable length (`size` field in > > > > > message) > > > > > > > - Receive in 2 steps, header and payload > > > > > > > - Add new message type VHOST_USER_ECHO, to check > > > > > connection status > > > > > > > > > > > > > > Changes from v3: > > > > > > > - Convert -mem-path to QemuOpts with prealloc, share and > > > > > unlink > > > > > > properties > > > > > > > - Set 1 sec timeout when read/write to the unix domain > > > > > socket > > > > > > > - Fix file descriptor leak > > > > > > > > > > > > > > Changes from v2: > > > > > > > - Reconnect when the backend disappears > > > > > > > > > > > > > > Changes from v1: > > > > > > > - Implementation of vhost-user netdev backend > > > > > > > - Code improvements > > > > > > > > > > > > > > Antonios Motakis (8): > > > > > > > Convert -mem-path to QemuOpts and add prealloc and share > > > > > properties > > > > > > > New -mem-path option - unlink. > > > > > > > Decouple vhost from kernel interface > > > > > > > Add vhost-user skeleton > > > > > > > Add domain socket communication for vhost-user backend > > > > > > > Add vhost-user calls implementation > > > > > > > Add new vhost-user netdev backend > > > > > > > Add vhost-user reconnection > > > > > > > > > > > > > > exec.c | 57 +++- > > > > > > > hmp-commands.hx | 4 +- > > > > > > > hw/net/vhost_net.c | 144 +++++++--- > > > > > > > hw/net/virtio-net.c | 42 ++- > > > > > > > hw/scsi/vhost-scsi.c | 13 +- > > > > > > > hw/virtio/Makefile.objs | 2 +- > > > > > > > hw/virtio/vhost-backend.c | 556 > > > > > > ++++++++++++++++++++++++++++++++++++++ > > > > > > > hw/virtio/vhost.c | 46 ++-- > > > > > > > include/exec/cpu-all.h | 3 - > > > > > > > include/hw/virtio/vhost-backend.h | 40 +++ > > > > > > > include/hw/virtio/vhost.h | 4 +- > > > > > > > include/net/vhost-user.h | 17 ++ > > > > > > > include/net/vhost_net.h | 15 +- > > > > > > > net/Makefile.objs | 2 +- > > > > > > > net/clients.h | 3 + > > > > > > > net/hub.c | 1 + > > > > > > > net/net.c | 2 + > > > > > > > net/tap.c | 16 +- > > > > > > > net/vhost-user.c | 177 ++++++++++++ > > > > > > > qapi-schema.json | 21 +- > > > > > > > qemu-options.hx | 24 +- > > > > > > > vl.c | 41 ++- > > > > > > > 22 files changed, 1106 insertions(+), 124 deletions(-) > > > > > > > create mode 100644 hw/virtio/vhost-backend.c > > > > > > > create mode 100644 include/hw/virtio/vhost-backend.h > > > > > > > create mode 100644 include/net/vhost-user.h > > > > > > > create mode 100644 net/vhost-user.c > > > > > > > > > > > > > > -- > > > > > > > 1.8.3.2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Vhost-user Protocol > > > =================== > > > > > > This protocol is aiming to complement the ioctl interface used to control > > > the > > > vhost implementation in the Linux kernel. It implements the control plane > > > needed > > > to establish virtqueue sharing with a user space process on the same > > > host. It > > > uses communication over a Unix domain socket to share file descriptors in > > > the > > > ancillary data of the message. > > > > > > The protocol defines 2 sides of the communication, master and slave. > > > Master is > > > the application that shares it's virtqueues, in our case QEMU. Slave is > > > the > > > consumer of the virtqueues. > > > > > > In the current implementation QEMU is the Master, and the Slave is > > > intended to > > > be a software ethernet switch running in user space, such as Snabbswitch. > > > > > > Master and slave can be either a client (i.e. connecting) or server > > > (listening) > > > in the socket communication. > > > > > > Message Specification > > > --------------------- > > > > > > Note that all numbers are in the machine native byte order. A vhost-user > > > message > > > consists of 3 header fields and a payload: > > > > > > ------------------------------------ > > > | request | flags | size | payload | > > > ------------------------------------ > > > > > > * Request: 32-bit type of the request > > > * Flags: 32-bit bit field: > > > - Lower 2 bits are the version (currently 0x01) > > > - Bit 2 is the reply flag - needs to be sent on each reply from the > > > slave > > > * Size - 32-bit size of the payload > > > > > > > > > Depending on the request type, payload can be: > > > > > > * A single 64-bit integer > > > ------- > > > | u64 | > > > ------- > > > > > > u64: a 64-bit unsigned integer > > > > > > * A vring state description > > > --------------- > > > | index | num | > > > --------------- > > > > > > Index: a 32-bit index > > > Num: a 32-bit number > > > > > > * A vring address description > > > -------------------------------------------------------------- > > > | index | flags | size | descriptor | used | available | log | > > > -------------------------------------------------------------- > > > > > > Index: a 32-bit vring index > > > Flags: a 32-bit vring flags > > > Descriptor: a 64-bit user address of the vring descriptor table > > > Used: a 64-bit user address of the vring used ring > > > Available: a 64-bit user address of the vring available ring > > > Log: a 64-bit guest address for logging > > > > > > * Memory regions description > > > --------------------------------------------------- > > > | num regions | padding | region0 | ... | region7 | > > > --------------------------------------------------- > > > > > > Num regions: a 32-bit number of regions > > > Padding: 32-bit > > > > > > A region is: > > > --------------------------------------- > > > | guest address | size | user address | > > > --------------------------------------- > > > > > > Guest address: a 64-bit guest address of the region > > > Size: a 64-bit size > > > User address: a 64-bit user address > > > > > > > > > In QEMU the vhost-user message is implemented with the following struct: > > > > > > typedef struct VhostUserMsg { > > > VhostUserRequest request; > > > uint32_t flags; > > > uint32_t size; > > > union { > > > uint64_t u64; > > > struct vhost_vring_state state; > > > struct vhost_vring_addr addr; > > > VhostUserMemory memory; > > > }; > > > } QEMU_PACKED VhostUserMsg; > > > > > > Communication > > > ------------- > > > > > > The protocol for vhost-user is based on the existing implementation of > > > vhost > > > for the Linux Kernel. Most messages that can be send via the Unix domain > > > socket > > > implementing vhost-user have an equivalent ioctl to the kernel > > > implementation. > > > > > > The communication consists of master sending message requests and slave > > > sending > > > message replies. Most of the requests don't require replies. Here is a > > > list of > > > the ones that do: > > > > > > * VHOST_USER_ECHO > > > * VHOST_GET_FEATURES > > > * VHOST_GET_VRING_BASE > > > > > > There are several messages that the master sends with file descriptors > > > passed > > > in the ancillary data: > > > > > > * VHOST_SET_MEM_TABLE > > > * VHOST_SET_LOG_FD > > > * VHOST_SET_VRING_KICK > > > * VHOST_SET_VRING_CALL > > > * VHOST_SET_VRING_ERR > > > > > > If Master is unable to send the full message or receives a wrong reply it > > > will > > > close the connection. An optional reconnection mechanism can be > > > implemented. > > > > > > Message types > > > ------------- > > > > > > * VHOST_USER_ECHO > > > > > > Id: 1 > > > Equivalent ioctl: N/A > > > Master payload: N/A > > > > > > ECHO request that is used to periodically probe the connection. When > > > received by the slave, it is expected that he will send back an ECHO > > > packet with the REPLY flag set. > > > > > > * VHOST_USER_GET_FEATURES > > > > > > Id: 2 > > > Equivalent ioctl: VHOST_GET_FEATURES > > > Master payload: N/A > > > Slave payload: u64 > > > > > > Get from the underlying vhost implementation the features bitmask. > > > > > > * VHOST_USER_SET_FEATURES > > > > > > Id: 3 > > > Ioctl: VHOST_SET_FEATURES > > > Master payload: u64 > > > > > > Enable features in the underlying vhost implementation using a > > > bitmask. > > > > > > * VHOST_USER_SET_OWNER > > > > > > Id: 4 > > > Equivalent ioctl: VHOST_SET_OWNER > > > Master payload: N/A > > > > > > Issued when a new connection is established. It sets the current > > > Master > > > as an owner of the session. This can be used on the Slave as a > > > "session start" flag. > > > > > > * VHOST_USER_RESET_OWNER > > > > > > Id: 5 > > > Equivalent ioctl: VHOST_RESET_OWNER > > > Master payload: N/A > > > > > > Issued when a new connection is about to be closed. The Master will > > > no > > > longer own this connection (and will usually close it). > > > > > > * VHOST_USER_SET_MEM_TABLE > > > > > > Id: 6 > > > Equivalent ioctl: VHOST_SET_MEM_TABLE > > > Master payload: memory regions description > > > > > > Sets the memory map regions on the slave so it can translate the > > > vring > > > addresses. In the ancillary data there is an array of file > > > descriptors > > > for each memory mapped region. The size and ordering of the fds > > > matches > > > the number and ordering of memory regions. > > > > > > * VHOST_USER_SET_LOG_BASE > > > > > > Id: 7 > > > Equivalent ioctl: VHOST_SET_LOG_BASE > > > Master payload: u64 > > > > > > Sets the logging base address. > > > > > > * VHOST_USER_SET_LOG_FD > > > > > > Id: 8 > > > Equivalent ioctl: VHOST_SET_LOG_FD > > > Master payload: N/A > > > > > > Sets the logging file descriptor, which is passed as ancillary data. > > > > > > * VHOST_USER_SET_VRING_NUM > > > > > > Id: 9 > > > Equivalent ioctl: VHOST_SET_VRING_NUM > > > Master payload: vring state description > > > > > > Sets the number of vrings for this owner. > > > > > > * VHOST_USER_SET_VRING_ADDR > > > > > > Id: 10 > > > Equivalent ioctl: VHOST_SET_VRING_ADDR > > > Master payload: vring address description > > > Slave payload: N/A > > > > > > Sets the addresses of the different aspects of the vring. > > > > > > * VHOST_USER_SET_VRING_BASE > > > > > > Id: 11 > > > Equivalent ioctl: VHOST_SET_VRING_BASE > > > Master payload: vring state description > > > > > > Sets the base address where the available descriptors are. > > > > > > * VHOST_USER_GET_VRING_BASE > > > > > > Id: 12 > > > Equivalent ioctl: VHOST_USER_GET_VRING_BASE > > > Master payload: vring state description > > > Slave payload: vring state description > > > > > > Get the vring base address. > > > > > > * VHOST_USER_SET_VRING_KICK > > > > > > Id: 13 > > > Equivalent ioctl: VHOST_SET_VRING_KICK > > > Master payload: N/A > > > > > > Set the event file descriptor for adding buffers to the vring. It > > > is passed in the ancillary data. > > > > > > * VHOST_USER_SET_VRING_CALL > > > > > > Id: 14 > > > Equivalent ioctl: VHOST_SET_VRING_CALL > > > Master payload: N/A > > > > > > Set the event file descriptor to signal when buffers are used. It > > > is passed in the ancillary data. > > > > > > * VHOST_USER_SET_VRING_ERR > > > > > > Id: 15 > > > Equivalent ioctl: VHOST_SET_VRING_ERR > > > Master payload: N/A > > > > > > Set the event file descriptor to signal when error occurs. It > > > is passed in the ancillary data. > >