Hi all, a few weeks ago Stefan Hajnoczi pointed me to his work on virtio-blk performance.
Stefan's work had two sides. First, he captured very nice performance data of the block layer at http://www.linux-kvm.org/page/Virtio/Block/Latency; second, in order to measure peak performance, he basically implemented "vhost-blk" in userspace. This second part adds a thread for each block device that is activated via ioeventfd and converts virtio requests to AIO calls on the host. It is quite restricted, as it only supports raw files and Linux AIO, but it achieved performance improvements of IIRC 2-3x on big machines. We talked a bit about how to generalize the work and bring it upstream, and one idea was to make a fully multi-threaded block/AIO layer. This is a prerequisite for processing devices in their own thread, but it would also be an occasion to clean up several pieces of code, magically add AIO support for Windows, and probably be a good step towards making libblockformat. And, it turns out that multi-threading block is not that difficult. There are basically two parts in converting from coroutines to threads. One is to protect shared data, and we do quite well here because we already protect most bits with CoMutexes. The second is to remove manual scheduling (CoQueues) and replace it with standard thread synchronization primitives, such as condition variables and semaphores. For the first part, there are relatively few pieces of data that are shared by multiple coroutines and need to be protected by blocks, namely bottom halves and timers. For the second part, CoQueues are used by the throttled and tracked request lists. These lists also have to be protected by their own mutex. I have made an attempt at it in my github repo's thread-blocks branch. The ingredients are: - a common representation of requests, based on BdrvTrackedRequest but subsuming RwCo and other structs. This representation follows a request from beginning to end. Lists of BdrvTrackedRequests replace the CoQueues; - a generic threadpool, similar to the one in posix-aio-compat.c but with extra services similar to coroutine enter/yield. These services are just wrappers around condition variables, and do not prevent you from using regular mutex and condvars in the work items; - an AIO fast path, used when none of the coroutine-enabled goodies are active; A work item in the thread-pool replaces a coroutine and, unlike coroutines, execute completely asynchronously with respect to the iothread and VCPU threads. posix-aio-compat code is replaced by synchronous entry-points into the "file" driver. An interesting point is that (unlike coroutine-gthread) there is hardly any overhead from removing the coroutines, because: - when using the raw format, more or less the same code is executed, only in raw-posix.c rather than in posix-aio-compat.c; - when using qcow2, file I/O will execute synchronously in the same work item that is already being used for the format code. So format I/O is more expensive to start, but this is compensated completely by cheaper protocol I/O. There can be slowdowns in special cases such as reading from non-allocated clusters, of course. qcow2 is the only format that uses CoQueues internally. These are replaced by a condition variable. rbd, iSCSI and other aio-based protocols will require new locking, but are otherwise not a problem. NBD and sheepdog have to be almost rewritten, but I expect the result to be simpler because they can mostly use blocking I/O instead of coroutines. Everything else works with s/CoMutex/QemuMutex/; s/co_mutex/mutex/. When using raw on AIO, the QEMU thread pool can be bypassed completely and I/O can be submitted directly from the iothread. Stefan measured the cost of an io_submit to be comparable to the cost of waking up a thread in posix-aio-compat.c, and the same holds for my generic thread pool. Except for fixing non-file protocols (and testing and debugging of course), this is where I'm at; code is in branch thread-blocks at git://github.com/bonzini/qemu.git. It boots a raw image, so it cannot be that bad! :) I didn't test any of throttling or copy-on-read. However, changes are for a large part mechanical and can be posted in fairly small chunks. Anyhow, with this in place, aio.c can be rethought and generalized. Interaction with other threads and the system can occur in terms of EventNotifiers, so that portability to Windows falls out almost automatically just by porting those. The main improvement then would be separate contexts for asynchronous I/O, so that it is possible to have per-device threads as in Stefan's original proof of concept. Thoughts? Paolo