Hi all,

a few weeks ago Stefan Hajnoczi pointed me to his work on virtio-blk
performance.

Stefan's work had two sides.  First, he captured very nice performance
data of the block layer at
http://www.linux-kvm.org/page/Virtio/Block/Latency; second, in order to
measure peak performance, he basically implemented "vhost-blk" in
userspace.  This second part adds a thread for each block device that is
activated via ioeventfd and converts virtio requests to AIO calls on the
host.  It is quite restricted, as it only supports raw files and Linux
AIO, but it achieved performance improvements of IIRC 2-3x on big machines.

We talked a bit about how to generalize the work and bring it upstream,
and one idea was to make a fully multi-threaded block/AIO layer.  This
is a prerequisite for processing devices in their own thread, but it
would also be an occasion to clean up several pieces of code, magically
add AIO support for Windows, and probably be a good step towards making
libblockformat.

And, it turns out that multi-threading block is not that difficult.

There are basically two parts in converting from coroutines to threads.
 One is to protect shared data, and we do quite well here because we
already protect most bits with CoMutexes.  The second is to remove
manual scheduling (CoQueues) and replace it with standard thread
synchronization primitives, such as condition variables and semaphores.

For the first part, there are relatively few pieces of data that are
shared by multiple coroutines and need to be protected by blocks, namely
bottom halves and timers.

For the second part, CoQueues are used by the throttled and tracked
request lists.  These lists also have to be protected by their own mutex.

I have made an attempt at it in my github repo's thread-blocks branch.
The ingredients are:

- a common representation of requests, based on BdrvTrackedRequest but
subsuming RwCo and other structs.  This representation follows a request
from beginning to end.  Lists of BdrvTrackedRequests replace the CoQueues;

- a generic threadpool, similar to the one in posix-aio-compat.c but
with extra services similar to coroutine enter/yield.  These services
are just wrappers around condition variables, and do not prevent you
from using regular mutex and condvars in the work items;

- an AIO fast path, used when none of the coroutine-enabled goodies are
active;


A work item in the thread-pool replaces a coroutine and, unlike
coroutines, execute completely asynchronously with respect to the
iothread and VCPU threads.  posix-aio-compat code is replaced by
synchronous entry-points into the "file" driver.  An interesting point
is that (unlike coroutine-gthread) there is hardly any overhead from
removing the coroutines, because:

- when using the raw format, more or less the same code is executed,
only in raw-posix.c rather than in posix-aio-compat.c;

- when using qcow2, file I/O will execute synchronously in the same work
item that is already being used for the format code.  So format I/O is
more expensive to start, but this is compensated completely by cheaper
protocol I/O.  There can be slowdowns in special cases such as reading
from non-allocated clusters, of course.


qcow2 is the only format that uses CoQueues internally.  These are
replaced by a condition variable.

rbd, iSCSI and other aio-based protocols will require new locking, but
are otherwise not a problem.

NBD and sheepdog have to be almost rewritten, but I expect the result to
be simpler because they can mostly use blocking I/O instead of
coroutines.  Everything else works with s/CoMutex/QemuMutex/;
s/co_mutex/mutex/.

When using raw on AIO, the QEMU thread pool can be bypassed completely
and I/O can be submitted directly from the iothread.  Stefan measured
the cost of an io_submit to be comparable to the cost of waking up a
thread in posix-aio-compat.c, and the same holds for my generic thread pool.


Except for fixing non-file protocols (and testing and debugging of
course), this is where I'm at; code is in branch thread-blocks at
git://github.com/bonzini/qemu.git.  It boots a raw image, so it cannot
be that bad! :)  I didn't test any of throttling or copy-on-read.
However, changes are for a large part mechanical and can be posted in
fairly small chunks.

Anyhow, with this in place, aio.c can be rethought and generalized.
Interaction with other threads and the system can occur in terms of
EventNotifiers, so that portability to Windows falls out almost
automatically just by porting those.  The main improvement then would be
separate contexts for asynchronous I/O, so that it is possible to have
per-device threads as in Stefan's original proof of concept.

Thoughts?

Paolo

Reply via email to