[this time including the mailing list] This is yet another tiny bit of the multiqueue work, this time affecting the synchronization infrastructure for coroutines. Currently, coroutines synchronize between the main I/O thread and the dataplane iothread through the AioContext lock. However, for multiqueue a single BDS will be used by multiple iothreads and hence multiple AioContexts. This calls for a different approach to coroutine synchronization. This series is my first attempt at it. Besides multiqueue, I think throttling groups (which can already span multiple AioContexts) could also take advantage of the new CoMutexes.
The series has two main parts: - it makes CoMutex bind coroutines to an AioContexts. It is of course still possible to move coroutines around with explicit yield and qemu_coroutine_enter, but a CoMutex will now reenter the coroutine where it was running before, rather than in the AioContext of the previous owner. To do this, a new function aio_co_schedule is introduced to run a coroutine on a given iothread. I think this could be used in other places too; for now it lets a CoMutex protect stuff across multiple AioContexts without them moving around(*). Of course currently a CoMutex is generally not used across multiple iothreads, because you have to acquire/release the AioContext around CoMutex critical sections. However... - ... the second change is exactly to make CoMutex thread-safe and remove the need for a "thread-based" mutex around it. The new CoMutex is exactly the same as a mutex implementation that you'd find in an operating system. iothreads are the moral equivalent of CPUs in a kernel, while coroutines resemble kernel threads running without preemption on a CPU. Even if you have to take concurrency into account, the lack of preemption while running coroutines or bottom halves keeps the complexity at bay. For example, it is easy to synchronize between qemu_co_mutex_lock's yield and the qemu_coroutine_enter in aio_co_schedule's bottom half. Same as before, CoMutex puts coroutines to sleep with qemu_coroutine_yield and wake them up with the new aio_co_schedule. I could have wrapped CoMutex's CoQueue with a "regular" thread mutex or spinlock. The resulting code would have looked a lot like RFifoLock (with CoQueue replacing RFifoLock's condition variable). Rather, inspired by the parallel between coroutines and non-preemptive kernel threads, I chose to adopt the same lock-free mutex algorithm as OSv. The algorithm only needs two to four atomic ops for a lock-unlock pair (two when uncontended). To cover CoQueue, each CoQueue is made to depend on a CoMutex, similar to condition variables. Most CoQueues already have a corresponding CoMutex so this is not a big deal; converting the others is left for a future series. I did this because CoQueue looks a lot like OSv's waitqueues; so if necessary, we could even take OSv's support for wait morphing (which avoids the thundering herd problem) and add it to CoMutex and CoQueue. This may be useful when making tracked_requests thread-safe. Kevin: this has nothing to do with my old plan from Brno, and it's possibly a lot closer to what you wanted. Your idea was to automatically release the "thread mutex" when a coroutine yields, I think you should be fine with not having a thread mutex at all! This will need some changes in the formats because, for multiqueue, CoMutexes would need to be used like "real" thread mutexes. Code like this: ... qemu_co_mutex_unlock() ... /* still access shared data, but don't yield */ qemu_coroutine_yield() might be required to use this other pattern: ... /* access shared data, but don't yield */ qemu_co_mutex_unlock() qemu_coroutine_yield() because "adding a second CPU" is already introducing concurrency that wasn't there before. The "non-preemptive multitasking" reference only applies to things that access AioContext-local data. This includes the synchronization primitives implemented in this series, the thread pool, the Linux AIO support, but not much else. It still simplifies _those_ though. :) Anyhow, we'll always have some BlockDriver that do not support multiqueue, such as the network protocols. Thus it would be possible to handle the formats one at a time. raw-posix, raw and qcow2 would already form a pretty good set, and the first two do not use CoMutex at all. The patch has quite a lot of new code, but about half of it is testcases. The new test covers correctness and (when run with --verbose) also takes a stab at measuring performance; the results is that performance of CoMutex is comparable to pthread mutexes. The only snag is that that you cannot make a direct comparison between CoMutex (fair) and pthread_mutex_t (unfair). For this reason the testcase also measures performance of a quick-and-dirty implementation of a fair mutex, based on MCS locks + futexes. There's a lot of meat in the above text, and I hope it will make the code clearer and compensate for the terse commit messages. :) I'll probably write a single-threaded testcase too, just to provide some more unit test comparison of "before" and "after". I haven't even started a guest with this patches, let alone run qemu-iotests... generally the changes are well confined to unit tested code, patch 2 for example is completely untested. There are a couple other places that at least need more comments, but I wanted to throw the patch out for an early review of the general approach. Paolo Bonzini (11): coroutine: use QSIMPLEQ instead of QTAILQ throttle-groups: restart throttled requests from coroutine context coroutine: delete qemu_co_enter_next aio: introduce aio_co_schedule coroutine-lock: reschedule coroutine on the AioContext it was running on coroutine-lock: make CoMutex thread-safe coroutine-lock: add limited spinning to CoMutex test-aio-multithread: add performance comparison with thread-based mutexes coroutine-lock: place CoMutex before CoQueue in header coroutine-lock: add mutex argument to CoQueue APIs coroutine-lock: make CoRwlock thread-safe and fair async.c | 38 ++++ block/backup.c | 2 +- block/io.c | 2 +- block/qcow2-cluster.c | 4 +- block/sheepdog.c | 2 +- block/throttle-groups.c | 71 +++++-- hw/9pfs/9p.c | 2 +- include/block/aio.h | 14 ++ include/qemu/atomic.h | 3 + include/qemu/coroutine.h | 106 ++++++---- include/qemu/coroutine_int.h | 14 +- iothread.c | 16 ++ stubs/iothread.c | 11 ++ tests/Makefile | 11 +- tests/iothread.c | 107 ++++++++++ tests/iothread.h | 25 +++ tests/test-aio-multithread.c | 452 +++++++++++++++++++++++++++++++++++++++++++ trace-events | 8 +- util/qemu-coroutine-lock.c | 257 ++++++++++++++++++++---- util/qemu-coroutine.c | 2 +- 20 files changed, 1037 insertions(+), 110 deletions(-) create mode 100644 tests/iothread.c create mode 100644 tests/iothread.h create mode 100644 tests/test-aio-multithread.c -- 2.5.5