On Wed, Jul 15, 2015 at 07:13:23PM +0200, Paolo Bonzini wrote: > This patch rewrites the ctx->dispatching optimization, which was the cause > of some mysterious hangs that could be reproduced on aarch64 KVM only. > The hangs were indirectly caused by aio_poll() and in particular by > flash memory updates's call to blk_write(), which invokes aio_poll(). > Fun stuff: they had an extremely short race window, so much that > adding all kind of tracing to either the kernel or QEMU made it > go away (a single printf made it half as reproducible). > > On the plus side, the failure mode (a hang until the next keypress) > made it very easy to examine the state of the process with a debugger. > And there was a very nice reproducer from Laszlo, which failed pretty > often (more than half of the time) on any version of QEMU with a non-debug > kernel; it also failed fast, while still in the firmware. So, it could > have been worse. > > For some unknown reason they happened only with virtio-scsi, but > that's not important. It's more interesting that they disappeared with > io=native, making thread-pool.c a likely suspect for where the bug arose. > thread-pool.c is also one of the few places which use bottom halves > across threads, by the way. > > I hope that no other similar bugs exist, but just in case :) I am > going to describe how the successful debugging went... Since the > likely culprit was the ctx->dispatching optimization, which mostly > affects bottom halves, the first observation was that there are two > qemu_bh_schedule() invocations in the thread pool: the one in the aio > worker and the one in thread_pool_completion_bh. The latter always > causes the optimization to trigger, the former may or may not. In > order to restrict the possibilities, I introduced new functions > qemu_bh_schedule_slow() and qemu_bh_schedule_fast(): > > /* qemu_bh_schedule_slow: */ > ctx = bh->ctx; > bh->idle = 0; > if (atomic_xchg(&bh->scheduled, 1) == 0) { > event_notifier_set(&ctx->notifier); > } > > /* qemu_bh_schedule_fast: */ > ctx = bh->ctx; > bh->idle = 0; > assert(ctx->dispatching); > atomic_xchg(&bh->scheduled, 1); > > Notice how the atomic_xchg is still in qemu_bh_schedule_slow(). This > was already debated a few months ago, so I assumed it to be correct. > In retrospect this was a very good idea, as you'll see later. > > Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't > trigger the assertion (as expected). Changing the worker's invocation > to qemu_bh_schedule_slow() didn't hide the bug (another assumption > which luckily held). This already limited heavily the amount of > interaction between the threads, hinting that the problematic events > must have triggered around thread_pool_completion_bh(). > > As mentioned early, invoking a debugger to examine the state of a > hung process was pretty easy; the iothread was always waiting on a > poll(..., -1) system call. Infinite timeouts are much rarer on x86, > and this could be the reason why the bug was never observed there. > With the buggy sequence more or less resolved to an interaction between > thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was > to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping > that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and > qemu_bh_schedule_fast() would provide some hint. The output was: > > (gdb) p last_prepare > $3 = 103885451 > (gdb) p last_dispatch > $4 = 103876492 > (gdb) p last_poll > $5 = 115909333 > (gdb) p last_schedule > $6 = 115925212 > > Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch(). > This makes little sense unless there is an aio_poll() call involved, > and indeed with a slightly different instrumentation you can see that > there is one: > > (gdb) p last_prepare > $3 = 107569679 > (gdb) p last_dispatch > $4 = 107561600 > (gdb) p last_aio_poll > $5 = 110671400 > (gdb) p last_schedule > $6 = 110698917 > > So the scenario becomes clearer: > > iothread VCPU thread > -------------------------------------------------------------------------- > aio_ctx_prepare > aio_ctx_check > qemu_poll_ns(timeout=-1) > aio_poll > aio_dispatch > thread_pool_completion_bh > qemu_bh_schedule() > > At this point bh->scheduled = 1 and the iothread has not been woken up. > The solution must be close, but this alone should not be a problem, > because the bottom half is only rescheduled to account for rare situations > (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll() > calls, 2014-07-15). > > Introducing a third thread---a thread pool worker thread, which > also does qemu_bh_schedule()---does bring out the problematic case. > The third thread must be awakened *after* the callback is complete and > thread_pool_completion_bh has redone the whole loop, explaining the > short race window. And then this is what happens: > > thread pool worker > -------------------------------------------------------------------------- > <I/O completes> > qemu_bh_schedule() > > Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing > and the iothread is never woken up. This is where the bh->scheduled > optimization comes into play---it is correct, but removing it would > have masked the bug. > > So, what is the bug? > > Well, the question asked by the ctx->dispatching optimization ("is any > active aio_poll dispatching?") was wrong. The right question to ask > instead is "is any active aio_poll *not* dispatching", i.e. in the prepare > or poll phases? In that case, the aio_poll is sleeping or might go to > sleep anytime soon, and the EventNotifier must be invoked to wake > it up. > > In any other case (including if there is *no* active aio_poll at all!) > we can just wait for the next prepare phase to pick up the event (e.g. a > bottom half); the prepare phase will avoid the blocking and service the > bottom half. > > Expressing the invariant with a logic formula, the broken one looked like: > > !(exists(thread): in_dispatching(thread)) => !optimize > > or equivalently: > > !(exists(thread): > in_aio_poll(thread) && in_dispatching(thread)) => !optimize > > In the correct one, the negation is in a slightly different place: > > (exists(thread): > in_aio_poll(thread) && !in_dispatching(thread)) => !optimize > > or equivalently: > > (exists(thread): in_prepare_or_poll(thread)) => !optimize > > Even if the difference boils down to moving an exclamation mark :) > the implementation is quite different. However, I think the new > one is simpler to understand. > > In the old implementation, the "exists" was implemented with a boolean > value. This didn't really support well the case of multiple concurrent > event loops, but I thought was that this was okay: aio_poll holds the > AioContext lock so there cannot be concurrent aio_poll invocations, and > I was just considering nested event loops. However, aio_poll _could_ > indeed be concurrent with the GSource. This is why I came up with the > wrong invariant. > > In the new implementation, "exists" is computed simply by counting how many > threads are in the prepare or poll phases. There are some interesting > points to consider, but the gist of the idea remains: > > 1) AioContext can be used through GSource as well; as mentioned in the > patch, bit 0 of the counter is reserved for the GSource. > > 2) the counter need not be updated for a non-blocking aio_poll, because > it won't sleep forever anyway. This is just a matter of checking > the "blocking" variable. This requires some changes to the win32 > implementation, but is otherwise not too complicated. > > 3) as mentioned above, the new implementation will not call aio_notify > when there is *no* active aio_poll at all. The tests have to be > adjusted for this change. The calls to aio_notify in async.c are fine; > they only want to kick aio_poll out of a blocking wait, but need not > do anything if aio_poll is not running. > > 4) nested aio_poll: these just work with the new implementation; when > a nested event loop is invoked, the outer event loop is never in the > prepare or poll phases. The outer event loop thus has already decremented > the counter. > > Reported-by: Richard W. M. Jones <rjo...@redhat.com> > Reported-by: Laszlo Ersek <ler...@redhat.com> > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > --- > aio-posix.c | 19 ++++++------ > aio-win32.c | 33 ++++++++++----------- > async.c | 21 +++++--------- > docs/aio_notify.promela | 77 > +++++++++++++++++++++---------------------------- > include/block/aio.h | 29 +++++++++++++++---- > tests/test-aio.c | 26 +++-------------- > 6 files changed, 91 insertions(+), 114 deletions(-)
Looks good for QEMU 2.4: Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> I'll merge it for the next -rc unless there is any more discussion.
pgpSm07Pd9rPb.pgp
Description: PGP signature