comments below On 07/15/15 19:13, Paolo Bonzini wrote: > This patch rewrites the ctx->dispatching optimization, which was the cause > of some mysterious hangs that could be reproduced on aarch64 KVM only. > The hangs were indirectly caused by aio_poll() and in particular by > flash memory updates's call to blk_write(), which invokes aio_poll(). > Fun stuff: they had an extremely short race window, so much that > adding all kind of tracing to either the kernel or QEMU made it > go away (a single printf made it half as reproducible). > > On the plus side, the failure mode (a hang until the next keypress) > made it very easy to examine the state of the process with a debugger. > And there was a very nice reproducer from Laszlo, which failed pretty > often (more than half of the time) on any version of QEMU with a non-debug > kernel; it also failed fast, while still in the firmware. So, it could > have been worse. > > For some unknown reason they happened only with virtio-scsi, but > that's not important. It's more interesting that they disappeared with > io=native, making thread-pool.c a likely suspect for where the bug arose. > thread-pool.c is also one of the few places which use bottom halves > across threads, by the way. > > I hope that no other similar bugs exist, but just in case :) I am > going to describe how the successful debugging went... Since the > likely culprit was the ctx->dispatching optimization, which mostly > affects bottom halves, the first observation was that there are two > qemu_bh_schedule() invocations in the thread pool: the one in the aio > worker and the one in thread_pool_completion_bh. The latter always > causes the optimization to trigger, the former may or may not. In > order to restrict the possibilities, I introduced new functions > qemu_bh_schedule_slow() and qemu_bh_schedule_fast(): > > /* qemu_bh_schedule_slow: */ > ctx = bh->ctx; > bh->idle = 0; > if (atomic_xchg(&bh->scheduled, 1) == 0) { > event_notifier_set(&ctx->notifier); > } > > /* qemu_bh_schedule_fast: */ > ctx = bh->ctx; > bh->idle = 0; > assert(ctx->dispatching); > atomic_xchg(&bh->scheduled, 1); > > Notice how the atomic_xchg is still in qemu_bh_schedule_slow(). This > was already debated a few months ago, so I assumed it to be correct. > In retrospect this was a very good idea, as you'll see later. > > Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't > trigger the assertion (as expected). Changing the worker's invocation > to qemu_bh_schedule_slow() didn't hide the bug (another assumption > which luckily held). This already limited heavily the amount of > interaction between the threads, hinting that the problematic events > must have triggered around thread_pool_completion_bh(). > > As mentioned early, invoking a debugger to examine the state of a > hung process was pretty easy; the iothread was always waiting on a > poll(..., -1) system call. Infinite timeouts are much rarer on x86, > and this could be the reason why the bug was never observed there. > With the buggy sequence more or less resolved to an interaction between > thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was > to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping > that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and > qemu_bh_schedule_fast() would provide some hint. The output was: > > (gdb) p last_prepare > $3 = 103885451 > (gdb) p last_dispatch > $4 = 103876492 > (gdb) p last_poll > $5 = 115909333 > (gdb) p last_schedule > $6 = 115925212 > > Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch(). > This makes little sense unless there is an aio_poll() call involved, > and indeed with a slightly different instrumentation you can see that > there is one: > > (gdb) p last_prepare > $3 = 107569679 > (gdb) p last_dispatch > $4 = 107561600 > (gdb) p last_aio_poll > $5 = 110671400 > (gdb) p last_schedule > $6 = 110698917 > > So the scenario becomes clearer: > > iothread VCPU thread > -------------------------------------------------------------------------- > aio_ctx_prepare > aio_ctx_check > qemu_poll_ns(timeout=-1) > aio_poll > aio_dispatch > thread_pool_completion_bh > qemu_bh_schedule() > > At this point bh->scheduled = 1 and the iothread has not been woken up. > The solution must be close, but this alone should not be a problem, > because the bottom half is only rescheduled to account for rare situations > (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll() > calls, 2014-07-15). > > Introducing a third thread---a thread pool worker thread, which > also does qemu_bh_schedule()---does bring out the problematic case. > The third thread must be awakened *after* the callback is complete and > thread_pool_completion_bh has redone the whole loop, explaining the > short race window. And then this is what happens: > > thread pool worker > -------------------------------------------------------------------------- > <I/O completes> > qemu_bh_schedule() > > Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing > and the iothread is never woken up. This is where the bh->scheduled > optimization comes into play---it is correct, but removing it would > have masked the bug. > > So, what is the bug? > > Well, the question asked by the ctx->dispatching optimization ("is any > active aio_poll dispatching?") was wrong. The right question to ask > instead is "is any active aio_poll *not* dispatching", i.e. in the prepare > or poll phases? In that case, the aio_poll is sleeping or might go to > sleep anytime soon, and the EventNotifier must be invoked to wake > it up. > > In any other case (including if there is *no* active aio_poll at all!) > we can just wait for the next prepare phase to pick up the event (e.g. a > bottom half); the prepare phase will avoid the blocking and service the > bottom half. > > Expressing the invariant with a logic formula, the broken one looked like: > > !(exists(thread): in_dispatching(thread)) => !optimize > > or equivalently: > > !(exists(thread): > in_aio_poll(thread) && in_dispatching(thread)) => !optimize > > In the correct one, the negation is in a slightly different place: > > (exists(thread): > in_aio_poll(thread) && !in_dispatching(thread)) => !optimize > > or equivalently: > > (exists(thread): in_prepare_or_poll(thread)) => !optimize > > Even if the difference boils down to moving an exclamation mark :) > the implementation is quite different. However, I think the new > one is simpler to understand. > > In the old implementation, the "exists" was implemented with a boolean > value. This didn't really support well the case of multiple concurrent > event loops, but I thought was that this was okay: aio_poll holds the
Absolutely stunning work. I hope others can review it in its full glory; I don't know enough by far to do that. I'll just make two stupid comments: - the first "was" in the last line above is superfluous, - I'd prefer if the constants passed to the atomic primitives matched the type of the "notify_me" field. The constants are "int", and the field has type uint32_t. Admittedly, I don't know how GCC builtins behave with regard to argument conversion; if they work just like functions with visible prototypes, then it should be fine. (Yes, this is an irrelevant comment. :)) Also I hope Rich can report back with some test results. Acked-by: Laszlo Ersek <ler...@redhat.com> Thank you. Laszlo > AioContext lock so there cannot be concurrent aio_poll invocations, and > I was just considering nested event loops. However, aio_poll _could_ > indeed be concurrent with the GSource. This is why I came up with the > wrong invariant. > > In the new implementation, "exists" is computed simply by counting how many > threads are in the prepare or poll phases. There are some interesting > points to consider, but the gist of the idea remains: > > 1) AioContext can be used through GSource as well; as mentioned in the > patch, bit 0 of the counter is reserved for the GSource. > > 2) the counter need not be updated for a non-blocking aio_poll, because > it won't sleep forever anyway. This is just a matter of checking > the "blocking" variable. This requires some changes to the win32 > implementation, but is otherwise not too complicated. > > 3) as mentioned above, the new implementation will not call aio_notify > when there is *no* active aio_poll at all. The tests have to be > adjusted for this change. The calls to aio_notify in async.c are fine; > they only want to kick aio_poll out of a blocking wait, but need not > do anything if aio_poll is not running. > > 4) nested aio_poll: these just work with the new implementation; when > a nested event loop is invoked, the outer event loop is never in the > prepare or poll phases. The outer event loop thus has already decremented > the counter. > > Reported-by: Richard W. M. Jones <rjo...@redhat.com> > Reported-by: Laszlo Ersek <ler...@redhat.com> > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > --- > aio-posix.c | 19 ++++++------ > aio-win32.c | 33 ++++++++++----------- > async.c | 21 +++++--------- > docs/aio_notify.promela | 77 > +++++++++++++++++++++---------------------------- > include/block/aio.h | 29 +++++++++++++++---- > tests/test-aio.c | 26 +++-------------- > 6 files changed, 91 insertions(+), 114 deletions(-) > > diff --git a/aio-posix.c b/aio-posix.c > index 4abec38..268d14d 100644 > --- a/aio-posix.c > +++ b/aio-posix.c > @@ -233,26 +233,23 @@ static void add_pollfd(AioHandler *node) > bool aio_poll(AioContext *ctx, bool blocking) > { > AioHandler *node; > - bool was_dispatching; > int i, ret; > bool progress; > int64_t timeout; > > aio_context_acquire(ctx); > - was_dispatching = ctx->dispatching; > progress = false; > > /* aio_notify can avoid the expensive event_notifier_set if > * everything (file descriptors, bottom halves, timers) will > * be re-evaluated before the next blocking poll(). This is > * already true when aio_poll is called with blocking == false; > - * if blocking == true, it is only true after poll() returns. > - * > - * If we're in a nested event loop, ctx->dispatching might be true. > - * In that case we can restore it just before returning, but we > - * have to clear it now. > + * if blocking == true, it is only true after poll() returns, > + * so disable the optimization now. > */ > - aio_set_dispatching(ctx, !blocking); > + if (blocking) { > + atomic_add(&ctx->notify_me, 2); > + } > > ctx->walking_handlers++; > > @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking) > npfd = 0; > ctx->walking_handlers--; > > + if (blocking) { > + atomic_sub(&ctx->notify_me, 2); > + } > + > /* Run dispatch even if there were no readable fds to run timers */ > - aio_set_dispatching(ctx, true); > if (aio_dispatch(ctx)) { > progress = true; > } > > - aio_set_dispatching(ctx, was_dispatching); > aio_context_release(ctx); > > return progress; > diff --git a/aio-win32.c b/aio-win32.c > index 233d8f5..ae7c6cf 100644 > --- a/aio-win32.c > +++ b/aio-win32.c > @@ -279,30 +279,25 @@ bool aio_poll(AioContext *ctx, bool blocking) > { > AioHandler *node; > HANDLE events[MAXIMUM_WAIT_OBJECTS + 1]; > - bool was_dispatching, progress, have_select_revents, first; > + bool progress, have_select_revents, first; > int count; > int timeout; > > aio_context_acquire(ctx); > - have_select_revents = aio_prepare(ctx); > - if (have_select_revents) { > - blocking = false; > - } > - > - was_dispatching = ctx->dispatching; > progress = false; > > /* aio_notify can avoid the expensive event_notifier_set if > * everything (file descriptors, bottom halves, timers) will > * be re-evaluated before the next blocking poll(). This is > * already true when aio_poll is called with blocking == false; > - * if blocking == true, it is only true after poll() returns. > - * > - * If we're in a nested event loop, ctx->dispatching might be true. > - * In that case we can restore it just before returning, but we > - * have to clear it now. > + * if blocking == true, it is only true after poll() returns, > + * so disable the optimization now. > */ > - aio_set_dispatching(ctx, !blocking); > + if (blocking) { > + atomic_add(&ctx->notify_me, 2); > + } > + > + have_select_revents = aio_prepare(ctx); > > ctx->walking_handlers++; > > @@ -318,11 +313,11 @@ bool aio_poll(AioContext *ctx, bool blocking) > first = true; > > /* wait until next event */ > - while (count > 0) { > + do { > HANDLE event; > int ret; > > - timeout = blocking > + timeout = blocking && !have_select_revents > ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0; > if (timeout) { > aio_context_release(ctx); > @@ -331,7 +326,10 @@ bool aio_poll(AioContext *ctx, bool blocking) > if (timeout) { > aio_context_acquire(ctx); > } > - aio_set_dispatching(ctx, true); > + if (blocking) { > + assert(first); > + atomic_sub(&ctx->notify_me, 2); > + } > > if (first && aio_bh_poll(ctx)) { > progress = true; > @@ -351,11 +349,10 @@ bool aio_poll(AioContext *ctx, bool blocking) > blocking = false; > > progress |= aio_dispatch_handlers(ctx, event); > - } > + } while (count > 0); > > progress |= timerlistgroup_run_timers(&ctx->tlg); > > - aio_set_dispatching(ctx, was_dispatching); > aio_context_release(ctx); > return progress; > } > diff --git a/async.c b/async.c > index 77d080d..a232192 100644 > --- a/async.c > +++ b/async.c > @@ -184,6 +184,8 @@ aio_ctx_prepare(GSource *source, gint *timeout) > { > AioContext *ctx = (AioContext *) source; > > + atomic_or(&ctx->notify_me, 1); > + > /* We assume there is no timeout already supplied */ > *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)); > > @@ -200,6 +202,7 @@ aio_ctx_check(GSource *source) > AioContext *ctx = (AioContext *) source; > QEMUBH *bh; > > + atomic_and(&ctx->notify_me, ~1); > for (bh = ctx->first_bh; bh; bh = bh->next) { > if (!bh->deleted && bh->scheduled) { > return true; > @@ -254,23 +257,13 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx) > return ctx->thread_pool; > } > > -void aio_set_dispatching(AioContext *ctx, bool dispatching) > -{ > - ctx->dispatching = dispatching; > - if (!dispatching) { > - /* Write ctx->dispatching before reading e.g. bh->scheduled. > - * Optimization: this is only needed when we're entering the "unsafe" > - * phase where other threads must call event_notifier_set. > - */ > - smp_mb(); > - } > -} > - > void aio_notify(AioContext *ctx) > { > - /* Write e.g. bh->scheduled before reading ctx->dispatching. */ > + /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs > + * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. > + */ > smp_mb(); > - if (!ctx->dispatching) { > + if (ctx->notify_me) { > event_notifier_set(&ctx->notifier); > } > } > diff --git a/docs/aio_notify.promela b/docs/aio_notify.promela > index ad3f6f0..fccc7ee 100644 > --- a/docs/aio_notify.promela > +++ b/docs/aio_notify.promela > @@ -1,5 +1,5 @@ > /* > - * This model describes the interaction between aio_set_dispatching() > + * This model describes the interaction between ctx->notify_me > * and aio_notify(). > * > * Author: Paolo Bonzini <pbonz...@redhat.com> > @@ -14,57 +14,53 @@ > * spin -a docs/aio_notify.promela > * gcc -O2 pan.c > * ./a.out -a > + * > + * To verify it (with a bug planted in the model): > + * spin -a -DBUG docs/aio_notify.promela > + * gcc -O2 pan.c > + * ./a.out -a > */ > > #define MAX 4 > #define LAST (1 << (MAX - 1)) > #define FINAL ((LAST << 1) - 1) > > -bool dispatching; > +bool notify_me; > bool event; > > -int req, done; > +int req; > +int done; > > active proctype waiter() > { > - int fetch, blocking; > + int fetch; > > - do > - :: done != FINAL -> { > - // Computing "blocking" is separate from execution of the > - // "bottom half" > - blocking = (req == 0); > - > - // This is our "bottom half" > - atomic { fetch = req; req = 0; } > - done = done | fetch; > - > - // Wait for a nudge from the other side > - do > - :: event == 1 -> { event = 0; break; } > - :: !blocking -> break; > - od; > + do > + :: true -> { > + notify_me++; > > - dispatching = 1; > + if > +#ifndef BUG > + :: (req > 0) -> skip; > +#endif > + :: else -> > + // Wait for a nudge from the other side > + do > + :: event == 1 -> { event = 0; break; } > + od; > + fi; > > - // If you are simulating this model, you may want to add > - // something like this here: > - // > - // int foo; foo++; foo++; foo++; > - // > - // This only wastes some time and makes it more likely > - // that the notifier process hits the "fast path". > + notify_me--; > > - dispatching = 0; > + atomic { fetch = req; req = 0; } > + done = done | fetch; > } > - :: else -> break; > od > } > > active proctype notifier() > { > int next = 1; > - int sets = 0; > > do > :: next <= LAST -> { > @@ -74,8 +70,8 @@ active proctype notifier() > > // aio_notify > if > - :: dispatching == 0 -> sets++; event = 1; > - :: else -> skip; > + :: notify_me == 1 -> event = 1; > + :: else -> printf("Skipped event_notifier_set\n"); > skip; > fi; > > // Test both synchronous and asynchronous delivery > @@ -86,19 +82,12 @@ active proctype notifier() > :: 1 -> skip; > fi; > } > - :: else -> break; > od; > - printf("Skipped %d event_notifier_set\n", MAX - sets); > } > > -#define p (done == FINAL) > - > -never { > - do > - :: 1 // after an arbitrarily long prefix > - :: p -> break // p becomes true > - od; > - do > - :: !p -> accept: break // it then must remains true forever after > - od > +never { /* [] done < FINAL */ > +accept_init: > + do > + :: done < FINAL -> skip; > + od; > } > diff --git a/include/block/aio.h b/include/block/aio.h > index b46103e..be91e3f 100644 > --- a/include/block/aio.h > +++ b/include/block/aio.h > @@ -63,10 +63,30 @@ struct AioContext { > */ > int walking_handlers; > > - /* Used to avoid unnecessary event_notifier_set calls in aio_notify. > - * Writes protected by lock or BQL, reads are lockless. > + /* Used to avoid unnecessary event_notifier_set calls in aio_notify; > + * accessed with atomic primitives. If this field is 0, everything > + * (file descriptors, bottom halves, timers) will be re-evaluated > + * before the next blocking poll(), thus the event_notifier_set call > + * can be skipped. If it is non-zero, you may need to wake up a > + * concurrent aio_poll or the glib main event loop, making > + * event_notifier_set necessary. > + * > + * Bit 0 is reserved for GSource usage of the AioContext, and is 1 > + * between a call to aio_ctx_check and the next call to aio_ctx_dispatch. > + * Bits 1-31 simply count the number of active calls to aio_poll > + * that are in the prepare or poll phase. > + * > + * The GSource and aio_poll must use a different mechanism because > + * there is no certainty that a call to GSource's prepare callback > + * (via g_main_context_prepare) is indeed followed by check and > + * dispatch. It's not clear whether this would be a bug, but let's > + * play safe and allow it---it will just cause extra calls to > + * event_notifier_set until the next call to dispatch. > + * > + * Instead, the aio_poll calls include both the prepare and the > + * dispatch phase, hence a simple counter is enough for them. > */ > - bool dispatching; > + uint32_t notify_me; > > /* lock to protect between bh's adders and deleter */ > QemuMutex bh_lock; > @@ -89,9 +109,6 @@ struct AioContext { > QEMUTimerListGroup tlg; > }; > > -/* Used internally to synchronize aio_poll against qemu_bh_schedule. */ > -void aio_set_dispatching(AioContext *ctx, bool dispatching); > - > /** > * aio_context_new: Allocate a new AioContext. > * > diff --git a/tests/test-aio.c b/tests/test-aio.c > index a7cb5c9..217e337 100644 > --- a/tests/test-aio.c > +++ b/tests/test-aio.c > @@ -97,14 +97,6 @@ static void event_ready_cb(EventNotifier *e) > > /* Tests using aio_*. */ > > -static void test_notify(void) > -{ > - g_assert(!aio_poll(ctx, false)); > - aio_notify(ctx); > - g_assert(!aio_poll(ctx, true)); > - g_assert(!aio_poll(ctx, false)); > -} > - > typedef struct { > QemuMutex start_lock; > bool thread_acquired; > @@ -331,7 +323,7 @@ static void test_wait_event_notifier(void) > EventNotifierTestData data = { .n = 0, .active = 1 }; > event_notifier_init(&data.e, false); > aio_set_event_notifier(ctx, &data.e, event_ready_cb); > - g_assert(!aio_poll(ctx, false)); > + while (aio_poll(ctx, false)); > g_assert_cmpint(data.n, ==, 0); > g_assert_cmpint(data.active, ==, 1); > > @@ -356,7 +348,7 @@ static void test_flush_event_notifier(void) > EventNotifierTestData data = { .n = 0, .active = 10, .auto_set = true }; > event_notifier_init(&data.e, false); > aio_set_event_notifier(ctx, &data.e, event_ready_cb); > - g_assert(!aio_poll(ctx, false)); > + while (aio_poll(ctx, false)); > g_assert_cmpint(data.n, ==, 0); > g_assert_cmpint(data.active, ==, 10); > > @@ -494,14 +486,6 @@ static void test_timer_schedule(void) > * works well, and that's what I am using. > */ > > -static void test_source_notify(void) > -{ > - while (g_main_context_iteration(NULL, false)); > - aio_notify(ctx); > - g_assert(g_main_context_iteration(NULL, true)); > - g_assert(!g_main_context_iteration(NULL, false)); > -} > - > static void test_source_flush(void) > { > g_assert(!g_main_context_iteration(NULL, false)); > @@ -669,7 +653,7 @@ static void test_source_wait_event_notifier(void) > EventNotifierTestData data = { .n = 0, .active = 1 }; > event_notifier_init(&data.e, false); > aio_set_event_notifier(ctx, &data.e, event_ready_cb); > - g_assert(g_main_context_iteration(NULL, false)); > + while (g_main_context_iteration(NULL, false)); > g_assert_cmpint(data.n, ==, 0); > g_assert_cmpint(data.active, ==, 1); > > @@ -694,7 +678,7 @@ static void test_source_flush_event_notifier(void) > EventNotifierTestData data = { .n = 0, .active = 10, .auto_set = true }; > event_notifier_init(&data.e, false); > aio_set_event_notifier(ctx, &data.e, event_ready_cb); > - g_assert(g_main_context_iteration(NULL, false)); > + while (g_main_context_iteration(NULL, false)); > g_assert_cmpint(data.n, ==, 0); > g_assert_cmpint(data.active, ==, 10); > > @@ -830,7 +814,6 @@ int main(int argc, char **argv) > while (g_main_context_iteration(NULL, false)); > > g_test_init(&argc, &argv, NULL); > - g_test_add_func("/aio/notify", test_notify); > g_test_add_func("/aio/acquire", test_acquire); > g_test_add_func("/aio/bh/schedule", test_bh_schedule); > g_test_add_func("/aio/bh/schedule10", test_bh_schedule10); > @@ -845,7 +828,6 @@ int main(int argc, char **argv) > g_test_add_func("/aio/event/flush", > test_flush_event_notifier); > g_test_add_func("/aio/timer/schedule", test_timer_schedule); > > - g_test_add_func("/aio-gsource/notify", > test_source_notify); > g_test_add_func("/aio-gsource/flush", > test_source_flush); > g_test_add_func("/aio-gsource/bh/schedule", > test_source_bh_schedule); > g_test_add_func("/aio-gsource/bh/schedule10", > test_source_bh_schedule10); >