Re: [Qemu-devel] [PATCH] AioContext: fix broken ctx->dispatching optimization

Stefan Hajnoczi Thu, 16 Jul 2015 03:18:38 -0700

On Wed, Jul 15, 2015 at 07:13:23PM +0200, Paolo Bonzini wrote:
> This patch rewrites the ctx->dispatching optimization, which was the cause
> of some mysterious hangs that could be reproduced on aarch64 KVM only.
> The hangs were indirectly caused by aio_poll() and in particular by
> flash memory updates's call to blk_write(), which invokes aio_poll().
> Fun stuff: they had an extremely short race window, so much that
> adding all kind of tracing to either the kernel or QEMU made it
> go away (a single printf made it half as reproducible).
> 
> On the plus side, the failure mode (a hang until the next keypress)
> made it very easy to examine the state of the process with a debugger.
> And there was a very nice reproducer from Laszlo, which failed pretty
> often (more than half of the time) on any version of QEMU with a non-debug
> kernel; it also failed fast, while still in the firmware.  So, it could
> have been worse.
> 
> For some unknown reason they happened only with virtio-scsi, but
> that's not important.  It's more interesting that they disappeared with
> io=native, making thread-pool.c a likely suspect for where the bug arose.
> thread-pool.c is also one of the few places which use bottom halves
> across threads, by the way.
> 
> I hope that no other similar bugs exist, but just in case :) I am
> going to describe how the successful debugging went...  Since the
> likely culprit was the ctx->dispatching optimization, which mostly
> affects bottom halves, the first observation was that there are two
> qemu_bh_schedule() invocations in the thread pool: the one in the aio
> worker and the one in thread_pool_completion_bh.  The latter always
> causes the optimization to trigger, the former may or may not.  In
> order to restrict the possibilities, I introduced new functions
> qemu_bh_schedule_slow() and qemu_bh_schedule_fast():
> 
>      /* qemu_bh_schedule_slow: */
>      ctx = bh->ctx;
>      bh->idle = 0;
>      if (atomic_xchg(&bh->scheduled, 1) == 0) {
>          event_notifier_set(&ctx->notifier);
>      }
> 
>      /* qemu_bh_schedule_fast: */
>      ctx = bh->ctx;
>      bh->idle = 0;
>      assert(ctx->dispatching);
>      atomic_xchg(&bh->scheduled, 1);
> 
> Notice how the atomic_xchg is still in qemu_bh_schedule_slow().  This
> was already debated a few months ago, so I assumed it to be correct.
> In retrospect this was a very good idea, as you'll see later.
> 
> Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't
> trigger the assertion (as expected).  Changing the worker's invocation
> to qemu_bh_schedule_slow() didn't hide the bug (another assumption
> which luckily held).  This already limited heavily the amount of
> interaction between the threads, hinting that the problematic events
> must have triggered around thread_pool_completion_bh().
> 
> As mentioned early, invoking a debugger to examine the state of a
> hung process was pretty easy; the iothread was always waiting on a
> poll(..., -1) system call.  Infinite timeouts are much rarer on x86,
> and this could be the reason why the bug was never observed there.
> With the buggy sequence more or less resolved to an interaction between
> thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was
> to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping
> that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and
> qemu_bh_schedule_fast() would provide some hint.  The output was:
> 
>     (gdb) p last_prepare
>     $3 = 103885451
>     (gdb) p last_dispatch
>     $4 = 103876492
>     (gdb) p last_poll
>     $5 = 115909333
>     (gdb) p last_schedule
>     $6 = 115925212
> 
> Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch().
> This makes little sense unless there is an aio_poll() call involved,
> and indeed with a slightly different instrumentation you can see that
> there is one:
> 
>     (gdb) p last_prepare
>     $3 = 107569679
>     (gdb) p last_dispatch
>     $4 = 107561600
>     (gdb) p last_aio_poll
>     $5 = 110671400
>     (gdb) p last_schedule
>     $6 = 110698917
> 
> So the scenario becomes clearer:
> 
>    iothread                   VCPU thread
> --------------------------------------------------------------------------
>    aio_ctx_prepare
>    aio_ctx_check
>    qemu_poll_ns(timeout=-1)
>                               aio_poll
>                                 aio_dispatch
>                                   thread_pool_completion_bh
>                                     qemu_bh_schedule()
> 
> At this point bh->scheduled = 1 and the iothread has not been woken up.
> The solution must be close, but this alone should not be a problem,
> because the bottom half is only rescheduled to account for rare situations
> (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll()
> calls, 2014-07-15).
> 
> Introducing a third thread---a thread pool worker thread, which
> also does qemu_bh_schedule()---does bring out the problematic case.
> The third thread must be awakened *after* the callback is complete and
> thread_pool_completion_bh has redone the whole loop, explaining the
> short race window.  And then this is what happens:
> 
>                                                       thread pool worker
> --------------------------------------------------------------------------
>                                                       <I/O completes>
>                                                       qemu_bh_schedule()
> 
> Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing
> and the iothread is never woken up.  This is where the bh->scheduled
> optimization comes into play---it is correct, but removing it would
> have masked the bug.
> 
> So, what is the bug?
> 
> Well, the question asked by the ctx->dispatching optimization ("is any
> active aio_poll dispatching?") was wrong.  The right question to ask
> instead is "is any active aio_poll *not* dispatching", i.e. in the prepare
> or poll phases?  In that case, the aio_poll is sleeping or might go to
> sleep anytime soon, and the EventNotifier must be invoked to wake
> it up.
> 
> In any other case (including if there is *no* active aio_poll at all!)
> we can just wait for the next prepare phase to pick up the event (e.g. a
> bottom half); the prepare phase will avoid the blocking and service the
> bottom half.
> 
> Expressing the invariant with a logic formula, the broken one looked like:
> 
>    !(exists(thread): in_dispatching(thread)) => !optimize
> 
> or equivalently:
> 
>    !(exists(thread):
>           in_aio_poll(thread) && in_dispatching(thread)) => !optimize
> 
> In the correct one, the negation is in a slightly different place:
> 
>    (exists(thread):
>          in_aio_poll(thread) && !in_dispatching(thread)) => !optimize
> 
> or equivalently:
> 
>    (exists(thread): in_prepare_or_poll(thread)) => !optimize
> 
> Even if the difference boils down to moving an exclamation mark :)
> the implementation is quite different.  However, I think the new
> one is simpler to understand.
> 
> In the old implementation, the "exists" was implemented with a boolean
> value.  This didn't really support well the case of multiple concurrent
> event loops, but I thought was that this was okay: aio_poll holds the
> AioContext lock so there cannot be concurrent aio_poll invocations, and
> I was just considering nested event loops.  However, aio_poll _could_
> indeed be concurrent with the GSource.  This is why I came up with the
> wrong invariant.
> 
> In the new implementation, "exists" is computed simply by counting how many
> threads are in the prepare or poll phases.  There are some interesting
> points to consider, but the gist of the idea remains:
> 
> 1) AioContext can be used through GSource as well; as mentioned in the
> patch, bit 0 of the counter is reserved for the GSource.
> 
> 2) the counter need not be updated for a non-blocking aio_poll, because
> it won't sleep forever anyway.  This is just a matter of checking
> the "blocking" variable.  This requires some changes to the win32
> implementation, but is otherwise not too complicated.
> 
> 3) as mentioned above, the new implementation will not call aio_notify
> when there is *no* active aio_poll at all.  The tests have to be
> adjusted for this change.  The calls to aio_notify in async.c are fine;
> they only want to kick aio_poll out of a blocking wait, but need not
> do anything if aio_poll is not running.
> 
> 4) nested aio_poll: these just work with the new implementation; when
> a nested event loop is invoked, the outer event loop is never in the
> prepare or poll phases.  The outer event loop thus has already decremented
> the counter.
> 
> Reported-by: Richard W. M. Jones <rjo...@redhat.com>
> Reported-by: Laszlo Ersek <ler...@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>  aio-posix.c             | 19 ++++++------
>  aio-win32.c             | 33 ++++++++++-----------
>  async.c                 | 21 +++++---------
>  docs/aio_notify.promela | 77 
> +++++++++++++++++++++----------------------------
>  include/block/aio.h     | 29 +++++++++++++++----
>  tests/test-aio.c        | 26 +++--------------
>  6 files changed, 91 insertions(+), 114 deletions(-)


Looks good for QEMU 2.4:

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>

I'll merge it for the next -rc unless there is any more discussion.

pgpSm07Pd9rPb.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH] AioContext: fix broken ctx->dispatching optimization

Reply via email to