On Tue, Sep 12, 2023 at 07:10:34PM -0400, Stefan Hajnoczi wrote: > This patch fixes a race condition in test-bdrv-drain that is difficult > to reproduce. test-bdrv-drain sometimes fails without an error message > on the block pull request sent by Kevin Wolf on Sep 4, 2023. I was able > to reproduce it locally and found that "block-backend: process I/O in > the current AioContext" (in this patch series) is the first commit where > it reproduces. > > I do not know why "block-backend: process I/O in the current AioContext" > exposes this bug. It might be related to the fact that the test's preadv > request runs in the main thread instead of IOThread a after my commit.
In reading the commit message before the impacted code, my first thought was that you had a typo of an extra word (that is, something to fix by s/a //), but reading further, a better fix would be calling attention to the fact that you are referencing a specific named thread, as in s/IOThread a/IOThread A/... > That might simply change the timing of the test. > > Now on to the race condition in test-bdrv-drain. The main thread > schedules a BH in IOThread a and then drains the BDS: ...and another spot with the same parse issue... > > aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data); > > /* The request is running on the IOThread a. Draining its block device ...but here you were quoting from the existing code base, which is where I finally realized it was more than just your commit message. > * will make sure that it has completed as far as the BDS is concerned, > * but the drain in this thread can continue immediately after > * bdrv_dec_in_flight() and aio_ret might be assigned only slightly > * later. */ > do_drain_begin(drain_type, bs); > > If the BH completes before do_drain_begin() then there is nothing to > worry about. > > If the BH invokes bdrv_flush() before do_drain_begin(), then > do_drain_begin() waits for it to complete. > > The problematic case is when do_drain_begin() runs before the BH enters > bdrv_flush(). Then do_drain_begin() misses the BH and the drain > mechanism has failed in quiescing I/O. > > Fix this by incrementing the in_flight counter so that do_drain_begin() > waits for test_iothread_main_thread_bh(). > > Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com> > --- > tests/unit/test-bdrv-drain.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c > index ccc453c29e..67a79aa3f0 100644 > --- a/tests/unit/test-bdrv-drain.c > +++ b/tests/unit/test-bdrv-drain.c > @@ -512,6 +512,7 @@ static void test_iothread_main_thread_bh(void *opaque) > * executed during drain, otherwise this would deadlock. */ > aio_context_acquire(bdrv_get_aio_context(data->bs)); > bdrv_flush(data->bs); > + bdrv_dec_in_flight(data->bs); /* incremented by test_iothread_common() */ > aio_context_release(bdrv_get_aio_context(data->bs)); > } > > @@ -583,6 +584,13 @@ static void test_iothread_common(enum drain_type > drain_type, int drain_thread) > aio_context_acquire(ctx_a); > } > > + /* > + * Increment in_flight so that do_drain_begin() waits for > + * test_iothread_main_thread_bh(). This prevents the race between > + * test_iothread_main_thread_bh() in IOThread a and do_drain_begin() > in > + * this thread. test_iothread_main_thread_bh() decrements in_flight. > + */ > + bdrv_inc_in_flight(bs); > aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data); > > /* The request is running on the IOThread a. Draining its block > device and indeed, your commit message is consistent with the current code's naming convention. If you have reason to respin, a pre-req patch to change the case before adding more references might be nice, but I won't insist. Reviewed-by: Eric Blake <ebl...@redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org