Re: test-blockjob: intermittent CI failures in msys2-64bit job

Hanna Czenczek Wed, 26 Apr 2023 01:05:18 -0700

On 26.04.23 09:38, Emanuele Giuseppe Esposito wrote:


Am 25/04/2023 um 18:48 schrieb Hanna Czenczek:

On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote:

On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote:


Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:

On 17.03.23 15:35, Thomas Huth wrote:

On 17/03/2023 11.17, Peter Maydell wrote:

On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.mayd...@linaro.org>
wrote:

On Fri, 3 Mar 2023 at 18:36, Peter Maydell
<peter.mayd...@linaro.org> wrote:

I've noticed that test-blockjob seems to fail intermittently
on the msys2-64bit job:

https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
https://gitlab.com/qemu-project/qemu/-/jobs/3865312440

Sample output:
| 53/89
ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3

Here's an intermittent failure from my macos x86 machine:

172/621 qemu:unit / test-blockjob
              ERROR           0.26s   killed by signal 6 SIGABRT

And an intermittent on the freebsd 13 CI job:
https://gitlab.com/qemu-project/qemu/-/jobs/3950667240

MALLOC_PERTURB_=197
G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
/tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k

▶ 178/650 /blockjob/ids
              OK
178/650 qemu:unit / test-blockjob
              ERROR           0.31s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀
―――――――――――――――――――――――――――――――――――――
stderr:
Assertion failed: (job->status == JOB_STATUS_STANDBY), function
test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
499.


TAP parsing error: Too few tests run (expected 9, got 1)
(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

Anybody in the block team looking at these, or shall we just
disable this test entirely ?

I ran into this issue today, too:

    https://gitlab.com/thuth/qemu/-/jobs/3954367101

... if nobody is interested in fixing this test, I think we should
disable it...

    Thomas

I'm looking at this now, and seems that it's broken since
6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
Aiocontext locks", as it stops critical section by new
aio_context_release() call exactly after bdrv_drain_all_and(), so it's
not a surprise that job may start at that moment and the following
assertion fires.

Emanuele could please look at it?

Well if I understood correctly, the only thing that was preventing the
job from continuing was the aiocontext lock held.

The failing assertion even mentions that:
/* Lock the IO thread to prevent the job from being run */
[...]
/* But the job cannot run, so it will remain on standby */
assert(job->status == JOB_STATUS_STANDBY);

Essentially bdrv_drain_all_end() would wake up the job coroutine, but it
would anyways block somewhere after since the aiocontext lock was taken
by the test.

Now that we got rid of aiocontext lock in job code, nothing prevents the
test from resuming.
Mixing job lock and aiocontext acquire/release is not a good idea, and
it would anyways block job_resume() called by bdrv_drain_all_end(), so
the test itself would deadlock.

So unless @Kevin has a better idea, this seems to be just an "hack" to
test stuff that is not possible to do now anymore. What I would suggest
is to get rid of that test, or at least of that assert function. I
unfortunately cannot reproduce the failure, but I think the remaining
functions called by the test should run as expected.

Thanks! I agree. Probably, alternatively we could just expand the
drained section, like

@@ -488,12 +488,6 @@ static void test_complete_in_standby(void)
      bdrv_drain_all_begin();
      assert_job_status_is(job, JOB_STATUS_STANDBY);

-    /* Lock the IO thread to prevent the job from being run */
-    aio_context_acquire(ctx);
-    /* This will schedule the job to resume it */
-    bdrv_drain_all_end();
-    aio_context_release(ctx);
-
      WITH_JOB_LOCK_GUARD() {
          /* But the job cannot run, so it will remain on standby */
          assert(job->status == JOB_STATUS_STANDBY);
@@ -511,6 +505,7 @@ static void test_complete_in_standby(void)
          job_dismiss_locked(&job, &error_abort);
      }

+    bdrv_drain_all_end();
      aio_context_acquire(ctx);
      destroy_blk(blk);
      aio_context_release(ctx);


But, seems that test wanted to specifically test job, that still in
STANDBY exactly after drained section...

[cc: Hanna]

Hanna, it was your test (added in
c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you
remember was important to catch STANDBY job *after* a drained section?

I’m not quite sure, but I think the idea was that we basically try to
get as close to something that might come in over QMP.  Over QMP, you
can’t issue a job-complete while keeping everything drained, so I
wouldn’t just extend the drained section.

Getting rid of the assert function also seems pointless.  If we want to
test whether job-complete works on tests in standby, we must put the
test in standby, and verify this.  We can get rid of the test, of
course, but it is a regression test, so it isn’t like it was added just
for fun.  Then again, it’s now already effectively commented out via
environment variable, so it doesn’t seem like a loss in practice to to
fully drop it.

Anyway – the thing I wonder about is, if this is to test whether jobs in
standby can be completed…  Why don’t we just pause the job instead of
going through the context lock hassle?  I.e. just put a job_pause()
right after bdrv_drain_all_begin().

If I’m not mistaken, reproducing the bug in the test seems really simple
by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that
works just fine if only you have a job_pause() in the drained section.
(And dropping the aio_context_acquire()/release() calls, because they
don’t do anything anymore.)

Why sleep(1)? I have the patch ready:

diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
index a130f6fefb..8054b72afa 100644
--- a/tests/unit/test-blockjob.c
+++ b/tests/unit/test-blockjob.c
@@ -488,11 +488,15 @@ static void test_complete_in_standby(void)
      bdrv_drain_all_begin();
      assert_job_status_is(job, JOB_STATUS_STANDBY);

+    /*
+     * Increase pause_count so that the counter is
+     * unbalanced and job won't resume
+     */
+    job_pause(job);
+
      /* Lock the IO thread to prevent the job from being run */
-    aio_context_acquire(ctx);
      /* This will schedule the job to resume it */
      bdrv_drain_all_end();
-    aio_context_release(ctx);

      WITH_JOB_LOCK_GUARD() {
          /* But the job cannot run, so it will remain on standby */

But I don't get why we should sleep additionally.

Yep, that’s the patch. The sleep was just a note on how to get it toreproduce 100 %.


Hanna

Re: test-blockjob: intermittent CI failures in msys2-64bit job

Reply via email to