On Wed, May 17, 2023 at 05:28:34PM +0200, Kevin Wolf wrote: > This tests exercises graph locking, draining, and graph modifications > with AioContext switches a lot. Amongst others, it serves as a > regression test for bdrv_graph_wrlock() deadlocking because it is called > with a locked AioContext and for AioContext handling in the NBD server. > > Signed-off-by: Kevin Wolf <kw...@redhat.com>
I've now confirmed the following setups with just './check graph-changes-while-io -qcow2': patch 3 alone => test fails with wrlock assertion (good, we're catching the 8.0 regression where the new assertion failure is tripping) patch 1 and 3 => test fails differently than patch 3 alone (good, we're exposing the fact that NBD had a pre-existing bug, regardless of whether the added rwlock made it easier to spot) patch 2 and 3 => test passes (good, patch 2 appears to have fixed this particular bug, and when we are ready to revert patch 1 because we get rid of AioContext locking we'll still be okay) patch 1, 2, and 3 => test passes (good, we fixed the NBD bug, and have regression testing in place for a scenario that previously wasn't getting good testing) As such, I'm happy to supply: Tested-by: Eric Blake <ebl...@redhat.com> Now on to the patch itself... > --- > tests/qemu-iotests/iotests.py | 4 ++ > .../qemu-iotests/tests/graph-changes-while-io | 56 +++++++++++++++++-- > .../tests/graph-changes-while-io.out | 4 +- > 3 files changed, 58 insertions(+), 6 deletions(-) > > @@ -84,6 +84,54 @@ class TestGraphChangesWhileIO(QMPTestCase): > > bench_thr.join() > > + def test_commit_while_io(self) -> None: > + # Run qemu-img bench in the background > + bench_thr = Thread(target=do_qemu_img_bench, args=(200000, )) TIL - you can create a 1-item tuple in Python. It caught me off-guard, but makes sense now that I've re-read it. > + > + # While qemu-img bench is running, repeatedly commit overlay to node0 > + while bench_thr.is_alive(): > + result = self.qsd.qmp('block-commit', { > + 'job-id': 'job0', > + 'device': 'overlay', > + }) > + self.assert_qmp(result, 'return', {}) > + > + result = self.qsd.qmp('block-job-cancel', { > + 'device': 'job0', > + }) > + self.assert_qmp(result, 'return', {}) > + > + cancelled = False > + while not cancelled: > + for event in self.qsd.get_qmp().get_events(wait=10.0): The updated test took about 34 seconds on my machine; long enough that it is rightfully not part of './check -g quick', but still reasonable that I had no problems reproducing the issue while the test was running. > + if event['event'] != 'JOB_STATUS_CHANGE': > + continue > + if event['data']['status'] == 'null': > + cancelled = True > + > + bench_thr.join() > + It feels a bit odd that the test is skipped during './check -nbd', yet it IS utilizing nbd, and the fix in patch 2 is indeed in NBD code. But that's not a flaw in the test itself, just a limitation of what images we need in order to set up the NBD service in a way to trigger the problem. Reviewed-by: Eric Blake <ebl...@redhat.com> I'm in a spot where I can quickly queue this through my NBD tree so we can get it backported to 8.0.1; pull request coming up, provided the full series passes a few more tests on my end. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org