Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: > On 21.12.23 22:23, Kevin Wolf wrote: > > From: Stefan Hajnoczi<stefa...@redhat.com> > > > > Stop depending on the AioContext lock and instead access > > SCSIDevice->requests from only one thread at a time: > > - When the VM is running only the BlockBackend's AioContext may access > > the requests list. > > - When the VM is stopped only the main loop may access the requests > > list. > > > > These constraints protect the requests list without the need for locking > > in the I/O code path. > > > > Note that multiple IOThreads are not supported yet because the code > > assumes all SCSIRequests are executed from a single AioContext. Leave > > that as future work. > > > > Signed-off-by: Stefan Hajnoczi<stefa...@redhat.com> > > Reviewed-by: Eric Blake<ebl...@redhat.com> > > Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> > > Signed-off-by: Kevin Wolf<kw...@redhat.com> > > --- > > include/hw/scsi/scsi.h | 7 +- > > hw/scsi/scsi-bus.c | 181 ++++++++++++++++++++++++++++------------- > > 2 files changed, 131 insertions(+), 57 deletions(-) > > My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more > often because of this commit than because of the original bug, i.e. when > repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, > this tends to happen when unplugging the scsi-hd: > > {"execute":"device_del","arguments":{"id":"stg0"}} > {"return": {}} > qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: > Assertion `ctx == blk->ctx' failed. > > (gdb) bt > #0 0x00007f32a668d83c in () at /usr/lib/libc.so.6 > #1 0x00007f32a663d668 in raise () at /usr/lib/libc.so.6 > #2 0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6 > #3 0x00007f32a66253dc in () at /usr/lib/libc.so.6 > #4 0x00007f32a6635d26 in () at /usr/lib/libc.so.6 > #5 0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at > ../block/block-backend.c:2429 > #6 blk_get_aio_context (blk=0x556e6e89ccf0) at > ../block/block-backend.c:2417 > #7 0x0000556e6b112d87 in scsi_device_for_each_req_async_bh > (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128 > #8 0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at > ../util/async.c:218 > #9 0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, > blocking=blocking@entry=true) at ../util/aio-posix.c:722 > #10 0x0000556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920) > at ../iothread.c:63 > #11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at > ../util/qemu-thread-posix.c:541 > #12 0x00007f32a668b9eb in () at /usr/lib/libc.so.6 > #13 0x00007f32a670f7cc in () at /usr/lib/libc.so.6 > > I don’t know anything about the problem yet, but as usual, I like > speculation and discovering how wrong I was later on, so one thing I came > across that’s funny about virtio-scsi is that requests can happen even while > a disk is being attached or detached. That is, Linux seems to probe all > LUNs when a new virtio-scsi device is being attached, and it won’t stop just > because a disk is being attached or removed. So maybe that’s part of the > problem, that we get a request while the BB is being detached, and > temporarily in an inconsistent state (BDS context differs from BB context).
I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. At the first sight I don't see a reason why this would break something, but I've learnt not to trust my first impression with the graph locking work... Of course, I also didn't check if there are more things inside of the device emulation that need additional locking in this case, too. But even if so, blk_get_aio_context() should never see an inconsistent state. Kevin