On Wed, Jun 24, 2020 at 06:16:05PM +0200, Sergio Lopez wrote: > On Tue, Jun 23, 2020 at 03:24:54PM +0100, Stefan Hajnoczi wrote: > > On Mon, Jun 22, 2020 at 04:16:04PM +0200, Sergio Lopez wrote: > > > On Fri, Jun 19, 2020 at 02:04:06PM +0100, Stefan Hajnoczi wrote: > > > > On Thu, Jun 11, 2020 at 10:36:22AM +0200, Sergio Lopez wrote: > > > > > Hi, > > > > > > > > > > While debugging BZ#1844343, I managed to reproduce the issue which > > > > > leads to crash with a backtrace like this one: > > > > > > > > > > <---- snip ----> > > > > > Thread 2 (Thread 0x7fe208463f00 (LWP 1659571)): > > > > > #0 0x00007fe2033b78ed in __lll_lock_wait () at /lib64/libpthread.so.0 > > > > > #1 0x00007fe2033b0bd4 in pthread_mutex_lock () at > > > > > /lib64/libpthread.so.0 > > > > > #2 0x0000560caa8f1e6d in qemu_mutex_lock_impl > > > > > (mutex=0x560cacc68a10, file=0x560caaa9797f "util/async.c", > > > > > line=521) at util/qemu-thread-posix.c:78 > > > > > #3 0x0000560caa82414d in bdrv_set_aio_context_ignore > > > > > (bs=bs@entry=0x560cacc73570, > > > > > new_context=new_context@entry=0x560cacc5fed0, > > > > > ignore=ignore@entry=0x7ffe388b1cc0) at block.c:6192 > > > > > #4 0x0000560caa824503 in bdrv_child_try_set_aio_context > > > > > (bs=bs@entry=0x560cacc73570, ctx=0x560cacc5fed0, > > > > > ignore_child=<optimized out>, errp=<optimized out>) > > > > > at block.c:6272 > > > > > #5 0x0000560caa859e6b in blk_do_set_aio_context > > > > > (blk=0x560cacecf370, new_context=0x560cacc5fed0, > > > > > update_root_node=update_root_node@entry=true, errp=errp@entry=0x0) at > > > > > block/block-backend.c:1989 > > > > > #6 0x0000560caa85c501 in blk_set_aio_context > > > > > (blk=<optimized out>, new_context=<optimized out>, > > > > > errp=errp@entry=0x0) at block/block-backend.c:2010 > > > > > #7 0x0000560caa61db30 in virtio_scsi_hotunplug > > > > > (hotplug_dev=0x560cadaafbd0, dev=0x560cacec1210, > > > > > errp=0x7ffe388b1d80) > > > > > at > > > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:869 > > > > > #8 0x0000560caa6ccd1e in qdev_unplug (dev=0x560cacec1210, > > > > > errp=errp@entry=0x7ffe388b1db8) > > > > > at qdev-monitor.c:872 > > > > > #9 0x0000560caa6ccd9e in qmp_device_del (id=<optimized out>, > > > > > errp=errp@entry=0x7ffe388b1db8) > > > > > at qdev-monitor.c:884 > > > > > #10 0x0000560caa7ec4d3 in qmp_marshal_device_del > > > > > (args=<optimized out>, ret=<optimized out>, errp=0x7ffe388b1e18) > > > > > at qapi/qapi-commands-qdev.c:99 > > > > > #11 0x0000560caa8a45ec in do_qmp_dispatch > > > > > (errp=0x7ffe388b1e10, allow_oob=<optimized out>, > > > > > request=<optimized out>, cmds=0x560cab1928a0 <qmp_commands>) at > > > > > qapi/qmp-dispatch.c:132 > > > > > #12 0x0000560caa8a45ec in qmp_dispatch > > > > > (cmds=0x560cab1928a0 <qmp_commands>, request=<optimized out>, > > > > > allow_oob=<optimized out>) > > > > > at qapi/qmp-dispatch.c:175 > > > > > #13 0x0000560caa7c2521 in monitor_qmp_dispatch (mon=0x560cacca2f00, > > > > > req=<optimized out>) > > > > > at monitor/qmp.c:145 > > > > > #14 0x0000560caa7c2bba in monitor_qmp_bh_dispatcher (data=<optimized > > > > > out>) at monitor/qmp.c:234 > > > > > #15 0x0000560caa8ec716 in aio_bh_call (bh=0x560cacbd80e0) at > > > > > util/async.c:117 > > > > > #16 0x0000560caa8ec716 in aio_bh_poll (ctx=ctx@entry=0x560cacbd6da0) > > > > > at util/async.c:117 > > > > > #17 0x0000560caa8efb04 in aio_dispatch (ctx=0x560cacbd6da0) at > > > > > util/aio-posix.c:459 > > > > > #18 0x0000560caa8ec5f2 in aio_ctx_dispatch > > > > > (source=<optimized out>, callback=<optimized out>, > > > > > user_data=<optimized out>) at util/async.c:260 > > > > > #19 0x00007fe2078d167d in g_main_context_dispatch () at > > > > > /lib64/libglib-2.0.so.0 > > > > > #20 0x0000560caa8eebb8 in glib_pollfds_poll () at util/main-loop.c:219 > > > > > #21 0x0000560caa8eebb8 in os_host_main_loop_wait (timeout=<optimized > > > > > out>) at util/main-loop.c:242 > > > > > #22 0x0000560caa8eebb8 in main_loop_wait (nonblocking=<optimized > > > > > out>) at util/main-loop.c:518 > > > > > #23 0x0000560caa6cfe51 in main_loop () at vl.c:1828 > > > > > #24 0x0000560caa57b322 in main (argc=<optimized out>, argv=<optimized > > > > > out>, envp=<optimized out>) > > > > > at vl.c:4504 > > > > > > > > > > Thread 1 (Thread 0x7fe1fb059700 (LWP 1659573)): > > > > > #0 0x00007fe20301b70f in raise () at /lib64/libc.so.6 > > > > > #1 0x00007fe203005b25 in abort () at /lib64/libc.so.6 > > > > > #2 0x00007fe2030059f9 in _nl_load_domain.cold.0 () at > > > > > /lib64/libc.so.6 > > > > > #3 0x00007fe203013cc6 in .annobin_assert.c_end () at /lib64/libc.so.6 > > > > > #4 0x0000560caa85bfe4 in blk_get_aio_context (blk=0x560cacecf370) at > > > > > block/block-backend.c:1968 > > > > > #5 0x0000560caa85bfe4 in blk_get_aio_context (blk=0x560cacecf370) at > > > > > block/block-backend.c:1962 > > > > > #6 0x0000560caa61d79c in virtio_scsi_ctx_check (s=0x560cadaafbd0, > > > > > s=0x560cadaafbd0, d=0x560cacec1210) > > > > > at > > > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:250 > > > > > #7 0x0000560caa61d79c in virtio_scsi_handle_cmd_req_prepare > > > > > (req=0x7fe1ec013880, s=0x560cadaafbd0) > > > > > at > > > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:569 > > > > > #8 0x0000560caa61d79c in virtio_scsi_handle_cmd_vq > > > > > (s=s@entry=0x560cadaafbd0, vq=vq@entry=0x7fe1f82ac140) > > > > > at > > > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:612 > > > > > #9 0x0000560caa61e48e in virtio_scsi_data_plane_handle_cmd > > > > > (vdev=<optimized out>, vq=0x7fe1f82ac140) > > > > > at > > > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi-dataplane.c:60 > > > > > #10 0x0000560caa62bfbe in virtio_queue_notify_aio_vq (vq=<optimized > > > > > out>) > > > > > at > > > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/virtio/virtio.c:2243 > > > > > #11 0x0000560caa8ef046 in run_poll_handlers_once > > > > > (ctx=ctx@entry=0x560cacc689b0, > > > > > timeout=timeout@entry=0x7fe1fb058658) at util/aio-posix.c:517 > > > > > #12 0x0000560caa8efbc5 in try_poll_mode (timeout=0x7fe1fb058658, > > > > > ctx=0x560cacc689b0) > > > > > at util/aio-posix.c:607 > > > > > #13 0x0000560caa8efbc5 in aio_poll (ctx=0x560cacc689b0, > > > > > blocking=blocking@entry=true) > > > > > at util/aio-posix.c:639 > > > > > #14 0x0000560caa6ca7f4 in iothread_run (opaque=0x560cacc1f000) at > > > > > iothread.c:75 > > > > > #15 0x0000560caa8f1d84 in qemu_thread_start (args=0x560cacc666f0) at > > > > > util/qemu-thread-posix.c:519 > > > > > #16 0x00007fe2033ae2de in start_thread () at /lib64/libpthread.so.0 > > > > > #17 0x00007fe2030dfe83 in clone () at /lib64/libc.so.6 > > > > > <---- snip ----> > > > > > > > > > > Both the code path initiated by virtio_scsi_data_plane_cmd_vq() and > > > > > the one coming down from virtio_scsi_hotunplug() should be protected > > > > > by virtio_scsi_acquire(), but this can still happen because the latter > > > > > works by acquiring the AioContext pointed by s->ctx, and we have this > > > > > in bdrv_set_aio_context_ignore(): > > > > > > > > > > 6140 void bdrv_set_aio_context_ignore(BlockDriverState *bs, > > > > > 6141 AioContext *new_context, > > > > > GSList **ignore) > > > > > 6142 { > > > > > (...) > > > > > 6179 /* > > > > > 6180 * If this function was recursively called from > > > > > 6181 * bdrv_set_aio_context_ignore(), there may be nodes in > > > > > the > > > > > 6182 * subtree that have not yet been moved to the new > > > > > AioContext. > > > > > 6183 * Release the old one so bdrv_drained_end() can poll > > > > > them. > > > > > 6184 */ > > > > > 6185 if (qemu_get_aio_context() != old_context) { > > > > > 6186 aio_context_release(old_context); > > > > > 6187 } > > > > > > > > > > My first thought here is that given the context and the apparent > > > > > expectations of where and how virtio_scsi_acquire() is used, we should > > > > > probably use an independent lock on VirtIOSCSI instead of acquiring > > > > > the AioContext. > > > > > > > > > > What do you think? > > > > > > > > The virtio_scsi_ctx_check() assertion seems reasonable. virtio-scsi > > > > shouldn't be processing requests while AioContexts are changing. > > > > > > I agree. > > > > > > > I'm surprised that aio_disable_external() doesn't already protect this > > > > code path? > > > > > > The conflict happens after the call to aio_enable_external(): > > > > > > 838 static void virtio_scsi_hotunplug(HotplugHandler *hotplug_dev, > > > DeviceState *dev, > > > 839 Error **errp) > > > 840 { > > > 841 VirtIODevice *vdev = VIRTIO_DEVICE(hotplug_dev); > > > 842 VirtIOSCSI *s = VIRTIO_SCSI(vdev); > > > 843 SCSIDevice *sd = SCSI_DEVICE(dev); > > > 844 AioContext *ctx = s->ctx ?: qemu_get_aio_context(); > > > 845 > > > 846 if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) { > > > 847 virtio_scsi_acquire(s); > > > 848 virtio_scsi_push_event(s, sd, > > > 849 VIRTIO_SCSI_T_TRANSPORT_RESET, > > > 850 VIRTIO_SCSI_EVT_RESET_REMOVED); > > > 851 virtio_scsi_release(s); > > > 852 } > > > 853 > > > 854 aio_disable_external(ctx); > > > 855 qdev_simple_device_unplug_cb(hotplug_dev, dev, errp); > > > 856 aio_enable_external(ctx); > > > 857 > > > 858 if (s->ctx) { > > > 859 virtio_scsi_acquire(s); > > > 860 /* If other users keep the BlockBackend in the iothread, > > > that's ok */ > > > 861 blk_set_aio_context(sd->conf.blk, qemu_get_aio_context(), > > > NULL); > > > ^^^^^^^^^^^^^^^^^^^ > > > |-> This is call that leads to releasing the context > > > > > > 862 virtio_scsi_release(s); > > > 863 } > > > 864 } > > > > > > I'm not sure if we can safely move the aio_enable_external() call to > > > the end of the function. > > > > I cannot think of a reason to resume virtqueue processing across lines > > 858-863. > > > > Instead of calling aio_disable_external() directly, line 854 could be > > changed to bdrv_drained_begin() and line 856 could be replaced with > > bdrv_drained_end() on line 864. > > > > This way in-flight requests are completed before unplug. Drained > > sections include the effect of aio_disable_external() so we no longer > > need to call it explicitly. > > Thanks for the suggestion, Stefan. I've tried this, along with other > approaches, but even if the blk_set_aio_context succeeds, we may get a > "blk_get_aio_context(d->conf.blk) == s->ctx)" later on. > > So, in a general sense, seems like in virtio_scsi_hotunplug() we're > assuming after this call we shouldn't see any other requests for the > unplugged SCSI drive in the virtqueue, but that's something that may > actually happen: > > Thread 1 (Thread 0x7f71d97c4700 (LWP 2194329)): > #0 0x00007f71ef9167ff in raise () at /lib64/libc.so.6 > #1 0x00007f71ef900c35 in abort () at /lib64/libc.so.6 > #2 0x00007f71ef900b09 in _nl_load_domain.cold.0 () at /lib64/libc.so.6 > #3 0x00007f71ef90ede6 in .annobin_assert.c_end () at /lib64/libc.so.6 > #4 0x0000560bb4346e68 in virtio_scsi_ctx_check (d=0x560bb62a1160, > s=<optimized out>, s=<optimized out>) > at /root/src/qemu/hw/scsi/virtio-scsi.c:250 > #5 0x0000560bb4346e68 in virtio_scsi_ctx_check (s=0x560bb777f2d0, > s=0x560bb777f2d0, d=0x560bb62a1160) > at /root/src/qemu/hw/scsi/virtio-scsi.c:247 > #6 0x0000560bb4346e68 in virtio_scsi_handle_cmd_req_prepare > (req=0x7f71cc316ea0, s=0x560bb777f2d0) > at /root/src/qemu/hw/scsi/virtio-scsi.c:569 > #7 0x0000560bb4346e68 in virtio_scsi_handle_cmd_vq > (s=s@entry=0x560bb777f2d0, vq=vq@entry=0x560bb7787520) at > /root/src/qemu/hw/scsi/virtio-scsi.c:612 > #8 0x0000560bb4347b0a in virtio_scsi_data_plane_handle_cmd (vdev=<optimized > out>, vq=0x560bb7787520) > at /root/src/qemu/hw/scsi/virtio-scsi-dataplane.c:60 > #9 0x0000560bb435b6c6 in virtio_queue_notify_aio_vq (vq=0x560bb7787520) > at /root/src/qemu/hw/virtio/virtio.c:2324 > #10 0x0000560bb47194db in aio_dispatch_handler (ctx=ctx@entry=0x560bb642d750, > node=0x7f6f940059f0) > at /root/src/qemu/util/aio-posix.c:328 > #11 0x0000560bb471a0e4 in aio_dispatch_ready_handlers (ready_list=<optimized > out>, ctx=<optimized out>) > at /root/src/qemu/util/aio-posix.c:358 > #12 0x0000560bb471a0e4 in aio_poll (ctx=0x560bb642d750, > blocking=blocking@entry=true) > at /root/src/qemu/util/aio-posix.c:653 > #13 0x0000560bb440ac00 in iothread_run (opaque=opaque@entry=0x560bb6231760) > at /root/src/qemu/iothread.c:75 > #14 0x0000560bb471c9ba in qemu_thread_start (args=<optimized out>) > at /root/src/qemu/util/qemu-thread-posix.c:521 > #15 0x00007f71efcaa14a in start_thread () at /lib64/libpthread.so.0 > #16 0x00007f71ef9dbf23 in clone () at /lib64/libc.so.6 > > > I'm considering keeping a local list of connected LUNs in each > virtio-scsi controller, adding and removing them on hotplug and > hotunplug, and using it in virtio_scsi_device_find() to mask out > unplugged devices, so virtio_scsi_handle_cmd_req_prepare() can fail > gracefully returning VIRTIO_SCSI_S_BAD_TARGET to the guest. > > How does this sound to you?
Maxim has a patch series that does something that sounds similar: [PATCH v2 0/7] RFC/WIP: Fix scsi devices plug/unplug races w.r.t virtio-scsi iothread https://patchew.org/QEMU/20200511160951.8733-1-mlevi...@redhat.com/ Stefan
signature.asc
Description: PGP signature