On Tue, Nov 07, 2017 at 11:39:44AM +0100, Paolo Bonzini wrote: > On 07/11/2017 02:59, Fam Zheng wrote: > > On Mon, 11/06 17:33, Paolo Bonzini wrote: > >> On 06/11/2017 17:11, Kevin Wolf wrote: > >>> Am 03.11.2017 um 11:26 hat Stefan Hajnoczi geschrieben: > >>>> On Wed, Nov 01, 2017 at 06:42:33AM +0000, lizhengui wrote: > >>>>> Hi, when I attach virtio-scsi disk to VM, the qemu crash happened at > >>>>> very low probability. > >>>>> > >>>>> The qemu crash bt is below: > >>>>> > >>>>> #0 0x00007f2be3ada1d7 in raise () from /usr/lib64/libc.so.6 > >>>>> #1 0x00007f2be3adb8c8 in abort () from /usr/lib64/libc.so.6 > >>>>> #2 0x000000000084fe49 in PAT_abort () > >>>>> #3 0x000000000084ce8d in patchIllInsHandler () > >>>>> #4 <signal handler called> > >>>>> #5 0x00000000008228bb in qemu_strnlen () > >>>>> #6 0x0000000000822934 in strpadcpy () > >>>>> #7 0x0000000000684a88 in scsi_disk_emulate_inquiry () > >>>>> #8 0x000000000068744b in scsi_disk_emulate_command () > >>>>> #9 0x000000000068c481 in scsi_req_enqueue () > >>>>> #10 0x00000000004b1f00 in virtio_scsi_handle_cmd_req_submit () > >>>>> #11 0x00000000004b2e9e in virtio_scsi_handle_cmd_vq () > >>>>> #12 0x000000000076dba7 in aio_dispatch () > >>>>> #13 0x000000000076dd96 in aio_poll () > >>>>> #14 0x00000000007a8673 in blk_prw () > >>>>> #15 0x00000000007a922c in blk_pread () > >>>>> #16 0x00000000007a9cd0 in blk_pread_unthrottled () > >>>>> #17 0x00000000005cb404 in guess_disk_lchs () > >>>>> #18 0x00000000005cb5b4 in hd_geometry_guess () > >>>>> #19 0x00000000005cad56 in blkconf_geometry () > >>>>> #20 0x0000000000685956 in scsi_realize () > >>>>> #21 0x000000000068d3e3 in scsi_qdev_realize () > >>>>> #22 0x00000000005e3938 in device_set_realized () > >>>>> #23 0x000000000075bced in property_set_bool () > >>>>> #24 0x0000000000760205 in object_property_set_qobject () > >>>>> #25 0x000000000075df64 in object_property_set_bool () > >>>>> #26 0x00000000005580ad in qdev_device_add () > >>>>> #27 0x000000000055850b in qmp_device_add () > >>>>> #28 0x0000000000818b37 in do_qmp_dispatch.constprop.1 () > >>>>> #29 0x0000000000818d8b in qmp_dispatch () > >>>>> #30 0x000000000045d212 in handle_qmp_command () > >>>>> #31 0x000000000081f819 in json_message_process_token () > >>>>> #32 0x00000000008434d0 in json_lexer_feed_char () > >>>>> #33 0x00000000008435e6 in json_lexer_feed () > >>>>> #34 0x000000000045bd72 in monitor_qmp_read () > >>>>> #35 0x000000000055ecf3 in tcp_chr_read () > >>>>> #36 0x00007f2be4cf899a in g_main_context_dispatch () from > >>>>> /usr/lib64/libglib-2.0.so.0 > >>>>> #37 0x000000000076b86b in os_host_main_loop_wait () > >>>>> #38 0x000000000076b995 in main_loop_wait () > >>>>> #39 0x0000000000569c51 in main_loop () > >>>>> #40 0x0000000000420665 in main () > >>>>> > >>>>> From the qemu crash bt, we can see that the scsi_realize has not > >>>>> completed yet. Some fields sush as vendor, version in SCSIDiskState is > >>>>> Null at this moment. If qemu handles scsi request from this scsi disk > >>>>> at this moment, the qemu will access some null pointers and cause crash. > >>>>> How can I solve this problem? Should we add a check that whether the > >>>>> scsi disk has realized or not in scsi_disk_emulate_command before > >>>>> Handling scsi requests? > >>>> > >>>> Please try this patch: > >>>> > >>>> diff --git a/hw/block/block.c b/hw/block/block.c > >>>> index 27878d0087..df99ddb899 100644 > >>>> --- a/hw/block/block.c > >>>> +++ b/hw/block/block.c > >>>> @@ -120,9 +120,16 @@ void blkconf_geometry(BlockConf *conf, int *ptrans, > >>>> } > >>>> } > >>>> if (!conf->cyls && !conf->heads && !conf->secs) { > >>>> + AioContext *ctx = blk_get_aio_context(conf->blk); > >>>> + > >>>> + /* Callers may not expect this function to dispatch aio > >>>> handlers, so > >>>> + * disable external aio such as guest device emulation. > >>>> + */ > >>>> + aio_disable_external(ctx); > >>>> hd_geometry_guess(conf->blk, > >>>> &conf->cyls, &conf->heads, &conf->secs, > >>>> ptrans); > >>>> + aio_enable_external(ctx); > >>>> } else if (ptrans && *ptrans == BIOS_ATA_TRANSLATION_AUTO) { > >>>> *ptrans = hd_bios_chs_auto_trans(conf->cyls, conf->heads, > >>>> conf->secs); > >>>> } > >>> > >>> But why is the new disk even attached to the controller and visible to > >>> the guest at this point when it hasn't completed its initialisation yet? > >>> Isn't the root problem that we're initialising things in the wrong > >>> order? > >> > >> Well, the root cause then is that scsi_device_find is just reusing the > >> list of devices on the SCSI bus. Devices are added to that list very > >> early by qdev_set_parent_bus. > >> > >> Stefan's patch could make the issue even harder to hit, but I think that > >> with iothreads you could hit it anyway. > >> > >> The solution is probably to add an "online" flag to the device, and set > >> it in scsi_device_realize. But even that has the issue that access to > >> the list is not protected with a lock. What do you think? > > > > Can main thread somehow call aio_context_acquire(vs->ctx) (and release) > > around > > qdev_set_parent_bus()? virtio_scsi_device_find() takes the lock. > > No, the context is not set yet. But the locking is easy to add, > separately from the bug that Zhengui is reporting.
Do you want to submit a patch for this instead of the aio_disable_external() approach I posted? I think your idea is cleaner than modifying the geometry probing function. Stefan
signature.asc
Description: PGP signature