Am 12.09.2019 um 12:30 hat Sergio Lopez geschrieben:
> 
> Kevin Wolf <kw...@redhat.com> writes:
> 
> > Am 11.09.2019 um 23:33 hat Eric Blake geschrieben:
> >> On 9/11/19 12:21 PM, Eric Blake wrote:
> >> > On 9/11/19 11:15 AM, Sergio Lopez wrote:
> >> >> On creation, the export's AioContext is set to the same one as the
> >> >> BlockBackend, while the AioContext in the client QIOChannel is left
> >> >> untouched.
> >> >>
> >> >> As a result, when using data-plane, nbd_client_receive_next_request()
> >> >> schedules coroutines in the IOThread AioContext, while the client's
> >> >> QIOChannel is serviced from the main_loop, potentially triggering the
> >> >> assertion at qio_channel_restart_[read|write].
> >> >>
> >> >> To fix this, as soon we have the export corresponding to the client,
> >> >> we call qio_channel_attach_aio_context() to attach the QIOChannel
> >> >> context to the export's AioContext. This matches with the logic in
> >> >> blk_aio_attached().
> >> >>
> >> >> RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1748253
> >> >> Signed-off-by: Sergio Lopez <s...@redhat.com>
> >> >> ---
> >> >>  nbd/server.c | 2 ++
> >> >>  1 file changed, 2 insertions(+)
> >> > 
> >> > I'd like a second opinion from Kevin, but the description makes sense to
> >> > me.  I'm happy to queue this through my NBD tree.
> >> > 
> >> > Reviewed-by: Eric Blake <ebl...@redhat.com>
> >> 
> >> I tried to test this patch, but even with it applied, I still got an
> >> aio-context crasher by attempting an nbd-server-start, nbd-server-add,
> >> nbd-server-stop (intentionally skipping the nbd-server-remove step) on a
> >> domain using iothreads, with a backtrace of:
> >> 
> >> #0  0x00007ff09d070e35 in raise () from target:/lib64/libc.so.6
> >> #1  0x00007ff09d05b895 in abort () from target:/lib64/libc.so.6
> >> #2  0x000055dd03b9ab86 in error_exit (err=1, msg=0x55dd03d59fb0
> >> <__func__.15769> "qemu_mutex_unlock_impl")
> >>     at util/qemu-thread-posix.c:36
> >> #3  0x000055dd03b9adcf in qemu_mutex_unlock_impl (mutex=0x55dd062d5090,
> >> file=0x55dd03d59041 "util/async.c",
> >>     line=523) at util/qemu-thread-posix.c:96
> >> #4  0x000055dd03b93433 in aio_context_release (ctx=0x55dd062d5030) at
> >> util/async.c:523
> >> #5  0x000055dd03ac421b in bdrv_do_drained_begin (bs=0x55dd0673a2d0,
> >> recursive=false, parent=0x0,
> >>     ignore_bds_parents=false, poll=true) at block/io.c:428
> >> #6  0x000055dd03ac4299 in bdrv_drained_begin (bs=0x55dd0673a2d0) at
> >> block/io.c:434
> >> #7  0x000055dd03aafb54 in blk_drain (blk=0x55dd06a3ec40) at
> >> block/block-backend.c:1605
> >> #8  0x000055dd03aae054 in blk_remove_bs (blk=0x55dd06a3ec40) at
> >> block/block-backend.c:800
> >> #9  0x000055dd03aad54a in blk_delete (blk=0x55dd06a3ec40) at
> >> block/block-backend.c:420
> >> #10 0x000055dd03aad7d6 in blk_unref (blk=0x55dd06a3ec40) at
> >> block/block-backend.c:475
> >> #11 0x000055dd03aecb68 in nbd_export_put (exp=0x55dd0726f920) at
> >> nbd/server.c:1666
> >> #12 0x000055dd03aec8fe in nbd_export_close (exp=0x55dd0726f920) at
> >> nbd/server.c:1616
> >> #13 0x000055dd03aecbf1 in nbd_export_close_all () at nbd/server.c:1689
> >> #14 0x000055dd03748845 in qmp_nbd_server_stop (errp=0x7ffcdf3cb4e8) at
> >> blockdev-nbd.c:233
> >> ...
> >> 
> >> Does that sound familiar to what you were seeing?  Does it mean we
> >> missed another spot where the context is set incorrectly?
> >
> > I think nbd_export_close_all() or one of the NBD functions called by it
> > needs to take the AioContext lock of the associated BlockBackend.
> >
> > The crash is because AIO_POLL_WHILE() wants to temporarily drop the lock
> > that we're not even holding.
> 
> Yes, I think locking the context during the "if (exp->blk) {" block at
> nbd/server.c:1646 should do the trick.

We need to be careful to avoid locking things twice, so maybe
nbd_export_put() is already too deep inside the NBD server.

Its callers are:

* qmp_nbd_server_add(). Like all other QMP handlers in blockdev-nbd.c it
  neglects to lock the AioContext, but it should do so. The lock is not
  only needed for the nbd_export_put() call, but even before.

* nbd_export_close(), which in turn is called from:
    * nbd_eject_notifier(): run in the main thread, not locked
    * nbd_export_remove():
        * qmp_nbd_server_remove(): see above
    * nbd_export_close_all():
        * bdrv_close_all()
        * qmp_nbd_server_stop()

There are also calls from qemu-nbd, but these can be ignored because we
don't have iothreads there.

I think the cleanest would be to take the lock in the outermost callers,
i.e. all QMP handlers that deal with a specific export, in the eject
notifier and in nbd_export_close_all().

> On the other hand, I wonder if there is any situation in which calling
> to blk_unref() without locking the context could be safe. If there isn't
> any, perhaps we should assert that the lock is held if blk->ctx != NULL
> to catch this kind of bugs earlier?

blk_unref() must be called from the main thread, and if the BlockBackend
to be unreferenced is not in the main AioContext, the lock must be held.

I'm not sure how to assert that locks are held, though. I once looked
for a way to do this, but it involved either looking at the internal
state of pthreads mutexes or hacking up QemuMutex with debug state.

Kevin

Attachment: signature.asc
Description: PGP signature

Reply via email to