On Fri, Jun 28, 2024 at 11:58:37AM GMT, Alexander Ivanov wrote: > Ping? > > On 6/7/24 17:00, Alexander Ivanov wrote: > > static void nbd_blockdev_client_closed(NBDClient *client, bool ignored) > > { > > nbd_client_put(client); > > + if (nbd_server == NULL) { > > + return; > > + } > > assert(nbd_server->connections > 0); > > nbd_server->connections--; > > nbd_update_server_watch(nbd_server); >
I've spent more time looking at this, and I'm stumped. Maybe Dan can help out. If I understand correctly, here's the race situation we have: qemu main loop coroutine client QMP nbd-server-start - calls nbd_server_start - assigns nbd_server = g_new0() - calls qio_net_listener_open_sync - calls nbd_server_update_watch - calls qio_net_listener_set_client_func(, nbd_accept, ) - waiting for client is now in coroutine - returns control to QMP main loop opens TCP socket - qio notices incoming connection - calls nbd_accept callback - starts NBD handshake nbd_client_new(, nbd_blockdev_client_closed) QMP nbd-server-stop - calls nbd_server_stop - calls nbd_server_free - calls qio_net_listener_disconnect() - marks qio channel as useless - sets nbd_server = NULL - qio sees channel is useless, disconnects client deals gracefully with dead connection - calls nbd_blockdev_client_closed callback - segfault since nbd_server->connections derefs NULL Thread 1 "qemu-kvm" received signal SIGSEGV, Segmentation fault. 0x000055555600af59 in nbd_blockdev_client_closed (client=0x55555810dfc0, ignored=false) at ../blockdev-nbd.c:56 56 assert(nbd_server->connections > 0); (gdb) p nbd_server $1 = (NBDServerData *) 0x0 (gdb) bt #0 0x000055555600af59 in nbd_blockdev_client_closed (client=0x55555810dfc0, ignored=false) at ../blockdev-nbd.c:56 #1 0x0000555555ff72f9 in client_close (client=0x55555810dfc0, negotiated=false) at ../nbd/server.c:1624 #2 0x0000555555ffbc49 in nbd_co_client_start (opaque=0x55555810dfc0) at ../nbd/server.c:3198 #3 0x0000555556220629 in coroutine_trampoline (i0=1488434896, i1=21845) Worse, the race could go another way: if another QMP nbd-server-start is called fast enough before the coroutine finishes the nbd_accept code from the first connection, it could attempt to modify the second nbd_server object, which may be associated with a completely different socket. As far as I can tell, the problem stems from the fact that the attempt to establish the connection with the client is continuing in a background coroutine despite our efforts to honor the QMP nbd-server-stop command. But I'm not sure on the proper way to inform qio to abandon all incoming traffic to a given net listener. Or maybe I should just change the semantics of QMP nbd-server-stop to fail if there are known connections from and nbd_accept() - but I still don't want that to wait indefinitely, so I still want some way to tell the qio channel that the server is going away and to tear down sockets as soon as possible. As a stopgap, something like this might prevent the SEGV: diff --git i/blockdev-nbd.c w/blockdev-nbd.c index 213012435f4..2f5ea094407 100644 --- i/blockdev-nbd.c +++ w/blockdev-nbd.c @@ -277,6 +277,12 @@ void qmp_nbd_server_stop(Error **errp) blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD); + if (qio_net_listener_is_connected(nbd_server->listener) || + nbd_server->connections > 0) { + error_setg(errp, "NBD server still has connected clients"); + return; + } + nbd_server_free(nbd_server); nbd_server = NULL; } but it's not as graceful as I'd like (it would be nicer to have the nbd-server-stop command wait until it knows all connections are cleaned, instead of failing up front). -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org