On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote: > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote: > > If g_main_loop_run()/aio_poll() is called in the coroutine context, > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup > > may be disordered. > > aio_poll() must not be called from coroutine context: > > bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking); > ^^^^^^^^^^^^^^^ > > Coroutines are not supposed to block. Instead, they should yield. > > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means > > some listened events is completed. Therefore, the completion callback > > function is dispatched. > > > > If this callback function needs to invoke aio_co_enter(), it will only > > wake up the coroutine (because we are already in coroutine context), > > which may cause that the data on this listening event_fd/socket_fd > > is not read/cleared. When the next poll () exits, it will be woken up again > > and inserted into the wakeup queue again. > > > > For example, if TLS is enabled in NBD, the server will call > > g_main_loop_run() > > in the coroutine, and repeatedly wake up the io_read event on a socket. > > The call stack is as follows: > > > > aio_co_enter() > > aio_co_wake() > > qio_channel_restart_read() > > aio_dispatch_handler() > > aio_dispatch_handlers() > > aio_dispatch() > > aio_ctx_dispatch() > > g_main_context_dispatch() > > g_main_loop_run() > > nbd_negotiate_handle_starttls() > > This code does not look like it was designed to run in coroutine > context. Two options: > > 1. Don't run it in coroutine context (e.g. use a BH instead). This > avoids blocking the coroutine but calling g_main_loop_run() is still > ugly, in my opinion. > > 2. Get rid of data.loop and use coroutine APIs instead: > > while (!data.complete) { > qemu_coroutine_yield(); > } > > and update nbd_tls_handshake() to call aio_co_wake(data->co) instead > of g_main_loop_quit(data->loop). > > This requires auditing the code to check whether the event loop might > invoke something that interferes with > nbd_negotiate_handle_starttls(). Typically this means monitor > commands or fd activity that could change the state of this > connection while it is yielded. This is where the real work is but > hopefully it will not be that hard to figure out.
I agree that 1) is ugly. So far, I've added some temporary assertions, to see when qio_channel_tls_handshake is reached; it looks like nbd/server.c is calling it from within coroutine context, but nbd/client.c is calling it from the main loop. The qio code handles either, but now I'm stuck in trying to get client.c into having the right coroutine context; the TLS handshake is done before the usual BlockDriverState *bs object is available, so I'm not even sure what aio context, if any, I should be using. But on my first try, I'm hitting: qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed. so I obviously got something wrong. It may be possible to use block_gen_c to turn nbd_receive_negotiate and nbd_receive_export_list into co_wrapper functions, if I can audit that all of their code goes through qio (and is therefore coroutine-safe), but that work is still ongoing. At any rate, qemu-iotest 233 should be good for testing that changes in this area work; I've also been testing with libnbd (test interop/interop-qemu-nbd-tls-certs hits qemu's server.c) and nbdkit (test tests/test-tls-psk.sh hits qemu's client.c). -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org