Am 27.03.2024 um 23:13 hat Eric Blake geschrieben: > On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote: > > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote: > > > If g_main_loop_run()/aio_poll() is called in the coroutine context, > > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup > > > may be disordered. > > > > aio_poll() must not be called from coroutine context: > > > > bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking); > > ^^^^^^^^^^^^^^^ > > > > Coroutines are not supposed to block. Instead, they should yield. > > > > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means > > > some listened events is completed. Therefore, the completion callback > > > function is dispatched. > > > > > > If this callback function needs to invoke aio_co_enter(), it will only > > > wake up the coroutine (because we are already in coroutine context), > > > which may cause that the data on this listening event_fd/socket_fd > > > is not read/cleared. When the next poll () exits, it will be woken up > > > again > > > and inserted into the wakeup queue again. > > > > > > For example, if TLS is enabled in NBD, the server will call > > > g_main_loop_run() > > > in the coroutine, and repeatedly wake up the io_read event on a socket. > > > The call stack is as follows: > > > > > > aio_co_enter() > > > aio_co_wake() > > > qio_channel_restart_read() > > > aio_dispatch_handler() > > > aio_dispatch_handlers() > > > aio_dispatch() > > > aio_ctx_dispatch() > > > g_main_context_dispatch() > > > g_main_loop_run() > > > nbd_negotiate_handle_starttls() > > > > This code does not look like it was designed to run in coroutine > > context. Two options: > > > > 1. Don't run it in coroutine context (e.g. use a BH instead). This > > avoids blocking the coroutine but calling g_main_loop_run() is still > > ugly, in my opinion. > > > > 2. Get rid of data.loop and use coroutine APIs instead: > > > > while (!data.complete) { > > qemu_coroutine_yield(); > > } > > > > and update nbd_tls_handshake() to call aio_co_wake(data->co) instead > > of g_main_loop_quit(data->loop). > > > > This requires auditing the code to check whether the event loop might > > invoke something that interferes with > > nbd_negotiate_handle_starttls(). Typically this means monitor > > commands or fd activity that could change the state of this > > connection while it is yielded. This is where the real work is but > > hopefully it will not be that hard to figure out. > > I agree that 1) is ugly. So far, I've added some temporary > assertions, to see when qio_channel_tls_handshake is reached; it looks > like nbd/server.c is calling it from within coroutine context, but > nbd/client.c is calling it from the main loop. The qio code handles > either, but now I'm stuck in trying to get client.c into having the > right coroutine context; the TLS handshake is done before the usual > BlockDriverState *bs object is available, so I'm not even sure what > aio context, if any, I should be using. But on my first try, I'm > hitting: > > qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed. > > so I obviously got something wrong.
Hard to tell without seeing the code, but it looks like you're trying to wake up the coroutine while you're still executing in the context of the same coroutine. It looks like the documentation of qio_channel_tls_handshake() is wrong and the function can return and call the callback immediately without dropping out of coroutine context. A rather heavy-handed, but obvious approach would be moving the qio_channel_tls_handshake() into a BH, then you know you'll always be outside of coroutine context in the callback. But maybe it would be enough to just check if the coroutine isn't already active: if (!qemu_coroutine_entered(co)) { aio_wake_co(co); } > It may be possible to use block_gen_c to turn nbd_receive_negotiate > and nbd_receive_export_list into co_wrapper functions, if I can audit > that all of their code goes through qio (and is therefore > coroutine-safe), but that work is still ongoing. If it's possible, I think that would be nicer in the code and would also reduce the time the guest is blocked while we're creating a new NBD connection. *reads code* Hm... Actually, one thing I was completely unaware of is that all of this is running in a separate thread, so maybe the existing synchronous code already doesn't block the guest. nbd_co_establish_connection() starts this thread. The thread doesn't have an AioContext, so anything that involves scheduling something in an AioContext (including BHs and waking up coroutines) will result in code running in a different thread. I'm not sure why a thread is used in the first place (does it do something that coroutines can't do?) and if running parts of the code in a different thread would be a problem, but we should probably have a look at this first. Mixing threads and coroutines feels strange. Kevin