On Tue, Nov 11, 2025 at 01:09:24PM -0600, Eric Blake wrote: > On Tue, Nov 11, 2025 at 02:43:16PM +0000, Daniel P. Berrangé wrote: > > > > > Hoisting the reference like this will make it easier for an upcoming > > > > > patch to still ensure the listener cannot be prematurely garbage > > > > > collected during the user's callback, even when the callback no longer > > > > > uses a per-sioc GSource. > > > > > > > > It isn't quite this simple. Glib reference counts the callback > > > > func / data, holding a reference when dispatching the callback. > > > > > > > > IOW, even if the GSource is unrefed, the callback 'notify' > > > > function won't be called if the main loop is in the process > > > > of dispatching. > > > > > > I'm not sure I follow your argument. Glib holds a reference on the > > > GSource object, not on the opaque data that is handed to the GSource. > > > It is possible to use g_source_set_callback_indirect() where GSource > > > can learn how to use the same reference counting on data as external > > > code, by the use of function pointers for ref and unref, but QIO uses > > > merely g_source_set_callback(). > > > https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gmain.c#L1844 > > > shows that glib then wraps that opaque pointer into an internal > > > GSourceCallback object which itself is reference counted, so that the > > > notify function is not called until the GSource is finalized, but that > > > is reference counting on the container, not on the opaque object > > > itself (which in this patch is the QIONetListener). > > > > > > > > > > > With this change, the reference on 'listener' can now be > > > > released even if the callback is currently dispatching. > > > > > > So if I'm understanding your concern, you're worried that the unwatch > > > code can finish looping through the g_source_destroy and then reach > > > the point where it unrefs listener, but that a late-breaking client > > > connection can trigger a callback that can still be executing in > > > another thread/coroutine after the listener is unref'ed but before the > > > GSource has been finalized? If so, would squashing this in fix the > > > problem you are seeing? > > > > Consider the following scenario, where we have two threads, one > > is calling QIONetListener APIs, and one is the event thread. > > > > In the current code, when we unref() the GSource for the socket > > watch, the destroy-notify does not get called, because the > > event thread is in the middle of a dispatch callback for the > > I/O event. When the dispatch callback returns control to the > > event loop, the GSourceCallback is unrefed, and this triggers > > the destroy-notify call, which unrefs the listener. > > > > The flow looks like this: > > > > Thread 1: > > qio_net_listener_set_client_func(lstnr, f, ...); > > => foreach sock: socket > > => object_ref(lstnr) > > => sock_src = qio_channel_socket_add_watch_source(sock, > > ...., lstnr, object_unref); > > > > Thread 2: > > poll() > > => event POLLIN on socket > > => ref(GSourceCallback) > > => call dispatch(sock) > > ...do stuff.. > > > > Thread 1: > > qio_net_listener_set_client_func(lstnr, NULL, ...); > > => foreach sock: socket > > => g_source_unref(sock_src) > > unref(lstnr) (the final reference) > > => finalize(lstnr) > > If I understand correctly, _this_ unref(lstnr) is NOT the final > reference, because of the additional reference still owned by the > GSource that is still dispatching, so finalize(lstnr) is not reached > here...
Sorry, yes, my bad. > > > > > Thread 2: > > => return dispatch(sock) > > => unref(GSourceCallback) > > => destroy-notify > > => object_unref > > ...but instead here. And that is the desirable property of the > pre-patch behavior with a per-GSource reference on the listsner object > - we are guaranteed that listener can't be finalized while there are > any pending dispatch in flight, even if the caller has unref'd their > last mention of lstnr, and therefore the dispatch never has a > use-after-free. Correct. > > That appears to work ok, however, there's still a race window that is > > not solved. Between the time thread 2 sees POLLIN, and when it calls > > the dispatch(sock) function, it is possible that thread 1 will drop > > the last reference: > > > > > > > > Thread 1: > > qio_net_listener_set_client_func(lstnr, f, ...); > > => object_ref(listener) > > => foreach sock: socket > > => sock_src = qio_channel_socket_add_watch_source(sock, > > ...., lstnr, NULL); > > > > Thread 2: > > poll() > > => event POLLIN on socket > > > > Thread 1: > > qio_net_listener_set_client_func(lstnr, NULL, ...); > > => foreach sock: socket > > => g_source_unref(sock_src) > > => object_unref(listener) > > unref(lstnr) (still 1 reference left) > > Another copy-paste problem? I think your argument is that if we lose > this race, this particular unref() drops the count to 0 triggering > finalize() early... Sigh, yes, somehow I got the examples inverted slightly. It was correct in my head at least :-) > > > > > Thread 2: > > => call dispatch(sock) > > => ref(lstnr) > > ...before this one can compensate, and thus having a UAF for a > finalized lstnr passed to the callback. Yes, I can see the race now; > thanks for persisting with the explanation. > > > ...do stuff.. > > => unref(lstnr) (the final reference) > > => finalize(lstnr) > > => return dispatch(sock) > > => unref(GSourceCallback) > > > > > > I don't see a way to solve this without synchronization with the event > > loop for releasing the reference on the opaque data for the dispatcher > > callback. That's what the current code does, but I'm seeing no way for > > the AioContext event loop callbacks to have anything equivalent. This > > feels like a gap in the AioContext design. > > Yes, that was one thing I was frustrated with while writing v2 - the > AioContext code simply lacks a notify hook, even though it is > demonstrably useful with GSource. But plumbing in a notify hook is > more invasive, and puts the series at risk of missing 10.2. Still, > it's something worth getting right, so I'll try a v3. I don't mind a short term NBD only solution for 10.2 is we think we can rationalize that the specific NBD usage pattern is safe, if it is easier to delay AioContext thoughts untill 11.0 dev cycle. > > This is admittedly an incredibly hard to trigger race condition. It would > > need a client to be calling a QMP command that tears down the NBD server, > > at the exact same time as a new NBD client was incoming. Or the same kind > > of scenario for other pieces of QEMU code using QIONetListener. This still > > makes me worried though, as rare races have a habit of hitting QEMU > > eventually. > > Indeed. And a hard-to-trigger race is, in some regards, worse than a > deadlock because it's harder to reproduce and prove whether it is > fixed. > > I'm thinking about whether any of the GSourceCallback solution (using > a reference-counter wrapper around the user's original opaque data to > prove how many in-flight callbacks are still being dispatched) may be > usable with AioContext. > > -- > Eric Blake, Principal Software Engineer > Red Hat, Inc. > Virtualization: qemu.org | libguestfs.org > With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
