On Tue, Nov 11, 2025 at 02:43:16PM +0000, Daniel P. Berrangé wrote:
> > > > Hoisting the reference like this will make it easier for an upcoming
> > > > patch to still ensure the listener cannot be prematurely garbage
> > > > collected during the user's callback, even when the callback no longer
> > > > uses a per-sioc GSource.
> > > 
> > > It isn't quite this simple. Glib reference counts the callback
> > > func / data, holding a reference when dispatching the callback.
> > > 
> > > IOW, even if the GSource is unrefed, the callback 'notify'
> > > function won't be called if the main loop is in the process
> > > of dispatching.
> > 
> > I'm not sure I follow your argument.  Glib holds a reference on the
> > GSource object, not on the opaque data that is handed to the GSource.
> > It is possible to use g_source_set_callback_indirect() where GSource
> > can learn how to use the same reference counting on data as external
> > code, by the use of function pointers for ref and unref, but QIO uses
> > merely g_source_set_callback().
> > https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gmain.c#L1844
> > shows that glib then wraps that opaque pointer into an internal
> > GSourceCallback object which itself is reference counted, so that the
> > notify function is not called until the GSource is finalized, but that
> > is reference counting on the container, not on the opaque object
> > itself (which in this patch is the QIONetListener).
> > 
> > > 
> > > With this change, the reference on 'listener' can now be
> > > released even if the callback is currently dispatching.
> > 
> > So if I'm understanding your concern, you're worried that the unwatch
> > code can finish looping through the g_source_destroy and then reach
> > the point where it unrefs listener, but that a late-breaking client
> > connection can trigger a callback that can still be executing in
> > another thread/coroutine after the listener is unref'ed but before the
> > GSource has been finalized?  If so, would squashing this in fix the
> > problem you are seeing?
> 
> Consider the following scenario, where we have two threads, one
> is calling QIONetListener APIs, and one is the event thread.
> 
> In the current code, when we unref() the GSource for the socket
> watch, the destroy-notify does not get called, because the
> event thread is in the middle of a dispatch callback for the
> I/O event.  When the dispatch callback returns control to the
> event loop, the GSourceCallback is unrefed, and this triggers
> the destroy-notify call, which unrefs the listener.
> 
> The flow looks like this:
> 
>   Thread 1:
>        qio_net_listener_set_client_func(lstnr, f, ...);
>            => foreach sock: socket
>                => object_ref(lstnr)
>                => sock_src = qio_channel_socket_add_watch_source(sock, ...., 
> lstnr, object_unref);
> 
>   Thread 2:
>        poll()
>           => event POLLIN on socket
>                => ref(GSourceCallback)
>                => call dispatch(sock)
>                     ...do stuff..
> 
>   Thread 1:
>        qio_net_listener_set_client_func(lstnr, NULL, ...);
>           => foreach sock: socket
>                => g_source_unref(sock_src)
>        unref(lstnr)  (the final reference)
>           => finalize(lstnr)

If I understand correctly, _this_ unref(lstnr) is NOT the final
reference, because of the additional reference still owned by the
GSource that is still dispatching, so finalize(lstnr) is not reached
here...

> 
>   Thread 2:
>               => return dispatch(sock)
>               => unref(GSourceCallback)
>                   => destroy-notify
>                      => object_unref

...but instead here.  And that is the desirable property of the
pre-patch behavior with a per-GSource reference on the listsner object
- we are guaranteed that listener can't be finalized while there are
any pending dispatch in flight, even if the caller has unref'd their
last mention of lstnr, and therefore the dispatch never has a
use-after-free.

> 
> 
> 
> > diff --git i/io/net-listener.c w/io/net-listener.c
> > index 9f4e3c0be0c..1fcbbeb7a76 100644
> > --- i/io/net-listener.c
> > +++ w/io/net-listener.c
> > @@ -67,8 +67,10 @@ static void qio_net_listener_aio_func(void *opaque)
> >  {
> >      QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(opaque);
> > 
> > +    object_ref(OBJECT(sioc->listener));
> >      qio_net_listener_channel_func(QIO_CHANNEL(sioc), G_IO_IN,
> >                                    sioc->listener);
> > +    object_unref(OBJECT(sioc->listener));
> >  }
> 
> Now consider this patch, plus this extra hunk...
> 
>   Thread 1:
>        qio_net_listener_set_client_func(lstnr, f, ...);
>            => object_ref(listener)
>            => foreach sock: socket
>                => sock_src = qio_channel_socket_add_watch_source(sock, ...., 
> lstnr, NULL);
> 
>   Thread 2:
>        poll()
>           => event POLLIN on socket
>                => call dispatch(sock)
>                     => ref(lstnr)
>                     ...do stuff..
> 
>   Thread 1:
>        qio_net_listener_set_client_func(lstnr, NULL, ...);
>           => foreach sock: socket
>                => g_source_unref(sock_src)
>           => object_unref(listener)
>        unref(lstnr)  (still 1 reference left)
> 
>   Thread 2:
>                     => unref(lstnr)    (the final reference)
>                         => finalize(lstnr)
>               => return dispatch(sock)
>               => unref(GSourceCallback)
>

Yes, that's the race I plugged between v1 and v2 by adding the
referencing around the client callback.

> 
> That appears to work ok, however, there's still a race window that is
> not solved. Between the time thread 2 sees POLLIN, and when it calls
> the dispatch(sock) function, it is possible that thread 1 will drop
> the last reference:
> 
> 
> 
>   Thread 1:
>        qio_net_listener_set_client_func(lstnr, f, ...);
>            => object_ref(listener)
>            => foreach sock: socket
>                => sock_src = qio_channel_socket_add_watch_source(sock, ...., 
> lstnr, NULL);
> 
>   Thread 2:
>        poll()
>           => event POLLIN on socket
> 
>   Thread 1:
>        qio_net_listener_set_client_func(lstnr, NULL, ...);
>           => foreach sock: socket
>                => g_source_unref(sock_src)
>           => object_unref(listener)
>        unref(lstnr)  (still 1 reference left)

Another copy-paste problem?  I think your argument is that if we lose
this race, this particular unref() drops the count to 0 triggering
finalize() early...

> 
>   Thread 2:
>                => call dispatch(sock)
>                     => ref(lstnr)

...before this one can compensate, and thus having a UAF for a
finalized lstnr passed to the callback.  Yes, I can see the race now;
thanks for persisting with the explanation.

>                     ...do stuff..
>                     => unref(lstnr)    (the final reference)
>                         => finalize(lstnr)
>               => return dispatch(sock)
>               => unref(GSourceCallback)
> 
> 
> I don't see a way to solve this without synchronization with the event
> loop for releasing the reference on the opaque data for the dispatcher
> callback.  That's what the current code does, but I'm seeing no way for
> the AioContext event loop callbacks to have anything equivalent. This
> feels like a gap in the AioContext design.

Yes, that was one thing I was frustrated with while writing v2 - the
AioContext code simply lacks a notify hook, even though it is
demonstrably useful with GSource.  But plumbing in a notify hook is
more invasive, and puts the series at risk of missing 10.2.  Still,
it's something worth getting right, so I'll try a v3.

> 
> This is admittedly an incredibly hard to trigger race condition. It would
> need a client to be calling a QMP command that tears down the NBD server,
> at the exact same time as a new NBD client was incoming. Or the same kind
> of scenario for other pieces of QEMU code using QIONetListener. This still
> makes me worried though, as rare races have a habit of hitting QEMU
> eventually.

Indeed.  And a hard-to-trigger race is, in some regards, worse than a
deadlock because it's harder to reproduce and prove whether it is
fixed.

I'm thinking about whether any of the GSourceCallback solution (using
a reference-counter wrapper around the user's original opaque data to
prove how many in-flight callbacks are still being dispatched) may be
usable with AioContext.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to