On Tue, Nov 11, 2025 at 02:07:50PM -0600, Eric Blake wrote:
> On Tue, Nov 11, 2025 at 01:09:24PM -0600, Eric Blake wrote:
> > > In the current code, when we unref() the GSource for the socket
> > > watch, the destroy-notify does not get called, because the
> > > event thread is in the middle of a dispatch callback for the
> > > I/O event. When the dispatch callback returns control to the
> > > event loop, the GSourceCallback is unrefed, and this triggers
> > > the destroy-notify call, which unrefs the listener.
> > >
> > > The flow looks like this:
> > >
> > > Thread 1:
> > > qio_net_listener_set_client_func(lstnr, f, ...);
> > > => foreach sock: socket
> > > => object_ref(lstnr)
> > > => sock_src = qio_channel_socket_add_watch_source(sock,
> > > ...., lstnr, object_unref);
> > >
> > > Thread 2:
> > > poll()
> > > => event POLLIN on socket
> > > => ref(GSourceCallback)
> > > => call dispatch(sock)
> > > ...do stuff..
> > >
> > > Thread 1:
> > > qio_net_listener_set_client_func(lstnr, NULL, ...);
> > > => foreach sock: socket
> > > => g_source_unref(sock_src)
> > > unref(lstnr) (the final reference)
> > > => finalize(lstnr)
> >
> > If I understand correctly, _this_ unref(lstnr) is NOT the final
> > reference, because of the additional reference still owned by the
> > GSource that is still dispatching, so finalize(lstnr) is not reached
> > here...
> >
> > >
> > > Thread 2:
> > > => return dispatch(sock)
> > > => unref(GSourceCallback)
> > > => destroy-notify
> > > => object_unref
> >
> > ...but instead here. And that is the desirable property of the
> > pre-patch behavior with a per-GSource reference on the listsner object
> > - we are guaranteed that listener can't be finalized while there are
> > any pending dispatch in flight, even if the caller has unref'd their
> > last mention of lstnr, and therefore the dispatch never has a
> > use-after-free.
>
> But thinking further, even if it is not an object_unref, but merely a
> change of the async callback function, is this the sort of thing where
> a mutex is needed to make things safer?
>
> Even without my patch series, we have this scenario:
>
> Thread 1:
> qio_net_listener_set_client_func(lstnr, f1, ...);
> => foreach sock: socket
> => object_ref(lstnr)
> => sock_src = qio_channel_socket_add_watch_source(sock, ....,
> lstnr, object_unref);
>
> Thread 2:
> poll()
> => event POLLIN on socket
> => ref(GSourceCallback) // while lstnr->io_func is f1
> ...do stuff..
>
> Thread 1:
> qio_net_listener_set_client_func(lstnr, f2, ...);
> => foreach sock: socket
> => g_source_unref(sock_src)
> => foreach sock: socket
> => object_ref(lstnr)
> => sock_src = qio_channel_socket_add_watch_source(sock, ....,
> lstnr, object_unref);
>
> Thread 2:
> => access lstnr->io_func // now sees f2
> => call f2(sock)
> => return dispatch(sock)
> => unref(GSourceCallback)
> => destroy-notify
> => object_unref
>
> So even though thread 2 noticed a client while f1 was registered,
> because we lack a mutex and are not using atomic access primitives,
> there is nothing preventing thread 1 from changing the io_func from f1
> to f2 before thread 2 finally calls the callback. And if f2 is NULL
> (because the caller is deregistering the callback), I could see the
> race resulting in qio_net_listener_channel_func() with its pre-patch
> code of:
>
> if (listener->io_func) {
> listener->io_func(listener, sioc, listener->io_data);
> }
>
> resulting in a NULL-pointer deref if thread 1 changed out
> listener->io_func after the if but before the dispatch.
Hmm, yes, that is a risk I hadn't thought of before.
> Adding a mutex might make QIONetListener slower in the time to grab
> the mutex (even if it is uncontended most of the time), but if we have
> a proper mutex in place, we could at least guarantee that
> listener->io_func, listener->io_data, and listener->io_notify are
> changed as a group, rather than sharded if the polling thread is
> managing a dispatch with listener as its opaque data at the same time
> the user's thread is trying to change the registered callback.
Yeah, a mutex should be ok, as it will be uncontended 99.999% of
the time.
> By itself, GSource DOES have a mutex lock around the GMainContext it
> is attached to; but our particular use of GSource is operating outside
> that locking scheme, so it looks like I've uncovered another latent
> problem.
Agreed.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|