> On Jul 18, 2022, at 2:03 AM, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>> Date: Sun, 17 Jul 2022 12:54:56 -0700
>> From: Jason Thorpe <thor...@me.com>
>> 
>> Aaaaand another new version.  This:
>> 
>> ==> Creates a knote_impl structure that's private to kern_event.c
>> that has the new lock.  I took the opportunity to move kn_influx to
>> the knote_impl as well, since absolutely no one outside of
>> kern_event.c should touch it.  This change is ABI-compatible.
>> 
>> ==> Improves the comments about the locking rules.
>> 
>> There are no functional changes since the last patch.
> 
> LGTM, thanks for working on this!
> 
> If I understand correctly, the issue is that removing the kqueue
> listener (i.e., filter_detach) and detaching the device notifier
> (i.e., klist_fini, via seldestroy) can race.  A process can close its
> kqueue file descriptor, or EV_DELETE a specific event subscription, at
> the same time as a device it had subscribed to is detaching and
> calling seldestroy.  That is, the lifetimes of:
> 
> (a) a knote in a kqueue listener for any of its event subscriptions, and
> (b) a knote in a driver's klist for any of its subscribers,
> 
> are overlapping but controlled independently, one by the kqueue user
> and the other by the driver, e.g. when the device is physically
> yanked, so the destruction by one path might happen while the other is
> still in use.  filter_detach can't safely call into the driver if the
> knote is being freed by klist_fini; klist_fini can't safely return to
> the driver if filter_detach is still in progress.

It’s actually simpler than that, and doesn’t really involve even much of a 
race… it’s that any knote associated with a device that has just been yanked 
may be referring to a device instance’s private data (via kn->kn_hook), and has 
a kn_fop pointer that points to functions provided by the now-detached device 
instance.  In the case of a driver that is statically linked into the kernel, 
at least the function pointers are probably stable and the problem is a simple 
use-after-free… but in the case of a dynamically-loaded driver module, kn_fop 
might itself now point at garbage, if the driver is unloaded.

So, if the driver detach happens at any time before the kqueue listener is 
detached, there is the potential to go BOOM (it’s not guaranteed, because UAF 
bugs are sometimes like that).

> Serializing the call to .f_detach and the replacement of kn_fop in
> klist_fini with a mutex ensures that one happens after the other for
> each knote.  Using a mutex attached to the knote itself avoids
> problems with the identity of the knote changing (close, EV_DELETE)
> and avoids problems with the identity of the device notifier changing
> (device detach) while either one is trying to run.

More or less… the mutex around the filter ops ensures that kn_fop will point to 
EITHER the driver-provided filter ops OR the stub do-nothing filter ops, and by 
holding the mutex across the call into the filter ops, ensures that 
klist_fini() will not complete while a filter op call is in-progress, thus 
ensuring that the freeing of the memory that backs the klist being torn down 
will not complete until there are no callers inside the about-to-be-gone filter 
ops.

> The kn_fop member is supposed to be stable all use points, so it
> doesn't require additional synchronization like atomic_load/store_*:
> 
> - Before .f_attach returns, the kqueue logic won't touch it, so
>  .f_attach can safely set it without synchronization.
> 
> - By the time the driver calls klist_fini (seldestroy), the driver
>  promises never to KNOTE again so it will never be touched by
>  filter_event.

It also promises to not allow any further f_attach calls to succeed.  (This is 
not new; the same promise is already made for select, and has been for many 
years).

> - filter_touch is used only for EVFILT_TIMER and EVFILT_USER, which
>  don't put the knote on any klist so never use klist_fini on it.

Right.

> - filter_detach and klist_fini are serialized by the knote foplock.

Right.

-- thorpej


Reply via email to