Re: Problem with outstanding knotes and device detach - and a fix

2022-07-18 Thread Jason Thorpe


> On Jul 18, 2022, at 2:03 AM, Taylor R Campbell 
>  wrote:
> 
>> Date: Sun, 17 Jul 2022 12:54:56 -0700
>> From: Jason Thorpe 
>> 
>> And another new version.  This:
>> 
>> ==> Creates a knote_impl structure that's private to kern_event.c
>> that has the new lock.  I took the opportunity to move kn_influx to
>> the knote_impl as well, since absolutely no one outside of
>> kern_event.c should touch it.  This change is ABI-compatible.
>> 
>> ==> Improves the comments about the locking rules.
>> 
>> There are no functional changes since the last patch.
> 
> LGTM, thanks for working on this!
> 
> If I understand correctly, the issue is that removing the kqueue
> listener (i.e., filter_detach) and detaching the device notifier
> (i.e., klist_fini, via seldestroy) can race.  A process can close its
> kqueue file descriptor, or EV_DELETE a specific event subscription, at
> the same time as a device it had subscribed to is detaching and
> calling seldestroy.  That is, the lifetimes of:
> 
> (a) a knote in a kqueue listener for any of its event subscriptions, and
> (b) a knote in a driver's klist for any of its subscribers,
> 
> are overlapping but controlled independently, one by the kqueue user
> and the other by the driver, e.g. when the device is physically
> yanked, so the destruction by one path might happen while the other is
> still in use.  filter_detach can't safely call into the driver if the
> knote is being freed by klist_fini; klist_fini can't safely return to
> the driver if filter_detach is still in progress.

It’s actually simpler than that, and doesn’t really involve even much of a 
race… it’s that any knote associated with a device that has just been yanked 
may be referring to a device instance’s private data (via kn->kn_hook), and has 
a kn_fop pointer that points to functions provided by the now-detached device 
instance.  In the case of a driver that is statically linked into the kernel, 
at least the function pointers are probably stable and the problem is a simple 
use-after-free… but in the case of a dynamically-loaded driver module, kn_fop 
might itself now point at garbage, if the driver is unloaded.

So, if the driver detach happens at any time before the kqueue listener is 
detached, there is the potential to go BOOM (it’s not guaranteed, because UAF 
bugs are sometimes like that).

> Serializing the call to .f_detach and the replacement of kn_fop in
> klist_fini with a mutex ensures that one happens after the other for
> each knote.  Using a mutex attached to the knote itself avoids
> problems with the identity of the knote changing (close, EV_DELETE)
> and avoids problems with the identity of the device notifier changing
> (device detach) while either one is trying to run.

More or less… the mutex around the filter ops ensures that kn_fop will point to 
EITHER the driver-provided filter ops OR the stub do-nothing filter ops, and by 
holding the mutex across the call into the filter ops, ensures that 
klist_fini() will not complete while a filter op call is in-progress, thus 
ensuring that the freeing of the memory that backs the klist being torn down 
will not complete until there are no callers inside the about-to-be-gone filter 
ops.

> The kn_fop member is supposed to be stable all use points, so it
> doesn't require additional synchronization like atomic_load/store_*:
> 
> - Before .f_attach returns, the kqueue logic won't touch it, so
>  .f_attach can safely set it without synchronization.
> 
> - By the time the driver calls klist_fini (seldestroy), the driver
>  promises never to KNOTE again so it will never be touched by
>  filter_event.

It also promises to not allow any further f_attach calls to succeed.  (This is 
not new; the same promise is already made for select, and has been for many 
years).

> - filter_touch is used only for EVFILT_TIMER and EVFILT_USER, which
>  don't put the knote on any klist so never use klist_fini on it.

Right.

> - filter_detach and klist_fini are serialized by the knote foplock.

Right.

-- thorpej




Re: Problem with outstanding knotes and device detach - and a fix

2022-07-18 Thread Taylor R Campbell
> Date: Sun, 17 Jul 2022 12:54:56 -0700
> From: Jason Thorpe 
> 
> And another new version.  This:
> 
> ==> Creates a knote_impl structure that's private to kern_event.c
> that has the new lock.  I took the opportunity to move kn_influx to
> the knote_impl as well, since absolutely no one outside of
> kern_event.c should touch it.  This change is ABI-compatible.
> 
> ==> Improves the comments about the locking rules.
> 
> There are no functional changes since the last patch.

LGTM, thanks for working on this!

If I understand correctly, the issue is that removing the kqueue
listener (i.e., filter_detach) and detaching the device notifier
(i.e., klist_fini, via seldestroy) can race.  A process can close its
kqueue file descriptor, or EV_DELETE a specific event subscription, at
the same time as a device it had subscribed to is detaching and
calling seldestroy.  That is, the lifetimes of:

(a) a knote in a kqueue listener for any of its event subscriptions, and
(b) a knote in a driver's klist for any of its subscribers,

are overlapping but controlled independently, one by the kqueue user
and the other by the driver, e.g. when the device is physically
yanked, so the destruction by one path might happen while the other is
still in use.  filter_detach can't safely call into the driver if the
knote is being freed by klist_fini; klist_fini can't safely return to
the driver if filter_detach is still in progress.

Serializing the call to .f_detach and the replacement of kn_fop in
klist_fini with a mutex ensures that one happens after the other for
each knote.  Using a mutex attached to the knote itself avoids
problems with the identity of the knote changing (close, EV_DELETE)
and avoids problems with the identity of the device notifier changing
(device detach) while either one is trying to run.

The kn_fop member is supposed to be stable all use points, so it
doesn't require additional synchronization like atomic_load/store_*:

- Before .f_attach returns, the kqueue logic won't touch it, so
  .f_attach can safely set it without synchronization.

- By the time the driver calls klist_fini (seldestroy), the driver
  promises never to KNOTE again so it will never be touched by
  filter_event.

- filter_touch is used only for EVFILT_TIMER and EVFILT_USER, which
  don't put the knote on any klist so never use klist_fini on it.

- filter_detach and klist_fini are serialized by the knote foplock.


Re: Problem with outstanding knotes and device detach - and a fix

2022-07-17 Thread Jason Thorpe

> On Jul 13, 2022, at 7:18 PM, Jason Thorpe  wrote:
> 
> Ok, new version.  Main differences:

And another new version.  This:

==> Creates a knote_impl structure that’s private to kern_event.c that has the 
new lock.  I took the opportunity to move kn_influx to the knote_impl as well, 
since absolutely no one outside of kern_event.c should touch it.  This change 
is ABI-compatible.

==> Improves the comments about the locking rules.

There are no functional changes since the last patch.

Index: kern/kern_event.c
===
RCS file: /cvsroot/src/sys/kern/kern_event.c,v
retrieving revision 1.143
diff -u -p -r1.143 kern_event.c
--- kern/kern_event.c   13 Jul 2022 14:11:46 -  1.143
+++ kern/kern_event.c   17 Jul 2022 19:44:06 -
@@ -120,6 +120,61 @@ static voidfilt_userdetach(struct knote
 static int filt_user(struct knote *, long hint);
 static int filt_usertouch(struct knote *, struct kevent *, long type);
 
+/*
+ * Private knote state that should never be exposed outside
+ * of kern_event.c
+ *
+ * Field locking:
+ *
+ * q   kn_kq->kq_lock
+ */
+struct knote_impl {
+   struct knoteki_knote;
+   unsigned intki_influx;  /* q: in-flux counter */
+   kmutex_tki_foplock; /* for kn_filterops */
+};
+
+#defineKIMPL_TO_KNOTE(kip) (&(kip)->ki_knote)
+#defineKNOTE_TO_KIMPL(knp) container_of((knp), struct knote_impl, 
ki_knote)
+
+static inline struct knote *
+knote_alloc(bool sleepok)
+{
+   struct knote_impl *ki;
+
+   ki = kmem_zalloc(sizeof(*ki), sleepok ? KM_SLEEP : KM_NOSLEEP);
+   mutex_init(>ki_foplock, MUTEX_DEFAULT, IPL_NONE);
+
+   return KIMPL_TO_KNOTE(ki);
+}
+
+static inline void
+knote_free(struct knote *kn)
+{
+   struct knote_impl *ki = KNOTE_TO_KIMPL(kn);
+
+   mutex_destroy(>ki_foplock);
+   kmem_free(ki, sizeof(*ki));
+}
+
+static inline void
+knote_foplock_enter(struct knote *kn)
+{
+   mutex_enter(_TO_KIMPL(kn)->ki_foplock);
+}
+
+static inline void
+knote_foplock_exit(struct knote *kn)
+{
+   mutex_exit(_TO_KIMPL(kn)->ki_foplock);
+}
+
+static inline bool
+knote_foplock_owned(struct knote *kn)
+{
+   return mutex_owned(_TO_KIMPL(kn)->ki_foplock);
+}
+
 static const struct fileops kqueueops = {
.fo_name = "kqueue",
.fo_read = (void *)enxio,
@@ -133,6 +188,31 @@ static const struct fileops kqueueops = 
.fo_restart = kqueue_restart,
 };
 
+static void
+filt_nopdetach(struct knote *kn __unused)
+{
+}
+
+static int
+filt_nopevent(struct knote *kn __unused, long hint __unused)
+{
+   return 0;
+}
+
+static const struct filterops nop_fd_filtops = {
+   .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE,
+   .f_attach = NULL,
+   .f_detach = filt_nopdetach,
+   .f_event = filt_nopevent,
+};
+
+static const struct filterops nop_filtops = {
+   .f_flags = FILTEROP_MPSAFE,
+   .f_attach = NULL,
+   .f_detach = filt_nopdetach,
+   .f_event = filt_nopevent,
+};
+
 static const struct filterops kqread_filtops = {
.f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach = NULL,
@@ -232,15 +312,41 @@ static size_t user_kfiltersz; /* size 
  * -> object lock (e.g., device driver lock, )
  * -> kn_kq->kq_lock
  *
- * Locking rules:
+ * Locking rules.  ==> indicates the lock is acquired by the backing
+ * object, locks prior are acquired before calling filter ops:
+ *
+ * f_attach: fdp->fd_lock -> knote foplock ->
+ *   (maybe) KERNEL_LOCK ==> backing object lock
+ *
+ * f_detach: fdp->fd_lock -> knote foplock ->
+ *(maybe) KERNEL_LOCK ==> backing object lock
+ *
+ * f_event via kevent: fdp->fd_lock -> knote foplock ->
+ *(maybe) KERNEL_LOCK ==> backing object lock
+ *N.B. NOTE_SUBMIT will never be set in the "hint" argument
+ *in this case.
  *
- * f_attach: fdp->fd_lock, KERNEL_LOCK
- * f_detach: fdp->fd_lock, KERNEL_LOCK
- * f_event(!NOTE_SUBMIT) via kevent: fdp->fd_lock, _no_ object lock
- * f_event via knote: whatever caller guarantees
- * Typically,  f_event(NOTE_SUBMIT) via knote: object lock
- * f_event(!NOTE_SUBMIT) via knote: nothing,
- * acquires/releases object lock inside.
+ * f_event via knote (via backing object: Whatever caller guarantees.
+ * Typically:
+ * f_event(NOTE_SUBMIT): caller has already acquired backing
+ * object lock.
+ * f_event(!NOTE_SUBMIT): caller has not acquired backing object,
+ * lock or has possibly acquired KERNEL_LOCK.  Backing object
+ * lock may or may not be acquired as-needed.
+ * N.B. the knote foplock will **not** be acquired in this case.  The
+ * caller guarantees that klist_fini() will not be called concurrently
+ * with knote().
+ *
+ * f_touch: 

Re: Problem with outstanding knotes and device detach - and a fix

2022-07-13 Thread Jason Thorpe

> On Jul 13, 2022, at 12:02 PM, Jason Thorpe  wrote:
> 
>> On Jul 13, 2022, at 11:25 AM, Taylor R Campbell 
>>  wrote:
>> 
>> Sorry, haven't had time yet to do a full review, but I spot at least
>> one problem that means this won't fly as is: kqueue_register and
>> kqueue_scan both call filter_touch under a spin lock, but with your
>> patch filter_touch now takes an rwlock -- which is forbidden under a
>> spin lock (and it'll crash under LOCKDEBUG).
> 
> I found another issue as well and will post a follow-up later.

Ok, new version.  Main differences:

==> each knote gets its own mutex, rather than using a global rwlock.

==> Moved the locking outside of the filter_*() functions.

==> “Dealt” with the locking issue around the “touch” op, by only allowing 
user_filtops and timer_filtops to use it, because they’re known to be safe to 
call “touch” without holding the knote foplock (they don’t put notes on any 
lists).

The knote foplock got tossed onto the end of struct knote with the idea being 
that it doesn’t change the knote ABI for modules (they can’t allocate their own 
knotes).  If this gets aligned with another change that bumps the kernel 
version, I’ll put the foplock up next to fop.

This passes the kqueue tests on a LOCKDEBUG kernel.

Index: kern/kern_event.c
===
RCS file: /cvsroot/src/sys/kern/kern_event.c,v
retrieving revision 1.143
diff -u -p -r1.143 kern_event.c
--- kern/kern_event.c   13 Jul 2022 14:11:46 -  1.143
+++ kern/kern_event.c   14 Jul 2022 02:01:44 -
@@ -133,6 +133,31 @@ static const struct fileops kqueueops = 
.fo_restart = kqueue_restart,
 };
 
+static void
+filt_nopdetach(struct knote *kn __unused)
+{
+}
+
+static int
+filt_nopevent(struct knote *kn __unused, long hint __unused)
+{
+   return 0;
+}
+
+static const struct filterops nop_fd_filtops = {
+   .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE,
+   .f_attach = NULL,
+   .f_detach = filt_nopdetach,
+   .f_event = filt_nopevent,
+};
+
+static const struct filterops nop_filtops = {
+   .f_flags = FILTEROP_MPSAFE,
+   .f_attach = NULL,
+   .f_detach = filt_nopdetach,
+   .f_event = filt_nopevent,
+};
+
 static const struct filterops kqread_filtops = {
.f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach = NULL,
@@ -235,12 +260,14 @@ static size_t user_kfiltersz; /* size 
  * Locking rules:
  *
  * f_attach: fdp->fd_lock, KERNEL_LOCK
- * f_detach: fdp->fd_lock, KERNEL_LOCK
- * f_event(!NOTE_SUBMIT) via kevent: fdp->fd_lock, _no_ object lock
- * f_event via knote: whatever caller guarantees
+ * f_detach: fdp->fd_lock, knote foplock, KERNEL_LOCK
+ * f_event(!NOTE_SUBMIT) via kevent: fdp->fd_lock, knote foplock,
+ *   acquires/releases object lock inside
+ * f_event via knote: whatever caller guarantees,
+ * knote foplock _not_ acquired
  * Typically,  f_event(NOTE_SUBMIT) via knote: object lock
- * f_event(!NOTE_SUBMIT) via knote: nothing,
- * acquires/releases object lock inside.
+ * f_event(!NOTE_SUBMIT) via knote:
+ * acquires/releases object lock inside
  *
  * Locking rules when detaching knotes:
  *
@@ -429,6 +456,7 @@ knote_alloc(bool sleepok)
struct knote *kn;
 
kn = kmem_zalloc(sizeof(*kn), sleepok ? KM_SLEEP : KM_NOSLEEP);
+   mutex_init(>kn_foplock, MUTEX_DEFAULT, IPL_NONE);
 
return kn;
 }
@@ -436,14 +464,28 @@ knote_alloc(bool sleepok)
 static inline void
 knote_free(struct knote *kn)
 {
+   mutex_destroy(>kn_foplock);
kmem_free(kn, sizeof(*kn));
 }
 
+/*
+ * Calls into the filterops need to be resilient against things which
+ * destroy a klist, e.g. device detach, freeing a vnode, etc., to avoid
+ * chasing garbage pointers (to data, or even potentially code in a
+ * module about to be unloaded).  To that end, we acquire the
+ * knote foplock before calling into the filter ops.  When a driver
+ * (or anything else) is tearing down its klist, klist_fini() enumerates
+ * each knote, acquires its foplock, and replaces the filterops with a
+ * nop stub, allowing knote detach (when descriptors are closed) to safely
+ * proceed.
+ */
+
 static int
 filter_attach(struct knote *kn)
 {
int rv;
 
+   KASSERT(mutex_owned(>kn_foplock));
KASSERT(kn->kn_fop != NULL);
KASSERT(kn->kn_fop->f_attach != NULL);
 
@@ -465,6 +507,8 @@ filter_attach(struct knote *kn)
 static void
 filter_detach(struct knote *kn)
 {
+
+   KASSERT(mutex_owned(>kn_foplock));
KASSERT(kn->kn_fop != NULL);
KASSERT(kn->kn_fop->f_detach != NULL);
 
@@ -478,10 +522,12 @@ filter_detach(struct knote *kn)
 }
 
 static int
-filter_event(struct knote *kn, long hint)
+filter_event(struct 

Re: Problem with outstanding knotes and device detach - and a fix

2022-07-13 Thread Jason Thorpe


> On Jul 13, 2022, at 11:25 AM, Taylor R Campbell 
>  wrote:
> 
> Sorry, haven't had time yet to do a full review, but I spot at least
> one problem that means this won't fly as is: kqueue_register and
> kqueue_scan both call filter_touch under a spin lock, but with your
> patch filter_touch now takes an rwlock -- which is forbidden under a
> spin lock (and it'll crash under LOCKDEBUG).

I found another issue as well and will post a follow-up later.

-- thorpej



Re: Problem with outstanding knotes and device detach - and a fix

2022-07-13 Thread Taylor R Campbell
Sorry, haven't had time yet to do a full review, but I spot at least
one problem that means this won't fly as is: kqueue_register and
kqueue_scan both call filter_touch under a spin lock, but with your
patch filter_touch now takes an rwlock -- which is forbidden under a
spin lock (and it'll crash under LOCKDEBUG).


Re: Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe


> On Jul 12, 2022, at 7:54 AM, Jason Thorpe  wrote:
> 
> The following patch rectifies this situation by having klist_fini() traverse 
> a list of knotes and substitute their filterops with no-op stubs.  This 
> requires synchronizing with any calls into the filterops themselves.  We have 
> convenient funnel-points for this in kern_event.c already, so it’s not 
> particularly difficult… the main issue is that the filterops themselves are 
> allowed to block.  My solution to this was to have a single global rwlock, 
> knote_filterops_lock, that’s acquired for reading when a call though a 
> knote's filterops is made, and in klist_fini() is acquired for writing for 
> the short duration needed to stub out filterops.  This lock should **very 
> rarely** be contended, but it will be *hot* (lots of acquire-for-read), so it 
> gets its own cache line.
> 
> If someone has ideas about another synchronization mechanism, I’m all ears… 
> again, the main issue is that the filterops calls themselves are allowed to 
> block, so I’m not sure that any of our passive serialization mechanisms would 
> work here.

FWIW, another alternative is to put the rwlock into the knote itself, if we 
think that lock will be Too Hot (which is not an unreasonable concern on a 
machine with many cores).  The trade-off is memory.  I could see putting the 
lock in the knote itself on amd64 and aarch64 (where we would expect to see 
high core count machines) and a global lock on everything else.  Then again, on 
amd64, struct knote is 136 bytes, so it already spills into the kmem-00192 
bucket, and on i386 it’s 84 bytes, so kmem-00128 bucket.  I guess each of those 
could easily accommodate an additional pointer-sized struct member to alleviate 
any scalability concern about a global rwlock.

-- thorpej




Re: Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe


> On Jul 12, 2022, at 10:27 AM, Jason Thorpe  wrote:
> 
> 
>> On Jul 12, 2022, at 7:54 AM, Jason Thorpe  wrote:
>> 
>> If someone has ideas about another synchronization mechanism, I’m all ears… 
>> again, the main issue is that the filterops calls themselves are allowed to 
>> block, so I’m not sure that any of our passive serialization mechanisms 
>> would work here.
>> 
>> 
> 
> Well, for some reason, I mistakenly posted an old version of this patch.  
> I’ll post a refresh later today.

Oh, no, never mind, I’m wrong about this :-). NOTHING TO SEE HERE *whistles*

-- thorpej




Re: Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe


> On Jul 12, 2022, at 7:54 AM, Jason Thorpe  wrote:
> 
> If someone has ideas about another synchronization mechanism, I’m all ears… 
> again, the main issue is that the filterops calls themselves are allowed to 
> block, so I’m not sure that any of our passive serialization mechanisms would 
> work here.
> 
> 

Well, for some reason, I mistakenly posted an old version of this patch.  I’ll 
post a refresh later today.

-- thorpej