> > in check_dead_utrace -- it can happen if utrace_clear_tsk() skipped > > clearing ->utrace pointer in otherwise normal detaching sequence. This > > can happen, due to utrace->u.live.signal being valid pointer at that time. > > Now this can happen when execution of resumed task starts: > > > > do_notify_resume > > get_signal_to_deliver > > tracehook_get_signal > > utrace_get_signal > > [utrace pointer found, utrace->lock taken] > [> "not taken", of course.] > > utrace_quiescent > > [signal is valid here, put onto live struct utrace without > > locking] > > > > So, ->utrace clearance skipped > > wake_quiescent > > check_dead_utrace > > BUG_ON(tsk->utrace == utrace);
> Again let me conjure the theory on what's supposed to prevent this one, > and maybe you will see the holes in the logic (or maybe I will). > > utrace_detach holds the utrace lock. To be in the remove_engine and > wake_quiescent path, quiesce returned 1, meaning it was in TASK_TRACED. > After that remove_engine called utrace_clear_tsk, and u.live.signal was > set then. The theory is that when check_dead_utrace checks u.live.signal > it will match what utrace_clear_tsk saw. When it sees something there, > it doesn't take the path to rcu_utrace_free. The BUG_ON hits because > the theory does not hold. > > Since we hold the lock, the only thing waking the target up before us is > SIGKILL. When that happens, utrace_quiescent will clear u.live.signal > asynchronously without regard to utrace->lock. If this happens after > utrace_clear_tsk and before check_dead_utrace, then we hit the anomaly. > > Do you think that logic is sound? The original theory was based on the > expectation that nothing else de-quiesces the target while we hold the > lock. I think that's still sound with the exception of SIGKILL. > Do you think that is valid otherwise? > > If this logic does explain the problem, then perhaps it is fixed by the > patch below. The comment gives the rationale for why it should cover the > SIGKILL scenario. What do you think? > --- a/kernel/utrace.c > +++ b/kernel/utrace.c > @@ -1423,6 +1423,23 @@ restart: > */ > BUG_ON(tsk->utrace != utrace); > BUG_ON(utrace->u.live.signal != signal); > + if (unlikely(killed)) { > + /* > + * Synchronize with any utrace_detach that > + * might be in progress. It expected us to > + * stay quiescent, but SIGKILL broke that. > + * Taking this lock here serializes its > + * work so that if it had the lock and > + * thought we were still in TASK_TRACED, we > + * block until it has finished looking at > + * utrace. A utrace_detach that gets the > + * lock after we release it here will not > + * think we are quiescent at all, since we > + * are in TASK_RUNNING state now. > + */ > + spin_lock(&utrace->lock); > + spin_unlock(&utrace->lock); > + } > utrace->u.live.signal = NULL; > } How can this fix anything? utrace->u.live.signal was still set to valid on-stack pointer many lines above without any exclusion from utrace_detach().