> > in check_dead_utrace -- it can happen if utrace_clear_tsk() skipped
> > clearing ->utrace pointer in otherwise normal detaching sequence. This
> > can happen, due to utrace->u.live.signal being valid pointer at that time.
> > Now this can happen when execution of resumed task starts:
> > 
> >     do_notify_resume
> >     get_signal_to_deliver
> >     tracehook_get_signal
> >     utrace_get_signal
> >     [utrace pointer found, utrace->lock taken]
> [>                                    "not taken", of course.]
> >     utrace_quiescent
> >     [signal is valid here, put onto live struct utrace without
> >      locking]
> > 
> > So, ->utrace clearance skipped
> >     wake_quiescent
> >     check_dead_utrace
> >     BUG_ON(tsk->utrace == utrace);

> Again let me conjure the theory on what's supposed to prevent this one,
> and maybe you will see the holes in the logic (or maybe I will).
> 
> utrace_detach holds the utrace lock.  To be in the remove_engine and
> wake_quiescent path, quiesce returned 1, meaning it was in TASK_TRACED.
> After that remove_engine called utrace_clear_tsk, and u.live.signal was
> set then.  The theory is that when check_dead_utrace checks u.live.signal
> it will match what utrace_clear_tsk saw.  When it sees something there,
> it doesn't take the path to rcu_utrace_free.  The BUG_ON hits because
> the theory does not hold.
> 
> Since we hold the lock, the only thing waking the target up before us is
> SIGKILL.  When that happens, utrace_quiescent will clear u.live.signal
> asynchronously without regard to utrace->lock.  If this happens after
> utrace_clear_tsk and before check_dead_utrace, then we hit the anomaly.
> 
> Do you think that logic is sound?  The original theory was based on the
> expectation that nothing else de-quiesces the target while we hold the
> lock.  I think that's still sound with the exception of SIGKILL.
> Do you think that is valid otherwise?
> 
> If this logic does explain the problem, then perhaps it is fixed by the
> patch below.  The comment gives the rationale for why it should cover the
> SIGKILL scenario.  What do you think?

> --- a/kernel/utrace.c
> +++ b/kernel/utrace.c
> @@ -1423,6 +1423,23 @@ restart:
>                        */
>                       BUG_ON(tsk->utrace != utrace);
>                       BUG_ON(utrace->u.live.signal != signal);
> +                     if (unlikely(killed)) {
> +                             /*
> +                              * Synchronize with any utrace_detach that
> +                              * might be in progress.  It expected us to
> +                              * stay quiescent, but SIGKILL broke that.
> +                              * Taking this lock here serializes its
> +                              * work so that if it had the lock and
> +                              * thought we were still in TASK_TRACED, we
> +                              * block until it has finished looking at
> +                              * utrace.  A utrace_detach that gets the
> +                              * lock after we release it here will not
> +                              * think we are quiescent at all, since we
> +                              * are in TASK_RUNNING state now.
> +                              */
> +                             spin_lock(&utrace->lock);
> +                             spin_unlock(&utrace->lock);
> +                     }
>                       utrace->u.live.signal = NULL;
>               }

How can this fix anything? utrace->u.live.signal was still set to
valid on-stack pointer many lines above without any exclusion from
utrace_detach().

Reply via email to