> Another bug which I _think_, can be triggered is "BUG_ON(tsk->utrace == 
> utrace);"

This one has indeed been happening.  I've seen it, but still haven't
reproduced it sufficiently repeatably to be able to work on it.

> in check_dead_utrace -- it can happen if utrace_clear_tsk() skipped
> clearing ->utrace pointer in otherwise normal detaching sequence. This
> can happen, due to utrace->u.live.signal being valid pointer at that time.
> Now this can happen when execution of resumed task starts:
> 
>       do_notify_resume
>       get_signal_to_deliver
>       tracehook_get_signal
>       utrace_get_signal
>       [utrace pointer found, utrace->lock taken]
[>                                      "not taken", of course.]
>       utrace_quiescent
>       [signal is valid here, put onto live struct utrace without
>        locking]
> 
> So, ->utrace clearance skipped
>       wake_quiescent
>       check_dead_utrace
>       BUG_ON(tsk->utrace == utrace);
> 
> I can't reproduce this on -rc8 at will, but I don't see anything that
> prevents above race as well. Probably window for #1 far wider :-(

Again let me conjure the theory on what's supposed to prevent this one,
and maybe you will see the holes in the logic (or maybe I will).

utrace_detach holds the utrace lock.  To be in the remove_engine and
wake_quiescent path, quiesce returned 1, meaning it was in TASK_TRACED.
After that remove_engine called utrace_clear_tsk, and u.live.signal was
set then.  The theory is that when check_dead_utrace checks u.live.signal
it will match what utrace_clear_tsk saw.  When it sees something there,
it doesn't take the path to rcu_utrace_free.  The BUG_ON hits because
the theory does not hold.

Since we hold the lock, the only thing waking the target up before us is
SIGKILL.  When that happens, utrace_quiescent will clear u.live.signal
asynchronously without regard to utrace->lock.  If this happens after
utrace_clear_tsk and before check_dead_utrace, then we hit the anomaly.

Do you think that logic is sound?  The original theory was based on the
expectation that nothing else de-quiesces the target while we hold the
lock.  I think that's still sound with the exception of SIGKILL.
Do you think that is valid otherwise?

If this logic does explain the problem, then perhaps it is fixed by the
patch below.  The comment gives the rationale for why it should cover the
SIGKILL scenario.  What do you think?


Thanks,
Roland


--- a/kernel/utrace.c
+++ b/kernel/utrace.c
@@ -1423,6 +1423,23 @@ restart:
                         */
                        BUG_ON(tsk->utrace != utrace);
                        BUG_ON(utrace->u.live.signal != signal);
+                       if (unlikely(killed)) {
+                               /*
+                                * Synchronize with any utrace_detach that
+                                * might be in progress.  It expected us to
+                                * stay quiescent, but SIGKILL broke that.
+                                * Taking this lock here serializes its
+                                * work so that if it had the lock and
+                                * thought we were still in TASK_TRACED, we
+                                * block until it has finished looking at
+                                * utrace.  A utrace_detach that gets the
+                                * lock after we release it here will not
+                                * think we are quiescent at all, since we
+                                * are in TASK_RUNNING state now.
+                                */
+                               spin_lock(&utrace->lock);
+                               spin_unlock(&utrace->lock);
+                       }
                        utrace->u.live.signal = NULL;
                }
 

Reply via email to