The patch I posted had a braino, but not one apropos to your question.
The GIT trees and 2.6-current/ patches have the proper version of that
change.  

> How can this fix anything? utrace->u.live.signal was still set to
> valid on-stack pointer many lines above without any exclusion from
> utrace_detach().

utrace_detach is excluded by the conventions of "quiescence"
synchronization, which is what this piece of the implementation
is all about.  That is, the remove_engine + wake_quiescent path
is excluded because quiesce() will see that the target is not in
TASK_TRACED state.  So, code that looks at utrace->u.live.signal
is excluded up until the set_current_state(TASK_TRACED) happens.
quiesce and utrace_quiescent use the siglock to synchronize the
check for TASK_TRACED with the transition into TASK_TRACED.

I did manage to reproduce the BUG_ON(tsk->utrace == utrace) crash
fairly reliably in one setup and diagnose what it was doing.  AFAICT
it was behaving as I described in my last posting.  Synchronization
on the way into TASK_TRACED was working fine, as I just described it
here.  utrace_detach's quiesce call decided the task was quiescent,
because it really was in TASK_TRACED.  The problem was that the task
woke up in parallel due to SIGKILL, and immediately cleared
utrace->u.live.signal with no synchronization.  The utrace_clear_tsk
and check_dead_utrace code expects that change to have been excluded
when it thinks the task is quiescent.  When that change was made
without synchronization and came in between utrace_clear_tsk's check
and check_dead_utrace's check, it caused the bookkeeping mismatch
that produced the BUG_ON crash.

The change I've now made is this: if we did enter TASK_TRACED, then upon
waking up, take and release utrace->lock.  If and only if we could have
been observed to be in TASK_TRACED, then we might still be considered
quiescent by the thread that holds utrace->lock.  While quiescent, we
must not touch our struct utrace.  So, use utrace->lock to serialize
after anyone who might consider us quiescent.  We are already in
TASK_RUNNING when we get the lock, so just by releasing it immediately we
exclude anyone holding the lock and seeing us as quiescent.  At that
point, we are no longer quiescent in anyone's eyes, and so it is safe to
clear utrace->u.live.signal.

In the testing I've been able to do so far, this change is working.
(In a setup where the test case crashed within a few minutes before,
it ran all night with no problems.)  It certainly needs more testing
to be confident about the fix.  I would very much appreciate your
analysis and checking of my logic.  

The ptrace crash you hit is still unresolved.  Since I never did
reproduce that particular crash myself, I don't have great confidence
in the patch I posted for it.  I have not put that patch in yet.  I'd
be very grateful for any feedback you can give from testing or from
reviewing the logic of my change.


Thanks,
Roland

Reply via email to