The patch I posted had a braino, but not one apropos to your question. The GIT trees and 2.6-current/ patches have the proper version of that change.
> How can this fix anything? utrace->u.live.signal was still set to > valid on-stack pointer many lines above without any exclusion from > utrace_detach(). utrace_detach is excluded by the conventions of "quiescence" synchronization, which is what this piece of the implementation is all about. That is, the remove_engine + wake_quiescent path is excluded because quiesce() will see that the target is not in TASK_TRACED state. So, code that looks at utrace->u.live.signal is excluded up until the set_current_state(TASK_TRACED) happens. quiesce and utrace_quiescent use the siglock to synchronize the check for TASK_TRACED with the transition into TASK_TRACED. I did manage to reproduce the BUG_ON(tsk->utrace == utrace) crash fairly reliably in one setup and diagnose what it was doing. AFAICT it was behaving as I described in my last posting. Synchronization on the way into TASK_TRACED was working fine, as I just described it here. utrace_detach's quiesce call decided the task was quiescent, because it really was in TASK_TRACED. The problem was that the task woke up in parallel due to SIGKILL, and immediately cleared utrace->u.live.signal with no synchronization. The utrace_clear_tsk and check_dead_utrace code expects that change to have been excluded when it thinks the task is quiescent. When that change was made without synchronization and came in between utrace_clear_tsk's check and check_dead_utrace's check, it caused the bookkeeping mismatch that produced the BUG_ON crash. The change I've now made is this: if we did enter TASK_TRACED, then upon waking up, take and release utrace->lock. If and only if we could have been observed to be in TASK_TRACED, then we might still be considered quiescent by the thread that holds utrace->lock. While quiescent, we must not touch our struct utrace. So, use utrace->lock to serialize after anyone who might consider us quiescent. We are already in TASK_RUNNING when we get the lock, so just by releasing it immediately we exclude anyone holding the lock and seeing us as quiescent. At that point, we are no longer quiescent in anyone's eyes, and so it is safe to clear utrace->u.live.signal. In the testing I've been able to do so far, this change is working. (In a setup where the test case crashed within a few minutes before, it ran all night with no problems.) It certainly needs more testing to be confident about the fix. I would very much appreciate your analysis and checking of my logic. The ptrace crash you hit is still unresolved. Since I never did reproduce that particular crash myself, I don't have great confidence in the patch I posted for it. I have not put that patch in yet. I'd be very grateful for any feedback you can give from testing or from reviewing the logic of my change. Thanks, Roland