Re: Stopped detach/attach status
My point was, the discussed problems with ptrace stop probably are not ptrace-only, we need other changes. Hopefully we should address them after v1. Sure. But I'd prefer to delay this discussion unless you think we should fix this right now. I'm glad to stop thinking about it for a while. (Let's just not forget for another year.) I mean, I just can't think about this until we solve all known problems in utrace-ptrace. When I am trying to think about 2 different problems at the same time, I can't make any progress ;) I understand the problem. (I do four or five at once and like to think I make marginal progress on one, but that is my famous optimism.) I dunno. I am not arguing, just I don't know. But, stopped-attach-transparency seem to check this. I am not sure, but Excessive waiting SIGSTOP after it was already waited for\n looks like it it does. I'll re-check. I don't know for sure either. I'm just speculating. Thanks, Roland
Re: Stopped detach/attach status
On 10/11, Roland McGrath wrote: Yes. In particular, ptrace(PTRACE_DETACH, SIGKILL) should cancel SIGNAL_STOP_STOPPED, yes? Yes. OK. SIGNAL_STOP_DEQUEUED exists for one purpose. It's to ensure that SIGCONT and SIGKILL can clear it to make complete their required effect of clearing all pending stop signals. (It fills the hole when another thread has dequeued a stop signal and then dropped the siglock to make its call to is_current_pgrp_orphaned()--so that half-delivered signal is still considered pending and thus must be cancelled by SIGCONT or SIGKILL.) Yes, except this doesn't really work. We have a lot of races, afaics, even without ptrace. The problem is, once we drop -siglock, we can't trust SIGNAL_STOP_DEQUEUED. And siglock is dropped by dequeue_signal(), it is dropped by get_signal_to_deliver(), by ptrace_signal/utrace too. Example. get_signal_to_deliver() dequeues, say, SIGTTIN, drops -siglock for is_current_pgrp_orphaned(). SIGCONT comes, clears SIGNAL_STOP_DEQUEUED - we shouldn't stop. Another thread deques SIGTTOU but this signal has a handler, so we should not stop. However, SIGTTOU restores SIGNAL_STOP_DEQUEUED. This means the first thread will initiate the group-stop later. Another way to put it is to say that the exists for one purpose statement above implies that only an actual SIGCONT or SIGKILL should ever clear SIGNAL_STOP_DEQUEUED. Yes, agreed. But I really need to re-read this code. The remaining two places are the -flags = SIGNAL_STOP_STOPPED cases in do_signal_stop and exit_signals. Since SIGNAL_STOP_DEQUEUED must always have been set before if you can get to those situations, it is harmless to use -flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED instead of: sig-flags = SIGNAL_STOP_DEQUEUED; sig-flags |= SIGNAL_STOP_STOPPED; or anything like that. It's probably cleanest to consolidate those two cases to call a single subroutine that does the tracehook_notify_jctl logic, unlock and do_notify_parent_cldstop. It can take a caller flag or just check PF_EXITING to omit the -exit_code + -state change of the do_signal_stop version of the code. That one subroutine can have a clear comment about the nonobvious flag usage. Hmm... not sure, but I need to think more. But please remember, the patch above is not complete of course and currently I do not see the good solution. What's incomplete aside from handling the exit_signals case the same way? Say, ptrace(DETACH/CONT, SIGSTOP). This should work, this means SIGNAL_STOP_DEQUEUED should be set even before the tracee calls do_signal_stop(). But otoh it doesn't look right to set this flag each time the tracee sees a stop signal from the debugger (especially on detach), -real_parent should not see multiple notifications. I am starting to think we should forget about these bugs, merge utrace-ptrace, and then try to fix them. If we can have utrace-ptrace code whose corner behavior matches the old code and is itself clean, then I don't care about the order of the changes going in. But it's not really clear to me that we can even describe the old behavior in terms clean enough to make an exact work-alike implementation that could possibly be clean. Yep. The current utrace-ptrace behaviour is close enough to upstream behaviour. I hope. I am going to restore the detach always wakeup logic to be compatible. Hopefully in this case utrace-ptrace will be equally buggy. I just can't concentrate on these problems until I finish utrace-ptrace. Because every week I should explain 1-2 times why it is still not ready for submission and when I am going to send it ;) Oleg.
Re: Stopped detach/attach status
Incomplete reply, just can't read/think/concentrate today... On 10/07, Roland McGrath wrote: We had a lengthy discussion about this. Yes. I only ever wanted that revert then because it was too late in the 2.6.30 cycle to hash this all out and get it really right. I meant that we should leave wrong enough alone in 2.6.30 but get it all worked out more properly in 2.6.31, but I forgot to follow up on it. If we can iron out the behavior now and the upstream version of implementing it is not big new hair, it might still be possible to get it fixed in 2.6.32. That piece of implementation is 100% wrong. But we have to figure out what the manifest semantics are today from the userland perspective and decide what exactly we want them to be before we implement those precise semantics in some sensible way. Yes. In particular, ptrace(PTRACE_DETACH, SIGKILL) should cancel SIGNAL_STOP_STOPPED, yes? - sig-flags = SIGNAL_STOP_STOPPED; + sig-flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED; Boy, do I not understand why that does anything about this at all! But I am barely awake tonight. Ok, I guess I do sort of if it goes along with some other patch to set SIGNAL_STOP_STOPPED. But since you've verified you really understand what happens, you can tell us! Two threads T1 and T2, both ptraced by P, both TASK_TRACED, T2 sleeps in ptrace_signal(). P does: ptrace(DETACH, T1, SIGSTOP); ptrace(DETACH, T2, SIGSTOP); The first DETACH wakes up T1, it dequeues SIGSTOP, calls do_signal_stop(). T2 is still TASK_TRACED, this means T1 completes the group-stop and sets sig-flags = SIGNAL_STOP_STOPPED. The second detach wakes up T2, it returns from ptrace_signal() and calls do_signal_stop() which does nothing without SIGNAL_STOP_DEQUEUED. But please remember, the patch above is not complete of course and currently I do not see the good solution. I am starting to think we should forget about these bugs, merge utrace-ptrace, and then try to fix them. Even the first detach can fail to stop T1, because SIGNAL_STOP_DEQUEUED can be cleared before. I never knew what user-space actually does with ptrace, now I am really surprized gdb/etc assume it can trust ptrace(SIGSTOP). Sometime it works, but only by accident. Oleg.
Re: Stopped detach/attach status
On 10/06, Jan Kratochvil wrote: On Mon, 05 Oct 2009 04:32:08 +0200, Oleg Nesterov wrote: [...] Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977. This unconditional wakeup is hopelessly wrong imho, and it is removed from utrace-ptrace code. But this breaks another test-case, attach-wait-on-stopped. I still think this test-case is wrong. We had a lengthy discussion about this. Now, this patch --- TTT_32/kernel/signal.c~PT_STOP 2009-10-04 04:08:36.0 +0200 +++ TTT_32/kernel/signal.c 2009-10-05 03:17:39.0 +0200 @@ -1708,7 +1708,7 @@ static int do_signal_stop(int signr) */ if (sig-group_stop_count) { if (!--sig-group_stop_count) - sig-flags = SIGNAL_STOP_STOPPED; + sig-flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED; current-exit_code = sig-group_exit_code; __set_current_state(TASK_STOPPED); } fixes the tests above. Of course this change is not enough, I did it just to verify I really understand what happens. Except, stopped-attach-transparency prints Excessive waiting SIGSTOP after the second attach/detach afaics the test-case is not right here. attach_detach() leaves the traced threads in STOPPED state, why pid_notifying_sigstop() should fail? Tried the patch above: http://koji.fedoraproject.org/scratch/jkratoch/task_1730038/ but it does not break stopped-attach-transparency for me. You mean, it doesn't fix stopped-attach-transparency? Strange... I tested it, with this patch stopped-attach-transparency succeeds on my machine. But! please note this change is not enough above, the right fix is not simple. However I thought it should work in this case... Should I also apply back edaba2c5334492f82d39ec35637c6dea5176a977? This revert should be unreverted anyway, imho. This wakeup is wrong in many ways, but note that the changelog says in particular: this wake_up_process() can break another assumption: PTRACE_DETACH with SIGSTOP should leave the tracee in TASK_STOPPED case. Because the untraced child can dequeue SIGSTOP and call do_signal_stop() before ptrace_detach() calls wake_up_process(). However, this race is unlikely, I don't think it can explain why stopped-attach-transparency fails with the patch above. Oleg.
Re: Stopped detach/attach status
I do not know. I'd leave this to Roland. I mean, if he thinks this should be fixed - I'll try to fix. But. This all looks unfixeable to me. In my opinion, the kernel is obviously wrong, and test-case are wrong too. And any fix in this area is user-visible and can break the current expectations. My view is that the handling of a tracee that was already job-stopped (left by ^Z, etc.) or of a tracee that you want to leave job-stopped for its natural parent to see on detach, has never really worked. In the past the kernel and GDB (let alone strace) have both been wrong and even internally inconsistent, and each changed incompatibly with itself (not to mention with each other) over the years. So for those cases as such, I don't think any particular compatibility matters so much as just settling on something acceptable now. I expect that there is no one behavior for this case that is actually compatible with the expectations of a significant range of tools/versions such that nearly any change at all wouldn't be a regression for some and an improvement for others. However, we unfortunately cannot say the same for all quirks that we would probably like to change. That is, the treatment of SIGSTOP and what wakeups get done and so forth is an intricate dependency of plain attach/detach logic in all manner of debuggeresque tools for numerous corner cases (many MT-related). So I think the main constraint on what we can come up with as newly almost sane semantics is that we really should not regress for any other corner of attach/detach et al logic in any current or past version of gdb, or strace, or anything else. Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977. Yes. That unconditional wakeup was one of the very first things years ago that made me shake my head and marvel about what sucker they would ever get to maintain what all the purported ABI semantics corners of this ludicrous ptrace() implementation were supposed to be, glad that it was not my problem to touch that wormy can of flaming obvious wrong with a ten foot pole. At the time it was probably already obvious to most people other than me that I would be that very sucker for years to come. Hmm, perhaps I have not set this up very well now for asking you to think up all the details of matching the established semantics (that really sounds better than purported, doesn't it?) of this insane implementation. :-) We had a lengthy discussion about this. Yes. I only ever wanted that revert then because it was too late in the 2.6.30 cycle to hash this all out and get it really right. I meant that we should leave wrong enough alone in 2.6.30 but get it all worked out more properly in 2.6.31, but I forgot to follow up on it. If we can iron out the behavior now and the upstream version of implementing it is not big new hair, it might still be possible to get it fixed in 2.6.32. That piece of implementation is 100% wrong. But we have to figure out what the manifest semantics are today from the userland perspective and decide what exactly we want them to be before we implement those precise semantics in some sensible way. We may have to settle on something that, while more consistent than today's kernel, is still somewhat wrong in the abstract, when we need to preserve application compatibility. - sig-flags = SIGNAL_STOP_STOPPED; + sig-flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED; Boy, do I not understand why that does anything about this at all! But I am barely awake tonight. Ok, I guess I do sort of if it goes along with some other patch to set SIGNAL_STOP_STOPPED. But since you've verified you really understand what happens, you can tell us! But as I said, I do not really understand what this test-case tries to do. What ptrace(PTRACE_DETACH, SIGSTOP) should mean? I think that ptrace(PTRACE_DETACH, signr) should mean the tracee should proceed with this signal, as if it was sent by, say, kill. Except for not being possibly intercepted again, roughly yes. But note that this really is only delivery, not generation (POSIX signal terms). (In the kludge cases for blocked signals and non-signal stops, sometimes it really is generation like kill, but here I mean resuming from real signal stops.) That means that magic generation-time effects, such as SIGSTOP clearing pending SIGCONT and vice versa, do not happen (if passing on the signal, they happened before at original generation). Internally, dequeue-time effects, i.e. SIGNAL_STOP_DEQUEUED and timer rearming, also don't happen (because they happened before). But since SIGNAL_STOP_DEQUEUED is pure internal bookkeeping, that is something we could change as an implementation detail for the semantics we want. In this case, I don't understand why stopped-attach-transparency sends SIGSTOP to every sub-thread. If the tracer wants to stop the thread group after detach, it can do
Re: Stopped detach/attach status
On Mon, 05 Oct 2009 04:32:08 +0200, Oleg Nesterov wrote: On 10/01, Jan Kratochvil wrote: the ptrace-testsuite http://sourceware.org/systemtap/wiki/utrace/tests currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for: FAIL: detach-stopped FAIL: stopped-attach-transparency [...] As for user-space, I don't really understand the second test-case, this again means I don't understand the supposed behaviour. The high level goal is described at its top. Users expect that if they run `gstack PID' or `gcore PID' the target PID will be absolutely in the same state as before gstack/gcore. That means it will keep both whether it was / was not stopped and also any possible existing / non-existing pending signal for a possible future waitpid() from its real (non-ptrace) parent PID. Another question whether technically what it does is right but this high level goal is hopefully valid. Except, stopped-attach-transparency prints Excessive waiting SIGSTOP after the second attach/detach afaics the test-case is not right here. attach_detach() leaves the traced threads in STOPPED state, why pid_notifying_sigstop() should fail? [ Not replying this part, have not built a kernel with this patch now. ] In this case, I don't understand why stopped-attach-transparency sends SIGSTOP to every sub-thread. If the tracer wants to stop the thread group after detach, it can do ptrace(PTRACE_DETACH, anythread, SIGSTOP); for_each_other_thread(pid) ptrace(PTRACE_DETACH, anythread, 0); or just kill(SIGSTOP); for_each_thread(pid) ptrace(PTRACE_DETACH, anythread, 0); OK, it this is the recommended way I can fix the testcase this way. The all-threads-being-sent-SIGSTOP way IIRC worked on linux-2.6.9 but I do not think this part of the compatibility must be kept. Thanks, Jan
Re: Stopped detach/attach status
(add Roland) On 10/01, Jan Kratochvil wrote: the ptrace-testsuite http://sourceware.org/systemtap/wiki/utrace/tests currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for: FAIL: detach-stopped FAIL: stopped-attach-transparency Do you agree with the testcases and is it planned to fix them for F12? I do not know. I'd leave this to Roland. I mean, if he thinks this should be fixed - I'll try to fix. But. This all looks unfixeable to me. In my opinion, the kernel is obviously wrong, and test-case are wrong too. And any fix in this area is user-visible and can break the current expectations. As for kernel, I lost any hope to understand what is the _supposed_ behaviour. As for user-space, I don't really understand the second test-case, this again means I don't understand the supposed behaviour. Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977. This unconditional wakeup is hopelessly wrong imho, and it is removed from utrace-ptrace code. But this breaks another test-case, attach-wait-on-stopped. I still think this test-case is wrong. We had a lengthy discussion about this. Now, this patch --- TTT_32/kernel/signal.c~PT_STOP 2009-10-04 04:08:36.0 +0200 +++ TTT_32/kernel/signal.c 2009-10-05 03:17:39.0 +0200 @@ -1708,7 +1708,7 @@ static int do_signal_stop(int signr) */ if (sig-group_stop_count) { if (!--sig-group_stop_count) - sig-flags = SIGNAL_STOP_STOPPED; + sig-flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED; current-exit_code = sig-group_exit_code; __set_current_state(TASK_STOPPED); } fixes the tests above. Of course this change is not enough, I did it just to verify I really understand what happens. Except, stopped-attach-transparency prints Excessive waiting SIGSTOP after the second attach/detach afaics the test-case is not right here. attach_detach() leaves the traced threads in STOPPED state, why pid_notifying_sigstop() should fail? But as I said, I do not really understand what this test-case tries to do. What ptrace(PTRACE_DETACH, SIGSTOP) should mean? I think that ptrace(PTRACE_DETACH, signr) should mean the tracee should proceed with this signal, as if it was sent by, say, kill. In this case, I don't understand why stopped-attach-transparency sends SIGSTOP to every sub-thread. If the tracer wants to stop the thread group after detach, it can do ptrace(PTRACE_DETACH, anythread, SIGSTOP); for_each_other_thread(pid) ptrace(PTRACE_DETACH, anythread, 0); or just kill(SIGSTOP); for_each_thread(pid) ptrace(PTRACE_DETACH, anythread, 0); I do not say this will really work with the current implementaion, we have other bugs/races. I mean I'd expect this should be the right way to do detach+stop. And. Currently PTRACE_CONT/PTRACE_DETACH/etc wakes up the tracee even if the thread group is stopped. This is obviously not right, but utrace-ptrace does the same. I guess we can't fix this without breaking existing applications. In short: I don't know what to do ;) Oleg.
Stopped detach/attach status
Hi Oleg, the ptrace-testsuite http://sourceware.org/systemtap/wiki/utrace/tests currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for: FAIL: detach-stopped FAIL: stopped-attach-transparency Do you agree with the testcases and is it planned to fix them for F12? Thanks, Jan
Re: Stopped detach/attach status
Hi Jan, On 10/01, Jan Kratochvil wrote: the ptrace-testsuite http://sourceware.org/systemtap/wiki/utrace/tests currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for: FAIL: detach-stopped Please recall our previous discussion about 95a3540da9c81a5987be810e1d9a83640a366bd5 which was then reverted ;) I do not know what I can do. In short, I think this wakeup is absolutely wrong (not only because of this test-case), but if we kill it we break another test-case: attach-wait-on-stopped I believe this test-case is wrong. I can send you mbox with the previous discussion. FAIL: stopped-attach-transparency I'll try to take a look on Monday, thanks. Oleg.