Re: Stopped detach/attach status

2009-10-16 Thread Roland McGrath
 My point was, the discussed problems with ptrace  stop probably are
 not ptrace-only, we need other changes. Hopefully we should address
 them after v1.

Sure.

 But I'd prefer to delay this discussion unless you think we should
 fix this right now.

I'm glad to stop thinking about it for a while.
(Let's just not forget for another year.)

 I mean, I just can't think about this until we solve all known
 problems in utrace-ptrace. When I am trying to think about 2 different
 problems at the same time, I can't make any progress ;)

I understand the problem.  (I do four or five at once and like to think I
make marginal progress on one, but that is my famous optimism.)

 I dunno. I am not arguing, just I don't know. But, stopped-attach-transparency
 seem to check this. I am not sure, but
 Excessive waiting SIGSTOP after it was already waited for\n looks like it
 it does. I'll re-check.

I don't know for sure either.  I'm just speculating.


Thanks,
Roland



Re: Stopped detach/attach status

2009-10-12 Thread Oleg Nesterov
On 10/11, Roland McGrath wrote:

  Yes. In particular, ptrace(PTRACE_DETACH, SIGKILL) should cancel
  SIGNAL_STOP_STOPPED, yes?

 Yes.

OK.

 SIGNAL_STOP_DEQUEUED exists for one purpose.  It's to ensure that SIGCONT
 and SIGKILL can clear it to make complete their required effect of clearing
 all pending stop signals.  (It fills the hole when another thread has
 dequeued a stop signal and then dropped the siglock to make its call to
 is_current_pgrp_orphaned()--so that half-delivered signal is still
 considered pending and thus must be cancelled by SIGCONT or SIGKILL.)

Yes, except this doesn't really work. We have a lot of races, afaics,
even without ptrace. The problem is, once we drop -siglock, we can't
trust SIGNAL_STOP_DEQUEUED. And siglock is dropped by dequeue_signal(),
it is dropped by get_signal_to_deliver(), by ptrace_signal/utrace too.

Example. get_signal_to_deliver() dequeues, say, SIGTTIN, drops -siglock
for is_current_pgrp_orphaned().

SIGCONT comes, clears SIGNAL_STOP_DEQUEUED - we shouldn't stop.

Another thread deques SIGTTOU but this signal has a handler, so we should
not stop. However, SIGTTOU restores SIGNAL_STOP_DEQUEUED. This means the
first thread will initiate the group-stop later.

 Another way to put it is to say that the exists for one purpose
 statement above implies that only an actual SIGCONT or SIGKILL should
 ever clear SIGNAL_STOP_DEQUEUED.

Yes, agreed. But I really need to re-read this code.

 The remaining two places are the -flags = SIGNAL_STOP_STOPPED cases in
 do_signal_stop and exit_signals.  Since SIGNAL_STOP_DEQUEUED must always
 have been set before if you can get to those situations, it is harmless
 to use -flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED instead of:

   sig-flags = SIGNAL_STOP_DEQUEUED;
   sig-flags |= SIGNAL_STOP_STOPPED;

 or anything like that.  It's probably cleanest to consolidate those two
 cases to call a single subroutine that does the tracehook_notify_jctl
 logic, unlock and do_notify_parent_cldstop.  It can take a caller flag
 or just check PF_EXITING to omit the -exit_code + -state change of the
 do_signal_stop version of the code.  That one subroutine can have a
 clear comment about the nonobvious flag usage.

Hmm... not sure, but I need to think more.

  But please remember, the patch above is not complete of course and currently
  I do not see the good solution.

 What's incomplete aside from handling the exit_signals case the same way?

Say, ptrace(DETACH/CONT, SIGSTOP). This should work, this means
SIGNAL_STOP_DEQUEUED should be set even before the tracee calls
do_signal_stop(). But otoh it doesn't look right to set this flag
each time the tracee sees a stop signal from the debugger (especially
on detach), -real_parent should not see multiple notifications.

  I am starting to think we should forget
  about these bugs, merge utrace-ptrace, and then try to fix them.

 If we can have utrace-ptrace code whose corner behavior matches the old
 code and is itself clean, then I don't care about the order of the
 changes going in.  But it's not really clear to me that we can even
 describe the old behavior in terms clean enough to make an exact
 work-alike implementation that could possibly be clean.

Yep.

The current utrace-ptrace behaviour is close enough to upstream behaviour.
I hope. I am going to restore the detach always wakeup logic to be
compatible. Hopefully in this case utrace-ptrace will be equally buggy.

I just can't concentrate on these problems until I finish utrace-ptrace.
Because every week I should explain 1-2 times why it is still not ready
for submission and when I am going to send it ;)

Oleg.



Re: Stopped detach/attach status

2009-10-08 Thread Oleg Nesterov
Incomplete reply, just can't read/think/concentrate today...

On 10/07, Roland McGrath wrote:

  We had a lengthy discussion about this.

 Yes.  I only ever wanted that revert then because it was too late in the
 2.6.30 cycle to hash this all out and get it really right.  I meant that
 we should leave wrong enough alone in 2.6.30 but get it all worked out
 more properly in 2.6.31, but I forgot to follow up on it.  If we can
 iron out the behavior now and the upstream version of implementing it is
 not big new hair, it might still be possible to get it fixed in 2.6.32.

 That piece of implementation is 100% wrong.  But we have to figure out
 what the manifest semantics are today from the userland perspective and
 decide what exactly we want them to be before we implement those precise
 semantics in some sensible way.

Yes. In particular, ptrace(PTRACE_DETACH, SIGKILL) should cancel
SIGNAL_STOP_STOPPED, yes?

  -   sig-flags = SIGNAL_STOP_STOPPED;
  +   sig-flags = SIGNAL_STOP_STOPPED | 
  SIGNAL_STOP_DEQUEUED;

 Boy, do I not understand why that does anything about this at all!
 But I am barely awake tonight.  Ok, I guess I do sort of if it goes
 along with some other patch to set SIGNAL_STOP_STOPPED.  But since
 you've verified you really understand what happens, you can tell us!

Two threads T1 and T2, both ptraced by P, both TASK_TRACED, T2 sleeps
in ptrace_signal().

P does:

ptrace(DETACH, T1, SIGSTOP);
ptrace(DETACH, T2, SIGSTOP);

The first DETACH wakes up T1, it dequeues SIGSTOP, calls do_signal_stop().
T2 is still TASK_TRACED, this means T1 completes the group-stop and sets
sig-flags = SIGNAL_STOP_STOPPED.

The second detach wakes up T2, it returns from ptrace_signal() and calls
do_signal_stop() which does nothing without SIGNAL_STOP_DEQUEUED.

But please remember, the patch above is not complete of course and currently
I do not see the good solution. I am starting to think we should forget
about these bugs, merge utrace-ptrace, and then try to fix them.

Even the first detach can fail to stop T1, because SIGNAL_STOP_DEQUEUED
can be cleared before.

I never knew what user-space actually does with ptrace, now I am really
surprized gdb/etc assume it can trust ptrace(SIGSTOP). Sometime it works,
but only by accident.

Oleg.



Re: Stopped detach/attach status

2009-10-07 Thread Oleg Nesterov
On 10/06, Jan Kratochvil wrote:

 On Mon, 05 Oct 2009 04:32:08 +0200, Oleg Nesterov wrote:
 [...]
  Firstly, I think we should un-revert 
  edaba2c5334492f82d39ec35637c6dea5176a977.
  This unconditional wakeup is hopelessly wrong imho, and it is removed
  from utrace-ptrace code. But this breaks another test-case,
  attach-wait-on-stopped. I still think this test-case is wrong.
  We had a lengthy discussion about this.
 
  Now, this patch
 
  --- TTT_32/kernel/signal.c~PT_STOP  2009-10-04 04:08:36.0 
  +0200
  +++ TTT_32/kernel/signal.c  2009-10-05 03:17:39.0 +0200
  @@ -1708,7 +1708,7 @@ static int do_signal_stop(int signr)
   */
  if (sig-group_stop_count) {
  if (!--sig-group_stop_count)
  -   sig-flags = SIGNAL_STOP_STOPPED;
  +   sig-flags = SIGNAL_STOP_STOPPED | 
  SIGNAL_STOP_DEQUEUED;
  current-exit_code = sig-group_exit_code;
  __set_current_state(TASK_STOPPED);
  }
 
  fixes the tests above. Of course this change is not enough, I did
  it just to verify I really understand what happens.
 
  Except, stopped-attach-transparency prints
 
  Excessive waiting SIGSTOP after the second attach/detach
 
  afaics the test-case is not right here. attach_detach() leaves the
  traced threads in STOPPED state, why pid_notifying_sigstop() should
  fail?

 Tried the patch above:
   http://koji.fedoraproject.org/scratch/jkratoch/task_1730038/

 but it does not break stopped-attach-transparency for me.

You mean, it doesn't fix stopped-attach-transparency?

Strange... I tested it, with this patch stopped-attach-transparency
succeeds on my machine.

But! please note this change is not enough above, the right fix
is not simple. However I thought it should work in this case...

 Should I also apply back edaba2c5334492f82d39ec35637c6dea5176a977?

This revert should be unreverted anyway, imho. This wakeup is wrong
in many ways, but note that the changelog says in particular:

this wake_up_process() can break another
assumption: PTRACE_DETACH with SIGSTOP should leave the tracee in
TASK_STOPPED case.  Because the untraced child can dequeue SIGSTOP and
call do_signal_stop() before ptrace_detach() calls wake_up_process().

However, this race is unlikely, I don't think it can explain why
stopped-attach-transparency fails with the patch above.

Oleg.



Re: Stopped detach/attach status

2009-10-07 Thread Roland McGrath
 I do not know. I'd leave this to Roland. I mean, if he thinks this
 should be fixed - I'll try to fix.
 
 But. This all looks unfixeable to me. In my opinion, the kernel is
 obviously wrong, and test-case are wrong too. And any fix in this
 area is user-visible and can break the current expectations.

My view is that the handling of a tracee that was already job-stopped
(left by ^Z, etc.) or of a tracee that you want to leave job-stopped for
its natural parent to see on detach, has never really worked.  In the
past the kernel and GDB (let alone strace) have both been wrong and even
internally inconsistent, and each changed incompatibly with itself (not
to mention with each other) over the years.  So for those cases as such,
I don't think any particular compatibility matters so much as just
settling on something acceptable now.  I expect that there is no one
behavior for this case that is actually compatible with the expectations
of a significant range of tools/versions such that nearly any change at
all wouldn't be a regression for some and an improvement for others.

However, we unfortunately cannot say the same for all quirks that we
would probably like to change.  That is, the treatment of SIGSTOP and
what wakeups get done and so forth is an intricate dependency of plain
attach/detach logic in all manner of debuggeresque tools for numerous
corner cases (many MT-related).  So I think the main constraint on what
we can come up with as newly almost sane semantics is that we really
should not regress for any other corner of attach/detach et al logic in
any current or past version of gdb, or strace, or anything else.

 Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977.

Yes.  That unconditional wakeup was one of the very first things years
ago that made me shake my head and marvel about what sucker they would
ever get to maintain what all the purported ABI semantics corners of
this ludicrous ptrace() implementation were supposed to be, glad that it
was not my problem to touch that wormy can of flaming obvious wrong with
a ten foot pole.  At the time it was probably already obvious to most
people other than me that I would be that very sucker for years to come.
Hmm, perhaps I have not set this up very well now for asking you to
think up all the details of matching the established semantics (that
really sounds better than purported, doesn't it?) of this insane
implementation. :-)

 We had a lengthy discussion about this.

Yes.  I only ever wanted that revert then because it was too late in the
2.6.30 cycle to hash this all out and get it really right.  I meant that
we should leave wrong enough alone in 2.6.30 but get it all worked out
more properly in 2.6.31, but I forgot to follow up on it.  If we can
iron out the behavior now and the upstream version of implementing it is
not big new hair, it might still be possible to get it fixed in 2.6.32.

That piece of implementation is 100% wrong.  But we have to figure out
what the manifest semantics are today from the userland perspective and
decide what exactly we want them to be before we implement those precise
semantics in some sensible way.  We may have to settle on something
that, while more consistent than today's kernel, is still somewhat wrong
in the abstract, when we need to preserve application compatibility.

   -   sig-flags = SIGNAL_STOP_STOPPED;
   +   sig-flags = SIGNAL_STOP_STOPPED | 
 SIGNAL_STOP_DEQUEUED;

Boy, do I not understand why that does anything about this at all!
But I am barely awake tonight.  Ok, I guess I do sort of if it goes
along with some other patch to set SIGNAL_STOP_STOPPED.  But since
you've verified you really understand what happens, you can tell us!

 But as I said, I do not really understand what this test-case tries
 to do. What ptrace(PTRACE_DETACH, SIGSTOP) should mean? I think that
 ptrace(PTRACE_DETACH, signr) should mean the tracee should proceed
 with this signal, as if it was sent by, say, kill.

Except for not being possibly intercepted again, roughly yes.  But note
that this really is only delivery, not generation (POSIX signal terms).
(In the kludge cases for blocked signals and non-signal stops, sometimes
it really is generation like kill, but here I mean resuming from real
signal stops.)  That means that magic generation-time effects, such as
SIGSTOP clearing pending SIGCONT and vice versa, do not happen (if
passing on the signal, they happened before at original generation).

Internally, dequeue-time effects, i.e. SIGNAL_STOP_DEQUEUED and timer
rearming, also don't happen (because they happened before).  But since
SIGNAL_STOP_DEQUEUED is pure internal bookkeeping, that is something we
could change as an implementation detail for the semantics we want.

 In this case, I don't understand why stopped-attach-transparency
 sends SIGSTOP to every sub-thread. If the tracer wants to stop
 the thread group after detach, it can do
 
   

Re: Stopped detach/attach status

2009-10-05 Thread Jan Kratochvil
On Mon, 05 Oct 2009 04:32:08 +0200, Oleg Nesterov wrote:
 On 10/01, Jan Kratochvil wrote:
 
  the ptrace-testsuite
  http://sourceware.org/systemtap/wiki/utrace/tests
 
  currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for:
  FAIL: detach-stopped
  FAIL: stopped-attach-transparency
[...]
 As for user-space, I don't really understand the second test-case,
 this again means I don't understand the supposed behaviour.

The high level goal is described at its top.  Users expect that if they run
`gstack PID' or `gcore PID' the target PID will be absolutely in the same
state as before gstack/gcore.

That means it will keep both whether it was / was not stopped and also any
possible existing / non-existing pending signal for a possible future
waitpid() from its real (non-ptrace) parent PID.

Another question whether technically what it does is right but this high level
goal is hopefully valid.


 Except, stopped-attach-transparency prints
 
   Excessive waiting SIGSTOP after the second attach/detach
 
 afaics the test-case is not right here. attach_detach() leaves the
 traced threads in STOPPED state, why pid_notifying_sigstop() should
 fail?

[ Not replying this part, have not built a kernel with this patch now.  ]


 In this case, I don't understand why stopped-attach-transparency
 sends SIGSTOP to every sub-thread. If the tracer wants to stop
 the thread group after detach, it can do
 
   ptrace(PTRACE_DETACH, anythread, SIGSTOP);
   for_each_other_thread(pid)
   ptrace(PTRACE_DETACH, anythread, 0);
 
 or just
 
   kill(SIGSTOP);
   for_each_thread(pid)
   ptrace(PTRACE_DETACH, anythread, 0);

OK, it this is the recommended way I can fix the testcase this way.
The all-threads-being-sent-SIGSTOP way IIRC worked on linux-2.6.9 but I do not
think this part of the compatibility must be kept.



Thanks,
Jan



Re: Stopped detach/attach status

2009-10-04 Thread Oleg Nesterov
(add Roland)

On 10/01, Jan Kratochvil wrote:

 the ptrace-testsuite
   http://sourceware.org/systemtap/wiki/utrace/tests

 currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for:
   FAIL: detach-stopped
   FAIL: stopped-attach-transparency

 Do you agree with the testcases and is it planned to fix them for F12?

I do not know. I'd leave this to Roland. I mean, if he thinks this
should be fixed - I'll try to fix.

But. This all looks unfixeable to me. In my opinion, the kernel is
obviously wrong, and test-case are wrong too. And any fix in this
area is user-visible and can break the current expectations.

As for kernel, I lost any hope to understand what is the _supposed_
behaviour.

As for user-space, I don't really understand the second test-case,
this again means I don't understand the supposed behaviour.


Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977.
This unconditional wakeup is hopelessly wrong imho, and it is removed
from utrace-ptrace code. But this breaks another test-case,
attach-wait-on-stopped. I still think this test-case is wrong.
We had a lengthy discussion about this.

Now, this patch

--- TTT_32/kernel/signal.c~PT_STOP  2009-10-04 04:08:36.0 
+0200
+++ TTT_32/kernel/signal.c  2009-10-05 03:17:39.0 +0200
@@ -1708,7 +1708,7 @@ static int do_signal_stop(int signr)
 */
if (sig-group_stop_count) {
if (!--sig-group_stop_count)
-   sig-flags = SIGNAL_STOP_STOPPED;
+   sig-flags = SIGNAL_STOP_STOPPED | 
SIGNAL_STOP_DEQUEUED;
current-exit_code = sig-group_exit_code;
__set_current_state(TASK_STOPPED);
}

fixes the tests above. Of course this change is not enough, I did
it just to verify I really understand what happens.

Except, stopped-attach-transparency prints

Excessive waiting SIGSTOP after the second attach/detach

afaics the test-case is not right here. attach_detach() leaves the
traced threads in STOPPED state, why pid_notifying_sigstop() should
fail?


But as I said, I do not really understand what this test-case tries
to do. What ptrace(PTRACE_DETACH, SIGSTOP) should mean? I think that
ptrace(PTRACE_DETACH, signr) should mean the tracee should proceed
with this signal, as if it was sent by, say, kill.

In this case, I don't understand why stopped-attach-transparency
sends SIGSTOP to every sub-thread. If the tracer wants to stop
the thread group after detach, it can do

ptrace(PTRACE_DETACH, anythread, SIGSTOP);
for_each_other_thread(pid)
ptrace(PTRACE_DETACH, anythread, 0);

or just

kill(SIGSTOP);
for_each_thread(pid)
ptrace(PTRACE_DETACH, anythread, 0);

I do not say this will really work with the current implementaion,
we have other bugs/races. I mean I'd expect this should be the right
way to do detach+stop.


And. Currently PTRACE_CONT/PTRACE_DETACH/etc wakes up the tracee even
if the thread group is stopped. This is obviously not right, but
utrace-ptrace does the same. I guess we can't fix this without breaking
existing applications.


In short: I don't know what to do ;)

Oleg.



Stopped detach/attach status

2009-10-01 Thread Jan Kratochvil
Hi Oleg,

the ptrace-testsuite
http://sourceware.org/systemtap/wiki/utrace/tests

currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for:
FAIL: detach-stopped
FAIL: stopped-attach-transparency

Do you agree with the testcases and is it planned to fix them for F12?


Thanks,
Jan



Re: Stopped detach/attach status

2009-10-01 Thread Oleg Nesterov
Hi Jan,

On 10/01, Jan Kratochvil wrote:

 the ptrace-testsuite
   http://sourceware.org/systemtap/wiki/utrace/tests

 currently FAILs (also) on Fedora 12 kernel-2.6.31.1-48.fc12.x86_64 for:
   FAIL: detach-stopped

Please recall our previous discussion about 
95a3540da9c81a5987be810e1d9a83640a366bd5
which was then reverted ;)

I do not know what I can do.

In short, I think this wakeup is absolutely wrong (not only because
of this test-case), but if we kill it we break another test-case:

attach-wait-on-stopped

I believe this test-case is wrong. I can send you mbox with the previous
discussion.

   FAIL: stopped-attach-transparency

I'll try to take a look on Monday, thanks.

Oleg.