Sorry I'm so long in following up on this. I know you've hacked a lot more since.
> The vanilla kernel relies on syscall_trace_leave() which does > send_sigtrap(TRAP_BRKPT). But with utrace-ptrace the tracee sleeps > in do_notify_resume() path, syscall_trace_leave() won't be called. It's even weirder than this, in nonobvious ways. I'm not even sure I'm saying anything contrary to what you've coded for, but I want to be clear and explicit about everything. I'm concerned we might accidentally code to quirks of the x86 implementation details where we should instead make the arch tracehook calls' interface contracts more clear and make sure everything we do in generic utrace and ptrace layers jibes in theory and not just in observed x86 behavior. What happens today in the vanilla x86 kernel is this: 1. user_enable_single_step -> TIF_SINGLESTEP done to user state before a syscall insn makes sure (if nothing else) we'll hit syscall_trace_leave after sys_foo returns. 1b. If done to a thread state of "in the syscall" as in syscall-entry, clone, or exec report, that too makes sure we get there even if we hadn't been single-stepping into the syscall insn. 2. syscall_trace_leave -> tracehook_report_syscall_exit ->... ptrace_stop 3a. user_enable_single_step -> TIF_SINGLESTEP (was already and still is set) 3b. user_disable_single_step (PTRACE_CONT et al) -> clear TIF_SINGLESTEP 4. tracehook_report_syscall_exit returns to syscall_trace_leave -> 5a. (if 3a) test_thread_flag(TIF_SINGLESTEP)=>1 -> send_sigtrap 5b. (if 3b) test_thread_flag(TIF_SINGLESTEP)=>0 -> no signal, no stop 6. (if 5a) do_notify_resume -> dequeue SIGTRAP, ptrace_signal->ptrace_stop I found it nonobvious that what we really do is ptrace_stop inside tracehook_report_syscall_exit and then upon resumption look afresh at TIF_SINGLESTEP to determine whether we really send a SIGTRAP. Note that #6 could always be preceded e.g. by a dequeue of SIGHUP, ptrace_stop for that, handler setup for that with its own ptrace_notify() stop--not really a signal and so not touching the queued SIGTRAP--inside tracehook_signal_handler (if used PTRACE_SINGLESTEP,SIGHUP to enter the handler), etc., so the "stepped" state finally observed at the signal stop for that SIGTRAP could be wildly different from the post-syscall state. In comparison, on powerpc it is the same up to #2 and then do_syscall_trace_leave does: step = test_thread_flag(TIF_SINGLESTEP); if (step || test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall_exit(regs, step); And that's it. (x86 in #2 always calls it with step=0.) Vanilla tracehook_report_syscall_exit in fact ignores the step flag, though I thought at some point before it had a "ptrace_notify(SIGTRAP)" there. So on powerpc, PTRACE_SINGLESTEP done from the syscall-exit stop will not give an immediate (fake) single-step trap at the immediate post-syscall insn, but just single-step the next one insn (I think). I'm sure nobody intended this to be an arch difference, and I'm not sure how much we are really stuck with either choice in particular for application compatibility on one arch or the other. There is no particular reason for one arch to queue a SIGTRAP and another to use tracehook_report_syscall_exit for single-step cases, just the taste of each arch maintainer. IIRC there were even different opinions about whether PTRACE_SINGLESTEP producing a SYSGOOD-style SIGTRAP|0x80 when you happened to step over a syscall insn was a helpful feature giving one more bit of information or an unhelpful bug making PTRACE_SINGLESTEP behave inconsistently with one particular sort of user insn vs all others. I didn't want to keep arguing it out at the time. I included the step flag when formulating tracehook_report_syscall_exit with the hope that arch maintainers could be easily convinced to pass in that flag so at least future generic code (utrace, whatever) could have enough information for whatever behavior we wanted without more arch work. That much seems almost to have panned out (some arch's do pass in that flag). (Probably the biggest reason that x86 does what it does and others don't is just that it has the send_sigtrap() helper so it's a clean-looking one-line call.) As to the ptrace behavior, I don't think it much matters one way or the other about the two stops. If you just used PTRACE_SYSCALL (twice) and so are now at the syscall-exit stop, and then use PTRACE_SINGLESTEP, you really don't need an extra stop with exactly the same register state. The only thing that can happen is a signal, and ptrace is always going to give you an intervening stop for that anyway. OTOH, I never thought it was right to give a SYSGOOD status code in response to PTRACE_SINGLESTEP that happens to be over a syscall insn, so neither the x86 status quo nor the powerpc status quo really floats my boat. The tracehook and utrace layers have yet another set of considerations that never came up with ptrace. You can't use both PTRACE_SINGLESTEP and PTRACE_SYSCALL and simultaneously. But in utrace you can single-step and get syscall-exit callbacks, either both by one engine or some engines doing one and some doing the other. I don't have any conclusions or prescriptions about any of this right at the moment. It just seems important to cite that all this is kicking around to begin with before I threw in the new wrinkle of utrace not wanting to stop inside tracehook_report_syscall_exit but instead only later after its caller will have returned. > - perhaps it makes sense to change force_sig_info() to use > lock_task_sighand(), then we can avoid taking taskist around > send_sigtrap(). This feels like an indicator that it would be better to call send_sigtrap in the tracee instead. > > - I guess, we should also send SIGTRAP when SINGLESTEP resumes > > the tracee from SYSCALL_EXIT stop. This is trivial, but I need > > to write the test-case first. > > Yes, with this change the kernel passes all tests plus the test-case > below. I wrote it because I lost the hope to understand the required > behaviour ;) We should get a case for this into ptrace-tests and then look into (with Jan) what we want to do about consistency across arch's. > Note also the couple of "assert(regs.rax == -ENOSYS)" which were commented > out, here we differ from vanilla kernel and this can't be "fixed". But, > as you pointed out previously, we do not need to be bug-compatible here. Right. The test can assert(regs.rax == -ENOSYS || regs.rax == child_pid). We want those sites clearly commented to go along with the explanation to upstream about the intentional semantics change. Thanks, Roland