The way UTRACE_STOP works for event callbacks is that it does not always cause an immediate stop after the reporting pass where someone used UTRACE_STOP. The various events all happen such that nothing else with user-visible side effects is going to happen before getting to user mode (which is to say, the signals-check path before user mode). So when you ask for UTRACE_STOP, you actually might still get more callbacks up through report_quiesce(0) or report_signal(UTRACE_SIGNAL_REPORT), and only actually stop if you still want to in that callback (or if some engine that used UTRACE_STOP isn't taking that callback at all).
The logic of this is that utrace is about user state, and nothing user-mode can see happens in between, so there is no need to stop in between. Moreover, in cases like CLONE, stopping right there means it's before the clone/fork syscall's return value has reached the task_pt_regs(), i.e. you do not see the thread's "after the event" user state. At the syscall-exit tracing point, and at the final report, you can see the proper register values. So this is all well and good for all such events. But there are some events where something else is going to happen, so we do stop before getting to the return-to-user path. Those are a CLONE in case of CLONE_VFORK, EXIT, and SYSCALL_ENTRY. After the CLONE event for a vfork, the thread will go into "vfork wait", an uninterruptible block until its new child dies or execs. It won't "do" anything else before getting to user mode. But in "vfork wait", it won't be considered safe to examine (so you can use ptrace, e.g.). So in this case, UTRACE_STOP does cause a proper stop that is right after the CLONE event callback returned. When you resume from that stop, you go into "vfork wait" (or don't, if the child is already gone by then). When the child goes away and that wait is done, you can hit the SYSCALL_EXIT event if you care, and final or signal reports, and as normal nothing else user visible is happening before the final report. In EXIT, obviously you are never going back to user mode again. But you can stop here to keep the thread's state alive, which lets you pause and examine its fds and so forth before you let it die. Your notification after all UTRACE_STOPs are lifted is the DEATH event. (I've just glazed over re-reading my own documentation, and I can't tell if any of this is adequately explained anywhere at all. If it's not, please suggest where in kerneldoc comments or utrace.tmpl would be a good place to add a paragraph about it.) So those two are fine. They stop immediately after the callback loop if someone wants UTRACE_STOP, and there are other events to catch after they resume if you need to notice when other engines have resumed. SYSCALL_ENTRY is unlike all other events. Right after this callback loop is when the important user-visible stuff happens (the system call). So we stop immediately there as for the other two. But, if another engine used UTRACE_STOP and maybe did something asynchronously, like modifying the syscall argument registers, you get no opportunity to see what happened. Once all engines lift UTRACE_STOP, the system call runs. Enter Renzo Davoli. He uses a syscall_entry callback whose purpose is to change the syscall arguments. But, it doesn't do it immediately in the callback. It wants to wake up its user-mode component to decide what to do about this particular call. So it uses UTRACE_STOP, that's why we have it. The other components cooperating with the engine contemplate the situation and kibitz, then the engine modifies the registers and calls utrace_control(UTRACE_RESUME). The system call goes ahead. Enter Renzo Davoli. He uses a syscall_entry callback whose purpose is tracing. This engine was attached first, so its normal event callbacks come before the other Renzo's callbacks. And now, the other Renzo's syscall_entry callback comes before this Renzo's, since the order is reversed from normal. So, the other Renzo's callback returns UTRACE_STOP. Then this Renzo's callback gets its SYSCALL_ENTRY. The registers it sees have not been changed yet, so they are not reflective of the actual call that will be made. This callback can look at utrace_resume_action(action) and see UTRACE_STOP, so it knows some previous engine wanted a stop and might change the state before resume. But there is nothing it can do. (So this Renzo packs up Renzo Jr. and flies to San Francisco for a weekend on the town. ;-) So, we have the "nested Renzos" problem. Renzo proposed the solution of having UTRACE_STOP cause an immediate stop after just your own callback. Then you'd have to resume before the next engine gets any callback at all, so when it gets one, it can see whatever you've done. This approach flies in the face of one of the key ideas of well-behaved engine coexistence in utrace. Your engine does not get to delay prompt notification of my engine about the state of the user thread. You get to delay the user-visible behavior of the thread itself arbitrarily with UTRACE_STOP. But that doesn't keep another engine from keeping track of things. This idea is of a piece with the notion that your callback should not go and block forever. You are all parasites on the same user thread, and while it's OK to paralyze the host and dig in, to be a good citizen of verminkind, you can't monopolize the scene at the expense of all your fellow parasites. So I have an alternate proposal. This approach keeps the norms of engine interaction and noninterference of callbacks while covering this special case of syscall entry stops. As explained above, the norm of interacting with other engines and their use of UTRACE_STOP is to use the final report. When your callback's action argument includes UTRACE_STOP, you know an earlier engine might be fiddling before the thread resumes. So, your callback can decide to return UTRACE_REPORT. That ensures that some report_quiesce (or report_signal/UTRACE_SIGNAL_REPORT) callback will be made after the other engine lifts its UTRACE_STOP and before user mode. At that point, you can see what user register values it might have installed, etc. In all events but syscall entry, a final report_quiesce(0) serves this need. My proposal is to extend this "resume report" approach to the syscall entry case. That is, after when some report_syscall_entry returned UTRACE_STOP so we've stopped, allow for a second reporting pass after we've been resumed, before running the system call. You'd get this pass if someone used UTRACE_REPORT. That is, in the first callback loop, one engine used UTRACE_STOP and another used UTRACE_REPORT. Then when the first engine used utrace_control() to resume, there would be a second reporting pass because of the second engine's earlier request. Or, even if there was just one engine, but it used UTRACE_STOP and then used utrace_control(UTRACE_REPORT) to resume, then it would get the second reporting pass. If someone uses UTRACE_STOP+UTRACE_REPORT in that pass, there would be a third pass, etc. What I have in mind is that the second (and however many more) pass would just be another report_syscall_entry callback to everyone with UTRACE_EVENT(SYSCALL_ENTRY) set. A flag bit in the action argument says this is a repeat notification. I think this strikes a decent balance of not adding more callbacks and more arguments to bloat the API in general, while imposing a fairly simple burden on engines to avoid getting confused by multiple calls. A tracing-only engine that just wants to see the syscall that is going to be done can just do: if (utrace_resume_action(action) == UTRACE_STOP) return UTRACE_REPORT; at the top of report_syscall_entry, so it just doesn't think about it until it thinks the call will go now through. Say an engine has a different agenda, just to see what syscall argument values came in from user mode before someone else changes them. It does: if (action & UTRACE_SYSCALL_RESUMED) return UTRACE_RESUME; to ignore the additional callbacks that might come after somebody decided to stop and report. It just does its work on the first one. Here comes Renzo again! He wants to have two or three or nineteen layers of the first kind of Renzo engine: each one stops at syscall entry, then resumes after changing some registers. He wants these to "nest", meaning that after the "outermost" one stops, fiddles, and resumes, the "next one in" stops, looks at the register as fiddled by the outermost guy, fiddles in a different way, and resumes, and on and on. Perhaps the first model (if last guy is stopping, punt to look again at resume report) works for that. Or perhaps the engine also needs to keep track with its own state flag it sets whenever it does its work, and then resets in exit tracing to prepare for next time. Renzo points out that this is a lot of iterations over the engine list. Each callback loop starts from the beginning again and calls each engine to decide this is a useless repetition for it (or a useless report now because it won't care until the later repetition). He suggests maybe the resume report callbacks could pick up in the middle of the list with the first guy to ask for UTRACE_STOP. (The ones earlier in the list already did what they do and said to let the system call go ahead, so they should not be interested in the second callback now just because of what a later engine did.) I take these points to heart. But I also know about the intricacies of the list management and asynchronous detach while stopped. They quickly make what might sound simple in the abstract not seem so simple in the implementation. I also wonder how much this simple-seeming idea might actually complicate the corners of the API (implementation aside) even beyond these intricacies I'm describing here. I know we have more concrete hashing out of engine interaction to do over time as we have more different engines doing interesting things. It looks easy to see what makes sense for stacking several of Renzo's engines that all do the same thing in utrace terms, exciting semantics aside. I think we'll find it more complex with a mix of other engines as they come about. I also know that this level of optimization can wait until we have a realistic plan of seeing more than about three engines on a task. So, even I can't write that much text and still think this interface choice is simple to understand. But I kind of think it's around as simple as it can be for its mandates. I'd appreciate any feedback. The implementation of this would be about like the patch below. Thanks, Roland ====== --- a/kernel/utrace.c +++ b/kernel/utrace.c @@ -1273,6 +1273,8 @@ struct utrace_report { #define INIT_REPORT(var) \ struct utrace_report var = { UTRACE_RESUME, 0, \ false, false, false, false } +#define RESET_REPORT(var) \ + ((var).detaches = (var).reports = (var).takers = (var).killed = false) /* * We are now making the report, so clear the flag saying we need one. @@ -1499,22 +1501,41 @@ bool utrace_report_syscall_entry(struct struct task_struct *task = current; struct utrace *utrace = task_utrace_struct(task); INIT_REPORT(report); + u32 resume_report = 0; +report: start_report(utrace); REPORT_CALLBACKS(_reverse, task, utrace, &report, UTRACE_EVENT(SYSCALL_ENTRY), report_syscall_entry, - report.result | report.action, engine, current, regs); + resume_report | report.result | report.action, + engine, current, regs); finish_report(&report, task, utrace); - if (report.action == UTRACE_STOP && - unlikely(utrace_stop(task, utrace, false))) + if (report.action == UTRACE_STOP) { + if (utrace_stop(task, utrace, report.reports)) + /* + * We are continuing despite UTRACE_STOP because of a + * SIGKILL. Don't let the system call actually proceed. + */ + return true; + /* - * We are continuing despite UTRACE_STOP because of a - * SIGKILL. Don't let the system call actually proceed. + * If we've been asked for another report after our stop, + * go back to report (and maybe stop) again before running + * the system call. The second (and later) reports are + * marked with the UTRACE_SYSCALL_RESUMED flag so that + * engines know this is a second report at the same entry. + * This gives them the chance to examine the registers anew + * after they might have been changed while we were stopped. */ - return true; + if (utrace->report) { + RESET_REPORT(report); + resume_report = UTRACE_SYSCALL_RESUMED; + goto report; + } + } - return report.result == UTRACE_SYSCALL_ABORT; + return utrace_syscall_action(report.result) == UTRACE_SYSCALL_ABORT; } /*