resuming after stop at syscall_entry

Roland McGrath Fri, 17 Apr 2009 21:27:29 -0700

The way UTRACE_STOP works for event callbacks is that it does not always
cause an immediate stop after the reporting pass where someone used
UTRACE_STOP.  The various events all happen such that nothing else with
user-visible side effects is going to happen before getting to user mode
(which is to say, the signals-check path before user mode).  So when you
ask for UTRACE_STOP, you actually might still get more callbacks up through
report_quiesce(0) or report_signal(UTRACE_SIGNAL_REPORT), and only actually
stop if you still want to in that callback (or if some engine that used
UTRACE_STOP isn't taking that callback at all).


The logic of this is that utrace is about user state, and nothing user-mode
can see happens in between, so there is no need to stop in between.
Moreover, in cases like CLONE, stopping right there means it's before the
clone/fork syscall's return value has reached the task_pt_regs(), i.e. you
do not see the thread's "after the event" user state.  At the syscall-exit
tracing point, and at the final report, you can see the proper register
values.  So this is all well and good for all such events.

But there are some events where something else is going to happen, so we do
stop before getting to the return-to-user path.  Those are a CLONE in case
of CLONE_VFORK, EXIT, and SYSCALL_ENTRY.

After the CLONE event for a vfork, the thread will go into "vfork wait", an
uninterruptible block until its new child dies or execs.  It won't "do"
anything else before getting to user mode.  But in "vfork wait", it won't
be considered safe to examine (so you can use ptrace, e.g.).  So in this
case, UTRACE_STOP does cause a proper stop that is right after the CLONE
event callback returned.  When you resume from that stop, you go into
"vfork wait" (or don't, if the child is already gone by then).  When the
child goes away and that wait is done, you can hit the SYSCALL_EXIT event
if you care, and final or signal reports, and as normal nothing else user
visible is happening before the final report.

In EXIT, obviously you are never going back to user mode again.  But you
can stop here to keep the thread's state alive, which lets you pause and
examine its fds and so forth before you let it die.  Your notification
after all UTRACE_STOPs are lifted is the DEATH event.

(I've just glazed over re-reading my own documentation, and I can't tell if
any of this is adequately explained anywhere at all.  If it's not, please
suggest where in kerneldoc comments or utrace.tmpl would be a good place to
add a paragraph about it.)

So those two are fine.  They stop immediately after the callback loop if
someone wants UTRACE_STOP, and there are other events to catch after
they resume if you need to notice when other engines have resumed.

SYSCALL_ENTRY is unlike all other events.  Right after this callback
loop is when the important user-visible stuff happens (the system call).
So we stop immediately there as for the other two.  But, if another
engine used UTRACE_STOP and maybe did something asynchronously, like
modifying the syscall argument registers, you get no opportunity to see
what happened.  Once all engines lift UTRACE_STOP, the system call runs.

Enter Renzo Davoli.  He uses a syscall_entry callback whose purpose is
to change the syscall arguments.  But, it doesn't do it immediately in
the callback.  It wants to wake up its user-mode component to decide
what to do about this particular call.  So it uses UTRACE_STOP, that's
why we have it.  The other components cooperating with the engine
contemplate the situation and kibitz, then the engine modifies the
registers and calls utrace_control(UTRACE_RESUME).  The system call goes
ahead.

Enter Renzo Davoli.  He uses a syscall_entry callback whose purpose is
tracing.  This engine was attached first, so its normal event callbacks
come before the other Renzo's callbacks.  And now, the other Renzo's
syscall_entry callback comes before this Renzo's, since the order is
reversed from normal.  So, the other Renzo's callback returns
UTRACE_STOP.  Then this Renzo's callback gets its SYSCALL_ENTRY.  The
registers it sees have not been changed yet, so they are not reflective
of the actual call that will be made.  This callback can look at
utrace_resume_action(action) and see UTRACE_STOP, so it knows some
previous engine wanted a stop and might change the state before resume.
But there is nothing it can do.  (So this Renzo packs up Renzo Jr. and
flies to San Francisco for a weekend on the town. ;-)

So, we have the "nested Renzos" problem.  Renzo proposed the solution of
having UTRACE_STOP cause an immediate stop after just your own callback.
Then you'd have to resume before the next engine gets any callback at
all, so when it gets one, it can see whatever you've done.

This approach flies in the face of one of the key ideas of well-behaved
engine coexistence in utrace.  Your engine does not get to delay prompt
notification of my engine about the state of the user thread.  You get
to delay the user-visible behavior of the thread itself arbitrarily with
UTRACE_STOP.  But that doesn't keep another engine from keeping track of
things.  This idea is of a piece with the notion that your callback
should not go and block forever.  You are all parasites on the same user
thread, and while it's OK to paralyze the host and dig in, to be a good
citizen of verminkind, you can't monopolize the scene at the expense of
all your fellow parasites.

So I have an alternate proposal.  This approach keeps the norms of
engine interaction and noninterference of callbacks while covering
this special case of syscall entry stops.

As explained above, the norm of interacting with other engines and their
use of UTRACE_STOP is to use the final report.  When your callback's
action argument includes UTRACE_STOP, you know an earlier engine might
be fiddling before the thread resumes.  So, your callback can decide to
return UTRACE_REPORT.  That ensures that some report_quiesce (or
report_signal/UTRACE_SIGNAL_REPORT) callback will be made after the
other engine lifts its UTRACE_STOP and before user mode.  At that point,
you can see what user register values it might have installed, etc.  In
all events but syscall entry, a final report_quiesce(0) serves this need.

My proposal is to extend this "resume report" approach to the syscall
entry case.  That is, after when some report_syscall_entry returned
UTRACE_STOP so we've stopped, allow for a second reporting pass after
we've been resumed, before running the system call.  You'd get this pass
if someone used UTRACE_REPORT.  That is, in the first callback loop, one
engine used UTRACE_STOP and another used UTRACE_REPORT.  Then when the
first engine used utrace_control() to resume, there would be a second
reporting pass because of the second engine's earlier request.  Or, even
if there was just one engine, but it used UTRACE_STOP and then used
utrace_control(UTRACE_REPORT) to resume, then it would get the second
reporting pass.  If someone uses UTRACE_STOP+UTRACE_REPORT in that pass,
there would be a third pass, etc.

What I have in mind is that the second (and however many more) pass
would just be another report_syscall_entry callback to everyone with
UTRACE_EVENT(SYSCALL_ENTRY) set.  A flag bit in the action argument says
this is a repeat notification.

I think this strikes a decent balance of not adding more callbacks and
more arguments to bloat the API in general, while imposing a fairly
simple burden on engines to avoid getting confused by multiple calls.

A tracing-only engine that just wants to see the syscall that is going
to be done can just do:

        if (utrace_resume_action(action) == UTRACE_STOP)
                return UTRACE_REPORT;

at the top of report_syscall_entry, so it just doesn't think about it
until it thinks the call will go now through.  

Say an engine has a different agenda, just to see what syscall argument
values came in from user mode before someone else changes them.  It does:

        if (action & UTRACE_SYSCALL_RESUMED)
                return UTRACE_RESUME;

to ignore the additional callbacks that might come after somebody
decided to stop and report.  It just does its work on the first one.

Here comes Renzo again!  He wants to have two or three or nineteen
layers of the first kind of Renzo engine: each one stops at syscall
entry, then resumes after changing some registers.  He wants these to
"nest", meaning that after the "outermost" one stops, fiddles, and
resumes, the "next one in" stops, looks at the register as fiddled by
the outermost guy, fiddles in a different way, and resumes, and on and
on.  Perhaps the first model (if last guy is stopping, punt to look
again at resume report) works for that.  Or perhaps the engine also
needs to keep track with its own state flag it sets whenever it does its
work, and then resets in exit tracing to prepare for next time.

Renzo points out that this is a lot of iterations over the engine list.
Each callback loop starts from the beginning again and calls each engine
to decide this is a useless repetition for it (or a useless report now
because it won't care until the later repetition).  He suggests maybe
the resume report callbacks could pick up in the middle of the list with
the first guy to ask for UTRACE_STOP.  (The ones earlier in the list
already did what they do and said to let the system call go ahead, so
they should not be interested in the second callback now just because of
what a later engine did.)  

I take these points to heart.  But I also know about the intricacies of
the list management and asynchronous detach while stopped.  They quickly
make what might sound simple in the abstract not seem so simple in the
implementation.  I also wonder how much this simple-seeming idea might
actually complicate the corners of the API (implementation aside) even
beyond these intricacies I'm describing here.  I know we have more
concrete hashing out of engine interaction to do over time as we have
more different engines doing interesting things.  It looks easy to see
what makes sense for stacking several of Renzo's engines that all do the
same thing in utrace terms, exciting semantics aside.  I think we'll
find it more complex with a mix of other engines as they come about.  I
also know that this level of optimization can wait until we have a
realistic plan of seeing more than about three engines on a task.

So, even I can't write that much text and still think this interface
choice is simple to understand.  But I kind of think it's around as
simple as it can be for its mandates.  I'd appreciate any feedback.

The implementation of this would be about like the patch below.


Thanks,
Roland

======

--- a/kernel/utrace.c
+++ b/kernel/utrace.c
@@ -1273,6 +1273,8 @@ struct utrace_report {
 #define INIT_REPORT(var) \
        struct utrace_report var = { UTRACE_RESUME, 0, \
                                     false, false, false, false }
+#define RESET_REPORT(var) \
+       ((var).detaches = (var).reports = (var).takers = (var).killed = false)
 
 /*
  * We are now making the report, so clear the flag saying we need one.
@@ -1499,22 +1501,41 @@ bool utrace_report_syscall_entry(struct 
        struct task_struct *task = current;
        struct utrace *utrace = task_utrace_struct(task);
        INIT_REPORT(report);
+       u32 resume_report = 0;
 
+report:
        start_report(utrace);
        REPORT_CALLBACKS(_reverse, task, utrace, &report,
                         UTRACE_EVENT(SYSCALL_ENTRY), report_syscall_entry,
-                        report.result | report.action, engine, current, regs);
+                        resume_report | report.result | report.action,
+                        engine, current, regs);
        finish_report(&report, task, utrace);
 
-       if (report.action == UTRACE_STOP &&
-           unlikely(utrace_stop(task, utrace, false)))
+       if (report.action == UTRACE_STOP) {
+               if (utrace_stop(task, utrace, report.reports))
+                       /*
+                        * We are continuing despite UTRACE_STOP because of a
+                        * SIGKILL.  Don't let the system call actually proceed.
+                        */
+                       return true;
+
                /*
-                * We are continuing despite UTRACE_STOP because of a
-                * SIGKILL.  Don't let the system call actually proceed.
+                * If we've been asked for another report after our stop,
+                * go back to report (and maybe stop) again before running
+                * the system call.  The second (and later) reports are
+                * marked with the UTRACE_SYSCALL_RESUMED flag so that
+                * engines know this is a second report at the same entry.
+                * This gives them the chance to examine the registers anew
+                * after they might have been changed while we were stopped.
                 */
-               return true;
+               if (utrace->report) {
+                       RESET_REPORT(report);
+                       resume_report = UTRACE_SYSCALL_RESUMED;
+                       goto report;
+               }
+       }
 
-       return report.result == UTRACE_SYSCALL_ABORT;
+       return utrace_syscall_action(report.result) == UTRACE_SYSCALL_ABORT;
 }
 
 /*

resuming after stop at syscall_entry

Reply via email to