> I think Roland is right. Let's forget about utrace for the moment,
> this code looks like
> 
>       if (!CONDITION) {
>               set_task_state(TASK_XXX);
>               schedule();
>       }
> 
> and it can obviously race with
> 
>       CONDITION = true;
>       wake_up_xxx();

This is Linux in-kernel synchronization 101.
The correct blocks of this sort do:

        set_current_state(TASK_UNINTERRUPTIBLE);
        while (!CONDITION) {
                schedule();
        }
        set_current_state(TASK_RUNNING);

This is entirely orthogonal to anything having to do with utrace per se.
This pattern race-free because wake_up_* will reset the waiter to
TASK_RUNNING after making CONDITION true.  So in the race where the
"while (!CONDITION)" clause lets it get into schedule(), that just
returns immediately, after which the second iteration sees CONDITION true.

This is why I recommended using higher-level things like <linux/wait.h>
facilities, where the correct usage patterns are more obvious.

But the race we're actually talking about here is on the other side of
the coin.

> I am wondering if there is another way to handle such a race...
> Suppose we
> 
>       - add UTRACE_STOP_XXX
> 
>       - modify utrace_control(UTRACE_STOP_XXX) to do
>         mark_engine_wants_stop() and nothing more.

utrace_control(current,,UTRACE_STOP) doesn't really do anything more.

So a callback doing this closes the old race: UTRACE_STOP is in force
when the tracer side wakes up, since utrace_control-on-current was done
before wake_up_tracer().  But...

>       - finish_callback_report(UTRACE_STOP_XXX) does
> 
>               spin_lock(&utrace->lock);
>               if (engine_wants_stop(engine))
>                       action = UTRACE_STOP;
>               else
>                       action = UTRACE_REUME;
>               spin_unlock(&utrace->lock);

Let's clarify what this means for the people following the utrace API
but not its implementation internals: currently, the callback's return
value overrides the engine's last-set state such as UTRACE_STOP, whether
that came from a previous utrace_control before the callback, a
utrace_control call inside the callback, or from the return value of a
previous callback.

So now we've moved to a different race (or a different instance of the
original race, if you like): after wake_up_tracer(), the tracer can come
and clear our engine's UTRACE_STOP state with its utrace_control call;
but then our callback return value just reestablishes UTRACE_STOP and we
still stop, having "lost" the asynchronous resumption from the tracer.

So Oleg has an idea for a new kind of callback return value, which I'd call
perhaps UTRACE_NOP.  This would mean to stay in UTRACE_STOP if the last
utrace_control call on this engine used UTRACE_STOP, or the last callback
return value left the engine in that state, otherwise UTRACE_RESUME.

> Then, probably producer could do
> 
>       utrace_control(current, UTRACE_STOP_XXX);
> 
>       slot = book_message_in_buffer(buf);
>       if (slot < 0)
>               return UTRACE_STOP_XXX;
> 
>       ...
>       return UTRACE_RESUME;
> 
> Even better, UTRACE_STOP_XXX can live outside of UTRACE_RESUME_MASK.

Right, it doesn't belong in enum utrace_resume_action--it doesn't make
sense as an argument for utrace_control, for example.  So it could be
something like a return-value flag bit saying "stop if marked for stop",
e.g. UTRACE_SINGLESTEP|UTRACE_STICKY_STOP for single-step, unless marked
to stop, in which case stay stopped.

So this would fix that race.  For the scenario we're discussing here,
that's probably just fine.  The only reason to stop was for the consumer
to run.  If it woke up quickly and ran enough to call utrace_control
while we were still finishing the return of the callback and utrace
bookkeeping, then we just don't stop at all.  (But note if there is
another engine that decides to stop anyway, then we'll have used a
second CPU and/or ping-pong scheduled to run the tracer thread in
parallel with a very short amount of work in the tracee before it just
blocks anyway.)

Now let's consider another scenario.  Take the normal debugger scenario,
i.e. a ptrace implementation done cleanly on the utrace API without the
magic kludges we have today, or a ptrace replacement built cleanly from
scratch in the ideal way, or the pending gdbstub implementation.  Here
what the callback does is wake up the tracer, and return UTRACE_STOP.
What the tracer thread does is block in some fashion until the tracee's
wake_up_tracer(), then do whatever logic (in ptrace, that means return
to user mode and let it make another ptrace system call, but that all
might be quite fast), and decide to examine the thread.  To make sure
its own callback is finished and UTRACE_STOP will be in force, it has to
synchronize with utrace_barrier.  So in maximum parallelism, this means,
having preempted the tracee just now by waking up, we now yield back to
the tracee to finish our callback, other engines' callbacks, and utrace
bookkeeping.  (This is currently a nasty busy-yield, but even in best
theory of today's API, another structured block waiting for wake-up.)
Then when the tracee yields back, we wake up, utrace_barrier() returns
success, and we go on to utrace_prepare_examine().  In extremis, this
ping-pongs again waiting for the tracee to get from the end of utrace
bookkeeping to the middle of scheduler bookkeeping.  Finally, we are
running in the tracer and can examine the tracee safely.

For this sort of scenario, it is clearly better by far to have the
wake_up_tracer() done as late as possible in the sequence.  Ideally,
when the tracer actually gets made runnable in the scheduler, the tracee
is already simultaneously fully descheduled and safe.  Second best, it
should have only a small race window purely inside the scheduler
mechanics.  That's what we have in old vanilla ptrace today, where
(roughly) schedule() vs wait_task_inactive() is the only race.  (What
little ptrace/core bookkeeping is between wake_up_tracer() and schedule(),
we do with preemption disabled and then use preempt_enable_no_resched()
immediately before schedule(), so it should be rare that an interrupt
right there causes us to yield and ping-pong with the tracer before we
just block for the first and last time in the sequence.)

To try to eliminate the races wait_task_inactive() handles probably
requires deep scheduler hacking and I'm not particularly trying to open
that can of worms.  But we should at least consider how to make multiple
utrace engines using only clean APIs reach the status quo ante level of
optimal that old vanilla ptrace gets.

In the past, Oleg has suggested something very general, like a
utrace_engine_ops hook that gets called right before schedule().
I shy away from something like that, since I think it degrades the
intended property of the API that it make it harder to write truly
arcane bugs with terrible effects on the system, not easier.  It's
just too big a sledgehammer to wield above the thinly-protected toes
of the utrace engine writer.

To avoid something dangerously general, the only real option is to give
something safely constrained and specific.  It's straightforward enough to
come up with something for any single synchronization method.  For
example, if every need for this kind of synchronization with tracees is
based on <linux/wait.h> facilities, then we can have a utrace API feature
whereby a callback can queue up a wait_queue_head_t pointer to be passed
to wake_up_interruptible_sync() after all the callbacks are done and
UTRACE_STOP is being performed.  It's probably sufficient to allow one
pending wake-up per engine, which makes it pretty cheap and simple to
implement.


Thanks,
Roland

Reply via email to