> I think Roland is right. Let's forget about utrace for the moment, > this code looks like > > if (!CONDITION) { > set_task_state(TASK_XXX); > schedule(); > } > > and it can obviously race with > > CONDITION = true; > wake_up_xxx();
This is Linux in-kernel synchronization 101. The correct blocks of this sort do: set_current_state(TASK_UNINTERRUPTIBLE); while (!CONDITION) { schedule(); } set_current_state(TASK_RUNNING); This is entirely orthogonal to anything having to do with utrace per se. This pattern race-free because wake_up_* will reset the waiter to TASK_RUNNING after making CONDITION true. So in the race where the "while (!CONDITION)" clause lets it get into schedule(), that just returns immediately, after which the second iteration sees CONDITION true. This is why I recommended using higher-level things like <linux/wait.h> facilities, where the correct usage patterns are more obvious. But the race we're actually talking about here is on the other side of the coin. > I am wondering if there is another way to handle such a race... > Suppose we > > - add UTRACE_STOP_XXX > > - modify utrace_control(UTRACE_STOP_XXX) to do > mark_engine_wants_stop() and nothing more. utrace_control(current,,UTRACE_STOP) doesn't really do anything more. So a callback doing this closes the old race: UTRACE_STOP is in force when the tracer side wakes up, since utrace_control-on-current was done before wake_up_tracer(). But... > - finish_callback_report(UTRACE_STOP_XXX) does > > spin_lock(&utrace->lock); > if (engine_wants_stop(engine)) > action = UTRACE_STOP; > else > action = UTRACE_REUME; > spin_unlock(&utrace->lock); Let's clarify what this means for the people following the utrace API but not its implementation internals: currently, the callback's return value overrides the engine's last-set state such as UTRACE_STOP, whether that came from a previous utrace_control before the callback, a utrace_control call inside the callback, or from the return value of a previous callback. So now we've moved to a different race (or a different instance of the original race, if you like): after wake_up_tracer(), the tracer can come and clear our engine's UTRACE_STOP state with its utrace_control call; but then our callback return value just reestablishes UTRACE_STOP and we still stop, having "lost" the asynchronous resumption from the tracer. So Oleg has an idea for a new kind of callback return value, which I'd call perhaps UTRACE_NOP. This would mean to stay in UTRACE_STOP if the last utrace_control call on this engine used UTRACE_STOP, or the last callback return value left the engine in that state, otherwise UTRACE_RESUME. > Then, probably producer could do > > utrace_control(current, UTRACE_STOP_XXX); > > slot = book_message_in_buffer(buf); > if (slot < 0) > return UTRACE_STOP_XXX; > > ... > return UTRACE_RESUME; > > Even better, UTRACE_STOP_XXX can live outside of UTRACE_RESUME_MASK. Right, it doesn't belong in enum utrace_resume_action--it doesn't make sense as an argument for utrace_control, for example. So it could be something like a return-value flag bit saying "stop if marked for stop", e.g. UTRACE_SINGLESTEP|UTRACE_STICKY_STOP for single-step, unless marked to stop, in which case stay stopped. So this would fix that race. For the scenario we're discussing here, that's probably just fine. The only reason to stop was for the consumer to run. If it woke up quickly and ran enough to call utrace_control while we were still finishing the return of the callback and utrace bookkeeping, then we just don't stop at all. (But note if there is another engine that decides to stop anyway, then we'll have used a second CPU and/or ping-pong scheduled to run the tracer thread in parallel with a very short amount of work in the tracee before it just blocks anyway.) Now let's consider another scenario. Take the normal debugger scenario, i.e. a ptrace implementation done cleanly on the utrace API without the magic kludges we have today, or a ptrace replacement built cleanly from scratch in the ideal way, or the pending gdbstub implementation. Here what the callback does is wake up the tracer, and return UTRACE_STOP. What the tracer thread does is block in some fashion until the tracee's wake_up_tracer(), then do whatever logic (in ptrace, that means return to user mode and let it make another ptrace system call, but that all might be quite fast), and decide to examine the thread. To make sure its own callback is finished and UTRACE_STOP will be in force, it has to synchronize with utrace_barrier. So in maximum parallelism, this means, having preempted the tracee just now by waking up, we now yield back to the tracee to finish our callback, other engines' callbacks, and utrace bookkeeping. (This is currently a nasty busy-yield, but even in best theory of today's API, another structured block waiting for wake-up.) Then when the tracee yields back, we wake up, utrace_barrier() returns success, and we go on to utrace_prepare_examine(). In extremis, this ping-pongs again waiting for the tracee to get from the end of utrace bookkeeping to the middle of scheduler bookkeeping. Finally, we are running in the tracer and can examine the tracee safely. For this sort of scenario, it is clearly better by far to have the wake_up_tracer() done as late as possible in the sequence. Ideally, when the tracer actually gets made runnable in the scheduler, the tracee is already simultaneously fully descheduled and safe. Second best, it should have only a small race window purely inside the scheduler mechanics. That's what we have in old vanilla ptrace today, where (roughly) schedule() vs wait_task_inactive() is the only race. (What little ptrace/core bookkeeping is between wake_up_tracer() and schedule(), we do with preemption disabled and then use preempt_enable_no_resched() immediately before schedule(), so it should be rare that an interrupt right there causes us to yield and ping-pong with the tracer before we just block for the first and last time in the sequence.) To try to eliminate the races wait_task_inactive() handles probably requires deep scheduler hacking and I'm not particularly trying to open that can of worms. But we should at least consider how to make multiple utrace engines using only clean APIs reach the status quo ante level of optimal that old vanilla ptrace gets. In the past, Oleg has suggested something very general, like a utrace_engine_ops hook that gets called right before schedule(). I shy away from something like that, since I think it degrades the intended property of the API that it make it harder to write truly arcane bugs with terrible effects on the system, not easier. It's just too big a sledgehammer to wield above the thinly-protected toes of the utrace engine writer. To avoid something dangerously general, the only real option is to give something safely constrained and specific. It's straightforward enough to come up with something for any single synchronization method. For example, if every need for this kind of synchronization with tracees is based on <linux/wait.h> facilities, then we can have a utrace API feature whereby a callback can queue up a wait_queue_head_t pointer to be passed to wake_up_interruptible_sync() after all the callbacks are done and UTRACE_STOP is being performed. It's probably sufficient to allow one pending wake-up per engine, which makes it pretty cheap and simple to implement. Thanks, Roland