Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
Hi - On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote: [...] As far as I know, utrace supports multiple trace-engines on a process. Since ptrace is just an engine of utrace, you can add another engine on utrace. utrace-+-ptrace_engine---owner_process | +-systemtap_module | +-ftrace_plugin Right. In this way, utrace is simply a multiplexing intermediary. Here, Frank had posted an example of utrace-ftrace engine. http://lkml.org/lkml/2009/1/27/294 And here is the latest his patch(which seems to support syscall tracing...) http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124 Reminder : we are looking at system-wide tracing here. Here are some comments about the current utrace implementation. Looking at include/linux/utrace.h from the tree 17 * A tracing engine starts by calling utrace_attach_task() or 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks 19 * (struct utrace_engine_ops), and some associated data. This produces a 20 * struct utrace_engine, which is the handle used for all other 21 * operations. An attached engine has its ops vector, its data, and an 22 * event mask controlled by utrace_set_events(). So if the system has, say 3000 threads, then we have 3000 struct utrace_engine created ? I wonder what effet this could have one cachelines if this is used to trace hot paths like system call entry/exit. Have you benchmarked this kind of scenario under tbench ? It has not been a problem, since utrace_engines are designed to be lightweight. Starting or stopping a systemtap script of the form probe process.syscall {} appears to have no noticable impact on a tbench suite. 24 * For each event bit that is set, that engine will get the 25 * appropriate ops-report_*() callback when the event occurs. The 26 * struct utrace_engine_ops need not provide callbacks for an event 27 * unless the engine sets one of the associated event bits. Looking at utrace_set_events(), we seem to be limited to 32 events on a 32-bits architectures because it uses a bitmask ? Isn't it a bit small? There are only a few types of thread events that involve different classes of treatment, or different degrees of freedom in terms of interference with the uninstrumented fast path of the threads. For example, it does not make sense to have different flag bits for different system calls, since choosing to trace *any* system call involves taking the thread off of the fast path with the TIF_ flag. Once it's off the fast path, it doesn't matter whether the utrace core or some client performs syscall discrimination, so it is left to the client. 682 /** 683 * utrace_set_events_pid - choose which event reports a tracing engine gets 684 * @pid:thread to affect 685 * @engine: attached engine to affect 686 * @eventmask: new event mask 687 * 688 * This is the same as utrace_set_events(), but takes a struct pid 689 * pointer rather than a struct task_struct pointer. The caller must 690 * hold a ref on @pid, but does not need to worry about the task 691 * staying valid. If it's been reaped so that @pid points nowhere, 692 * then this call returns -%ESRCH. Comments like but does not need to worry about the task staying valid does not make me feel safe and comfortable at all, could you explain how you can assume that derefencing an invalid pointer will return NULL ? (We're doing a final round of internal (pre-LKML) reviews of the utrace implementation right now on utrace-devel@redhat.com, where such comments get fastest attention from the experts.) For this particular issue, the utrace documentation file explains the liveness rules for the various pointers that can be fed to or received from utrace functions. This is not about feeling safe, it's about what the mechanism is deliberately designed to permit. About the utrace_attach_task() : 244 if (unlikely(target-flags PF_KTHREAD)) 245 /* 246 * Silly kernel, utrace is for users! 247 */ 248 return ERR_PTR(-EPERM); So we cannot trace kernel threads ? I'm not quite sure about all the reasons for this, but I believe that kernel threads don't tend to engage in job control / signal / system-call activities the same way as normal user threads do. 118 /* 119 * Called without locks, when we might be the first utrace engine to attach. 120 * If this is a newborn thread and we are not the creator, we have to wait 121 * for it. The creator gets the first chance to attach. The PF_STARTING 122 * flag is cleared after its report_clone hook has had a chance to run. 123 */ 124 static inline int utrace_attach_delay(struct task_struct *target) 125
Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote: Hi - On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote: [...] As far as I know, utrace supports multiple trace-engines on a process. Since ptrace is just an engine of utrace, you can add another engine on utrace. utrace-+-ptrace_engine---owner_process | +-systemtap_module | +-ftrace_plugin Right. In this way, utrace is simply a multiplexing intermediary. Here, Frank had posted an example of utrace-ftrace engine. http://lkml.org/lkml/2009/1/27/294 And here is the latest his patch(which seems to support syscall tracing...) http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124 Reminder : we are looking at system-wide tracing here. Here are some comments about the current utrace implementation. Looking at include/linux/utrace.h from the tree 17 * A tracing engine starts by calling utrace_attach_task() or 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks 19 * (struct utrace_engine_ops), and some associated data. This produces a 20 * struct utrace_engine, which is the handle used for all other 21 * operations. An attached engine has its ops vector, its data, and an 22 * event mask controlled by utrace_set_events(). So if the system has, say 3000 threads, then we have 3000 struct utrace_engine created ? I wonder what effet this could have one cachelines if this is used to trace hot paths like system call entry/exit. Have you benchmarked this kind of scenario under tbench ? It has not been a problem, since utrace_engines are designed to be lightweight. Starting or stopping a systemtap script of the form probe process.syscall {} appears to have no noticable impact on a tbench suite. 24 * For each event bit that is set, that engine will get the 25 * appropriate ops-report_*() callback when the event occurs. The 26 * struct utrace_engine_ops need not provide callbacks for an event 27 * unless the engine sets one of the associated event bits. Looking at utrace_set_events(), we seem to be limited to 32 events on a 32-bits architectures because it uses a bitmask ? Isn't it a bit small? There are only a few types of thread events that involve different classes of treatment, or different degrees of freedom in terms of interference with the uninstrumented fast path of the threads. For example, it does not make sense to have different flag bits for different system calls, since choosing to trace *any* system call involves taking the thread off of the fast path with the TIF_ flag. Once it's off the fast path, it doesn't matter whether the utrace core or some client performs syscall discrimination, so it is left to the client. 682 /** 683 * utrace_set_events_pid - choose which event reports a tracing engine gets 684 * @pid:thread to affect 685 * @engine: attached engine to affect 686 * @eventmask: new event mask 687 * 688 * This is the same as utrace_set_events(), but takes a struct pid 689 * pointer rather than a struct task_struct pointer. The caller must 690 * hold a ref on @pid, but does not need to worry about the task 691 * staying valid. If it's been reaped so that @pid points nowhere, 692 * then this call returns -%ESRCH. Comments like but does not need to worry about the task staying valid does not make me feel safe and comfortable at all, could you explain how you can assume that derefencing an invalid pointer will return NULL ? (We're doing a final round of internal (pre-LKML) reviews of the utrace implementation right now on utrace-devel@redhat.com, where such comments get fastest attention from the experts.) For this particular issue, the utrace documentation file explains the liveness rules for the various pointers that can be fed to or received from utrace functions. This is not about feeling safe, it's about what the mechanism is deliberately designed to permit. About the utrace_attach_task() : 244 if (unlikely(target-flags PF_KTHREAD)) 245 /* 246 * Silly kernel, utrace is for users! 247 */ 248 return ERR_PTR(-EPERM); So we cannot trace kernel threads ? I'm not quite sure about all the reasons for this, but I believe that kernel threads don't tend to engage in job control / signal / system-call activities the same way as normal user threads do. Some of them use some syscalls, but it doesn't involve a user/kernel switch. So it's not tracable by hooking syscall_entry/exit or using tracehooks. It would require specific hooks on sys_* functions for that. So this check
Re: Q: utrace-stopped utrace_report_jctl()
On 03/15, Roland McGrath wrote: Then we re-do this (well, almost) check under -siglock, } else if (task_is_stopped(target)) { if (!(target-utrace_flags UTRACE_EVENT(JCTL))) utrace-stopped = stopped = true; } But this is not nice. Let's suppose the task is already stopped, we do UTRACE_ATTACH + utrace_set_events(JCTL). This is exactly why utrace_set_events() sets -stopped preemptively for that case. Yes, thanks. I saw this code in utrace_set_events(), but then forgot. REPORT(task, utrace, report, UTRACE_EVENT(JCTL), report_jctl, what, notify); instead. There is a bug, but your fix changes a key API choice. I've put in this fix: -report_jctl, was_stopped ? CLD_STOPPED : CLD_CONTINUED, what); +report_jctl, was_stopped ? CLD_STOPPED : CLD_CONTINUED, +notify ? what : 0); The @type argument shows the state we will be in after the callback. If the state changes, there will be another callback. That's what a state-tracking tracer needs, e.g. to keep a little light on the screen red while the thread is stopped and green while it's running. The @notify argument shows what SIGCHLD the parent sees (if it were dequeuing all possible SIGCHLD postings as quickly as they come). That's what an event-tracking tracer needs, e.g. to match up with what SIGCHLDs are expected in the parent. I see, thanks. With the first patch, we call utrace_report_jctl() before we actually stop. do_signal_stop() can fail then, but I think this is OK, we can pretend that SIGCONT/SIGKILL happened after we stopped. It is not complete, and with this patch we always call -report_jctl with notify == 0. Just for discussion. I think I sort of understand the intent of your patch. If we change the calling convention for tracehook_notify_jctl, I think we want to preserve the aspect that the hook decides about sending the notification. That's how the ptrace quirks can be reimplemented differently later without changing the tracehook layer again. Also, we certainly don't want one tracehook call with two different locking conditions. Agreed, bool sig_locked is awful. But we can avoid it. The real problem is how to figure out the correct notify argument. I'll try to think more, but I am not sure I will find the clean solution :( Just in case. We can introduce PF_SIGCONTED flag and change prepare_signal(SIGCONT) and signal_wake_up(resume = 1) to set this flag. Since the task never changes its -flags in finish_stop() path, it is safe to do this under -siglock. This way utrace_report_jctl() can drop TASK_STOPPED before REPORT() and then check !PF_SIGCONTED before restoring the -state. But yes sure, this is very, very ugly. Oleg.
Re: utrace_set_events/utrace_control death/reap checks
On 03/15, Roland McGrath wrote: utrace_set_events: (utrace-death ((old_flags ~events) DEATH_EVENTS)) (old_flags ~events) DEATH_EVENTS) means the caller tries to clear DEATH/QUIESCE. Why this is not allowed? And why this is not allowed _only_ when the target runs utrace_report_death()-REPORT()? This is specifically documented for -EALREADY, and in the DocBook section Interlock with final callbacks. The idea is this: Aha, I didn't know. And I don't understand why do we need utrace-death at all. ... it is only used by utrace_control(UTRACE_DETACH). ... that it will (or it's already happening). utrace_control as the synchronizing step of asynchronous tear-down. If it returns 0, then report_death will not and it is safe to destroy data structures the callback code would use. Yes, with your explanation above this is clear. But can't we simplify this check a little bit? utrace_control: else if (unlikely(target-utrace_flags DEATH_EVENTS) || unlikely(utrace-death)) { return -EALREADY; can't we just do else if (unlikely(utrace-death)) { return -EALREADY; I guess I missed something, but can't understand why do we need to check -utrace_flags. We are going to call mark_engine_detached() below which clears engine-flags, and we hold utrace-lock. If utrace_flags DEATH_EVENTS is true, the subsequent utrace_report_death() must see engine-flags == 0 (it takes utrace-lock before REPORT_CALLBACKS), so it won't call any callback. Yes, it can play with engine itself, but this should be safe because struct utrace has a reference to attached engine. No? Oleg.
Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
On 03/16, Mathieu Desnoyers wrote: utrace_add_engine() set_notify_resume(target); ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice that it is set asynchronously with the execution of the target thread (as I do with my TIF_KERNEL_TRACE thread flag). However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in entry_64.S int_signal: and retint_signal: code paths. However, if there is no syscall tracing to do upon syscall entry, the thread flags are not re-read at syscall exit and you will miss the syscall exit returning from your target thread if this thread was blocked while you set its TIF_NOTIFY_RESUME. Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit. If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx) sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call tracehook_report_syscall_xxx(). Oleg.