Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
* Frank Ch. Eigler (f...@redhat.com) wrote: Hi - On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote: [...] As far as I know, utrace supports multiple trace-engines on a process. Since ptrace is just an engine of utrace, you can add another engine on utrace. utrace-+-ptrace_engine---owner_process | +-systemtap_module | +-ftrace_plugin Right. In this way, utrace is simply a multiplexing intermediary. Here, Frank had posted an example of utrace-ftrace engine. http://lkml.org/lkml/2009/1/27/294 And here is the latest his patch(which seems to support syscall tracing...) http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124 Reminder : we are looking at system-wide tracing here. Here are some comments about the current utrace implementation. Looking at include/linux/utrace.h from the tree 17 * A tracing engine starts by calling utrace_attach_task() or 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks 19 * (struct utrace_engine_ops), and some associated data. This produces a 20 * struct utrace_engine, which is the handle used for all other 21 * operations. An attached engine has its ops vector, its data, and an 22 * event mask controlled by utrace_set_events(). So if the system has, say 3000 threads, then we have 3000 struct utrace_engine created ? I wonder what effet this could have one cachelines if this is used to trace hot paths like system call entry/exit. Have you benchmarked this kind of scenario under tbench ? It has not been a problem, since utrace_engines are designed to be lightweight. Starting or stopping a systemtap script of the form probe process.syscall {} appears to have no noticable impact on a tbench suite. Do you mean starting this script for a single process or for _all_ the processes and threads running on the system ? 24 * For each event bit that is set, that engine will get the 25 * appropriate ops-report_*() callback when the event occurs. The 26 * struct utrace_engine_ops need not provide callbacks for an event 27 * unless the engine sets one of the associated event bits. Looking at utrace_set_events(), we seem to be limited to 32 events on a 32-bits architectures because it uses a bitmask ? Isn't it a bit small? There are only a few types of thread events that involve different classes of treatment, or different degrees of freedom in terms of interference with the uninstrumented fast path of the threads. For example, it does not make sense to have different flag bits for different system calls, since choosing to trace *any* system call involves taking the thread off of the fast path with the TIF_ flag. Once it's off the fast path, it doesn't matter whether the utrace core or some client performs syscall discrimination, so it is left to the client. If we limit ourself to thread-interaction events, I agree that they are limited. But in the system-wide tracing scenario, the criterions for filtering can apply to many more event categories. Referring to Roland's reply, I think using utrace to enable system-wide collection of data would just be a waste of resources. Going through a more lightweight system-wide activation seems more appropriate to me. Utrace is still a very promising tool to trace process-specific activity though. Mathieu 682 /** 683 * utrace_set_events_pid - choose which event reports a tracing engine gets 684 * @pid:thread to affect 685 * @engine: attached engine to affect 686 * @eventmask: new event mask 687 * 688 * This is the same as utrace_set_events(), but takes a struct pid 689 * pointer rather than a struct task_struct pointer. The caller must 690 * hold a ref on @pid, but does not need to worry about the task 691 * staying valid. If it's been reaped so that @pid points nowhere, 692 * then this call returns -%ESRCH. Comments like but does not need to worry about the task staying valid does not make me feel safe and comfortable at all, could you explain how you can assume that derefencing an invalid pointer will return NULL ? (We're doing a final round of internal (pre-LKML) reviews of the utrace implementation right now on utrace-devel@redhat.com, where such comments get fastest attention from the experts.) For this particular issue, the utrace documentation file explains the liveness rules for the various pointers that can be fed to or received from utrace functions. This is not about feeling safe, it's about what the mechanism is deliberately designed to permit. About the utrace_attach_task() : 244 if (unlikely(target-flags PF_KTHREAD)) 245 /* 246
Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
The utrace API itself is not a good fit for global tracing, since its purpose is tracing and control of individual user threads. There is no reason to allocate its per-task data structures when you are going to treat all tasks the same anyway. The points that I think are being missed are about the possibilities of overloading TIF_SYSCALL_TRACE. It's true that ptrace uses TIF_SYSCALL_TRACE as a flag for whether you are in the middle of a PTRACE_SYSCALL, so it can be confused by setting it for other purposes on a task that is also ptrace'd (but not with PTRACE_SYSCALL). Until we are able to do away with these parts of the old ptrace innards, you can't overload TIF_SYSCALL_TRACE without perturbing ptrace behavior. The utrace code does not have this problem. It keeps its own state bits, so for it, TIF_SYSCALL_TRACE means exactly the task will call tracehook_report_syscall_* and no more. To use TIF_SYSCALL_TRACE for another purpose, just set it on all the tasks you like (and/or set it on new tasks in fork.c) and add your code (tracepoints, whatever) to tracehook_report_syscall_* alongside the calls there into utrace. There is exactly one place in utrace code that clears TIF_SYSCALL_TRACE, and you just add !global_syscall_tracing_enabled to the condition there. You don't need to bother clearing TIF_SYSCALL_TRACE on any task when you're done. If your global_syscall_tracing_enabled (or whatever it is) is clear, each task will lazily fall into utrace at its next syscall entry/exit and then utrace will reset TIF_SYSCALL_TRACE when it finds no reason left to have it on. I'm not really going to delve into utrace internals in this thread. Please raise those questions in review of the utrace patches when current code is actually posted, where they belong. Here I'll just mention the relevant things that relate to the underlying issue you raised about synchronization. As thoroughly documented, utrace_set_events() is a quick, asynchronous call that itself makes no guarantees about how quickly a running task will start to report the newly-requested events. For purposes relevant here, it just sets TIF_SYSCALL_TRACE and nothing else. In utrace, if you want synchronous assurance that a task misses no events you ask for, then you must first use utrace_control (et al) to stop it and synchronize. That is not something that makes much sense at all for a flip on global tracing operation, which is not generally especially synchronous with anything else. If you want best effort that a task will pick up newly-requested events Real Soon Now, you can use utrace_control with just UTRACE_REPORT. For purposes relevant here, this just does set_notify_resume(). That will send an IPI if the task is running, and then it will start noticing before it returns to user mode. So: set_tsk_thread_flag(task, TIF_SYSCALL_TRACE); set_notify_resume(task); is what I would expect you to do on each task if you want to quickly get it to start hitting tracehook_report_syscall_*. (I'm a bit dubious that there is really any need to speed it up with set_notify_resume, but that's just me.) Finally, some broader points about TIF_SYSCALL_TRACE that I think have been overlooked. The key special feature of TIF_SYSCALL_TRACE is that it gets you to a place where full user_regset access is available. Debuggers need this to read (and write) the full user register state arbitrarily, which they also need to do user backtraces and the like. If you do not need user_regset to work, then you don't need to be on this (slowest) code path. If you are only interested in reading syscall arguments and results (or even in changing syscall results in exit tracing) then you do not need user_regset and you do not need to take the slowest syscall path. (If you are doing backtraces but already rely on full kernel stack unwinding to do it, you also do not need user_regset.) From anywhere inside the kernel, you can use the asm/syscall.h calls to read syscall args, whichever entry path the task took. The other mechanism to hook into every syscall entry/exit is TIF_SYSCALL_AUDIT. On some machines (like x86), this takes a third, warm code path that is faster than the TIF_SYSCALL_TRACE path (though obviously still off the fastest direct code path). It can be faster precisely because it doesn't need to allow for user_regset access, nor for modification of syscall arguments in entry tracing. For normal read-only tracing of just the actual syscall details, it has all you need. Unfortunately the arch code all looks like: if (unlikely(current-audit_context)) audit_syscall_{entry,exit}(...); So we need to change that to: if (unlikely(test_thread_flag(TIF_SYSCALL_AUDIT))) audit_syscall_{entry,exit}(...); But that is pretty easy to get right, even doing it blind on arch's you can't test. Far better than adding new asm hackery for each arch that's almost identical to TIF_SYSCALL_TRACE or
Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
Hi - On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote: [...] As far as I know, utrace supports multiple trace-engines on a process. Since ptrace is just an engine of utrace, you can add another engine on utrace. utrace-+-ptrace_engine---owner_process | +-systemtap_module | +-ftrace_plugin Right. In this way, utrace is simply a multiplexing intermediary. Here, Frank had posted an example of utrace-ftrace engine. http://lkml.org/lkml/2009/1/27/294 And here is the latest his patch(which seems to support syscall tracing...) http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124 Reminder : we are looking at system-wide tracing here. Here are some comments about the current utrace implementation. Looking at include/linux/utrace.h from the tree 17 * A tracing engine starts by calling utrace_attach_task() or 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks 19 * (struct utrace_engine_ops), and some associated data. This produces a 20 * struct utrace_engine, which is the handle used for all other 21 * operations. An attached engine has its ops vector, its data, and an 22 * event mask controlled by utrace_set_events(). So if the system has, say 3000 threads, then we have 3000 struct utrace_engine created ? I wonder what effet this could have one cachelines if this is used to trace hot paths like system call entry/exit. Have you benchmarked this kind of scenario under tbench ? It has not been a problem, since utrace_engines are designed to be lightweight. Starting or stopping a systemtap script of the form probe process.syscall {} appears to have no noticable impact on a tbench suite. 24 * For each event bit that is set, that engine will get the 25 * appropriate ops-report_*() callback when the event occurs. The 26 * struct utrace_engine_ops need not provide callbacks for an event 27 * unless the engine sets one of the associated event bits. Looking at utrace_set_events(), we seem to be limited to 32 events on a 32-bits architectures because it uses a bitmask ? Isn't it a bit small? There are only a few types of thread events that involve different classes of treatment, or different degrees of freedom in terms of interference with the uninstrumented fast path of the threads. For example, it does not make sense to have different flag bits for different system calls, since choosing to trace *any* system call involves taking the thread off of the fast path with the TIF_ flag. Once it's off the fast path, it doesn't matter whether the utrace core or some client performs syscall discrimination, so it is left to the client. 682 /** 683 * utrace_set_events_pid - choose which event reports a tracing engine gets 684 * @pid:thread to affect 685 * @engine: attached engine to affect 686 * @eventmask: new event mask 687 * 688 * This is the same as utrace_set_events(), but takes a struct pid 689 * pointer rather than a struct task_struct pointer. The caller must 690 * hold a ref on @pid, but does not need to worry about the task 691 * staying valid. If it's been reaped so that @pid points nowhere, 692 * then this call returns -%ESRCH. Comments like but does not need to worry about the task staying valid does not make me feel safe and comfortable at all, could you explain how you can assume that derefencing an invalid pointer will return NULL ? (We're doing a final round of internal (pre-LKML) reviews of the utrace implementation right now on utrace-devel@redhat.com, where such comments get fastest attention from the experts.) For this particular issue, the utrace documentation file explains the liveness rules for the various pointers that can be fed to or received from utrace functions. This is not about feeling safe, it's about what the mechanism is deliberately designed to permit. About the utrace_attach_task() : 244 if (unlikely(target-flags PF_KTHREAD)) 245 /* 246 * Silly kernel, utrace is for users! 247 */ 248 return ERR_PTR(-EPERM); So we cannot trace kernel threads ? I'm not quite sure about all the reasons for this, but I believe that kernel threads don't tend to engage in job control / signal / system-call activities the same way as normal user threads do. 118 /* 119 * Called without locks, when we might be the first utrace engine to attach. 120 * If this is a newborn thread and we are not the creator, we have to wait 121 * for it. The creator gets the first chance to attach. The PF_STARTING 122 * flag is cleared after its report_clone hook has had a chance to run. 123 */ 124 static inline int utrace_attach_delay(struct task_struct *target) 125
Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote: Hi - On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote: [...] As far as I know, utrace supports multiple trace-engines on a process. Since ptrace is just an engine of utrace, you can add another engine on utrace. utrace-+-ptrace_engine---owner_process | +-systemtap_module | +-ftrace_plugin Right. In this way, utrace is simply a multiplexing intermediary. Here, Frank had posted an example of utrace-ftrace engine. http://lkml.org/lkml/2009/1/27/294 And here is the latest his patch(which seems to support syscall tracing...) http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124 Reminder : we are looking at system-wide tracing here. Here are some comments about the current utrace implementation. Looking at include/linux/utrace.h from the tree 17 * A tracing engine starts by calling utrace_attach_task() or 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks 19 * (struct utrace_engine_ops), and some associated data. This produces a 20 * struct utrace_engine, which is the handle used for all other 21 * operations. An attached engine has its ops vector, its data, and an 22 * event mask controlled by utrace_set_events(). So if the system has, say 3000 threads, then we have 3000 struct utrace_engine created ? I wonder what effet this could have one cachelines if this is used to trace hot paths like system call entry/exit. Have you benchmarked this kind of scenario under tbench ? It has not been a problem, since utrace_engines are designed to be lightweight. Starting or stopping a systemtap script of the form probe process.syscall {} appears to have no noticable impact on a tbench suite. 24 * For each event bit that is set, that engine will get the 25 * appropriate ops-report_*() callback when the event occurs. The 26 * struct utrace_engine_ops need not provide callbacks for an event 27 * unless the engine sets one of the associated event bits. Looking at utrace_set_events(), we seem to be limited to 32 events on a 32-bits architectures because it uses a bitmask ? Isn't it a bit small? There are only a few types of thread events that involve different classes of treatment, or different degrees of freedom in terms of interference with the uninstrumented fast path of the threads. For example, it does not make sense to have different flag bits for different system calls, since choosing to trace *any* system call involves taking the thread off of the fast path with the TIF_ flag. Once it's off the fast path, it doesn't matter whether the utrace core or some client performs syscall discrimination, so it is left to the client. 682 /** 683 * utrace_set_events_pid - choose which event reports a tracing engine gets 684 * @pid:thread to affect 685 * @engine: attached engine to affect 686 * @eventmask: new event mask 687 * 688 * This is the same as utrace_set_events(), but takes a struct pid 689 * pointer rather than a struct task_struct pointer. The caller must 690 * hold a ref on @pid, but does not need to worry about the task 691 * staying valid. If it's been reaped so that @pid points nowhere, 692 * then this call returns -%ESRCH. Comments like but does not need to worry about the task staying valid does not make me feel safe and comfortable at all, could you explain how you can assume that derefencing an invalid pointer will return NULL ? (We're doing a final round of internal (pre-LKML) reviews of the utrace implementation right now on utrace-devel@redhat.com, where such comments get fastest attention from the experts.) For this particular issue, the utrace documentation file explains the liveness rules for the various pointers that can be fed to or received from utrace functions. This is not about feeling safe, it's about what the mechanism is deliberately designed to permit. About the utrace_attach_task() : 244 if (unlikely(target-flags PF_KTHREAD)) 245 /* 246 * Silly kernel, utrace is for users! 247 */ 248 return ERR_PTR(-EPERM); So we cannot trace kernel threads ? I'm not quite sure about all the reasons for this, but I believe that kernel threads don't tend to engage in job control / signal / system-call activities the same way as normal user threads do. Some of them use some syscalls, but it doesn't involve a user/kernel switch. So it's not tracable by hooking syscall_entry/exit or using tracehooks. It would require specific hooks on sys_* functions for that. So this check
Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
On 03/16, Mathieu Desnoyers wrote: utrace_add_engine() set_notify_resume(target); ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice that it is set asynchronously with the execution of the target thread (as I do with my TIF_KERNEL_TRACE thread flag). However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in entry_64.S int_signal: and retint_signal: code paths. However, if there is no syscall tracing to do upon syscall entry, the thread flags are not re-read at syscall exit and you will miss the syscall exit returning from your target thread if this thread was blocked while you set its TIF_NOTIFY_RESUME. Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit. If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx) sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call tracehook_report_syscall_xxx(). Oleg.