Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-23 Thread Mathieu Desnoyers
* Frank Ch. Eigler (f...@redhat.com) wrote:
 Hi -
 
 
 On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
 
  [...]
   As far as I know, utrace supports multiple trace-engines on a process.
   Since ptrace is just an engine of utrace, you can add another engine on 
   utrace.
   
   utrace-+-ptrace_engine---owner_process
  |
  +-systemtap_module
  |
  +-ftrace_plugin
 
 Right.  In this way, utrace is simply a multiplexing intermediary.
 
 
   Here, Frank had posted an example of utrace-ftrace engine.
   http://lkml.org/lkml/2009/1/27/294
   
   And here is the latest his patch(which seems to support syscall 
   tracing...)
   http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
   
  
  Reminder : we are looking at system-wide tracing here. Here are some
  comments about the current utrace implementation.
  
  Looking at include/linux/utrace.h from the tree
  
  17  * A tracing engine starts by calling utrace_attach_task() or
  18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
  19  * (struct utrace_engine_ops), and some associated data.  This produces 
  a
  20  * struct utrace_engine, which is the handle used for all other
  21  * operations.  An attached engine has its ops vector, its data, and an
  22  * event mask controlled by utrace_set_events().
  
  So if the system has, say 3000 threads, then we have 3000 struct
  utrace_engine created ? I wonder what effet this could have one
  cachelines if this is used to trace hot paths like system call
  entry/exit. Have you benchmarked this kind of scenario under tbench ?
 
 It has not been a problem, since utrace_engines are designed to be
 lightweight.  Starting or stopping a systemtap script of the form
 
 probe process.syscall {}
 
 appears to have no noticable impact on a tbench suite.
 

Do you mean starting this script for a single process or for _all_ the
processes and threads running on the system ?

 
  24  * For each event bit that is set, that engine will get the
  25  * appropriate ops-report_*() callback when the event occurs.  The
  26  * struct utrace_engine_ops need not provide callbacks for an event
  27  * unless the engine sets one of the associated event bits.
  
  Looking at utrace_set_events(), we seem to be limited to 32 events on a
  32-bits architectures because it uses a bitmask ? Isn't it a bit small?
 
 There are only a few types of thread events that involve different
 classes of treatment, or different degrees of freedom in terms of
 interference with the uninstrumented fast path of the threads.
 
 For example, it does not make sense to have different flag bits for
 different system calls, since choosing to trace *any* system call
 involves taking the thread off of the fast path with the TIF_ flag.
 Once it's off the fast path, it doesn't matter whether the utrace core
 or some client performs syscall discrimination, so it is left to the
 client.
 

If we limit ourself to thread-interaction events, I agree that they are
limited. But in the system-wide tracing scenario, the criterions for
filtering can apply to many more event categories.

Referring to Roland's reply, I think using utrace to enable system-wide
collection of data would just be a waste of resources. Going through a
more lightweight system-wide activation seems more appropriate to me.
Utrace is still a very promising tool to trace process-specific activity
though.

Mathieu

 
  682 /**
  683  * utrace_set_events_pid - choose which event reports a tracing engine 
  gets
  684  * @pid:thread to affect
  685  * @engine: attached engine to affect
  686  * @eventmask:  new event mask
  687  *
  688  * This is the same as utrace_set_events(), but takes a struct pid
  689  * pointer rather than a struct task_struct pointer.  The caller must
  690  * hold a ref on @pid, but does not need to worry about the task
  691  * staying valid.  If it's been reaped so that @pid points nowhere,
  692  * then this call returns -%ESRCH.
  
  
  Comments like but does not need to worry about the task staying valid
  does not make me feel safe and comfortable at all, could you explain
  how you can assume that derefencing an invalid pointer will return
  NULL ?
 
 (We're doing a final round of internal (pre-LKML) reviews of the
 utrace implementation right now on utrace-devel@redhat.com, where such
 comments get fastest attention from the experts.)
 
 For this particular issue, the utrace documentation file explains the
 liveness rules for the various pointers that can be fed to or received
 from utrace functions.  This is not about feeling safe, it's about
 what the mechanism is deliberately designed to permit.
 
 
  About the utrace_attach_task() :
  
  244 if (unlikely(target-flags  PF_KTHREAD))
  245 /*
  246   

Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-19 Thread Roland McGrath
The utrace API itself is not a good fit for global tracing, since its
purpose is tracing and control of individual user threads.  There is
no reason to allocate its per-task data structures when you are going
to treat all tasks the same anyway.  The points that I think are being
missed are about the possibilities of overloading TIF_SYSCALL_TRACE.

It's true that ptrace uses TIF_SYSCALL_TRACE as a flag for whether you are
in the middle of a PTRACE_SYSCALL, so it can be confused by setting it for 
other purposes on a task that is also ptrace'd (but not with PTRACE_SYSCALL).
Until we are able to do away with these parts of the old ptrace innards,
you can't overload TIF_SYSCALL_TRACE without perturbing ptrace behavior.

The utrace code does not have this problem.  It keeps its own state bits,
so for it, TIF_SYSCALL_TRACE means exactly the task will call
tracehook_report_syscall_* and no more.  To use TIF_SYSCALL_TRACE for
another purpose, just set it on all the tasks you like (and/or set it on
new tasks in fork.c) and add your code (tracepoints, whatever) to
tracehook_report_syscall_* alongside the calls there into utrace.  There is
exactly one place in utrace code that clears TIF_SYSCALL_TRACE, and you
just add  !global_syscall_tracing_enabled to the condition there.  You
don't need to bother clearing TIF_SYSCALL_TRACE on any task when you're
done.  If your global_syscall_tracing_enabled (or whatever it is) is
clear, each task will lazily fall into utrace at its next syscall
entry/exit and then utrace will reset TIF_SYSCALL_TRACE when it finds no
reason left to have it on.

I'm not really going to delve into utrace internals in this thread.  Please
raise those questions in review of the utrace patches when current code is
actually posted, where they belong.  Here I'll just mention the relevant
things that relate to the underlying issue you raised about synchronization.
As thoroughly documented, utrace_set_events() is a quick, asynchronous call
that itself makes no guarantees about how quickly a running task will start
to report the newly-requested events.  For purposes relevant here, it just
sets TIF_SYSCALL_TRACE and nothing else.  In utrace, if you want synchronous
assurance that a task misses no events you ask for, then you must first use
utrace_control (et al) to stop it and synchronize.  That is not something
that makes much sense at all for a flip on global tracing operation, which
is not generally especially synchronous with anything else.  If you want
best effort that a task will pick up newly-requested events Real Soon Now,
you can use utrace_control with just UTRACE_REPORT.  For purposes relevant
here, this just does set_notify_resume().  That will send an IPI if the task
is running, and then it will start noticing before it returns to user mode.
So:
set_tsk_thread_flag(task, TIF_SYSCALL_TRACE);
set_notify_resume(task);
is what I would expect you to do on each task if you want to quickly get it
to start hitting tracehook_report_syscall_*.  (I'm a bit dubious that there
is really any need to speed it up with set_notify_resume, but that's just me.)

Finally, some broader points about TIF_SYSCALL_TRACE that I think
have been overlooked.  The key special feature of TIF_SYSCALL_TRACE
is that it gets you to a place where full user_regset access is
available.  Debuggers need this to read (and write) the full user
register state arbitrarily, which they also need to do user
backtraces and the like.  If you do not need user_regset to work,
then you don't need to be on this (slowest) code path.

If you are only interested in reading syscall arguments and results
(or even in changing syscall results in exit tracing) then you do
not need user_regset and you do not need to take the slowest syscall
path.  (If you are doing backtraces but already rely on full kernel
stack unwinding to do it, you also do not need user_regset.)  From
anywhere inside the kernel, you can use the asm/syscall.h calls to
read syscall args, whichever entry path the task took.

The other mechanism to hook into every syscall entry/exit is
TIF_SYSCALL_AUDIT.  On some machines (like x86), this takes a third,
warm code path that is faster than the TIF_SYSCALL_TRACE path
(though obviously still off the fastest direct code path).  It can
be faster precisely because it doesn't need to allow for user_regset
access, nor for modification of syscall arguments in entry tracing.
For normal read-only tracing of just the actual syscall details,
it has all you need.

Unfortunately the arch code all looks like:

if (unlikely(current-audit_context))
 audit_syscall_{entry,exit}(...);

So we need to change that to:

if (unlikely(test_thread_flag(TIF_SYSCALL_AUDIT)))
 audit_syscall_{entry,exit}(...);

But that is pretty easy to get right, even doing it blind on arch's
you can't test.  Far better than adding new asm hackery for each arch
that's almost identical to TIF_SYSCALL_TRACE or 

Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-16 Thread Frank Ch. Eigler
Hi -


On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:

 [...]
  As far as I know, utrace supports multiple trace-engines on a process.
  Since ptrace is just an engine of utrace, you can add another engine on 
  utrace.
  
  utrace-+-ptrace_engine---owner_process
 |
 +-systemtap_module
 |
 +-ftrace_plugin

Right.  In this way, utrace is simply a multiplexing intermediary.


  Here, Frank had posted an example of utrace-ftrace engine.
  http://lkml.org/lkml/2009/1/27/294
  
  And here is the latest his patch(which seems to support syscall tracing...)
  http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
  
 
 Reminder : we are looking at system-wide tracing here. Here are some
 comments about the current utrace implementation.
 
 Looking at include/linux/utrace.h from the tree
 
 17  * A tracing engine starts by calling utrace_attach_task() or
 18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
 19  * (struct utrace_engine_ops), and some associated data.  This produces a
 20  * struct utrace_engine, which is the handle used for all other
 21  * operations.  An attached engine has its ops vector, its data, and an
 22  * event mask controlled by utrace_set_events().
 
 So if the system has, say 3000 threads, then we have 3000 struct
 utrace_engine created ? I wonder what effet this could have one
 cachelines if this is used to trace hot paths like system call
 entry/exit. Have you benchmarked this kind of scenario under tbench ?

It has not been a problem, since utrace_engines are designed to be
lightweight.  Starting or stopping a systemtap script of the form

probe process.syscall {}

appears to have no noticable impact on a tbench suite.


 24  * For each event bit that is set, that engine will get the
 25  * appropriate ops-report_*() callback when the event occurs.  The
 26  * struct utrace_engine_ops need not provide callbacks for an event
 27  * unless the engine sets one of the associated event bits.
 
 Looking at utrace_set_events(), we seem to be limited to 32 events on a
 32-bits architectures because it uses a bitmask ? Isn't it a bit small?

There are only a few types of thread events that involve different
classes of treatment, or different degrees of freedom in terms of
interference with the uninstrumented fast path of the threads.

For example, it does not make sense to have different flag bits for
different system calls, since choosing to trace *any* system call
involves taking the thread off of the fast path with the TIF_ flag.
Once it's off the fast path, it doesn't matter whether the utrace core
or some client performs syscall discrimination, so it is left to the
client.


 682 /**
 683  * utrace_set_events_pid - choose which event reports a tracing engine 
 gets
 684  * @pid:thread to affect
 685  * @engine: attached engine to affect
 686  * @eventmask:  new event mask
 687  *
 688  * This is the same as utrace_set_events(), but takes a struct pid
 689  * pointer rather than a struct task_struct pointer.  The caller must
 690  * hold a ref on @pid, but does not need to worry about the task
 691  * staying valid.  If it's been reaped so that @pid points nowhere,
 692  * then this call returns -%ESRCH.
 
 
 Comments like but does not need to worry about the task staying valid
 does not make me feel safe and comfortable at all, could you explain
 how you can assume that derefencing an invalid pointer will return
 NULL ?

(We're doing a final round of internal (pre-LKML) reviews of the
utrace implementation right now on utrace-devel@redhat.com, where such
comments get fastest attention from the experts.)

For this particular issue, the utrace documentation file explains the
liveness rules for the various pointers that can be fed to or received
from utrace functions.  This is not about feeling safe, it's about
what the mechanism is deliberately designed to permit.


 About the utrace_attach_task() :
 
 244 if (unlikely(target-flags  PF_KTHREAD))
 245 /*
 246  * Silly kernel, utrace is for users!
 247  */
 248 return ERR_PTR(-EPERM);
 
 So we cannot trace kernel threads ?

I'm not quite sure about all the reasons for this, but I believe that
kernel threads don't tend to engage in job control / signal /
system-call activities the same way as normal user threads do.


 118 /*
 119  * Called without locks, when we might be the first utrace engine to 
 attach.
 120  * If this is a newborn thread and we are not the creator, we have to wait
 121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
 122  * flag is cleared after its report_clone hook has had a chance to run.
 123  */
 124 static inline int utrace_attach_delay(struct task_struct *target)
 125 

Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-16 Thread Frederic Weisbecker
On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote:
 Hi -
 
 
 On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
 
  [...]
   As far as I know, utrace supports multiple trace-engines on a process.
   Since ptrace is just an engine of utrace, you can add another engine on 
   utrace.
   
   utrace-+-ptrace_engine---owner_process
  |
  +-systemtap_module
  |
  +-ftrace_plugin
 
 Right.  In this way, utrace is simply a multiplexing intermediary.
 
 
   Here, Frank had posted an example of utrace-ftrace engine.
   http://lkml.org/lkml/2009/1/27/294
   
   And here is the latest his patch(which seems to support syscall 
   tracing...)
   http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
   
  
  Reminder : we are looking at system-wide tracing here. Here are some
  comments about the current utrace implementation.
  
  Looking at include/linux/utrace.h from the tree
  
  17  * A tracing engine starts by calling utrace_attach_task() or
  18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
  19  * (struct utrace_engine_ops), and some associated data.  This produces 
  a
  20  * struct utrace_engine, which is the handle used for all other
  21  * operations.  An attached engine has its ops vector, its data, and an
  22  * event mask controlled by utrace_set_events().
  
  So if the system has, say 3000 threads, then we have 3000 struct
  utrace_engine created ? I wonder what effet this could have one
  cachelines if this is used to trace hot paths like system call
  entry/exit. Have you benchmarked this kind of scenario under tbench ?
 
 It has not been a problem, since utrace_engines are designed to be
 lightweight.  Starting or stopping a systemtap script of the form
 
 probe process.syscall {}
 
 appears to have no noticable impact on a tbench suite.
 
 
  24  * For each event bit that is set, that engine will get the
  25  * appropriate ops-report_*() callback when the event occurs.  The
  26  * struct utrace_engine_ops need not provide callbacks for an event
  27  * unless the engine sets one of the associated event bits.
  
  Looking at utrace_set_events(), we seem to be limited to 32 events on a
  32-bits architectures because it uses a bitmask ? Isn't it a bit small?
 
 There are only a few types of thread events that involve different
 classes of treatment, or different degrees of freedom in terms of
 interference with the uninstrumented fast path of the threads.
 
 For example, it does not make sense to have different flag bits for
 different system calls, since choosing to trace *any* system call
 involves taking the thread off of the fast path with the TIF_ flag.
 Once it's off the fast path, it doesn't matter whether the utrace core
 or some client performs syscall discrimination, so it is left to the
 client.
 
 
  682 /**
  683  * utrace_set_events_pid - choose which event reports a tracing engine 
  gets
  684  * @pid:thread to affect
  685  * @engine: attached engine to affect
  686  * @eventmask:  new event mask
  687  *
  688  * This is the same as utrace_set_events(), but takes a struct pid
  689  * pointer rather than a struct task_struct pointer.  The caller must
  690  * hold a ref on @pid, but does not need to worry about the task
  691  * staying valid.  If it's been reaped so that @pid points nowhere,
  692  * then this call returns -%ESRCH.
  
  
  Comments like but does not need to worry about the task staying valid
  does not make me feel safe and comfortable at all, could you explain
  how you can assume that derefencing an invalid pointer will return
  NULL ?
 
 (We're doing a final round of internal (pre-LKML) reviews of the
 utrace implementation right now on utrace-devel@redhat.com, where such
 comments get fastest attention from the experts.)
 
 For this particular issue, the utrace documentation file explains the
 liveness rules for the various pointers that can be fed to or received
 from utrace functions.  This is not about feeling safe, it's about
 what the mechanism is deliberately designed to permit.
 
 
  About the utrace_attach_task() :
  
  244 if (unlikely(target-flags  PF_KTHREAD))
  245 /*
  246  * Silly kernel, utrace is for users!
  247  */
  248 return ERR_PTR(-EPERM);
  
  So we cannot trace kernel threads ?
 
 I'm not quite sure about all the reasons for this, but I believe that
 kernel threads don't tend to engage in job control / signal /
 system-call activities the same way as normal user threads do.
 


Some of them use some syscalls, but it doesn't involve a user/kernel switch.
So it's not tracable by hooking syscall_entry/exit or using tracehooks.
It would require specific hooks on sys_* functions for that.

So this check 

Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-16 Thread Oleg Nesterov
On 03/16, Mathieu Desnoyers wrote:

 utrace_add_engine()
   set_notify_resume(target);

 ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
 that it is set asynchronously with the execution of the target thread
 (as I do with my TIF_KERNEL_TRACE thread flag).

 However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
 entry_64.S

 int_signal:
 and
 retint_signal:

 code paths. However, if there is no syscall tracing to do upon syscall
 entry, the thread flags are not re-read at syscall exit and you will
 miss the syscall exit returning from your target thread if this thread
 was blocked while you set its TIF_NOTIFY_RESUME.

Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit.
If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx)
sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call
tracehook_report_syscall_xxx().

Oleg.