Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-16 Thread Frank Ch. Eigler
Hi -


On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:

 [...]
  As far as I know, utrace supports multiple trace-engines on a process.
  Since ptrace is just an engine of utrace, you can add another engine on 
  utrace.
  
  utrace-+-ptrace_engine---owner_process
 |
 +-systemtap_module
 |
 +-ftrace_plugin

Right.  In this way, utrace is simply a multiplexing intermediary.


  Here, Frank had posted an example of utrace-ftrace engine.
  http://lkml.org/lkml/2009/1/27/294
  
  And here is the latest his patch(which seems to support syscall tracing...)
  http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
  
 
 Reminder : we are looking at system-wide tracing here. Here are some
 comments about the current utrace implementation.
 
 Looking at include/linux/utrace.h from the tree
 
 17  * A tracing engine starts by calling utrace_attach_task() or
 18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
 19  * (struct utrace_engine_ops), and some associated data.  This produces a
 20  * struct utrace_engine, which is the handle used for all other
 21  * operations.  An attached engine has its ops vector, its data, and an
 22  * event mask controlled by utrace_set_events().
 
 So if the system has, say 3000 threads, then we have 3000 struct
 utrace_engine created ? I wonder what effet this could have one
 cachelines if this is used to trace hot paths like system call
 entry/exit. Have you benchmarked this kind of scenario under tbench ?

It has not been a problem, since utrace_engines are designed to be
lightweight.  Starting or stopping a systemtap script of the form

probe process.syscall {}

appears to have no noticable impact on a tbench suite.


 24  * For each event bit that is set, that engine will get the
 25  * appropriate ops-report_*() callback when the event occurs.  The
 26  * struct utrace_engine_ops need not provide callbacks for an event
 27  * unless the engine sets one of the associated event bits.
 
 Looking at utrace_set_events(), we seem to be limited to 32 events on a
 32-bits architectures because it uses a bitmask ? Isn't it a bit small?

There are only a few types of thread events that involve different
classes of treatment, or different degrees of freedom in terms of
interference with the uninstrumented fast path of the threads.

For example, it does not make sense to have different flag bits for
different system calls, since choosing to trace *any* system call
involves taking the thread off of the fast path with the TIF_ flag.
Once it's off the fast path, it doesn't matter whether the utrace core
or some client performs syscall discrimination, so it is left to the
client.


 682 /**
 683  * utrace_set_events_pid - choose which event reports a tracing engine 
 gets
 684  * @pid:thread to affect
 685  * @engine: attached engine to affect
 686  * @eventmask:  new event mask
 687  *
 688  * This is the same as utrace_set_events(), but takes a struct pid
 689  * pointer rather than a struct task_struct pointer.  The caller must
 690  * hold a ref on @pid, but does not need to worry about the task
 691  * staying valid.  If it's been reaped so that @pid points nowhere,
 692  * then this call returns -%ESRCH.
 
 
 Comments like but does not need to worry about the task staying valid
 does not make me feel safe and comfortable at all, could you explain
 how you can assume that derefencing an invalid pointer will return
 NULL ?

(We're doing a final round of internal (pre-LKML) reviews of the
utrace implementation right now on utrace-devel@redhat.com, where such
comments get fastest attention from the experts.)

For this particular issue, the utrace documentation file explains the
liveness rules for the various pointers that can be fed to or received
from utrace functions.  This is not about feeling safe, it's about
what the mechanism is deliberately designed to permit.


 About the utrace_attach_task() :
 
 244 if (unlikely(target-flags  PF_KTHREAD))
 245 /*
 246  * Silly kernel, utrace is for users!
 247  */
 248 return ERR_PTR(-EPERM);
 
 So we cannot trace kernel threads ?

I'm not quite sure about all the reasons for this, but I believe that
kernel threads don't tend to engage in job control / signal /
system-call activities the same way as normal user threads do.


 118 /*
 119  * Called without locks, when we might be the first utrace engine to 
 attach.
 120  * If this is a newborn thread and we are not the creator, we have to wait
 121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
 122  * flag is cleared after its report_clone hook has had a chance to run.
 123  */
 124 static inline int utrace_attach_delay(struct task_struct *target)
 125 

Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-16 Thread Frederic Weisbecker
On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote:
 Hi -
 
 
 On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
 
  [...]
   As far as I know, utrace supports multiple trace-engines on a process.
   Since ptrace is just an engine of utrace, you can add another engine on 
   utrace.
   
   utrace-+-ptrace_engine---owner_process
  |
  +-systemtap_module
  |
  +-ftrace_plugin
 
 Right.  In this way, utrace is simply a multiplexing intermediary.
 
 
   Here, Frank had posted an example of utrace-ftrace engine.
   http://lkml.org/lkml/2009/1/27/294
   
   And here is the latest his patch(which seems to support syscall 
   tracing...)
   http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
   
  
  Reminder : we are looking at system-wide tracing here. Here are some
  comments about the current utrace implementation.
  
  Looking at include/linux/utrace.h from the tree
  
  17  * A tracing engine starts by calling utrace_attach_task() or
  18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
  19  * (struct utrace_engine_ops), and some associated data.  This produces 
  a
  20  * struct utrace_engine, which is the handle used for all other
  21  * operations.  An attached engine has its ops vector, its data, and an
  22  * event mask controlled by utrace_set_events().
  
  So if the system has, say 3000 threads, then we have 3000 struct
  utrace_engine created ? I wonder what effet this could have one
  cachelines if this is used to trace hot paths like system call
  entry/exit. Have you benchmarked this kind of scenario under tbench ?
 
 It has not been a problem, since utrace_engines are designed to be
 lightweight.  Starting or stopping a systemtap script of the form
 
 probe process.syscall {}
 
 appears to have no noticable impact on a tbench suite.
 
 
  24  * For each event bit that is set, that engine will get the
  25  * appropriate ops-report_*() callback when the event occurs.  The
  26  * struct utrace_engine_ops need not provide callbacks for an event
  27  * unless the engine sets one of the associated event bits.
  
  Looking at utrace_set_events(), we seem to be limited to 32 events on a
  32-bits architectures because it uses a bitmask ? Isn't it a bit small?
 
 There are only a few types of thread events that involve different
 classes of treatment, or different degrees of freedom in terms of
 interference with the uninstrumented fast path of the threads.
 
 For example, it does not make sense to have different flag bits for
 different system calls, since choosing to trace *any* system call
 involves taking the thread off of the fast path with the TIF_ flag.
 Once it's off the fast path, it doesn't matter whether the utrace core
 or some client performs syscall discrimination, so it is left to the
 client.
 
 
  682 /**
  683  * utrace_set_events_pid - choose which event reports a tracing engine 
  gets
  684  * @pid:thread to affect
  685  * @engine: attached engine to affect
  686  * @eventmask:  new event mask
  687  *
  688  * This is the same as utrace_set_events(), but takes a struct pid
  689  * pointer rather than a struct task_struct pointer.  The caller must
  690  * hold a ref on @pid, but does not need to worry about the task
  691  * staying valid.  If it's been reaped so that @pid points nowhere,
  692  * then this call returns -%ESRCH.
  
  
  Comments like but does not need to worry about the task staying valid
  does not make me feel safe and comfortable at all, could you explain
  how you can assume that derefencing an invalid pointer will return
  NULL ?
 
 (We're doing a final round of internal (pre-LKML) reviews of the
 utrace implementation right now on utrace-devel@redhat.com, where such
 comments get fastest attention from the experts.)
 
 For this particular issue, the utrace documentation file explains the
 liveness rules for the various pointers that can be fed to or received
 from utrace functions.  This is not about feeling safe, it's about
 what the mechanism is deliberately designed to permit.
 
 
  About the utrace_attach_task() :
  
  244 if (unlikely(target-flags  PF_KTHREAD))
  245 /*
  246  * Silly kernel, utrace is for users!
  247  */
  248 return ERR_PTR(-EPERM);
  
  So we cannot trace kernel threads ?
 
 I'm not quite sure about all the reasons for this, but I believe that
 kernel threads don't tend to engage in job control / signal /
 system-call activities the same way as normal user threads do.
 


Some of them use some syscalls, but it doesn't involve a user/kernel switch.
So it's not tracable by hooking syscall_entry/exit or using tracehooks.
It would require specific hooks on sys_* functions for that.

So this check 

Re: Q: utrace-stopped utrace_report_jctl()

2009-03-16 Thread Oleg Nesterov
On 03/15, Roland McGrath wrote:

  Then we re-do this (well, almost) check under -siglock,
 
  } else if (task_is_stopped(target)) {
  if (!(target-utrace_flags  UTRACE_EVENT(JCTL)))
  utrace-stopped = stopped = true;
  }
 
  But this is not nice. Let's suppose the task is already stopped, we do
  UTRACE_ATTACH + utrace_set_events(JCTL).

 This is exactly why utrace_set_events() sets -stopped preemptively for
 that case.

Yes, thanks. I saw this code in utrace_set_events(), but then forgot.

  REPORT(task, utrace, report, UTRACE_EVENT(JCTL),
 report_jctl, what, notify);
 
  instead.

 There is a bug, but your fix changes a key API choice.
 I've put in this fix:

 -report_jctl, was_stopped ? CLD_STOPPED : CLD_CONTINUED, what);
 +report_jctl, was_stopped ? CLD_STOPPED : CLD_CONTINUED,
 +notify ? what : 0);

 The @type argument shows the state we will be in after the callback.
 If the state changes, there will be another callback.  That's what a
 state-tracking tracer needs, e.g. to keep a little light on the screen red
 while the thread is stopped and green while it's running.

 The @notify argument shows what SIGCHLD the parent sees (if it were
 dequeuing all possible SIGCHLD postings as quickly as they come).  That's
 what an event-tracking tracer needs, e.g. to match up with what SIGCHLDs
 are expected in the parent.

I see, thanks.

  With the first patch, we call utrace_report_jctl() before we actually
  stop. do_signal_stop() can fail then, but I think this is OK, we can
  pretend that SIGCONT/SIGKILL happened after we stopped. It is not complete,
  and with this patch we always call -report_jctl with notify == 0. Just for
  discussion.

 I think I sort of understand the intent of your patch.  If we change the
 calling convention for tracehook_notify_jctl, I think we want to preserve
 the aspect that the hook decides about sending the notification.  That's
 how the ptrace quirks can be reimplemented differently later without
 changing the tracehook layer again.  Also, we certainly don't want one
 tracehook call with two different locking conditions.

Agreed, bool sig_locked is awful. But we can avoid it. The real problem
is how to figure out the correct notify argument. I'll try to think more,
but I am not sure I will find the clean solution :(

Just in case. We can introduce PF_SIGCONTED flag and change
prepare_signal(SIGCONT) and signal_wake_up(resume = 1) to set this flag.
Since the task never changes its -flags in finish_stop() path, it is safe
to do this under -siglock. This way utrace_report_jctl() can drop
TASK_STOPPED before REPORT() and then check !PF_SIGCONTED before restoring
the -state. But yes sure, this is very, very ugly.

Oleg.



Re: utrace_set_events/utrace_control death/reap checks

2009-03-16 Thread Oleg Nesterov
On 03/15, Roland McGrath wrote:

  utrace_set_events:
 
  (utrace-death  ((old_flags  ~events)  DEATH_EVENTS))
 
  (old_flags  ~events)  DEATH_EVENTS) means the caller tries to
  clear DEATH/QUIESCE. Why this is not allowed? And why this is not
  allowed _only_ when the target runs utrace_report_death()-REPORT()?

 This is specifically documented for -EALREADY, and in the DocBook section
 Interlock with final callbacks.  The idea is this:

Aha, I didn't know.

  And I don't understand why do we need utrace-death at all.
 ...
  it is only used by
  utrace_control(UTRACE_DETACH).
 ...
 that it will (or it's already happening).

 utrace_control as the synchronizing step of
 asynchronous tear-down.  If it returns 0, then report_death will not and it
 is safe to destroy data structures the callback code would use.

Yes, with your explanation above this is clear.


But can't we simplify this check a little bit?

utrace_control:

else if (unlikely(target-utrace_flags  DEATH_EVENTS) ||
   unlikely(utrace-death)) {
return -EALREADY;

can't we just do

else if (unlikely(utrace-death)) {
return -EALREADY;

I guess I missed something, but can't understand why do we need to
check -utrace_flags. We are going to call mark_engine_detached()
below which clears engine-flags, and we hold utrace-lock.

If utrace_flags  DEATH_EVENTS is true, the subsequent
utrace_report_death() must see engine-flags == 0 (it takes
utrace-lock before REPORT_CALLBACKS), so it won't call any
callback. Yes, it can play with engine itself, but this should
be safe because struct utrace has a reference to attached
engine.

No?

Oleg.



Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

2009-03-16 Thread Oleg Nesterov
On 03/16, Mathieu Desnoyers wrote:

 utrace_add_engine()
   set_notify_resume(target);

 ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
 that it is set asynchronously with the execution of the target thread
 (as I do with my TIF_KERNEL_TRACE thread flag).

 However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
 entry_64.S

 int_signal:
 and
 retint_signal:

 code paths. However, if there is no syscall tracing to do upon syscall
 entry, the thread flags are not re-read at syscall exit and you will
 miss the syscall exit returning from your target thread if this thread
 was blocked while you set its TIF_NOTIFY_RESUME.

Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit.
If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx)
sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call
tracehook_report_syscall_xxx().

Oleg.