Re: global tracing

2008-08-06 Thread Petr Tesarik
On Sun, 2008-08-03 at 19:50 -0700, Roland McGrath wrote:
 [...]
For global tracing, those checks would be:
 
   if ((current-utrace_flags | utrace_global_flags)  mask) slow path;
 
The cost is now two or three instructions with one load.  It would
increase to four or five instructions with two loads.  By and large,
these checks are already in places that take a lot of locks and so
forth, so this addition seems pretty tiny.  It's certainly no worse
than adding a marker (in the current markers implementation), and
probably usually far better, since it combines with the existing
utrace check.

If you really want to avoid it, there is a way:

  * Create another global variable utrace_possible_flags. Each bit
is set only if there is either a global tracer for the event,
or at least one tracer in the system (keep a global counter).
  * Always check utrace_possible_flags first, and if it is set
(thus requesting the slow path anyway), only then check the
per-thread and global flags.

OK, this has its cost (maintaining the counter and one extra check in
case where utrace _is_ actually used), but as long as there is no
tracing happening in the system, performance does not suffer a single
CPU cycle. It may even be a tiny better for architectures where
accessing a global variable is cheaper than accessing
current-utrace_flags...

I'm not 100% sold that it's worth the complexity, but we can go that way
if a nitpicker jumps up and argues.

Petr Tesarik


signature.asc
Description: This is a digitally signed message part


Re: global tracing

2008-08-06 Thread Frank Ch. Eigler
Hi -

  The alternative I considered is the nonexistence of global tracing
  support, thus no utrace_global_flags test in the syscall fast path.
 
 It will never be in the fast path.  It will always require
 TIF_SYSCALL_TRACE to bet set on each thread, which means the slow path.
 [...]

OK, I must have misunderstood your original posting:

# [...]
# d. Kernel already has checks here, so almost free.
# 
#The utrace event hooks are at places where the kernel has had old
#ptrace checks forever.  The old code has fast paths that do:
#   if (current-ptrace  mask) slow path;
#Now in those same places there is:
#   if (current-utrace_flags  mask) slow path;
#So the cost of the checks is identical to what's already there.  [...]
#For global tracing, those checks would be:
#   if ((current-utrace_flags | utrace_global_flags)  mask) slow path;
# [...]


- FChE



Re: global tracing

2008-08-06 Thread Roland McGrath
  It will never be in the fast path.  It will always require
  TIF_SYSCALL_TRACE to bet set on each thread, which means the slow path.
  [...]
 
 OK, I must have misunderstood your original posting:
 
 # [...]
 # d. Kernel already has checks here, so almost free.

This refers to all the other cases, where there is just a check at the time
of the event.  The syscall case is special, requiring TIF_SYSCALL_TRACE.


Thanks,
Roland



Re: global tracing

2008-08-06 Thread Roland McGrath
 Actually, this point is where I'm stuck on these weeks.
 
 If we add marker or tracepoint to trace every syscalls,
 we might have to put it in the tracehook or audit and set
 TIF_SYSCALL_TRACE for every process, or put tracepoint
 in the syscall entrance/exit asm-code and check another
 flag. Since latter adds additional flag-checking in fast-path,
 I think it is not acceptable.

I agree completely that it would be wrong to do any new arch work for this,
especially assembly hacking.  Certainly piggy-backing on the existing
TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT in some fashion is the way to go.

If you don't need complete user register access at your tracepoint, then
the audit path is an option.  I suspect you do, and so TIF_SYSCALL_TRACE
is what to use.  Then you can put tracepoints in tracehook_report_syscall_*.

It's straightforward to write a loop to set TIF_SYSCALL_TRACE on every
task.  The only wrinkle is dealing with clearing the flag correctly.  You
don't need a loop, because it can be cleared lazily by each thread when it
gets into the slow path and finds it has no reason to be there.  This is
not very hard.  It only requires adding a few lines in the utrace code to
check your global-syscall-trace flag in deciding when to clear
TIF_SYSCALL_TRACE.

This would be unlike a plain tracepoint only in that you have to make this
explicit call to switch it on and off.  (Maybe this could be rolled into
the tracepoint probe registration API.)

I'm not at all arguing against having utrace global tracing to provide you
this feature instead.  (I already raised the pros/cons about that generally
and that discussion can continue.)  But this is how you'd do it sensibly
with tracepoints IMHO.  (The details I just described are not much different
from what utrace global tracing would have for handling TIF_SYSCALL_TRACE.)


Thanks,
Roland



Re: global tracing

2008-08-06 Thread Roland McGrath
   * Create another global variable utrace_possible_flags. Each bit
 is set only if there is either a global tracer for the event,
 or at least one tracer in the system (keep a global counter).
   * Always check utrace_possible_flags first, and if it is set
 (thus requesting the slow path anyway), only then check the
 per-thread and global flags.

That seems like a bad trade-off.  The common case to optimize is that this
event now not going to be traced.  If someone somewhere is running strace
on their programs, my task should not go through any slow paths just
because of them.  That's a degradation from today's performance with plain
old ptrace.

The fast path having two negative tests in the common case is surely better
than what should be the fast path having a slow false positive for me, because
someone else somewhere ran strace -f sleep 99 .

If it comes down to exactly the current check only is acceptable cost,
then the opposite direction is what makes sense to me.  That is, have
global tracing go do:
task-utrace_flags |= global_utrace_flags;
on every task whenever a new bit is set in global_utrace_flags.  (Then
there can be some lazy fixup for stale task-utrace_flags values after
global_utrace_flags has bits cleared.  It's essentially the same plan
as for setting TIF_SYSCALL_TRACE for global syscall tracing.)


Thanks,
Roland



Re: global tracing

2008-08-06 Thread Masami Hiramatsu
Hi,

Roland McGrath wrote:
 Actually, this point is where I'm stuck on these weeks.

 If we add marker or tracepoint to trace every syscalls,
 we might have to put it in the tracehook or audit and set
 TIF_SYSCALL_TRACE for every process, or put tracepoint
 in the syscall entrance/exit asm-code and check another
 flag. Since latter adds additional flag-checking in fast-path,
 I think it is not acceptable.
 
 I agree completely that it would be wrong to do any new arch work for this,
 especially assembly hacking.  Certainly piggy-backing on the existing
 TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT in some fashion is the way to go.
 
 If you don't need complete user register access at your tracepoint, then
 the audit path is an option.  I suspect you do, and so TIF_SYSCALL_TRACE
 is what to use.  Then you can put tracepoints in tracehook_report_syscall_*.

Actually, I did it and found it is not simple to hook audit syscall.
It seems that audit flag is not synchronously cleared/set on processes with
audit_context.
I think tracehook is better and simpler way to do that. But there is still
some audit-related problem when I set TIF_SYSCALL_TRACE flag on every process,
and I'm investigating that.

Maybe I need to improve syscall audit.

 It's straightforward to write a loop to set TIF_SYSCALL_TRACE on every
 task.  The only wrinkle is dealing with clearing the flag correctly.  You
 don't need a loop, because it can be cleared lazily by each thread when it
 gets into the slow path and finds it has no reason to be there.  This is
 not very hard.  It only requires adding a few lines in the utrace code to
 check your global-syscall-trace flag in deciding when to clear
 TIF_SYSCALL_TRACE.

That's a good idea. I'll check that.

 This would be unlike a plain tracepoint only in that you have to make this
 explicit call to switch it on and off.  (Maybe this could be rolled into
 the tracepoint probe registration API.)

Sure, even though, we can enable it when initializing tracepoint-marker
conversion module.

 I'm not at all arguing against having utrace global tracing to provide you
 this feature instead.  (I already raised the pros/cons about that generally
 and that discussion can continue.)  But this is how you'd do it sensibly
 with tracepoints IMHO.  (The details I just described are not much different
 from what utrace global tracing would have for handling TIF_SYSCALL_TRACE.)

I agree that.
I think if I can set TIF_SYSCALL_TRACE on each process safely, it can work
with utrace global tracing too.
In that case, I can move to utrace global tracing feature.

Thank you,

 
 
 Thanks,
 Roland

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [EMAIL PROTECTED]



Re: global tracing

2008-08-05 Thread Roland McGrath
 Answer to (a) is surely yes, but...

Since you're sure, what would you say to convince a skeptic?

 ... wouldn't it be better to first push the base utrace upstream and add
 this as a feature thereafter?

I think this is probably how it will go anyway.  I want to get a plan on
the table now.  The consensus here about the details will inform my ideas
for implementing it.  I want to think it through enough to see how the
innards would be and figure out if it will entail significant rearrangement
of the utrace implementation.


Thanks,
Roland



Re: global tracing

2008-08-05 Thread Roland McGrath
 This kind of interface would be nice to have in utrace only if it were
 significantly cheaper than doing what we do now: potentially attaching
 utrace-engines to each thread -- or (in the near future, systemtap
 bug# 6445) to subtrees of the process hierarchy.  

The overhead (memory + setup/teardown cost) is per-thread X per-tracer.
We'd have to measure what it is in practice.  I'd guess the memory won't be
an issue unless you were really milking the system for performance.  I'd
guess the first issue will be big chunks of slow at script setup/teardown
when there are lots of threads on the system.

The main feature of global tracing is that it avoids this overhead.
It goes without saying that you could always just trace every thread
individually and produce the same result at high level.  

The other feature is its simplicity.  The baseline work to do global
tracing via by-thread is not entirely trivial, as David will attest.

For subtrees, there wouldn't any time soon be an option other that
global or by-each-thread.  In the long run, there might be some new
optimizations for using utrace to treat many threads all the same.
Whatever comes along to benefit that case, I don't think it will
constitute an argument either for or against global tracing.

 (An extra chunk of work per clone() may well be cheaper than extra work
 at every system call.)

I assume what you mean here is for global syscall tracing.  There is no
such trade-off.  With vanilla utrace, you always do both.  With global
tracing, you still always do the latter.

  Systemtap doesn't currently change outcomes in a callback, so reason
  c. doesn't apply much.  [...]
 
 Actually, this is the main reasons that utrace-level support sounds
 interesting to me.  We have had requests for exposing some thread
 control primitives to systemtap probe handlers - to block/resume, send
 signals, that sort of stuff.  *If* going through utrace (as opposed to
 a separate API) would make this smoother and compose better (should
 e.g.  there be different systemtap scripts fighting over the threads),
 that could be worthwhile.

We'd have to discuss concrete scenarios to get entirely clear on this.
But off hand those sound like things that make sense to do with vanilla
utrace on individual threads.  i.e. blocking a thread implies that you
maintain per-thread state, as opposed to just a per-event consideration
of the thread on hand.  (Also, for blocking specifically, utrace is the
only kosher way to go about it--anything else fails badly at playing
nicely with other tracing and debugging facilities.)  

So to me this says you just need whatever global tracing facility you're
using to have a good place to make utrace setup calls when you discover
you want to do this sort of thing.  That's a feature that utrace global
tracing clearly has.  But given a particular scenario and a given other
means of getting its necessary event hooks, that other means might well
be fine in this regard too.  To know, we'd have to get concrete about
each of the specific tracepoints you would use instead.


Thanks,
Roland



Re: global tracing

2008-08-05 Thread Frank Ch. Eigler
Hi -

On Tue, Aug 05, 2008 at 03:32:42PM -0700, Roland McGrath wrote:
  This kind of interface would be nice to have in utrace only if it were
  significantly cheaper than doing what we do now: potentially attaching
  utrace-engines to each thread -- or (in the near future, systemtap
  bug# 6445) to subtrees of the process hierarchy.  
 
 The overhead (memory + setup/teardown cost) is per-thread X per-tracer.
 We'd have to measure what it is in practice.  [...]

Right.


 The other feature is its simplicity.  The baseline work to do global
 tracing via by-thread is not entirely trivial, as David will attest.

Right, though once it's done, it's done ...

 For subtrees, there wouldn't any time soon be an option other that
 global or by-each-thread.  [...]

... and is necessary for this part anyway.


  (An extra chunk of work per clone() may well be cheaper than extra work
  at every system call.)
 
 I assume what you mean here is for global syscall tracing.  There is no
 such trade-off.  With vanilla utrace, you always do both.  With global
 tracing, you still always do the latter.

The alternative I considered is the nonexistence of global tracing
support, thus no utrace_global_flags test in the syscall fast path.


   Systemtap doesn't currently change outcomes in a callback, so reason
   c. doesn't apply much.  [...]
  
  Actually, this is the main reasons that utrace-level support sounds
  interesting to me.  We have had requests for exposing some thread
  control primitives to systemtap probe handlers - to block/resume, send
  signals, that sort of stuff.  [...]

 We'd have to discuss concrete scenarios to get entirely clear on this.
 [...]

Well, it would be desirable to have some facility to block/resume and
send signals to threads.  It would be desirable for this not to be
available only for utrace-probes and not only targeting the currently
utrace-hooked thread, but enqueue the command to an arbitrary one.


- FChE



Re: global tracing

2008-08-04 Thread David Smith
Roland McGrath wrote:
 We've mentioned global tracing.  I think it's time now to discuss it
 thoroughly and decide what we do or don't want to do.

...

 2. Why do we want utrace global tracing?

From a systemtap point of view, we'd certainly use global tracing.

...

 3. What would it look like?
 
 Global engines' callbacks all run after all per-task engine callbacks.
 (This could change in future.)

I guess in a perfect world callbacks would still be called in the
order they were attached.  But, if calling the global callbacks last
makes things easier, I think systemtap could handle it.

 I had originally planned to rule out SYSCALL events for global tracing.
 The reason is that this is not like other event checks where a simple
 flag gets checked cheaply.  Instead, it requires setting the low-level
 TIF_SYSCALL_TRACE on a thread, which makes it take a far slower path on
 system call entry and exit, and has a big impact on performance just
 from that alone.  Global tracing has to set this individually on every
 thread, and then pay that big overhead across the board.

If we had utrace memory map tracing (I believe it is on your TODO list),
 systemtap wouldn't use global (or even per-thread) SYSCALL events as much.

...

 I'd kind of prefer to exclude REAP events for global tracing.

Currently systemtap only uses DEATH events, so I don't have much of an
opinion there.

...

 4. So, what's the plan?
 
 I need folks who might use global tracing to answer these questions:
 
a. Do we want it?

Yes.  Systemtap currently does global tracing now, in a manner similar
to crash-suspend.c.  The code looks for global CLONE, EXEC, and DEATH
events, so systemtap knows when threads come and go.  Once systemtap
finds a process the user has told us he's interested in, it attaches
some additional per-thread engine(s).

In the future, Frank has mentioned trying to do global memory map
tracing, which would require global syscall tracing (or future global
memory map tracing).

b. Do we want it right now?

Yes.  If you need beta testers, let me know.

c. What justifies doing it in utrace (vs leaving it purely to
   tracepoints et al), to placate upstream critics?

 Please don't say, That would be nice; your reasons sound good.
 That just does not help at all.  The reasons in #2 above are ones I can
 think of, but I'm not arguing for them or for the feature.  If you want
 the feature, *you* will be justifying it to the upstream critics.  Let's
 here be as skeptical about adding the new complexity, before we decide on
 doing it, as our unsympathetic reviewers will be.

Global tracing would be *really* nice; your reasons sound *great*.
How's that? :-)

Seriously, your reasons a. (Event vocabulary clearly aligned with
utrace events), b. (Coordinated with per-task utrace callbacks), and
d. (Kernel already has checks here, so almost free) apply most
clearly to systemtap.  Systemtap doesn't currently change outcomes in a
callback, so reason c. doesn't apply much.  Systemtap is interested in
performance impacts and the a./b. advantages seem quite obvious to me.
Avoiding the complexities of manually attaching/detaching to every
thread in the system seems important also.

-- 
David Smith
[EMAIL PROTECTED]
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)



Re: global tracing

2008-08-04 Thread Roland McGrath
  2. Why do we want utrace global tracing?
 
 From a systemtap point of view, we'd certainly use global tracing.

You're using tracepoints/markers too.  (You'll use anything, you minx.)
What we need is reasons for this to be a utrace feature.

 Global tracing would be *really* nice; your reasons sound *great*.
 How's that? :-)

Cursing me with loud praise!

 Seriously, your reasons a. (Event vocabulary clearly aligned with
 utrace events), b. (Coordinated with per-task utrace callbacks), and
 d. (Kernel already has checks here, so almost free) apply most
 clearly to systemtap.  Systemtap doesn't currently change outcomes in a
 callback, so reason c. doesn't apply much.  Systemtap is interested in
 performance impacts and the a./b. advantages seem quite obvious to me.

Ok.  Since a. is basically aesthetic, I think what would be concrete here
is to see how you'd use it in practice such that b. matters to you.

 Avoiding the complexities of manually attaching/detaching to every
 thread in the system seems important also.

That's a reason to have some kind of global tracing as opposed to none.
Sold.  It's not a reason to have utrace global tracing instead of only
tracepoints and markers.


Thanks,
Roland



global tracing

2008-08-03 Thread Roland McGrath
We've mentioned global tracing.  I think it's time now to discuss it
thoroughly and decide what we do or don't want to do.

1. So, what is global tracing?

It's an interface to trace the events that a utrace engine can trace,
but generically across the whole the system without attaching to
specific threads.


2. Why do we want utrace global tracing?

(I won't go into what the ability to trace things is good for in the
abstract, I assume we're all sold on that.)  This has been an item on
the utrace TODO list for a long time, since before we had any other
plan for system-wide hooks in the kernel.  Now we have tracepoints and
markers (et al).  

So the question here is, why do we want to do this in utrace?  In each
place that utrace has a tracing hook (now all in linux/tracehook.h),
you could easily add a tracepoint/marker.  So what does utrace global
tracing offer over using tracepoints?

Here are my thoughts on this.  I'm not 100% sold that these justify it.
There is a clear argument not to add another feature that provides a
second way to do what you can already do with tracepoints.

a. Event vocabulary clearly aligned with utrace events.

   The identifiers for and details of all the places you can get events
   and what information is on hand match the per-task utrace interface.
   This makes it very straightforward to compose higher-level interfaces
   that describe events uniformly, whether they are tracked via the
   global or per-task mechanism.

   This is quite a weak argument.  It would never be difficult to map
   the two different mechanisms to a uniform higher-level event vocabulary.

b. Coordinated with per-task utrace callbacks.

   If system-wide hooks are an independent mechanism, it won't be
   obvious (or necessarily stay reliable) whether the tracepoint is
   before or after the utrace callbacks, etc.  

   As part of a unified interface, that will be well-specified.  (If we
   grow some complex callback order priority feature, the global hooks
   might have detailed options for where to land in the ordering with
   various per-task callbacks.)  Moreover, it's natural for a global
   tracing callback to get informed directly about what other utrace
   engines are doing.  e.g., a system-wide catch-all hook for debugging
   stray crashes can tell if an active debugger is doing something to
   the particular task and get out of the way.

c. Callbacks can change outcomes.

   In utrace, the syscall and signal callbacks can affect what the task
   actually does in a well-specified way.  Tracepoints just report events.

   For syscalls, off hand I can only see wanting this for fault injection.
   There might be other sensible uses.

   For signals, this might be crucial to doing the crash-catcher of
   last resort sort of thing (at least, to do it more efficiently than
   giving every task in the system a utrace engine just for that).  What
   I'd expect this to do is catch SIGNAL_CORE with a global tracing
   callback that attaches a new per-task engine, ignores and pushes back
   the signal (like crash-suspend does), and the new engine UTRACE_STOPs
   until some user-level crash handling stuff wakes up and takes over.

d. Kernel already has checks here, so almost free.

   The utrace event hooks are at places where the kernel has had old
   ptrace checks forever.  The old code has fast paths that do:

if (current-ptrace  mask) slow path;

   Now in those same places there is:

if (current-utrace_flags  mask) slow path;

   So the cost of the checks is identical to what's already there.  This
   is the main thing I've expected to soothe the upstream performance
   nit-pickers about utrace: zero new overhead if you ain't usin' it.

   For global tracing, those checks would be:

if ((current-utrace_flags | utrace_global_flags)  mask) slow path;

   The cost is now two or three instructions with one load.  It would
   increase to four or five instructions with two loads.  By and large,
   these checks are already in places that take a lot of locks and so
   forth, so this addition seems pretty tiny.  It's certainly no worse
   than adding a marker (in the current markers implementation), and
   probably usually far better, since it combines with the existing
   utrace check.


3. What would it look like?

Global tracing would use the same struct utrace_engine_ops, sharing all
the same signatures for the callbacks.  There would be a call to
register a global tracing engine, which would give you an engine
represented by the same struct utrace_attached_engine type (so this
pointer is passed to your callbacks).

All the calls to administer global tracing engines would be separate
from the existing per-task utrace calls, though we overload the same
types and use the same callbacks.  Perhaps only register/unregister
calls, though maybe also a set_events to change your event mask after
the fact.  I'm leaving aside the asynchronous detach details for now.

Callbacks would

Re: asynchronous detach, global tracing

2008-07-31 Thread Frank Ch. Eigler
Roland McGrath [EMAIL PROTECTED] writes:

 [...]
 What the utrace interface has always said about this is, So don't do
 that.  [...]
 What I overlooked is that not just your data structures, but your
 callbacks too might be going away, i.e. unloading the kernel module.

I don't think the module-unloading case is so special.  If there exist
races involving utrace detach, then they will affect long-lasting
modules too that may want to do some utracing then some other stuff,
then perhaps return to utracing again.  In this scenario, the data too
is volatile or could be repurposed between utrace sessions.  Such a
module would need to know positively when no further callbacks will
arrive.

 [...]  For global engines' detach, one option is to offer no help
 with your own data structures but to solve the module-unload problem
 using the module refcount.  [...]

If having a per-cpu counter vector is sufficiently low weight for
utrace to update it around every callback, how about letting a utrace
engine specify an (optional?) percpu-integer vector?  Then, the utrace
client could use a similar synchronization algorithm as that of
module/refcount unloading to assure itself of a complete and final
utrace detach.  It could even opt to reuse the counters between
engines, or between utracing sessions, if it knows that its data/code
lifetimes can work with that.

- FChE