syscall tracing overheads: utrace vs. kprobes

2009-04-28 Thread Frank Ch. Eigler
Hi -

In a few contexts, it comes up as to whether it is faster to probe
process syscalls with kprobes or with something higher level such as
utrace.  (There are other hypothetical options too (per-syscall
tracepoints) that could be measured this way in the future.)  

It was time to check the intuitions about the overheads.  So, choosing
a syscall that won't get short-circuited via vdso:

% cat foo.c
#include unistd.h

int main ()
{
  unsigned c;
  for (c=0; c1000; c++)
 (void) close (1000);
}
% gcc foo.c

Now we compare these scenarios:

# stap -e 'probe never {}' -t --vp 1 -c a.out

Here, no actual probing occurs so we get a measurement of the plain
uninstrumented run time of ten million close(2)s.

# stap -e 'probe process.syscall {}' -t --vp 1 -c a.out

Here, we intercept sys_close with a kprobe.  If the system is not too
busy, we should pick up only the close(2)s coming from a.out, though a
few close(2)'s executed by other processes may show up.

# stap -e 'probe syscall.close {}' -t --vp 1 -c a.out

Here, we intercept all a.out's syscalls with utrace.  Other processes
are not affected at all, but other syscalls by a.out would be --
though in our test, there are hardly any of those.


Some typical results on my 2.66GHz 2*Xeon5150 machine runnin Fedora 9 -
2.6.27.12:

never:  
Pass 5: run completed in 740usr/3310sys/4155real ms.

kprobe: 
probe syscall.close (input:1:1), hits: 1028, cycles: 176min/202avg/3632max
Pass 5: run completed in 750usr/9320sys/10193real ms.

utrace: 
probe process.syscall (input:1:1), hits: 1025, cycles: 
176min/209avg/184392max
Pass 5: run completed in 1670usr/6860sys/8645real ms.

So utrace added 4.5 seconds, and kprobes added 6.0 seconds to the
uninstrumented 4.1 second run time.  But wait: we should subtract the
time taken by the probe handler itself: 200ish cycles at 2.66 GHz,
which is about 0.75 seconds.  So the overheads are approximately:

never: n/a
kprobe: 5.2 seconds = 0.52 us per hit
utrace: 3.6 seconds = 0.36 us per hit


Note that these are microbenchmarks that represent an ideal case
compared to a larger run, since they probably fit comfily inside
caches.  They probably also undercount the probe handler's run time.


- FChE



Re: syscall tracing overheads: utrace vs. kprobes

2009-04-28 Thread Roland McGrath
 Certainly, in general.  But in this specific test, only the under-test
 system calls occurred in essnetially the whole system, so the overhead
 measurements were in a way the bare minimums imposed by the kprobes
 vs. utrace callback infrastructure itself.

Yes.  That's why I meant to explain how these numbers are true but not
necessarily the numbers that matter.

  In real situations [...]  the distribution of overheads will be
  quite different.
 
 Or rather, the basic overhead quanta measured above may be multiplied
 along several different axes.

Yes.



Re: syscall tracing overheads: utrace vs. kprobes

2009-04-28 Thread David Smith
Frank Ch. Eigler wrote:
 Hi -
 
 In a few contexts, it comes up as to whether it is faster to probe
 process syscalls with kprobes or with something higher level such as
 utrace.  (There are other hypothetical options too (per-syscall
 tracepoints) that could be measured this way in the future.)  

These scenarios are a bit wrong:

 Now we compare these scenarios:
 
 # stap -e 'probe never {}' -t --vp 1 -c a.out
 
 Here, no actual probing occurs so we get a measurement of the plain
 uninstrumented run time of ten million close(2)s.

The above one is fine.

 # stap -e 'probe process.syscall {}' -t --vp 1 -c a.out
 
 Here, we intercept sys_close with a kprobe.  If the system is not too
 busy, we should pick up only the close(2)s coming from a.out, though a
 few close(2)'s executed by other processes may show up.
 
 # stap -e 'probe syscall.close {}' -t --vp 1 -c a.out
 
 Here, we intercept all a.out's syscalls with utrace.  Other processes
 are not affected at all, but other syscalls by a.out would be --
 though in our test, there are hardly any of those.

These 2 are swapped:  the 'process.syscall' probe is a utrace-based
probe and the 'syscall.close' probe is a kprobe-based probe.

Note that in the results, the description and probe types matched correctly.

 Some typical results on my 2.66GHz 2*Xeon5150 machine runnin Fedora 9 -
 2.6.27.12:
 
 never:  
 Pass 5: run completed in 740usr/3310sys/4155real ms.
 
 kprobe: 
 probe syscall.close (input:1:1), hits: 1028, cycles: 
 176min/202avg/3632max
 Pass 5: run completed in 750usr/9320sys/10193real ms.
 
 utrace: 
 probe process.syscall (input:1:1), hits: 1025, cycles: 
 176min/209avg/184392max
 Pass 5: run completed in 1670usr/6860sys/8645real ms.
 
 So utrace added 4.5 seconds, and kprobes added 6.0 seconds to the
 uninstrumented 4.1 second run time.  But wait: we should subtract the
 time taken by the probe handler itself: 200ish cycles at 2.66 GHz,
 which is about 0.75 seconds.  So the overheads are approximately:
 
 never: n/a
 kprobe: 5.2 seconds = 0.52 us per hit
 utrace: 3.6 seconds = 0.36 us per hit
 
 
 Note that these are microbenchmarks that represent an ideal case
 compared to a larger run, since they probably fit comfily inside
 caches.  They probably also undercount the probe handler's run time.

-- 
David Smith
dsm...@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)