Re: Instrumenting for a different profiling algorithm

2007-03-29 Thread Michael Veksler

Ian Lance Taylor wrote:

It's really a lot easier to do this as a source code modification than
as a compiler change. Unless you already have a lot of experience
with the compiler, I think you'd be lucky or very good to get it done
in two weeks.
I have already done it as source code modification, once. It has its 
limitations

which may be overcome in the compiler (e.g. using a specialized ABI to
call the profiler functions).

For the prologue changes look at FUNCTION_PROFILER and friends in the
internal documentation. There is currently no support for profiling
support in the epilogue. It could be added along the same lines as
FUNCTION_PROFILER.

Thanks for the pointer. I have now read this documentation, but I am not
sure it suits my needs:

* No support for an epilogue. Isn't there some support for finalize
through other means? My epilogue should be called even when
exceptions are thrown.
* I need a static variable per function (not a counter but a pointer
to a struct or a struct) I hope that the proposed
fprintf(f,LP%d, labelno) is sufficient.
* I need something more portable than assembler code. I'd like to
define the profiler in C or even in a *portable* lower level language.
* The documentation repeatedly mentions 'mcount', is it just an
example for the most common profiling function, or is it a must? I
would like to define something different, which does a completely
different stuff.


If I were you I wouldn't bother with __FUNCTION__ and would just deal
with PC addresses. Use a post-processing pass to convert those back
into function names, using, e.g., addr2line

That's not a perfect solution.

* I am not sure if there is a portable way to extract PC from stack.
For now I can do this for Linux/x86 but its not perfect (also, how
can this be done? With -fomit-frame-pointer and with
-fno-omit-frame-pointer?)
* Inline functions won't be profiled (I'd like to be able to
selectively profile some of the inline functions using
__attribute__((profile)) or some such).
* There are many great features that are more difficult to implement
if the PC is used instead of a string.

Thanks
Michael

--
Michael Veksler
http:///tx.technion.ac.il/~mveksler




Instrumenting for a different profiling algorithm

2007-03-28 Thread Michael Veksler

Hello all,

[Sorry for the excessively long mail. It contains introduction, problem 
explanation, solution and a set of how-to questions].


== Introduction ==
Because gprof is completely useless in some cases (see below), I had to 
develop
(1999) a new profiling algorithm. Unfortunately, my implementation 
requires manual
instrumentation of all interesting functions with RAII objects (C++ 
only).
Aside form the manual instrumentation, the new profiler works perfectly, 
with

remarkable accuracy and low overhead (a bit faster than gcc -O2 -pg -g).

Instead of doing this manually, I would like to tell GCC instrument the 
code such that:


  1. Create a static variable per profiled function (could be void*).
  2. Call a profiling-function upon entry (and pass this pointer, and
 __FUNCTION__ - or whatever its name is in c99/c++0x).
  3. Call a conclude-function upon exit from the profiled function.

== Problem description ==

gcc -pg and gprof can't accumulate time of function's children. Instead, 
gprof
would interpolate based on call count. For example, look at the attached 
file.
In practice 'fast' takes less than a millisecond, and slow takes around 
a second,

yet gprof manages to reverse their order (when compiling gcc -pg -g t.c,
and running gprof a.out):

|0.000.00   1/1001slow [4]
   1.050.001000/1001fast [3]
[1]100.01.060.001001 f [1]
---
[2]100.00.001.06 main [2]
   0.001.051000/1000fast [3]
   0.000.00   1/1   slow [4]
|The only correct  information is on the time that the leaf f takes.

Measuring the time manually (see t.c) give (contradicting gprof):

   |Slow took 1.15 seconds
   Sum of fast took 0.00 seconds
   |

(You can tune t.c to make the result as inaccurate as you like)

=== Solution ===

Record approximate time of entry and exit of each function (volatile tick
counter is updated by a SIGPROF handler). Accumulate time for each:

  1. Time for a function (unlike -pg this will be function+sibling time).
  2. Time for a pair of caller-callee functions. (The time that the
 callee  contributes to the caller's total)

In my experience, this scheme works much better than the current 
-pg+gprof (the latter is totally useless for my needs, I'd better flip a 
coin to decide which function I guess to be expensive).


To do:
I would like to add the instrumentation (as explained above). To avoid 
billing the overhead of the profiling on instrumented functions, I need 
to maintain

volatile unsigned in_profiler;
and avoid counting time when in_profiler!=0. in_profiler is updated as 
follows:


  1. in_profiler=1 right before parameters are passed to the call of
 start_profile_function(state, __FUNCTION__); and
 finalize_profile_function(state);
  2. in_profiler=0 immediately  after *_profile_function() cleans up
 the remains of the call (depending on the ABI, it is possible that
 no such clean up occurs).

== How to: ==

I would like to find how to force gcc (and its optimizers):

  1. Not to move stuff across in_profiler=*  assignments, for all
 optimizers. This was measured to skew profiling by 10-20% on x86
 Linux, and more than x2 on tiny functions on PPC-AIX (adding an
 __asm__ register+memory barrier lowered this to 1-4% on x86, but
 not for PPC-AIX's ABI)
  2. Instrument start_profile_function as early as possible in the
 profiled function (if possible before most of the preamble).
  3. Instrument finalize _profile_function as late as possible in the
 profiled function (if possible after most of the preamble).
  4. Take exceptions into account
  5. Extract the __FUNCTION__ information from inside
 start_profile_function() instead of passing it explicitly (debug
 info?)
1. If __FUNCTION__ can be extracted from stack then maybe ditch
   in_profiler=* altogether (this way the profiler will be
   billed correctly for its time, without such variables).
2. Use specialized ABI to call *_profile_function, saving up
   the need for register save/restore for profiler's function
   calls.

It took me about a week to implement it in C++ using STL (using RAII and 
manual instrumentation). I hope that it would take me no more than two 
weeks doing this for C. Anything more than will shift my benefit/cost 
ration to finding a weaker solution.


I would be happy to get any comments or suggestions (especially don't 
waste your time, it is not needed because...).


Note that I am not 100% if I am allowed to assign copyright to FSF. I am 
doing this with my student hat (need it for accurate profiling of my 
research code), but my IBM researcher hat may (or may not) interfere, 
and I'll need advice from our legal department.


Thanks
 Michael

--
Michael Veksler
http:///tx.technion.ac.il/~mveksler


Re: Instrumenting for a different profiling algorithm

2007-03-28 Thread Ian Lance Taylor
Michael Veksler [EMAIL PROTECTED] writes:

 I would like to find how to force gcc (and its optimizers):
 
1. Not to move stuff across in_profiler=*  assignments, for all
   optimizers. This was measured to skew profiling by 10-20% on x86
   Linux, and more than x2 on tiny functions on PPC-AIX (adding an
   __asm__ register+memory barrier lowered this to 1-4% on x86, but
   not for PPC-AIX's ABI)
2. Instrument start_profile_function as early as possible in the
   profiled function (if possible before most of the preamble).
3. Instrument finalize _profile_function as late as possible in the
   profiled function (if possible after most of the preamble).
4. Take exceptions into account
5. Extract the __FUNCTION__ information from inside
   start_profile_function() instead of passing it explicitly (debug
   info?)
  1. If __FUNCTION__ can be extracted from stack then maybe ditch
 in_profiler=* altogether (this way the profiler will be
 billed correctly for its time, without such variables).
  2. Use specialized ABI to call *_profile_function, saving up
 the need for register save/restore for profiler's function
 calls.
 
 It took me about a week to implement it in C++ using STL (using RAII
 and manual instrumentation). I hope that it would take me no more than
 two weeks doing this for C. Anything more than will shift my
 benefit/cost ration to finding a weaker solution.

It's really a lot easier to do this as a source code modification than
as a compiler change.  Unless you already have a lot of experience
with the compiler, I think you'd be lucky or very good to get it done
in two weeks.

For the prologue changes look at FUNCTION_PROFILER and friends in the
internal documentation.  There is currently no support for profiling
support in the epilogue.  It could be added along the same lines as
FUNCTION_PROFILER.

If I were you I wouldn't bother with __FUNCTION__ and would just deal
with PC addresses.  Use a post-processing pass to convert those back
into function names, using, e.g., addr2line.

Hope this helps.

Ian