Nick's performance theory - was inline mania

2000-08-01 Thread Nick Ing-Simmons

John Tobey [EMAIL PROTECTED] writes:
 Maybe not for void functions with no args, tail-called and with 
 no prefix, but in more typically cases yes it can be different
 the "function-ness" of i_foo applies constaints on where args
 and "result" are which optimizer _may_ not be able to unravel.

May not be able because of what the Standard says, or because of
suboptimal optimization?

suboptimal optimization, (i.e. lack of knowledge about rest of program
at time of expansion) - note that a suitably "optimal" optimizer 
_could_ turn 100,000 #define-d lines back into "local real functions".

But is usually much easier add entropy - so start with its the same 
function - call it, and let compiler decide which ones to expand.


GCC won't unless you go -O3 or above.  This is why many people (me
included) stop at -O2 for most programs.

Me too - because I _fundamentally_ believe inlining is nearly always
sub-optimal for real programs.

But -O3 (or -finline-functions) is there for the folk that want 
to believe the opposite. 

And there is -Dinline -D__inline__ for the inline case.
What there isn't though is -fhash_define-as-inline or -fno-macros
so at very least lets avoid _that_ path.


 Non-inline functions have their place in reducing code size
 and easing debugging.  I just want an i_foo for every foo that callers
 will have the option of using.
 
 Before we make any promises to do all that extra work can we 
 measure (for various architectures) the cost of a real call vs inline.
 
 I want proof that inline makes X% difference.

I'm not going to prove that.  A normal C function call involves
several instructions and a jump most likely across page boundaries.

I have said this before but the gist of the Nick-theory is:

Page boundaries are a don't care unless there is a page miss.
Page misses are so costly that everything else can be ignored,
but for sane programs they should only be incured at "startup".
(Reducing code size e.g. no inline only helps here - less pages to load.)

It is cache that matters.

Modern processors (can) execute several instructions per-cycle.
In contrast a cache miss to 100MHz SDRAM costs a 500MHz processor
more than 5-cycles (say up to 10 instructions for 2-way super-scalar) 
per word missed.

I used to think that this was a "RISC Processor only" argument.
But is seems (no hard numbers yet) that Pentium at least follows
same pattern.

If someone else wants to prove this, great.  I just don't think it's
that much trouble.  (mostly psychological - what will people think if
they see that all our code is in headers and all our C files are
autogenerated?)

We can unlink the .c files once we have compiled them ;-)

-- 
Nick Ing-Simmons




Re: Nick's performance theory - was inline mania

2000-08-01 Thread John Tobey

Nick Ing-Simmons [EMAIL PROTECTED] wrote:
 But is usually much easier add entropy - so start with its the same 
 function - call it, and let compiler decide which ones to expand.

You'll get no argument on that point.  Please stop suggesting that I
want to take the power of decision away from programmers *OR*
compilers.

 If someone else wants to prove this, great.  I just don't think it's
 that much trouble.  (mostly psychological - what will people think if
 they see that all our code is in headers and all our C files are
 autogenerated?)
 
 We can unlink the .c files once we have compiled them ;-)

Nope.  Messes up source debuggers.