Re: benchmarking - it's now all(-1,0,1,5,6)% faster

perl6-internals-return-14948-archive=jab . org Mon, 10 Feb 2003 14:09:12 -0800

On Sun, Jan 12, 2003 at 10:24:23AM +0100, Leopold Toetsch wrote:
> In perl.perl6.internals, you wrote:
> > --- Leopold Toetsch <[EMAIL PROTECTED]> wrote:
> >>   * SLOW (same slow with register or odd aligned)
> >>   * 0x818118a <jit_func+194>:    sub    0x8164cac,%ebx
> >>   * 0x8181190 <jit_func+200>:    jne    0x818118a <jit_func+194>
> 
> > The slow one has the loop crossing over a 16 byte boundary. Try moving it
> > over a bit.
> 
> Yep, actually it looks like a 8 byte boundary:
> Following program:


> And here is the output:
> 
>   0   790.826400 M op/s
>   1   523.305494 M op/s
>   2   788.544190 M op/s
>   3   783.447189 M op/s
>   4   783.975462 M op/s
>   5   788.208178 M op/s
>   6   782.466484 M op/s
>   7   788.059343 M op/s
>   8   788.836349 M op/s
>   9   522.986581 M op/s
>  10   788.895326 M op/s
>  11   784.021624 M op/s
>  12   789.773978 M op/s
>  13   788.065635 M op/s
>  14   783.558056 M op/s
>  15   789.010709 M op/s
>  16   782.463565 M op/s
>  17   523.049517 M op/s
>  18   781.350657 M op/s

etc

> This of course has the assumption, that the program did run at the
> same address, which is - from my experience with gdb - usually true.
> 
> So moving the critical part of a program by just one byte can cause a
> huge slowdown.

I don't think that I ever mailed what seemed to be the answer back to p5p
or p6i. Thanks to Leo's suggestions I went hunting in the gcc man pages.
2.95 and 3.0 are quite informative.

-falign-functions
-falign-labels
-falign-loops
-falign-jumps

all default to a machine dependent default. This default isn't documented
explicitly, but I presume that on x86 it's the same as the x86 specific -m
options of the same name (deprecated in gcc 3.0, removed along with their
documentation by 3.2)

*Their* alignment defaults are:

`-malign-loops=NUM'
     Align loops to a 2 raised to a NUM byte boundary.  If
     `-malign-loops' is not specified, the default is 2 unless gas 2.8
     (or later) is being used in which case the default is to align the
     loop on a 16 byte boundary if it is less than 8 bytes away.

sooooo

50% of the time your function/label/loop/jump is 16 byte aligned.
50% of the time your function/label/loop/jump is "randomly" aligned

So, a slight code size change early on in a file can cause the remaining
functions to ping either onto, or off alignment. Hence later loops in
completely unrelated code can happen to become optimally aligned, and go
faster. And similarly other loops which were optimally aligned will now
go unaligned, and go more slowly.

This is probably the right default for the general case, but it is
counterproductive for benchmarking small code changes. So on gcc 2.95 I'm
compiling with:

-O -malign-loops=3 -malign-jumps=3 -malign-functions=3 -mpreferred-stack-boundary=3 
-march=i686

(thats 2**3, ie 8)

and on gcc 3.2 on a different machine:
-O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 
-mpreferred-stack-boundary=3 -march=i586

This seems to smooth out the jumps.
In the end copy on write regexps are on average 0% faster on the fast PIII
machine with gcc 2.95, and about 2% faster on the slower Cyrix with gcc 3.2
Based on what perlbench thinks.

Nicholas Clark

Re: benchmarking - it's now all(-1,0,1,5,6)% faster

Reply via email to