[Bug c/70686] New: -fprofile-generate (not fprofile-use) somehow produces much faster binary

alekshs at hotmail dot com Fri, 15 Apr 2016 11:36:58 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686


            Bug ID: 70686
           Summary: -fprofile-generate (not fprofile-use) somehow produces
                    much faster binary
           Product: gcc
           Version: 5.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alekshs at hotmail dot com
  Target Milestone: ---

I have this small benchmark that does 100mn loops of 20 divisions by 2.
Periodically it bumps up the values so that it continues to have something to
divide /2. I time this and see the results. 

-------------------------------
<start of code>


#include <math.h>     
#include <stdio.h>     
#include <time.h>

int main() 
{
printf("\n");

const double a = 3333333.3456743289;  //initial randomly assigned values to
start halving
const double aa = 4444555.444334244;
const double aaa = 6666777.66666666;
const double aaaa = 32769999.123458;

unsigned int i;
double score;
double g; //the number to be used for making the divisions, so essentially
halving everything each round

double b; 
double bb;
double bbb;
double bbbb;

g = 2;  

b = a;
bb = aa;
bbb = aaa;
bbbb = aaaa;

double total_time;
clock_t start, end;

start = 0;
end = 0;
score = 0;

start = clock();

 for (i = 1; i <100000001; i++) 
 {
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;

   if (b    < 1.0000001)  {b=b+i+12.432432432;}  //just adding more stuff  in
order for the number
   if (bb   < 1.0000001)  {bb=bb+i+15.4324442;}  //to return back to larger
values after several
   if (bbb  < 1.0000001)  {bbb=bbb+i+19.42884;}  //rounds of halving
   if (bbbb < 1.0000001)  {bbbb=bbbb+i+34.481;} 
}

 end = clock();

 total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;

 score = (10000000 / total_time);
 printf("\nFinal number: %0.20f", (b+bb+bbb+bbbb));

 printf("\nTime elapsed: %0.0f msecs", total_time);   
 printf("\nScore: %0.0f\n", score);

 return 0;
}
-----------------------
<end of code>


This is run on a quad q8200 @ 1.75ghz

Now notice the times:

gcc Maths4asm.c -lm -O0  => 6224ms
gcc Maths4asm.c -lm -O2 and -O3  => 1527ms
gcc Maths4asm.c -lm -Ofast  => 1227ms
gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms
gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms  (I have a core quad,
technically it's core2 arch)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms. 
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than
without PGO, slower than -fprofile-generate)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than
without PGO, slower than -fprofile-generate).

So PGO optimization made it worse, but the most mindblowing thing is the
running of the profiler, getting execution times down to 530ms. The profiler
run (-generate) should normally take this to 4000-5000ms or above, as it
monitors the process to create a log file. 

I have never run into a -fprofile-generate build that wasn't at least 2-3 times
slower than a normal build - let alone 2-3 times faster. This is totally
absurd. 

And then, to top it all, -fprofile-use (using the logfile to create the best
binary) created worse binaries. 

Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead
of core2.

This stuff is almost unbelievable. I thought initially that the profiler must
be activating multithreading, but no. I scripted simultaneous use of 4 runs,
they all give the same time - that means, there was no extra cpu use in other
threads.

The implication of all these is that if -fprofile-generate can somehow give
code that is executing at 500ms, and the non -fprofile-generate binaries are
running at 1200ms, then serious performance is left on the table on normal
builds.

[Bug c/70686] New: -fprofile-generate (not fprofile-use) somehow produces much faster binary

Reply via email to