https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686
Bug ID: 70686 Summary: -fprofile-generate (not fprofile-use) somehow produces much faster binary Product: gcc Version: 5.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: alekshs at hotmail dot com Target Milestone: --- I have this small benchmark that does 100mn loops of 20 divisions by 2. Periodically it bumps up the values so that it continues to have something to divide /2. I time this and see the results. ------------------------------- <start of code> #include <math.h> #include <stdio.h> #include <time.h> int main() { printf("\n"); const double a = 3333333.3456743289; //initial randomly assigned values to start halving const double aa = 4444555.444334244; const double aaa = 6666777.66666666; const double aaaa = 32769999.123458; unsigned int i; double score; double g; //the number to be used for making the divisions, so essentially halving everything each round double b; double bb; double bbb; double bbbb; g = 2; b = a; bb = aa; bbb = aaa; bbbb = aaaa; double total_time; clock_t start, end; start = 0; end = 0; score = 0; start = clock(); for (i = 1; i <100000001; i++) { b=b/g; b=b/g; b=b/g; b=b/g; b=b/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; if (b < 1.0000001) {b=b+i+12.432432432;} //just adding more stuff in order for the number if (bb < 1.0000001) {bb=bb+i+15.4324442;} //to return back to larger values after several if (bbb < 1.0000001) {bbb=bbb+i+19.42884;} //rounds of halving if (bbbb < 1.0000001) {bbbb=bbbb+i+34.481;} } end = clock(); total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; score = (10000000 / total_time); printf("\nFinal number: %0.20f", (b+bb+bbb+bbbb)); printf("\nTime elapsed: %0.0f msecs", total_time); printf("\nScore: %0.0f\n", score); return 0; } ----------------------- <end of code> This is run on a quad q8200 @ 1.75ghz Now notice the times: gcc Maths4asm.c -lm -O0 => 6224ms gcc Maths4asm.c -lm -O2 and -O3 => 1527ms gcc Maths4asm.c -lm -Ofast => 1227ms gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms (I have a core quad, technically it's core2 arch) gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms. gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms. gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than without PGO, slower than -fprofile-generate) gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than without PGO, slower than -fprofile-generate). So PGO optimization made it worse, but the most mindblowing thing is the running of the profiler, getting execution times down to 530ms. The profiler run (-generate) should normally take this to 4000-5000ms or above, as it monitors the process to create a log file. I have never run into a -fprofile-generate build that wasn't at least 2-3 times slower than a normal build - let alone 2-3 times faster. This is totally absurd. And then, to top it all, -fprofile-use (using the logfile to create the best binary) created worse binaries. Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead of core2. This stuff is almost unbelievable. I thought initially that the profiler must be activating multithreading, but no. I scripted simultaneous use of 4 runs, they all give the same time - that means, there was no extra cpu use in other threads. The implication of all these is that if -fprofile-generate can somehow give code that is executing at 500ms, and the non -fprofile-generate binaries are running at 1200ms, then serious performance is left on the table on normal builds.