https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686
--- Comment #2 from alekshs at hotmail dot com --- (In reply to Richard Biener from comment #1) > It's not so mind-blowing - it's simply that -fprofile-generate makes our > GIMPLE level if-conversion no longer apply. Without -fprofile-generate > we if-convert the loop into > > for (i = 1; i <100000001; i++) > { > ... > > b = b + (b < 1.00001) ? i + 12.43 : 0.0; > ... > } > > thus we always evaluate the i + 12.43 and one additional addition of zero. > > We do this to eventually enable vectorization but without any check > on whether it would be profitable when not vectorizing (your testcase > shows it's not profitable). > > Confirmed. -fno-tree-loop-if-convert should fix it in this particular case. Aha, thanks for the swift reply. Regarding profitability, I should note that the PGO misses entirely the fact that 20 mulsd could become 10 mulpd: 400560: f2 0f 59 e9 mulsd %xmm1,%xmm5 400564: f2 0f 59 e1 mulsd %xmm1,%xmm4 400568: f2 0f 59 d9 mulsd %xmm1,%xmm3 40056c: f2 0f 59 d1 mulsd %xmm1,%xmm2 400570: f2 0f 59 e9 mulsd %xmm1,%xmm5 400574: f2 0f 59 e1 mulsd %xmm1,%xmm4 400578: f2 0f 59 d9 mulsd %xmm1,%xmm3 40057c: f2 0f 59 d1 mulsd %xmm1,%xmm2 400580: f2 0f 59 e9 mulsd %xmm1,%xmm5 400584: f2 0f 59 e1 mulsd %xmm1,%xmm4 400588: f2 0f 59 d9 mulsd %xmm1,%xmm3 40058c: f2 0f 59 d1 mulsd %xmm1,%xmm2 400590: f2 0f 59 e9 mulsd %xmm1,%xmm5 400594: f2 0f 59 e1 mulsd %xmm1,%xmm4 400598: f2 0f 59 d9 mulsd %xmm1,%xmm3 40059c: f2 0f 59 d1 mulsd %xmm1,%xmm2 4005a0: f2 0f 59 e9 mulsd %xmm1,%xmm5 4005a4: f2 0f 59 e1 mulsd %xmm1,%xmm4 4005a8: f2 0f 59 d9 mulsd %xmm1,%xmm3 4005ac: f2 0f 59 d1 mulsd %xmm1,%xmm2 ...So there was job to be done there. That's at -03 -march=native btw (to preserve accuracy, unlike -Ofast). Ofast too doesn't pack them. It kind of splits to scalar muls and packed adds. It's a similar situation with another such small benchmark I made where it was doing 4 x sqrts all the time (with some stuff added when values got too low, so as to keep going), but the 2x packed sqrts I did in asm were much faster than the 4 scalar that gcc was generating (at every level of optimization and profiling - it didn't do 2x packed... kept doing it 4x scalar). I'm attaching the bench in the end. It seems like gcc avoids packing instructions like the plague in non-array code even when there are obvious and serious measurable benefits. Perhaps the heuristics need some tune up for both profiled and non-profiled compilation. ----- code of sqrtbench.c ----- #include <math.h> #include <stdio.h> #include <time.h> int main() { const double a = 911798473; // assigning some randomly chosen constants to begin math functions const double aa = 143314345; const double aaa = 531432117; const double aaaa = 343211418; unsigned int i; //loop counter double b; //variables that will be used for storing square roots double bb; double bbb; double bbbb; b = a; //assign some large values to the variables in order to start finding square roots bb = aa; bbb = aaa; bbbb = aaaa; double score; // score double time1; //how much time the program took clock_t start, end; //stopwatch timers start = clock(); for (i = 1; i <100000001; i++) { b=sqrt (b); bb=sqrt(bb); bbb=sqrt(bbb); bbbb=sqrt(bbbb); if (b <= 1.0000001) {b=b+i+12.432432432;} if (bb <= 1.0000001) {bb=bb+i+15.4324442;} if (bbb <= 1.0000001) {bbb=bbb+i+19.42884;} if (bbbb <= 1.0000001) {bbbb=bbbb+i+34.481;} } end = clock(); time1 = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; score = (10000000 / time1); // Just a way to give a "score" insead of just time elapsed. // Baseline calibration is at 1000 points rewarded for 10000ms delay... // In other words if you finish 5 times faster, say 2000ms, you get 5000 points printf("\nFinal number: %0.16f", (b+bb+bbb+bbbb)); // The number that resulted from all the math functions - useful for checking math accuracy from unsafe optimizations if (b+bb+bbb+bbbb > 4.0000032938759028) {printf(" Result [INCORRECT - 4.0000032938759027 expected]");} //checking result if (b+bb+bbb+bbbb < 4.0000032938759026) {printf(" Result [INCORRECT - 4.0000032938759027 expected]");} //checking result printf("\nTime elapsed: %0.0f msecs", time1); // Time elapsed announced to the user printf("\nScore: %0.0f\n", score); // Score announced to the user return 0; } -----end code ---- (the above generates, consistently, 4 sqrtsd instead of 2 sqrtpd, at -O3 and PGO).