https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686

--- Comment #2 from alekshs at hotmail dot com ---
(In reply to Richard Biener from comment #1)
> It's not so mind-blowing - it's simply that -fprofile-generate makes our
> GIMPLE level if-conversion no longer apply.  Without -fprofile-generate
> we if-convert the loop into
> 
>  for (i = 1; i <100000001; i++) 
>  {
>  ...
>     
>    b = b + (b < 1.00001) ? i + 12.43 : 0.0; 
> ...
> }
> 
> thus we always evaluate the i + 12.43 and one additional addition of zero.
> 
> We do this to eventually enable vectorization but without any check
> on whether it would be profitable when not vectorizing (your testcase
> shows it's not profitable).
> 
> Confirmed.  -fno-tree-loop-if-convert should fix it in this particular case.

Aha, thanks for the swift reply.

Regarding profitability, I should note that the PGO misses entirely the fact
that 20 mulsd could become 10 mulpd:


  400560:       f2 0f 59 e9             mulsd  %xmm1,%xmm5
  400564:       f2 0f 59 e1             mulsd  %xmm1,%xmm4
  400568:       f2 0f 59 d9             mulsd  %xmm1,%xmm3
  40056c:       f2 0f 59 d1             mulsd  %xmm1,%xmm2
  400570:       f2 0f 59 e9             mulsd  %xmm1,%xmm5
  400574:       f2 0f 59 e1             mulsd  %xmm1,%xmm4
  400578:       f2 0f 59 d9             mulsd  %xmm1,%xmm3
  40057c:       f2 0f 59 d1             mulsd  %xmm1,%xmm2
  400580:       f2 0f 59 e9             mulsd  %xmm1,%xmm5
  400584:       f2 0f 59 e1             mulsd  %xmm1,%xmm4
  400588:       f2 0f 59 d9             mulsd  %xmm1,%xmm3
  40058c:       f2 0f 59 d1             mulsd  %xmm1,%xmm2
  400590:       f2 0f 59 e9             mulsd  %xmm1,%xmm5
  400594:       f2 0f 59 e1             mulsd  %xmm1,%xmm4
  400598:       f2 0f 59 d9             mulsd  %xmm1,%xmm3
  40059c:       f2 0f 59 d1             mulsd  %xmm1,%xmm2
  4005a0:       f2 0f 59 e9             mulsd  %xmm1,%xmm5
  4005a4:       f2 0f 59 e1             mulsd  %xmm1,%xmm4
  4005a8:       f2 0f 59 d9             mulsd  %xmm1,%xmm3
  4005ac:       f2 0f 59 d1             mulsd  %xmm1,%xmm2


...So there was job to be done there. That's at -03 -march=native btw (to
preserve accuracy, unlike -Ofast). Ofast too doesn't pack them. It kind of
splits to scalar muls and packed adds.

It's a similar situation with another such small benchmark I made where it was
doing 4 x sqrts all the time (with some stuff added when values got too low, so
as to keep going), but the 2x packed sqrts I did in asm were much faster than
the 4 scalar that gcc was generating (at every level of optimization and
profiling - it didn't do 2x packed... kept doing it 4x scalar). I'm attaching
the bench in the end.

It seems like gcc avoids packing instructions like the plague in non-array code
even when there are obvious and serious measurable benefits. Perhaps the
heuristics need some tune up for both profiled and non-profiled compilation.


-----
code of sqrtbench.c
-----

#include <math.h>     
#include <stdio.h>     
#include <time.h>

int main() 
{
const double a = 911798473;  // assigning some randomly chosen constants to
begin math functions
const double aa = 143314345;
const double aaa = 531432117;
const double aaaa = 343211418;

unsigned int i; //loop counter

double b; //variables that will be used for storing square roots
double bb;
double bbb;
double bbbb;

b = a;  //assign some large values to the variables in order to start finding
square roots
bb = aa;
bbb = aaa;
bbbb = aaaa;

double score; // score
double time1; //how much time the program took

clock_t start, end; //stopwatch timers

start = clock();

 for (i = 1; i <100000001; i++) 
 {
   b=sqrt (b);
   bb=sqrt(bb);
   bbb=sqrt(bbb);
   bbbb=sqrt(bbbb);

   if (b    <= 1.0000001)  {b=b+i+12.432432432;} 
   if (bb   <= 1.0000001)  {bb=bb+i+15.4324442;} 
   if (bbb  <= 1.0000001)  {bbb=bbb+i+19.42884;}
   if (bbbb <= 1.0000001)  {bbbb=bbbb+i+34.481;}
  }

 end = clock();

 time1 = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;

 score = (10000000 / time1); // Just a way to give a "score" insead of just
time elapsed.
                            // Baseline calibration is at 1000 points rewarded
for 10000ms delay...
                            // In other words if you finish 5 times faster, say
2000ms, you get 5000 points

 printf("\nFinal number: %0.16f", (b+bb+bbb+bbbb));  // The number that
resulted from all the math functions - useful for checking math accuracy from
unsafe optimizations

 if (b+bb+bbb+bbbb > 4.0000032938759028) {printf("    Result [INCORRECT -
4.0000032938759027 expected]");} //checking result
 if (b+bb+bbb+bbbb < 4.0000032938759026) {printf("    Result [INCORRECT -
4.0000032938759027 expected]");} //checking result 

 printf("\nTime elapsed: %0.0f msecs", time1);   // Time elapsed announced to
the user
 printf("\nScore: %0.0f\n", score);  // Score announced to the user

 return 0;
}

-----end code ----
(the above generates, consistently, 4 sqrtsd instead of 2 sqrtpd, at -O3 and
PGO).

Reply via email to