https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92335

            Bug ID: 92335
           Summary: missed transformation to branchless
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the following code (compiled with -O2 or -O3 and even with -march=haswell)
gcc will use a branchless construct in foo but not in bar (changing from float
to int does not modify the behavior)
(see https://godbolt.org/z/0ZWKb5 )

with -Ofast they will compile in the same vectorized branchless loop, still I
do not see why the branch shall be retained at -O2 in bar

for random "x" the branchless version is 6 times faster on any out-of-order cpu

float foo(float const * __restrict__ x, 
float const * __restrict__ y) {
  float ret=0.f;
  for (int i=0;i<1024;++i) {
    auto k = y[i];
    ret += x[i]>0.f ? k : 0.f;
  }
    return ret;
}



float bar(float const * __restrict__ x, 
float const * __restrict__ y) {
  float ret=0.f;
  for (int i=0;i<1024;++i) {
    auto k = y[i];
    if(x[i]>0.f) ret += k;
  }
    return ret;
}

Reply via email to