https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125919

            Bug ID: 125919
           Summary: __builtin_powi(-1.0,n) handling
           Product: gcc
           Version: 17.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Found while looking at PR 125914.

Consider the four equivalent functions, compiled on amd64 with -O3,

double v1(int n)
{
  return __builtin_powi (-1.0, n);
}

double v2(int n)
{
  return 1.0 - 2.0 * (n & 1);
}

double v3(int n)
{
  return (double) (1 - 2 * (n & 1));
}

double v4(int n)
{
  return n & 1 ? -1.0 : 1.0;
}

With a relatively recent trunk, "gcc -O3 -S powi.c" the output is

v1:
        andl    $1, %edi
        movsd   .LC1(%rip), %xmm0
        je      .L3
        movsd   .LC0(%rip), %xmm0
.L3:
        ret
v2:
        andl    $1, %edi
        pxor    %xmm0, %xmm0
        movsd   .LC1(%rip), %xmm1
        cvtsi2sdl       %edi, %xmm0
        addsd   %xmm0, %xmm0
        subsd   %xmm0, %xmm1
        movapd  %xmm1, %xmm0
        ret
v3:
        andl    $1, %edi
        movl    $1, %eax
        pxor    %xmm0, %xmm0
        addl    %edi, %edi
        subl    %edi, %eax
        cvtsi2sdl       %eax, %xmm0
        ret
v4:
        andl    $1, %edi
        movsd   .LC1(%rip), %xmm0
        je      .L8
        movsd   .LC0(%rip), %xmm0
.L8:
        ret
.LC0:
        .long   0
        .long   -1074790400
.LC1:
        .long   0
        .long   1072693248

v1 (what __builtin_powi expands to) and v4 are identical; they use
two loads (latency around 4 to 5 cycles) and a conditional jump
(with the danger of misprediction and load on the branch predictors),
plus cache use.

v2 has a load plus a two-instruction dependency chain, so 6-7 cycles,
but is branchless.

v3 appears to be best: It is branchless, its first three instructions
can be run in parallel, and then a dependency chain of two single-cycle
instructions plus one conversion - let's say five cycles.

Maybe something for match.pd?

Reply via email to