https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125919
Bug ID: 125919
Summary: __builtin_powi(-1.0,n) handling
Product: gcc
Version: 17.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tkoenig at gcc dot gnu.org
Target Milestone: ---
Found while looking at PR 125914.
Consider the four equivalent functions, compiled on amd64 with -O3,
double v1(int n)
{
return __builtin_powi (-1.0, n);
}
double v2(int n)
{
return 1.0 - 2.0 * (n & 1);
}
double v3(int n)
{
return (double) (1 - 2 * (n & 1));
}
double v4(int n)
{
return n & 1 ? -1.0 : 1.0;
}
With a relatively recent trunk, "gcc -O3 -S powi.c" the output is
v1:
andl $1, %edi
movsd .LC1(%rip), %xmm0
je .L3
movsd .LC0(%rip), %xmm0
.L3:
ret
v2:
andl $1, %edi
pxor %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
cvtsi2sdl %edi, %xmm0
addsd %xmm0, %xmm0
subsd %xmm0, %xmm1
movapd %xmm1, %xmm0
ret
v3:
andl $1, %edi
movl $1, %eax
pxor %xmm0, %xmm0
addl %edi, %edi
subl %edi, %eax
cvtsi2sdl %eax, %xmm0
ret
v4:
andl $1, %edi
movsd .LC1(%rip), %xmm0
je .L8
movsd .LC0(%rip), %xmm0
.L8:
ret
.LC0:
.long 0
.long -1074790400
.LC1:
.long 0
.long 1072693248
v1 (what __builtin_powi expands to) and v4 are identical; they use
two loads (latency around 4 to 5 cycles) and a conditional jump
(with the danger of misprediction and load on the branch predictors),
plus cache use.
v2 has a load plus a two-instruction dependency chain, so 6-7 cycles,
but is branchless.
v3 appears to be best: It is branchless, its first three instructions
can be run in parallel, and then a dependency chain of two single-cycle
instructions plus one conversion - let's say five cycles.
Maybe something for match.pd?