https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124384
Bug ID: 124384
Summary: [14/15/16 Regression] hot path is slowed down when the
cold return path is merged into it due to early-ra
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Keywords: missed-optimization, ra
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
CC: nsz at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
i see subomptimal code gen for
float foo (float x)
{
if (__builtin_expect (x > 0, 0))
if (x>2) return 0;
return x*x;
}
because the return path merge causes extra register move in the hot path
https://godbolt.org/g/AZxxrR
aarch64 at -O2:
foo:
fcmpe s0, #0.0
bgt .L8
.L2:
fmul s1, s0, s0
.L1:
fmov s0, s1 // extra reg move
ret
.p2align 3
.L8:
fmov s2, 2.0e+0
movi v1.2s, #0
fcmpe s0, s2
ble .L2
b .L1 // need not jmp back
With -O2 -mearly-ra=none (and GCC 9-13 at -O2) we get:
```
foo(float):
.LFB0:
.cfi_startproc
fcmpe s0, #0.0
fmov s31, s0
bgt .L6
fmul s0, s31, s31
.L1:
ret
.p2align 2,,3
.L6:
fmov s30, 2.0e+0
movi v0.2s, #0
fcmpe s31, s30
bgt .L1
fmul s0, s31, s31
b .L1
```
The only thing missing is the b .L1 should be turned into ret.