16 Regression] hot path is slowed down when the cold return path is merged into it due to early-ra

pinskia at gcc dot gnu.org via Gcc-bugs Thu, 05 Mar 2026 18:33:48 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124384


            Bug ID: 124384
           Summary: [14/15/16 Regression] hot path is slowed down when the
                    cold return path is merged into it due to early-ra
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ra
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
                CC: nsz at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

i see subomptimal code gen for

float foo (float x)
{
  if (__builtin_expect (x > 0, 0))
    if (x>2) return 0;
  return x*x;
}

because the return path merge causes extra register move in the hot path
https://godbolt.org/g/AZxxrR

aarch64 at -O2:

foo:
        fcmpe   s0, #0.0
        bgt     .L8
.L2:
        fmul    s1, s0, s0
.L1:
        fmov    s0, s1   // extra reg move
        ret
        .p2align 3
.L8:
        fmov    s2, 2.0e+0
        movi    v1.2s, #0
        fcmpe   s0, s2
        ble     .L2
        b       .L1    // need not jmp back

With -O2 -mearly-ra=none (and GCC 9-13 at -O2) we get:
```
foo(float):
.LFB0:
        .cfi_startproc
        fcmpe   s0, #0.0
        fmov    s31, s0
        bgt     .L6
        fmul    s0, s31, s31
.L1:
        ret
        .p2align 2,,3
.L6:
        fmov    s30, 2.0e+0
        movi    v0.2s, #0
        fcmpe   s31, s30
        bgt     .L1
        fmul    s0, s31, s31
        b       .L1
```

The only thing missing is the b .L1 should be turned into ret.

[Bug target/124384] New: [14/15/16 Regression] hot path is slowed down when the cold return path is merged into it due to early-ra

Reply via email to