[Bug tree-optimization/123603] [16 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 15 Jan 2026 05:02:15 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123603


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2026-01-15
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The following is the opt-info difference w/o PGO and LTO where I cannot
reproduce a difference in runtime.

+exchange2.fppized.f90:1003:32: optimized: basic block part vectorized using 8
byte vectors
+exchange2.fppized.f90:1003:32: optimized: basic block part vectorized using 8
byte vectors
+exchange2.fppized.f90:1003:32: optimized: basic block part vectorized using 8
byte vectors
+exchange2.fppized.f90:1003:32: optimized: basic block part vectorized using 8
byte vectors

the location looks like a fallback (function-scope) one.

I can't reproduce with -march=x86-64-v3 {-O2, -O3, -O3 -flto}, so it seems
PGO is required :/

The opt-info difference for -O2 -march=x86-64-v3 -flto with PGO (training with
train set) is

-exchange2.fppized.f90:1207:71: optimized: loop vectorized using 8 byte vectors
and unroll factor 1
+exchange2.fppized.f90:1207:71: optimized: loop vectorized using 16 byte
vectors and unroll factor 2
+exchange2.fppized.f90:1207:71: optimized: epilogue loop vectorized using 8
byte vectors and unroll factor 1
...

this is in the notoriously critical digits_2 subroutine.  Let me see
if there's some obviously wrong now (I fear not).

[Bug tree-optimization/123603] [16 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

Reply via email to