https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557
Bug ID: 89557 Summary: [7/8 regression] 4*movq to 2*movaps IPC performance regression on znver1 Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: 0xe2.0x9a.0x9b at gmail dot com Target Milestone: --- Approximate C++ source code: struct __attribute__((aligned(16))) A { union { struct { uint64_t a; double b; }; uint64_t data[2]; }; }; A a; a.a = 2; a.b = x*y; return a; CPU: AMD Ryzen 5 1600 Six-Core Processor GCC 7.4.0 generates (no -march/mtune): movq $2, 0x80(%rsp) movsd %xmm0, 0x88(%rsp) mov 0x80(%rsp), %rax mov 0x88(%rsp), %rdx mov %rax, 0x30(%rsp) mov %rdx, 0x38(%rsp) GCC 7.4.0 generates (no -march, -mtune=native): movq $2, 0x80(%rsp) movsd %xmm0, 0x88(%rsp) movaps 0x80(%rsp), %xmm6 movaps %xmm6, 0x30(%rsp) GCC 8.2.0 generates (no -march/mtune): movq $2, 0x80(%rsp) movsd %xmm0, 0x88(%rsp) movdqa 0x80(%rsp), %xmm6 movaps %xmm6, 0x30(%rsp) GCC 8.2.0 generates (no -march, -mtune=native): movq $2, 0x80(%rsp) movsd %xmm0, 0x88(%rsp) movaps 0x80(%rsp), %xmm6 movaps %xmm6, 0x30(%rsp) IPC of an executable which uses the above code (perf stat): GCC 7.4.0 (no -march/mtune): 617.233116 task-clock (msec) # 0.997 CPUs utilized 4,139,124,553 instructions # 1.94 insn per cycle GCC 7.4.0 (no -march, -mtune=native): 1106.252920 task-clock (msec) # 1.000 CPUs utilized 3,995,268,509 instructions # 1.02 insn per cycle GCC 8.2.0 (no -march/mtune): 1096.852485 task-clock (msec) # 1.000 CPUs utilized 3,790,839,401 instructions # 0.97 insn per cycle GCC 8.2.0 (no -march, -mtune=native): 1105.693441 task-clock (msec) # 1.000 CPUs utilized 4,041,957,928 instructions # 1.04 insn per cycle Summary: Using 2*movaps instead of 4*movq severely lowers IPC on znver1 CPUs