https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

            Bug ID: 89557
           Summary: [7/8 regression] 4*movq to 2*movaps IPC performance
                    regression on znver1
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: 0xe2.0x9a.0x9b at gmail dot com
  Target Milestone: ---

Approximate C++ source code:

  struct __attribute__((aligned(16))) A {
    union {
      struct {
        uint64_t a;
        double b;
      };
      uint64_t data[2];
    };
  };

  A a;
  a.a = 2;
  a.b = x*y;
  return a;

CPU: AMD Ryzen 5 1600 Six-Core Processor

GCC 7.4.0 generates (no -march/mtune):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  mov 0x80(%rsp), %rax
  mov 0x88(%rsp), %rdx
  mov %rax, 0x30(%rsp)
  mov %rdx, 0x38(%rsp)

GCC 7.4.0 generates (no -march, -mtune=native):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movaps 0x80(%rsp), %xmm6
  movaps %xmm6, 0x30(%rsp)

GCC 8.2.0 generates (no -march/mtune):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movdqa 0x80(%rsp), %xmm6
  movaps %xmm6, 0x30(%rsp)

GCC 8.2.0 generates (no -march, -mtune=native):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movaps 0x80(%rsp), %xmm6
  movaps %xmm6, 0x30(%rsp)

IPC of an executable which uses the above code (perf stat):

  GCC 7.4.0 (no -march/mtune):
        617.233116      task-clock (msec)         #    0.997 CPUs utilized 
     4,139,124,553      instructions              #    1.94  insn per cycle

  GCC 7.4.0 (no -march, -mtune=native):
       1106.252920      task-clock (msec)         #    1.000 CPUs utilized      
     3,995,268,509      instructions              #    1.02  insn per cycle

  GCC 8.2.0 (no -march/mtune):
       1096.852485      task-clock (msec)         #    1.000 CPUs utilized
     3,790,839,401      instructions              #    0.97  insn per cycle

  GCC 8.2.0 (no -march, -mtune=native):
       1105.693441      task-clock (msec)         #    1.000 CPUs utilized     
     4,041,957,928      instructions              #    1.04  insn per cycle

Summary: Using 2*movaps instead of 4*movq severely lowers IPC on znver1 CPUs

Reply via email to