https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102438

            Bug ID: 102438
           Summary: [x86-64] Failure to optimize out random extra
                    store+load in vector code when memcpy is used
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

#include <stddef.h>

typedef double simde_float64x1_t __attribute__((__vector_size__(8)));

simde_float64x1_t simde_vabs_f64(simde_float64x1_t a) {
    simde_float64x1_t r;
    r[0] = -a[0];
    return (simde_float64x1_t)r;
}

On AMD64 with -O3, this is outputted:

simde_vabs_f64(double __vector(1)):
        movsd   xmm0, QWORD PTR [rsp+8]
        xorpd   xmm0, XMMWORD PTR .LC0[rip]
        mov     rax, rdi
        movsd   QWORD PTR [rsp-24], xmm0
        mov     rdx, QWORD PTR [rsp-24]
        mov     QWORD PTR [rdi], rdx
        ret

If we instead just return `r` (without the cast) this is instead outputted:

simde_vabs_f64(double __vector(1)):
        movsd   xmm0, QWORD PTR [rsp+8]
        xorpd   xmm0, XMMWORD PTR .LC0[rip]
        mov     rax, rdi
        movsd   QWORD PTR [rdi], xmm0
        ret

It seems as though the presence of a cast (to the same type, no less) confuses
GCC into spilling the result into memory.

The GIMPLE optimized output is different for the two, so idk how much this
target-specific to x86, but I haven't been able to reproduce it anywhere else,
so ¯\_(ツ)_/¯. 

PS: The same bug can also be reproduced with -m32

Reply via email to