https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102438
Bug ID: 102438 Summary: [x86-64] Failure to optimize out random extra store+load in vector code when memcpy is used Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- #include <stddef.h> typedef double simde_float64x1_t __attribute__((__vector_size__(8))); simde_float64x1_t simde_vabs_f64(simde_float64x1_t a) { simde_float64x1_t r; r[0] = -a[0]; return (simde_float64x1_t)r; } On AMD64 with -O3, this is outputted: simde_vabs_f64(double __vector(1)): movsd xmm0, QWORD PTR [rsp+8] xorpd xmm0, XMMWORD PTR .LC0[rip] mov rax, rdi movsd QWORD PTR [rsp-24], xmm0 mov rdx, QWORD PTR [rsp-24] mov QWORD PTR [rdi], rdx ret If we instead just return `r` (without the cast) this is instead outputted: simde_vabs_f64(double __vector(1)): movsd xmm0, QWORD PTR [rsp+8] xorpd xmm0, XMMWORD PTR .LC0[rip] mov rax, rdi movsd QWORD PTR [rdi], xmm0 ret It seems as though the presence of a cast (to the same type, no less) confuses GCC into spilling the result into memory. The GIMPLE optimized output is different for the two, so idk how much this target-specific to x86, but I haven't been able to reproduce it anywhere else, so ¯\_(ツ)_/¯. PS: The same bug can also be reproduced with -m32