https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582
--- Comment #2 from Tony Poppleton <tony.poppleton at gmail dot com> --- Re-testing this with GCC 5.1, the code appears to be even less efficient, for both cases; DM1: .LFB0: .cfi_startproc movss b(%rip), %xmm0 xorl %eax, %eax movss %xmm0, a(%rip) movss b+4(%rip), %xmm0 movss %xmm0, a+4(%rip) movss b+8(%rip), %xmm0 movss %xmm0, a+8(%rip) movss b+12(%rip), %xmm0 movss %xmm0, a+12(%rip) movss b+16(%rip), %xmm0 movss %xmm0, a+16(%rip) ret .cfi_endproc .LFB0: .cfi_startproc movq b(%rip), %rax movq %rax, a(%rip) movq b+8(%rip), %rax movq %rax, a+8(%rip) movl b+16(%rip), %eax movl %eax, a+16(%rip) xorl %eax, %eax ret .cfi_endproc Why is the "xorl" appearing in both cases? Should this be logged as a separate bug. Incidentally, compiling with -O1 produces the same code as -O2 on older GCCs (as in the description comment above) My total guess is it is due to a and b not having any initial values, and an optimization that takes into account value ranges is getting confused?