http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54422
Bug #: 54422 Summary: Merge adjacent stores of elements of a vector (or loads) Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: gli...@gcc.gnu.org Target: x86_64-linux-gnu Hello, #include <x86intrin.h> void f1(__m128d*dd,__m128d e){ double*d=(double*)dd; d[0]=e[0]; d[1]=e[1]; } void f2(__m128d*dd,__m128d e){ _mm_storeu_pd((double*)dd,e); } void f3(__m128d*dd,__m128d e){ __builtin_memcpy(dd,&e,16); } for this code, gcc -O3 -mavx2 generates: for f2: vmovupd %xmm0, (%rdi) (it could possibly have guessed that the alignment was right, but I don't mind today) for f1: vmovlpd %xmm0, (%rdi) vmovhpd %xmm0, 8(%rdi) (this is my main issue, could it merge those into a vmovupd?) for f3: vmovdqa %xmm0, -40(%rsp) movq -40(%rsp), %rax vmovapd %xmm0, -24(%rsp) movq %rax, (%rdi) movq -16(%rsp), %rax movq %rax, 8(%rdi) (I hope the sse memcpy patch at http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html will eventually help with that) At tree level, for f1, we have: _3 = BIT_FIELD_REF <e_5(D), 64, 0>; MEM[(double *)dd_1(D)] = _3; _6 = BIT_FIELD_REF <e_5(D), 64, 64>; MEM[(double *)dd_1(D) + 8B] = _6; merging those 2 looks like it might be possible (though I am not familiar with that part of the compiler, maybe only the backend can handle it). Note that I am interested in both the aligned and unaligned cases (if f1 takes a double* argument instead of a __m128d*), and in both loads and stores. Most relevant other bugs I found were: PR 41464, PR 23684, PR 47059.