http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Bug ID: 60086 Summary: suboptimal asm generated for a loop (store/load false aliasing) Product: gcc Version: 4.7.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Created attachment 32060 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32060&action=edit source code that compiles Hello, I am seeing suboptimal performance of the following loop compiled with gcc 4.7.3 (but also 4.4.7, Ubuntu, full test code attached): for(i=0; i<NSIZE; i++){ a[i] += b[i]; c[i] += d[i]; } Arrays are dynamically allocated and aligned to page boundary, declared with __restrict__ and __attribute__((aligned(32))). I am running on Intel i7-2620M (Sandy Bridge). The problem is IMHO related to '4k aliasing'. It happens for the most common case of a/b/c/d starting at page boundary (e.g., natural result of malloc). To demonstrate, here is the assembly generated with 'gcc -mtune=native -mavx -O3': .L8: vmovapd (%rdx,%rdi), %ymm0 #1 load b addq $1, %r8 #2 vaddpd (%rcx,%rdi), %ymm0, %ymm0 #3 load a and add vmovapd %ymm0, (%rdx,%rdi) #4 store a vmovapd (%rax,%rdi), %ymm0 #5 load d vaddpd (%rsi,%rdi), %ymm0, %ymm0 #6 load c and add vmovapd %ymm0, (%rax,%rdi) #7 store c addq $32, %rdi #8 cmpq %r8, %r12 #9 ja .L8 #10 The 4k aliasing problem is caused by lines 4 and 5 (writing result to array a and reading data from either c or d). From my tests this seems to be the default behavior for both AVX and SSE2 instruction sets, and for both vectorized and non-vectorized cases. It is easy to fix the problem by placing the two writes together, at the end of the iteration, e.g.: .L8: vmovapd (%rdx,%rdi), %ymm1 #1 addq $1, %r8 #2 vaddpd (%rcx,%rdi), %ymm1, %ymm1 #3 vmovapd (%rax,%rdi), %ymm0 #4 vaddpd (%rsi,%rdi), %ymm0, %ymm0 #5 vmovapd %ymm1, (%rdx,%rdi) #6 vmovapd %ymm0, (%rax,%rdi) #7 addq $32, %rdi #8 cmpq %r8, %r12 #9 ja .L8 #10 In this case the writes happen after all the loads. The above code is (almost) what ICC generates for this case. For problem sizes small enough to fit in L1 the speedup is roughly 50%.