For the following test case, prefetches will be inserted for both the load and store of a[i] if the loop is vectorized:
float a[1024], b[1024]; void foo(int beta) { int i; for(i=0; i<1024; i++) a[i] = a[i] + beta * b[i]; } with gcc -O3 -fprefetch-loop-arrays -march=amdfam10 -S, a piece of the assembly is: movaps (%rcx), %xmm0 addl $4, %edi prefetcht0 (%rdx) prefetcht0 240(%rcx) prefetchw (%rdx) leaq 64(%rax), %rsi mulps %xmm1, %xmm0 If we don't vectorize the loop, we only generate prefetch for the load a[i]: addl $16, %eax salq $2, %rcx mulss %xmm1, %xmm0 prefetcht0 a+92(%rcx) prefetcht0 b+92(%rcx) movl %esi, %ecx -- Summary: Redundant prefetches for the vectorized loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021