http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #7 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 22:00:36 UTC --- (In reply to comment #3) > Your testcase doesn't ressemble the original, the inner for cycles need > clearing of the iteration variable. Ah, indeed... fingers were too fast. One additional data point with -O2 -ftree-vectorize -mfma4 -mavx with all loops: movslq %r8d, %rax movl $C+32, %edx xorl %esi, %esi leaq B(,%rax,8), %rcx movl $C, %eax .L3: >> vmovsd 80(%rcx), %xmm1 addl $2, %esi vmovapd A(%rdi), %ymm0 >> vmovddup %xmm1, %xmm1 vbroadcastsd (%rcx), %ymm2 addq $160, %rcx >> vinsertf128 $1, %xmm1, %ymm1, %ymm1 vfmaddpd (%rax), %ymm2, %ymm0, %ymm2 vmovapd %ymm2, (%rax) addq $64, %rax vfmaddpd (%rdx), %ymm1, %ymm0, %ymm0 vmovapd %ymm0, (%rdx) addq $64, %rdx cmpl $10, %esi jne .L3 This could be just "vbroadcastsd 80(%rcx), %ymm1". For some reason combine pass does not form it.