http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676
Bug #: 56676 Summary: unnecesary splitted load when using avx2 Classification: Unclassified Product: gcc Version: 4.7.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: nel...@seznam.cz Compile notorious example int foo(int *a,int *b){ int i; int r=0; for(i=0;i<32;i++) r+= a[i]*b[i]; return r; } with -O3 -mavx2. gcc generates code that is suboptimal in several ways. Part relevant to this bug is spliting 32byte load into two 16byte loads. .L5: vmovdqu (%r8,%rdx), %xmm1 addl $1, %ecx vinserti128 $0x1, 16(%r8,%rdx), %ymm1, %ymm1 vpmulld (%rbx,%rdx), %ymm1, %ymm1