> /usr/local/gcc44/bin/gcc -v [..] gcc version 4.4.0 20080503 (experimental) (GCC) > gcc -O3 -mfpmath=sse -fno-pic -fno-tree-vectorize -S himenoBMTxps.c
With -O2/-O3, the inner loop in jacobi() in this program ends containing a lot of this: movss _p-4(%edi,%edx,4), %xmm0 movl -96(%ebp), %edi subss _p-4(%edi,%edx,4), %xmm0 movl -108(%ebp), %edi subss _p-4(%edi,%edx,4), %xmm0 movl -92(%ebp), %edi addss _p-4(%edi,%edx,4), %xmm0 movl -124(%ebp), %edi At -O1 or -Os, it instead produces: movss 34056(%eax), %xmm0 subss 33024(%eax), %xmm0 subss -33024(%eax), %xmm0 addss -34056(%eax), %xmm0 which is much better. On core 2 it claims to be 40% faster at -Os. IIRC this isn't a problem on x86-64, but IRA+-O3 was much worse again. -- Summary: bad choice of loop IVs above -Os on x86 Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: astrange at ithinksw dot com GCC target triplet: i?86-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36127