When comparing the kind of code ICC outputs vs gcc, it's really obvious gcc could make a better use of x86 baroque addressing modes. More specifically i rarely ever see it using the *8 scale factor, even when addressing nicely power-of-2 sized stuff, and that's definitely a performance problem when dealing with those large SSE vectors.
In that testcase the *8 scale factor is only used once and even if it's questionnable the use a of a fancier mode would help in this particular testcase, there's no doubt it would in Real World. Also, take note of the horrible code for massage48; in Real World it's even worse: 4012a6: mov $0x30,%edx 4012af: imul %edx,%eax That's not from the testcase, that's in a loop and edx get reloaded each time. Tested with today's cvs and something like -O3 -march=k8 -fomit-frame-pointer -mfpmath=sse and -O3 -march=pentium4 -fomit-frame-pointer -mfpmath=sse. -- Summary: [missed-optimization] gcc4 is really reluctant to use fancy x86 addressing modes Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: enhancement Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tbptbp at gmail dot com CC: gcc-bugs at gcc dot gnu dot org GCC host triplet: cygwin http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19680