------- Comment #6 from ubizjak at gmail dot com 2008-11-17 18:11 ------- I think that
addps .LC10(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC11(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC12(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC13(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC14(%rip), %xmm0 mulps %xmm1, %xmm0 is the bottleneck. Perhaps we should split impilicit memory operands out of the insn by some generic peephole (if the register is available) and schedule loads appropriately. OTOH, loop optimizer should detect invariant loads and move them out of the loop. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134