------- Comment #8 from bangerth at dealii dot org 2006-12-14 15:35 ------- Here is an analysis of the assembler code we get when using my first command line in my previous comment, i.e. no hand unrolling. I'm using 4.1.0, btw.
The main loop looks like this: -------------------------- .L2: pushl %edx // push 'factor' xorl %eax, %eax // eax=0 fildl (%esp) // st(0)=(double)factor addl $1, %edx // ++factor fstl data // data[0]=factor movl %eax, (%esp) // push 0 fildl (%esp) // st(0)=0 addl $4, %esp cmpl $1000000000, %edx fstl data+24 // data[3]=0 fstl data+48 // data[6]=0 fstl data+8 // data[1]=0 fxch %st(1) // st(0)=factor fstl data+32 // data[4]=factor fxch %st(1) // st(0)=0 fstl data+56 // data[7]=0 fstl data+16 // data[2]=0 fstpl data+40 // data[5]=0; st(0)=factor fstpl data+64 // data[8]=factor jne .L2 --------------------- I can find several things wrong with this: a/ the sequence xorl %eax, %eax movl %eax, (%esp) fildl (%esp) could certainly be made more efficient by using fldz. b/ I find the use of fstpl at the end of the loop quite ingenious, since it avoids another fxch. However, the two uses of fxch in the middle may nevertheless be avoided if we manage to realize that we can reorder all those stores. So, in summary, it is not that gcc doesn't realize that it can unroll these loops -- it actually does that, the slowdown comes from other places. W. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30201