On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote: > L.S., > > Due to the discussion on register allocation, I went back to a hobby of > mine: Studying the assembly output of the compiler. > > For this Fortran subroutine (note: unless otherwise told to the Fortran > front end, reals are 32 bit floating point numbers): > > subroutine sum(a, b, c, n) > integer i, n > real a(n), b(n), c(n) > do i = 1, n > c(i) = a(i) + b(i) > enddo > end > > with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: > > xorps %xmm2, %xmm2 > .... > .L6: > movaps %xmm2, %xmm0 > movaps %xmm2, %xmm1 > movlps (%r9,%rax), %xmm0 > movlps (%r8,%rax), %xmm1 > movhps 8(%r9,%rax), %xmm0 > movhps 8(%r8,%rax), %xmm1 > incl %ecx > addps %xmm1, %xmm0 > movaps %xmm0, 0(%rbp,%rax) > addq $16, %rax > cmpl %ebx, %ecx > jb .L6 > > I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} > have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before > they are completely filled with the mov{l,h}ps instructions ? >
I think it is used to avoid partial SSE register stall. -- H.J.