On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote:
> L.S.,
>
> Due to the discussion on register allocation, I went back to a hobby of
> mine: Studying the assembly output of the compiler.
>
> For this Fortran subroutine (note: unless otherwise told to the Fortran
> front end, reals are 32 bit floating point numbers):
>
>      subroutine sum(a, b, c, n)
>      integer i, n
>      real a(n), b(n), c(n)
>      do i = 1, n
>         c(i) = a(i) + b(i)
>      enddo
>      end
>
> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
>
>        xorps   %xmm2, %xmm2
>        ....
> .L6:
>        movaps  %xmm2, %xmm0
>        movaps  %xmm2, %xmm1
>        movlps  (%r9,%rax), %xmm0
>        movlps  (%r8,%rax), %xmm1
>        movhps  8(%r9,%rax), %xmm0
>        movhps  8(%r8,%rax), %xmm1
>        incl    %ecx
>        addps   %xmm1, %xmm0
>        movaps  %xmm0, 0(%rbp,%rax)
>        addq    $16, %rax
>        cmpl    %ebx, %ecx
>        jb      .L6
>
> I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1}
> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before
> they are completely filled with the mov{l,h}ps instructions ?
>

I think it is used to avoid partial SSE register stall.


-- 
H.J.

Reply via email to