https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #8 from Chris Elrod <elrodc at gmail dot com> ---
Created attachment 45358
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358&action=edit
gfortran compiled assembly for the tranposed version of the original code.

Here is the assembly for the loop body of the transposed version of the code,
compiled by gfortran:


.L8:
        vmovss  36(%rsi), %xmm0
        addq    $40, %rsi
        vrsqrtss        %xmm0, %xmm2, %xmm2
        addq    $12, %rdi
        vmulss  %xmm0, %xmm2, %xmm0
        vmulss  %xmm2, %xmm0, %xmm0
        vmulss  %xmm7, %xmm2, %xmm2
        vaddss  %xmm8, %xmm0, %xmm0
        vmulss  %xmm2, %xmm0, %xmm0
        vmulss  -8(%rsi), %xmm0, %xmm5
        vmulss  -12(%rsi), %xmm0, %xmm4
        vmulss  -32(%rsi), %xmm0, %xmm0
        vmovaps %xmm5, %xmm3
        vfnmadd213ss    -16(%rsi), %xmm5, %xmm3
        vmovaps %xmm4, %xmm2
        vfnmadd213ss    -20(%rsi), %xmm5, %xmm2
        vmovss  %xmm0, -4(%rdi)
        vrsqrtss        %xmm3, %xmm1, %xmm1
        vmulss  %xmm3, %xmm1, %xmm3
        vmulss  %xmm1, %xmm3, %xmm3
        vmulss  %xmm7, %xmm1, %xmm1
        vaddss  %xmm8, %xmm3, %xmm3
        vmulss  %xmm1, %xmm3, %xmm3
        vmulss  %xmm3, %xmm2, %xmm6
        vmovaps %xmm4, %xmm2
        vfnmadd213ss    -24(%rsi), %xmm4, %xmm2
        vfnmadd231ss    %xmm6, %xmm6, %xmm2
        vrsqrtss        %xmm2, %xmm10, %xmm10
        vmulss  %xmm2, %xmm10, %xmm1
        vmulss  %xmm10, %xmm1, %xmm1
        vmulss  %xmm7, %xmm10, %xmm10
        vaddss  %xmm8, %xmm1, %xmm1
        vmulss  %xmm10, %xmm1, %xmm1
        vmulss  %xmm1, %xmm3, %xmm2
        vmulss  %xmm6, %xmm2, %xmm2
        vmovss  -36(%rsi), %xmm6
        vxorps  %xmm9, %xmm2, %xmm2
        vmulss  %xmm6, %xmm2, %xmm10
        vmulss  %xmm2, %xmm5, %xmm2
        vfmadd231ss     -40(%rsi), %xmm1, %xmm10
        vfmadd132ss     %xmm4, %xmm2, %xmm1
        vfnmadd132ss    %xmm0, %xmm10, %xmm1
        vmulss  %xmm0, %xmm5, %xmm0
        vmovss  %xmm1, -12(%rdi)
        vsubss  %xmm0, %xmm6, %xmm0
        vmulss  %xmm3, %xmm0, %xmm3
        vmovss  %xmm3, -8(%rdi)
        cmpq    %rsi, %rax
        jne     .L8


While Flang had a second loop of scalar code (to catch the N mod [SIMD vector
width] remainder of the vectorized loop), there are no secondary loops in the
gfortran code, meaning these must all be scalar operations (I have a hard time
telling apart SSE from scalar code...).

It looks similar in the operations it performs to Flang's vectorized loop,
except that it is only performing operations on a single number at a time.
Because to get efficient vectorization, we need corresponding elements to be
contiguous (ie, all the input1s, all the input2s).
We do not get any benefit from having all the different elements with the same
index (the first input1 next to the first input2, next to the first input3...)
being contiguous.


The memory layout I used is performance-optimal, but is something that gfortran
unfortunately often cannot handle automatically (without manual unrolling).
This is why I filed a report on bugzilla.

Reply via email to