https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
One could in some awkward way also rewrite the loop to use integer SSE (so
we have access to min/max), relying on zero-filling scalar moves into %xmm
and then using vector integer operations.  Need to split the loads from
the arithmetic for that of course.  With that IACA is more happy (14 uops
for Haswell, throughput 3.5 cycles vs. 4 before).  But this is quite
aggressive "STV".

        vmovd  %eax, %xmm10
        vmovd  %ebx, %xmm12
        .p2align 4,,10
        .p2align 3
.L34:
#       addl    -8(%r9,%rcx,4), %eax
        vmovd   -8(%r9,%rcx,4), %xmm13
        vpaddd  %xmm13, %xmm10, %xmm10
#       movl    %eax, -4(%r13,%rcx,4)
        vmovd   %xmm10, -4(%r13,%rcx,4)
#       movl    -8(%r8,%rcx,4), %esi
        vmovd   -8(%r8,%rcx,4), %xmm11
#       addl    -8(%rdx,%rcx,4), %esi
        vmovd   -8(%rdx,%rcx,4), %xmm13
        vpaddd  %xmm13, %xmm11, %xmm11
#       cmpl    %eax, %esi
#       cmovge  %esi, %eax
        vpmaxsd %xmm11, %xmm10, %xmm10
        movl    %ecx, %esi
#       cmpl    %ebx, %eax
#       cmovl   %ebx, %eax
        vpmaxsd %xmm12, %xmm10, %xmm10
#       movl    %eax, -4(%r13,%rcx,4)
        vmovd   %xmm10, -4(%r13,%rcx,4)
        incq    %rcx
        cmpq    %rcx, %rdi
        jne     .L34

Interesting fact is that doing this _improves_ 456.hmmer 9% beyond fixing
the original regression...!

Note I carefully avoided crossing integer/vector boundaries and thus didn't
try to just do the max operation in the vector domain.  At least on AMD
CPUs moving data between integer and FP/vector is slow (IIRC Intel doesn't
care).  With AVX512 can we even do vpadd %eax, %xmm0, %xmm0 (don't care
if %eax is splat or just in lane zero) - IIRC there was support for
'scalar' operands on some ops.

The above experiment also clearly shows integer max/min operations are
desperately missing and cmp + cmov or cmp + branch + move isn't a good
substitute.

Now - isn't STV supposed to handle exactly cases like this?  Well, it
seems it only looks for TImode operations.

Note ICC when tuning for Haswell produces exactly the code we are producing
now (which is slow):

..B1.80:                        # Preds ..B1.80 ..B1.79
                                # Execution count [1.25e+01]
        movl      4(%r15,%rdi,4), %eax                          #146.10
        movl      4(%rsi,%rdi,4), %edx                          #147.12
        addl      4(%r13,%rdi,4), %eax                          #146.19
        addl      4(%r11,%rdi,4), %edx                          #147.20
        cmpl      %edx, %eax                                    #149.2
        cmovge    %eax, %edx                                    #149.2
        addl      4(%rbx,%rdi,4), %edx                          #148.2
        cmpl      $-987654321, %edx                             #149.2
        cmovl     %r14d, %edx                                   #149.2
        movl      %edx, 4(%r8,%rdi,4)                           #149.26
        incq      %rdi                                          #133.5
        cmpq      %rcx, %rdi                                    #133.5
        jb        ..B1.80       # Prob 82%                      #133.5

it everywhere uses cmov quite aggressively it seems...

Reply via email to