https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The naiive masked epilogue (--param vect-partial-vector-usage=1 and support
for whilesiult as in a prototype I have) then looks like

        leal    -1(%rdx), %eax
        cmpl    $62, %eax
        jbe     .L11

.L11:
        xorl    %ecx, %ecx
        jmp     .L4

.L4:
        movl    %ecx, %eax
        subl    %ecx, %edx
        addq    %rax, %rsi
        addq    %rax, %rdi
        addq    %r8, %rax
        cmpl    $64, %edx
        jl      .L8 
        kxorq   %k1, %k1, %k1
        kxnorq  %k1, %k1, %k1
.L7:
        vmovdqu8        (%rsi), %zmm0{%k1}{z}
        vmovdqu8        (%rdi), %zmm1{%k1}{z}
        vpavgb  %zmm1, %zmm0, %zmm0
        vmovdqu8        %zmm0, (%rax){%k1}
.L21:
        vzeroupper
        ret

.L8:
        vmovdqa64       .LC0(%rip), %zmm1
        vpbroadcastb    %edx, %zmm0
        vpcmpb  $1, %zmm0, %zmm1, %k1
        jmp     .L7

RTL isn't good at jump threading the mess caused by my ad-hoc whileult
RTL expansion - representing this at a higher level is probably the way
to go.  What you'd basically should get is for the epilogue (also used
when the main vectorized loop isn't entered):

        vmovdqa64       .LC0(%rip), %zmm1
        vpbroadcastb    %edx, %zmm0
        vpcmpb  $1, %zmm0, %zmm1, %k1
        vmovdqu8        (%rsi), %zmm0{%k1}{z}
        vmovdqu8        (%rdi), %zmm1{%k1}{z}
        vpavgb  %zmm1, %zmm0, %zmm0
        vmovdqu8        %zmm0, (%rax){%k1}

that is a compare of a vector with { niter, niter, ... } with { 0, 1,2 3, .. }
producing the mask (that has a latency of 3 according to agner) and then
simply the vectorized code masked.  You can probably assembly code that
if you'd be interested in the (optimal) performance outcome.

For now we probably want to have the main loop traditionally vectorized
without masking because Intel has poor mask support and AMD has bad
latency on the mask producing compares.  But having a masked vectorized
epilog avoids the need for a scalar epilog, saving code-size, and
avoids the need to vectorize that multiple times (or choosing SSE vectors
here).  For Zen4 the above will of course utilize two 512bit op halves
even when one is fully masked (well, I suppose at least that this is the case).

Reply via email to