https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

            Bug ID: 107093
           Summary: AVX512 mask operations not simplified in fully masked
                    loop
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Trying to implement WHILE_ULT for AVX512 I run into optimization issues. 
Consider

double a[1024], b[1024];

void foo (int n)
{
  for (int i = 0; i < n; ++i)
    a[i] = b[i] * 3.;
}

compiled with -O3 -march=cascadelake --param vect-partial-vector-usage=2

I get snippets like

        kxnorb  %k1, %k1, %k1
        kortestb        %k1, %k1
        je      .L11

or

        kxorb   %k1, %k1, %k1
        kxnorb  %k1, %k1, %k1

where we fail to simplify the operations.  Looking at the RTL it looks
like missed jump threading, but I do see the ops being

(insn 18 72 74 5 (parallel [
            (set (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
                (not:QI (xor:QI (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
                        (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86]))))
            (unspec [
                    (const_int 0 [0])
                ] UNSPEC_MASKOP)
        ]) 1912 {kxnorqi}
     (expr_list:REG_EQUAL (const_int -1 [0xffffffffffffffff])
        (nil)))

thus having an UNSPEC in them.  When emitting a SET from constm1 I end up
with mask<->GPR moves and if-converted code which isn't optimal either.
When doing -fno-if-conversion I get

.L7:
        vmovapd b(%rax), %ymm1{%k1}
        addl    $4, %ecx
        movl    %edi, %edx
        vmulpd  %ymm2, %ymm1, %ymm0
        subl    %ecx, %edx
        vmovapd %ymm0, a(%rax){%k1}
        kxnorb  %k1, %k1, %k1
        cmpl    $4, %edx
        jge     .L5
        vpbroadcastd    %edx, %xmm0
        vpcmpd  $1, %xmm0, %xmm3, %k1
.L5:
        addq    $32, %rax
        kortestb        %k1, %k1
        jne     .L7

which also doesn't have the desired short-cut from the cmpl $4, %edx.

Reply via email to