https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115693

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
             Blocks|                            |53947
   Last reconfirmed|                            |2024-06-28
             Target|                            |x86_64-*-*
     Ever confirmed|0                           |1
                 CC|                            |crazylht at gmail dot com

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #1)
> I'm transferring it to tree-optimization as the following cases are compiled
> to stupid code:
> 
> char a[8], b[8];
> 
> int test()
> {
>       for (int i = 0; i < 8; i++)
>               if (a[i] != b[i])
>                       return 0;
> 
>       return 1;
> }
> 
> int test1()
> {
>       int ret = 0;
>       for (int i = 0; i < 8; i++)
>               ret = ret || a[i] != b[i];
> 
>       return ret;
> }
> 
> So it makes more sense to fix this in the optimization passes, instead of
> ad-hoc hack in libstdc++.
> 
> But I'm not sure if there already exists a dup.

Let's keep this bug for the above testcase(s).  For test() the issue is
that even with SSE4.1 we don't seem to support ptest for V8QImode?

For test1 cost modeling makes vectorization worthwhile, though with just SSE2
we get

test1:
.LFB1:
        .cfi_startproc
        movq    a(%rip), %xmm1
        pxor    %xmm2, %xmm2
        movq    b(%rip), %xmm0
        pcmpeqb %xmm1, %xmm0
        movq    .LC0(%rip), %xmm1
        pandn   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        punpcklbw       %xmm2, %xmm0
        punpcklbw       %xmm2, %xmm1
        pshufd  $78, %xmm0, %xmm0
        pxor    %xmm2, %xmm2
        movdqa  %xmm0, %xmm3
        punpcklwd       %xmm2, %xmm0
        punpcklwd       %xmm2, %xmm3
        pshufd  $78, %xmm0, %xmm0
        por     %xmm3, %xmm0
        movdqa  %xmm1, %xmm3
        punpcklwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm3
        pshufd  $78, %xmm1, %xmm1
        por     %xmm3, %xmm1
        por     %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrlq   $32, %xmm1
        por     %xmm1, %xmm0
        movd    %xmm0, %eax
        ret

with SSE4.2 it's a bit better and "just"

test1:
.LFB1:
        .cfi_startproc
        movq    a(%rip), %xmm1
        movq    b(%rip), %xmm0
        pcmpeqb %xmm1, %xmm0
        movq    .LC0(%rip), %xmm1
        pandn   %xmm1, %xmm0
        pmovzxbw        %xmm0, %xmm2
        psrlq   $32, %xmm0
        pmovzxbw        %xmm0, %xmm0
        pmovzxwd        %xmm0, %xmm1
        psrlq   $32, %xmm0
        pmovzxwd        %xmm0, %xmm0
        por     %xmm1, %xmm0
        pmovzxwd        %xmm2, %xmm1
        psrlq   $32, %xmm2
        pmovzxwd        %xmm2, %xmm2
        por     %xmm2, %xmm1
        por     %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrlq   $32, %xmm1
        por     %xmm1, %xmm0
        movd    %xmm0, %eax
        ret

but we fail to realize that the bitwise-OR reduction could be narrowed to
char:

  <bb 3> [local count: 954449106]:
  # ret_12 = PHI <iftmp.0_5(7), 0(15)>
  # i_14 = PHI <i_7(7), 0(15)>
  # ivtmp_4 = PHI <ivtmp_3(7), 8(15)>
  _1 = a[i_14];
  _2 = b[i_14];
  _8 = _1 != _2; 
  _9 = (int) _8;
  iftmp.0_5 = _9 | ret_12;
  i_7 = i_14 + 1;
  ivtmp_3 = ivtmp_4 - 1;
  if (ivtmp_3 != 0)
    goto <bb 7>; [87.50%]

instead we keep 4 V2SImode "accumulators" and widen the compare results.

The best would be if scalar opts would make this a bool reduction though
IIRC we have a PR for that being not handled.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to