[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-05-15 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

--- Comment #5 from GCC Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:090714e6cf8029f4ff8883dce687200024adbaeb

commit r15-530-g090714e6cf8029f4ff8883dce687200024adbaeb
Author: liuhongt 
Date:   Wed May 15 10:56:24 2024 +0800

Set d.one_operand_p to true when TARGET_SSSE3 in
ix86_expand_vecop_qihi_partial.

pshufb is available under TARGET_SSSE3, so
ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.

With the patch under -march=x86-64-v2

v8qi
foo (v8qi a)
{
  return a >> 5;
}

<   pmovsxbw%xmm0, %xmm0
<   psraw   $5, %xmm0
<   pshufb  .LC0(%rip), %xmm0

vs.

>   movdqa  %xmm0, %xmm1
>   pcmpeqd %xmm0, %xmm0
>   pmovsxbw%xmm1, %xmm1
>   psrlw   $8, %xmm0
>   psraw   $5, %xmm1
>   pand%xmm1, %xmm0
>   packuswb%xmm0, %xmm0

Although there's a memory load from constant pool, but it should be
better when it's inside a loop. The load from constant pool can be
hoist out. it's 1 instruction vs 4 instructions.

<   pshufb  .LC0(%rip), %xmm0

vs.

>   pcmpeqd %xmm0, %xmm0
>   psrlw   $8, %xmm0
>   pand%xmm1, %xmm0
>   packuswb%xmm0, %xmm0

gcc/ChangeLog:

PR target/114514
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Set d.one_operand_p to true when TARGET_SSSE3.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr114514-shufb.c: New test.

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-05-15 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

--- Comment #4 from GCC Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:0cc0956b3bb8bcbc9196075b9073a227d799e042

commit r15-529-g0cc0956b3bb8bcbc9196075b9073a227d799e042
Author: liuhongt 
Date:   Tue May 14 18:39:54 2024 +0800

Optimize ashift >> 7 to vpcmpgtb for vector int8.

Since there is no corresponding instruction, the shift operation for
vector int8 is implemented using the instructions for vector int16,
but for some special shift counts, it can be transformed into vpcmpgtb.

gcc/ChangeLog:

PR target/114514
* config/i386/i386-expand.cc
(ix86_expand_vec_shift_qihi_constant): Optimize ashift >> 7 to
vpcmpgtb.
(ix86_expand_vecop_qihi_partial): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr114514-shift.c: New test.

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

--- Comment #3 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> Confirmed.
> 
> Note non sign bit can be improved too:
> ```
I assume you're talking about broadcast from imm or directly from constant
pool. GCC chooses the former, with -Os we can also generate the later.
According to microbenchmark, the former is better. I also tries to disable
broadcasting from imm and test with stress-ng vecmath, the performance is
similar.

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

--- Comment #2 from Andrew Pinski  ---
For non constant clang produces:
```
signedshiftright:
movzbl  %dil, %eax
movd%eax, %xmm1
psrlw   %xmm1, %xmm0
pcmpeqd %xmm2, %xmm2
psrlw   %xmm1, %xmm2
movdqa  .LCPI0_0(%rip), %xmm3   # xmm3 =
[32896,32896,32896,32896,32896,32896,32896,32896]
psrlw   %xmm1, %xmm3
psrlw   $8, %xmm2
punpcklbw   %xmm2, %xmm2# xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw $0, %xmm2, %xmm1# xmm1 = xmm2[0,0,0,0,4,5,6,7]
pshufd  $0, %xmm1, %xmm1# xmm1 = xmm1[0,0,0,0]
pand%xmm1, %xmm0
pxor%xmm3, %xmm0
psubb   %xmm3, %xmm0
retq

unsignedshiftrtight:
movzbl  %dil, %eax
movd%eax, %xmm1
psrlw   %xmm1, %xmm0
pcmpeqd %xmm2, %xmm2
psrlw   %xmm1, %xmm2
psrlw   $8, %xmm2
punpcklbw   %xmm2, %xmm2# xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw $0, %xmm2, %xmm1# xmm1 = xmm2[0,0,0,0,4,5,6,7]
pshufd  $0, %xmm1, %xmm1# xmm1 = xmm1[0,0,0,0]
pand%xmm1, %xmm0
retq
```

I am not sure which way is faster here though.

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement
   Last reconfirmed||2024-03-28
 CC||pinskia at gcc dot gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #1 from Andrew Pinski  ---
Confirmed.

Note non sign bit can be improved too:
```
#define vector __attribute__((vector_size(16)))

typedef vector signed char v16qi;
typedef vector unsigned char v16uqi;

v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 6;
}
v16uqi
foo1 (v16uqi a, v16uqi b)
{
return a >> 6;
}
```

clang produces:
```
_Z4foo2Dv16_aS_:
psrlw   $6, %xmm0
pand.LCPI0_0(%rip), %xmm0 #{3,3,3,...}
movdqa  .LCPI0_1(%rip), %xmm1 #{2,2,2,...}
pxor%xmm1, %xmm0
psubb   %xmm1, %xmm0
retq
_Z4foo1Dv16_hS_:
psrlw   $6, %xmm0
pand.LCPI1_0(%rip), %xmm0 #{3,3,3,...}
retq
```