[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2021-12-20 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

Andrew Pinski  changed:

   What|Removed |Added

Summary|Missed vectorization under  |[12 Regression] Missed
   |-mavx512f -mavx512vl after  |vectorization under
   |r12-5489|-mavx512f -mavx512vl after
   ||r12-5489
   Target Milestone|--- |12.0
   Keywords||missed-optimization

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #1 from Tamar Christina  ---
Looks like the change causes the simpler conditional to be detected by the
vectorizer as a masked operation, which in principle makes sense:

note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 > 255
? iftmp.0_19 : iftmp.0_20;
note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 :
iftmp.0_20;
note:   extra pattern stmt: patt_40 = x.1_14 > 255;
note:   extra pattern stmt: patt_42 = () patt_40;

However not quite sure how the masking works on x86.  The additional statement
generated for patt_42 causes it to fail during vectorization:

note:   ==> examining pattern def statement: patt_42 = ()
patt_40;
note:   ==> examining statement: patt_42 = () patt_40;
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) 
missed:   conversion not supported by target.
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) 
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) 
missed:   not vectorized: relevant stmt not supported: patt_42 =
() patt_40;
missed:  bad operation or unsupported loop bound.
note:  * Analysis  failed with vector mode V32QI

as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and a
mask.

which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to
allow any QI mode conversions [1] so it fails..

Not sure why it's doing the replacement without checking to see that the target
is able to vectorize the statements it generates later. Specifically it doesn't
check if what's returned by build_mask_conversion is supported or not.

My guess is because vectorizable_condition will fail anyway without the type of
the conditional being a vector boolean.

With -mavx512vl V32QI seems to generate in the pattern mask conversions between
vector (8)  and without it vector(32) . I
think some x86 person needs to give a hint here :)

[1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-05 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #2 from Hongtao.liu  ---
(In reply to Tamar Christina from comment #1)
> Looks like the change causes the simpler conditional to be detected by the
> vectorizer as a masked operation, which in principle makes sense:
> 
> note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 >
> 255 ? iftmp.0_19 : iftmp.0_20;
> note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 :
> iftmp.0_20;
> note:   extra pattern stmt: patt_40 = x.1_14 > 255;
> note:   extra pattern stmt: patt_42 = () patt_40;
> 
> However not quite sure how the masking works on x86.  The additional
> statement generated for patt_42 causes it to fail during vectorization:
> 
> note:   ==> examining pattern def statement: patt_42 = ()
> patt_40;
> note:   ==> examining statement: patt_42 = () patt_40;
> note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
> note:   vect_is_simple_use: vectype vector(8) 
> missed:   conversion not supported by target.
> note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
> note:   vect_is_simple_use: vectype vector(8) 
> note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
> note:   vect_is_simple_use: vectype vector(8) 
> missed:   not vectorized: relevant stmt not supported: patt_42 =
> () patt_40;
> missed:  bad operation or unsupported loop bound.
> note:  * Analysis  failed with vector mode V32QI
> 
> as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and
> a mask.
W/ avx512, we're using scalar mode for mask, can we use VEC_UNPACKS_SBOOL_LO_
here?
Since we have vec_unpsack_sbool_lo/hi_qi which should be used for conversion
from vector<8>  to vector<4> .
> 
> which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to
> allow any QI mode conversions [1] so it fails..
> 
> Not sure why it's doing the replacement without checking to see that the
> target is able to vectorize the statements it generates later. Specifically
> it doesn't check if what's returned by build_mask_conversion is supported or
> not.
> 
> My guess is because vectorizable_condition will fail anyway without the type
> of the conditional being a vector boolean.
> 
> With -mavx512vl V32QI seems to generate in the pattern mask conversions
> between vector (8)  and without it vector(32)
> . I think some x86 person needs to give a hint here :)
> 
> [1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #3 from Hongtao.liu  ---
for patt_42 = () patt_40;

vectype_in (QImode:nunits 4)

 
unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fffea18ddc8 precision:1 min  max >
QI size  unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fffea18de70 nunits:4>

vectype_out(HImode)

 
unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fffea18ddc8 precision:1 min  max >
HI
size  constant 16>
unit-size  constant 2>
align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fffea18df18 nunits:16>

And ‘vec_pack_sbool_trunc_m’ only handle situation when input and output have
same mode.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #4 from Hongtao.liu  ---

(In reply to Hongtao.liu from comment #3)
> for patt_42 = () patt_40;
> 
> vectype_in (QImode:nunits 4)
> 
>   type  size 
> unit-size 
> align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18ddc8 precision:1 min  max
> >
> QI size  unit-size  0x7fffea2e2e58 1>
> align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18de70 nunits:4>
> 
> vectype_out(HImode)
> 
>   type  size 
> unit-size 
> align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18ddc8 precision:1 min  max
> >
> HI
> size  bitsizetype> constant 16>
> unit-size  sizetype> constant 2>
> align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18df18 nunits:16>
> 
> And ‘vec_pack_sbool_trunc_m’ only handle situation when input and output
> have same mode.

And GCC vectorizer only handle 2X elements, but not 4X,8X,...

  /* For scalar masks we may have different boolean
 vector types having the same QImode.  Thus we
 add additional check for elements number.  */
 if (known_eq (TYPE_VECTOR_SUBPARTS (vectype) * 2,  
  TYPE_VECTOR_SUBPARTS (narrow_vectype)))

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #5 from Hongtao.liu  ---

> And GCC vectorizer only handle 2X elements, but not 4X,8X,...
> 
>   /* For scalar masks we may have different boolean
>vector types having the same QImode.  Thus we
>add additional check for elements number.  */
>  if (known_eq (TYPE_VECTOR_SUBPARTS (vectype) * 2,
>   TYPE_VECTOR_SUBPARTS (narrow_vectype)))


We can have vec_pack_trunc_qi + vec_pack_sbool_trunc_qi by extending
vec_pack_sbool_trunc_qi to handle 8/4, 4/2, 2/1 vectype_out/vectype_in
elements.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2022-01-13
 CC||rguenth at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #6 from Richard Biener  ---
(In reply to Tamar Christina from comment #1)
> Looks like the change causes the simpler conditional to be detected by the
> vectorizer as a masked operation, which in principle makes sense:
> 
> note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 >
> 255 ? iftmp.0_19 : iftmp.0_20;
> note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 :
> iftmp.0_20;
> note:   extra pattern stmt: patt_40 = x.1_14 > 255;
> note:   extra pattern stmt: patt_42 = () patt_40;

if we look at the ifcvt result we see

  iftmp.0_19 = (unsigned char) _18;
  iftmp.0_20 = (unsigned char) _5;
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
  *_6 = iftmp.0_21;

I suspect we intended to carry the fact that x.1_14 > 255 will produce
a mask from a SImode vector element compare but we need a mask suitable
for a QImode vector element select (iftmp.0_19 and iftmp.0_20).  That's
not something we can express in scalar code and thus a pattern I think.

So the pattern as generated doesn't make very much sense to me.  I suppose
it might try to convert a AVX512 mask to a AVX2 style mask but that needs
to be done with

  patt_42 = patt_40 ? ()-1 : 0;

not with a conversion.  But it's also somewhat pointless since it will
simply cause the same issue again - the COND_EXPR vectorization will need
to cobble up 4 AVX512 masks to produce the desired result.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #7 from Richard Biener  ---
Forcing the pattern to not trigger produces the expected

t.c:8:6: missed:   not vectorized: relevant stmt not supported: iftmp.0_21 =
x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;

since condition vectorization itself doesn't know how to handle this, we end
up at

  if (vectype1 && !useless_type_conversion_p (vectype, vectype1))
return false;

with vectype V32QI and vectype1 V8SI.

Splitting out the compare from the COND_EXPR in the pattern but leaving out
the attempt to "widen" it reveals the same fact that vectorizable_condition
doesn't support packing of multiple vector defs for the mask operand.

I think that is what we need to add.  We also don't have a good representation
for "packing" of masks.

diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 3ea905538e1..729a1d32612 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
  rhs1_type);
}

-  if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
-   TYPE_VECTOR_SUBPARTS (vectype2)))
+  /* AVX512 style masks cannot be packed/unpacked.  */
+  if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
+ && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
+  TYPE_VECTOR_SUBPARTS (vectype2)))
tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
   else
tmp = rhs1;

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #8 from Hongtao.liu  ---
(In reply to Richard Biener from comment #7)
> Forcing the pattern to not trigger produces the expected
> 
> t.c:8:6: missed:   not vectorized: relevant stmt not supported: iftmp.0_21 =
> x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
> 
> since condition vectorization itself doesn't know how to handle this, we end
> up at
> 
>   if (vectype1 && !useless_type_conversion_p (vectype, vectype1))
> return false;
> 
> with vectype V32QI and vectype1 V8SI.
> 
> Splitting out the compare from the COND_EXPR in the pattern but leaving out
> the attempt to "widen" it reveals the same fact that vectorizable_condition
> doesn't support packing of multiple vector defs for the mask operand.
> 
> I think that is what we need to add.  We also don't have a good
> representation
> for "packing" of masks.
> 
shouldn't multi_step_cvt be supposed to handle this, just a little ambiguous
between ‘vec_pack_sbool_trunc_m’ and ‘vec_pack_trunc_m’, need to make
vectorizer use vec_pack_sbool_trunc_qi + vec_pack_trunc_qi to get a HI mask,
just like how we pack 4 QImode to 1 SImode mask by vec_pack_trunc_qi +
vec_pack_trunc_hi (in the main loop with -mprefer-vector-width=256)

 mask_patt_40.26_118 = vect_x.18_95 > { 255, 255, 255, 255, 255, 255, 255, 255
};
  mask_patt_40.26_119 = vect_x.18_96 > { 255, 255, 255, 255, 255, 255, 255, 255
};
  mask_patt_40.26_120 = vect_x.18_97 > { 255, 255, 255, 255, 255, 255, 255, 255
};
  mask_patt_40.26_121 = vect_x.18_98 > { 255, 255, 255, 255, 255, 255, 255, 255
};
  mask_patt_42.28_122 = VEC_PACK_TRUNC_EXPR ;
  mask_patt_42.28_123 = VEC_PACK_TRUNC_EXPR ;
  mask_patt_42.27_124 = VEC_PACK_TRUNC_EXPR ;
  vect_patt_43.29_125 = VEC_COND_EXPR ;
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #9 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #7)
> I think that is what we need to add.  We also don't have a good
> representation
> for "packing" of masks.
> 
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> index 3ea905538e1..729a1d32612 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
>   rhs1_type);
> }
>  
> -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> -   TYPE_VECTOR_SUBPARTS (vectype2)))
> +  /* AVX512 style masks cannot be packed/unpacked.  */
> +  if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
> + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> +  TYPE_VECTOR_SUBPARTS (vectype2)))
> tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
>else
> tmp = rhs1;
Haven't had time to look at it properly yet, but my first impression
is that that's likely to regress SVE.  Packing and unpacking are
natural operations on boolean vector modes.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #10 from Hongtao.liu  ---
with
@@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
   c1 = VEC_PACK_TRUNC_EXPR;
   if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
  && VECTOR_BOOLEAN_TYPE_P (vectype)
- && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
+ && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
+ || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
  && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
optab1 = vec_pack_sbool_trunc_optab;
   else
@@ -12213,6 +12214,7 @@ supportable_narrowing_operation (enum tree_code code,
   if (VECTOR_BOOLEAN_TYPE_P (intermediate_type)
  && VECTOR_BOOLEAN_TYPE_P (prev_type)
  && intermediate_mode == prev_mode
+ && known_lt (TYPE_VECTOR_SUBPARTS (intermediate_type), BITS_PER_UNIT)
  && SCALAR_INT_MODE_P (prev_mode))
interm_optab = vec_pack_sbool_trunc_optab;
   else

-march=icelake-server -O3 -mprefer-vector-width=128 now can get vectorized
loop.


vmovdqu8(%rsi,%rax), %xmm0
vpmovzxbw   %xmm0, %xmm2
vpmovzxwd   %xmm2, %xmm1
vpsrldq $8, %xmm0, %xmm0
vpsrldq $8, %xmm2, %xmm2
vpmovzxbw   %xmm0, %xmm0
vpmovzxwd   %xmm2, %xmm2
vpmulld %xmm9, %xmm1, %xmm1
vpmulld %xmm9, %xmm2, %xmm2
vpmovzxwd   %xmm0, %xmm4
vpsrldq $8, %xmm0, %xmm0
vpmovzxwd   %xmm0, %xmm0
vpmulld %xmm9, %xmm4, %xmm4
vpmulld %xmm9, %xmm0, %xmm0
vpcmpud $6, %xmm6, %xmm1, %k0
vpsubd  %xmm1, %xmm7, %xmm3
vpcmpud $6, %xmm6, %xmm2, %k1
vpsubd  %xmm2, %xmm7, %xmm5
vpsrad  $31, %xmm5, %xmm5
vpsrad  $31, %xmm3, %xmm3
vpermt2w%xmm5, %xmm8, %xmm3
vpsubd  %xmm0, %xmm7, %xmm10
vpsubd  %xmm4, %xmm7, %xmm5
kshiftlb$4, %k1, %k1
vpcmpud $6, %xmm6, %xmm0, %k2
vpsrad  $31, %xmm5, %xmm5
vpsrad  $31, %xmm10, %xmm10
kandb   %k3, %k0, %k0
korb%k1, %k0, %k0
vpcmpud $6, %xmm6, %xmm4, %k1
vpermt2w%xmm10, %xmm8, %xmm5
vpermt2w%xmm2, %xmm8, %xmm1
vpermt2w%xmm0, %xmm8, %xmm4
vpermt2b%xmm5, %xmm11, %xmm3
vpermt2b%xmm4, %xmm11, %xmm1
kandb   %k3, %k1, %k1
kshiftlb$4, %k2, %k2
korb%k2, %k1, %k1
kunpckbw%k0, %k1, %k1
vmovdqu8%xmm3, %xmm1{%k1}
vmovdqu8%xmm1, (%rdi,%rax)
addq$16, %rax
cmpq%rax, %r8
jne .L4

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #11 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #10)
> with
> @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
>c1 = VEC_PACK_TRUNC_EXPR;
>if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> && VECTOR_BOOLEAN_TYPE_P (vectype)
> -   && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> +   && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> +   || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
> && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
>   optab1 = vec_pack_sbool_trunc_optab;
>else
> @@ -12213,6 +12214,7 @@ supportable_narrowing_operation (enum tree_code code,
>if (VECTOR_BOOLEAN_TYPE_P (intermediate_type)
> && VECTOR_BOOLEAN_TYPE_P (prev_type)
> && intermediate_mode == prev_mode
> +   && known_lt (TYPE_VECTOR_SUBPARTS (intermediate_type), BITS_PER_UNIT)
> && SCALAR_INT_MODE_P (prev_mode))
>   interm_optab = vec_pack_sbool_trunc_optab;
>else
> 
> -march=icelake-server -O3 -mprefer-vector-width=128 now can get vectorized
> loop.
> 
> 
>   vmovdqu8(%rsi,%rax), %xmm0
>   vpmovzxbw   %xmm0, %xmm2
>   vpmovzxwd   %xmm2, %xmm1
>   vpsrldq $8, %xmm0, %xmm0
>   vpsrldq $8, %xmm2, %xmm2
>   vpmovzxbw   %xmm0, %xmm0
>   vpmovzxwd   %xmm2, %xmm2
>   vpmulld %xmm9, %xmm1, %xmm1
>   vpmulld %xmm9, %xmm2, %xmm2
>   vpmovzxwd   %xmm0, %xmm4
>   vpsrldq $8, %xmm0, %xmm0
>   vpmovzxwd   %xmm0, %xmm0
>   vpmulld %xmm9, %xmm4, %xmm4
>   vpmulld %xmm9, %xmm0, %xmm0
>   vpcmpud $6, %xmm6, %xmm1, %k0
>   vpsubd  %xmm1, %xmm7, %xmm3
>   vpcmpud $6, %xmm6, %xmm2, %k1
>   vpsubd  %xmm2, %xmm7, %xmm5
>   vpsrad  $31, %xmm5, %xmm5
>   vpsrad  $31, %xmm3, %xmm3
>   vpermt2w%xmm5, %xmm8, %xmm3
>   vpsubd  %xmm0, %xmm7, %xmm10
>   vpsubd  %xmm4, %xmm7, %xmm5
>   kshiftlb$4, %k1, %k1
>   vpcmpud $6, %xmm6, %xmm0, %k2
>   vpsrad  $31, %xmm5, %xmm5
>   vpsrad  $31, %xmm10, %xmm10
>   kandb   %k3, %k0, %k0
>   korb%k1, %k0, %k0
>   vpcmpud $6, %xmm6, %xmm4, %k1
>   vpermt2w%xmm10, %xmm8, %xmm5
>   vpermt2w%xmm2, %xmm8, %xmm1
>   vpermt2w%xmm0, %xmm8, %xmm4
>   vpermt2b%xmm5, %xmm11, %xmm3
>   vpermt2b%xmm4, %xmm11, %xmm1
>   kandb   %k3, %k1, %k1
>   kshiftlb$4, %k2, %k2
>   korb%k2, %k1, %k1
>   kunpckbw%k0, %k1, %k1
>   vmovdqu8%xmm3, %xmm1{%k1}
>   vmovdqu8%xmm1, (%rdi,%rax)
>   addq$16, %rax
>   cmpq%rax, %r8
>   jne .L4

But still not as good as before, since original version we only need to pack
data which is produced by vec_cond_expr, but now need to extraly pack mask.

before

  # x_24 = PHI 
  # vectp_src.11_73 = PHI 
  # vectp_dst.23_112 = PHI 
  # ivtmp_115 = PHI 
  # DEBUG x => NULL
  # DEBUG BEGIN_STMT
  _1 = (sizetype) x_24;
  _2 = src_11(D) + _1;
  vect__3.13_75 = MEM  [(uint8_t *)vectp_src.11_73];
  _3 = *_2;
  vect__4.15_76 = [vec_unpack_lo_expr] vect__3.13_75;
  vect__4.15_77 = [vec_unpack_hi_expr] vect__3.13_75;
  vect__4.14_78 = [vec_unpack_lo_expr] vect__4.15_76;
  vect__4.14_79 = [vec_unpack_hi_expr] vect__4.15_76;
  vect__4.14_80 = [vec_unpack_lo_expr] vect__4.15_77;
  vect__4.14_81 = [vec_unpack_hi_expr] vect__4.15_77;
  _4 = (int) _3;
  vect__5.16_83 = vect__4.14_78 * vect_cst__82;
  vect__5.16_84 = vect__4.14_79 * vect_cst__82;
  vect__5.16_85 = vect__4.14_80 * vect_cst__82;
  vect__5.16_86 = vect__4.14_81 * vect_cst__82;
  _5 = _4 * i_scale_12(D);
  _6 = dst_13(D) + _1;
  # DEBUG x => NULL
  # DEBUG INLINE_ENTRY x264_clip_uint8
  # DEBUG BEGIN_STMT
  vect__14.17_88 = vect__5.16_83 & vect_cst__87;
  vect__14.17_89 = vect__5.16_84 & vect_cst__87;
  vect__14.17_90 = vect__5.16_85 & vect_cst__87;
  vect__14.17_91 = vect__5.16_86 & vect_cst__87;
  _14 = _5 & -256;
  vect__17.18_92 = -vect__5.16_83;
  vect__17.18_93 = -vect__5.16_84;
  vect__17.18_94 = -vect__5.16_85;
  vect__17.18_95 = -vect__5.16_86;
  _17 = -_5;
  vect__18.19_96 = vect__17.18_92 >> 31;
  vect__18.19_97 = vect__17.18_93 >> 31;
  vect__18.19_98 = vect__17.18_94 >> 31;
  vect__18.19_99 = vect__17.18_95 >> 31;
  _18 = _17 >> 31;
  iftmp.0_19 = (unsigned char) _18;
  iftmp.0_20 = (unsigned char) _5;
  _101 = vect__14.17_88 != vect_cst__100;
  vect_patt_40.20_102 = VEC_COND_EXPR <_101, vect__18.19_96, vect__5.16_83>;
  _103 = vect__14.17_89 != vect_cst__100;
  vect_patt_40.20_104 = VEC_COND_EXPR <_103, vect__18.19_97, vect__5.16_84>;
  _105 = vect__14.17_90 != vect_cst__100;
  vect_patt_40.20_106 = VEC_COND_EXPR <_105, vect__18.19_98, vect__5.16_85>;
  _107 = vect__14.17_91 != vect_cst__100;
  vect_patt_40.20_108 = VEC_COND_EXPR <_107, vect__18.19_99, vect__5.16_86>;
  vect_patt_41.22_109 = VEC_PACK_TRUNC_EXPR ;
  vect_p

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #12 from rguenther at suse dot de  ---
On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #10 from Hongtao.liu  ---
> with
> @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
>c1 = VEC_PACK_TRUNC_EXPR;
>if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
>   && VECTOR_BOOLEAN_TYPE_P (vectype)
> - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
>   && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))

I think we instead simply want

 if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
 && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1
 && VECTOR_BOOLEAN_TYPE_P (vectype)
 && TYPE_PRECISION (TREE_TYPE (vectype)) == 1)

note the docs of vec_pack_sbool_trunc say

This instruction pattern is used when all the vector input and output
operands have the same scalar mode @var{m} and thus using
@code{vec_pack_trunc_@var{m}} would be ambiguous.

It also says "_Narrow_ and merge the elements of two vectors.", I think
"narrow" is misleading here, _trunc in the optab name as well.  So
with the above it suggests we could have used vect_pack_trunc_hi here?

To avoid breaking things for the VnBImode using targets we probably
want to retain the SCALAR_INT_MODE_P (prev_mode) check.  And we
probably want to adjust the documentation a bit.

This all is with my pasted pattern patch or is this with the weird
inserted conversion still?

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #13 from rguenther at suse dot de  ---
On Thu, 13 Jan 2022, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #9 from rsandifo at gcc dot gnu.org  
> ---
> (In reply to Richard Biener from comment #7)
> > I think that is what we need to add.  We also don't have a good
> > representation
> > for "packing" of masks.
> > 
> > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> > index 3ea905538e1..729a1d32612 100644
> > --- a/gcc/tree-vect-patterns.c
> > +++ b/gcc/tree-vect-patterns.c
> > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
> >   rhs1_type);
> > }
> >  
> > -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > -   TYPE_VECTOR_SUBPARTS (vectype2)))
> > +  /* AVX512 style masks cannot be packed/unpacked.  */
> > +  if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
> > + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > +  TYPE_VECTOR_SUBPARTS (vectype2)))
> > tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
> >else
> > tmp = rhs1;
> Haven't had time to look at it properly yet, but my first impression
> is that that's likely to regress SVE.  Packing and unpacking are
> natural operations on boolean vector modes.

Sure, but we can't produce scalar code mimicking this for
1 bit element vectors.  It "works" for the others because based
on the width of the elements we choose a vector with different
number of elements.  But here the pattern produces a 8 bit element
which isn't want is desired here.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #14 from Hongtao.liu  ---

> 
> But still not as good as before, since original version we only need to pack
> data which is produced by vec_cond_expr, but now need to extraly pack mask.
> 
>

Also for non-avx512 target, it looks like a regression for non-sve target.

https://godbolt.org/z/4aavMjjG4

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #15 from Hongtao.liu  ---
(In reply to rguent...@suse.de from comment #12)
> On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> > 
> > --- Comment #10 from Hongtao.liu  ---
> > with
> > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code 
> > code,
> >c1 = VEC_PACK_TRUNC_EXPR;
> >if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> >   && VECTOR_BOOLEAN_TYPE_P (vectype)
> > - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
> >   && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
> 
> I think we instead simply want
> 
>  if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
>  && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1
>  && VECTOR_BOOLEAN_TYPE_P (vectype)
>  && TYPE_PRECISION (TREE_TYPE (vectype)) == 1)
> 
> note the docs of vec_pack_sbool_trunc say
> 
> This instruction pattern is used when all the vector input and output
> operands have the same scalar mode @var{m} and thus using
> @code{vec_pack_trunc_@var{m}} would be ambiguous.
> 
> It also says "_Narrow_ and merge the elements of two vectors.", I think
> "narrow" is misleading here, _trunc in the optab name as well.  So
> with the above it suggests we could have used vect_pack_trunc_hi here?
> 
> To avoid breaking things for the VnBImode using targets we probably
> want to retain the SCALAR_INT_MODE_P (prev_mode) check.  And we
> probably want to adjust the documentation a bit.
> 
> This all is with my pasted pattern patch or is this with the weird
> inserted conversion still?

It's w/o your patch, I'm try to handle the weird conversion with multi
steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8
-> HI:16 through vec_pack_sbool_trunc_hi). But on the othersize the weird
inserted conversion shouldn't be existed.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #16 from rguenther at suse dot de  ---
On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #15 from Hongtao.liu  ---
> (In reply to rguent...@suse.de from comment #12)
> > On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> > > 
> > > --- Comment #10 from Hongtao.liu  ---
> > > with
> > > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code 
> > > code,
> > >c1 = VEC_PACK_TRUNC_EXPR;
> > >if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> > >   && VECTOR_BOOLEAN_TYPE_P (vectype)
> > > - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > > + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > > + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
> > >   && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
> > 
> > I think we instead simply want
> > 
> >  if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> >  && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1
> >  && VECTOR_BOOLEAN_TYPE_P (vectype)
> >  && TYPE_PRECISION (TREE_TYPE (vectype)) == 1)
> > 
> > note the docs of vec_pack_sbool_trunc say
> > 
> > This instruction pattern is used when all the vector input and output
> > operands have the same scalar mode @var{m} and thus using
> > @code{vec_pack_trunc_@var{m}} would be ambiguous.
> > 
> > It also says "_Narrow_ and merge the elements of two vectors.", I think
> > "narrow" is misleading here, _trunc in the optab name as well.  So
> > with the above it suggests we could have used vect_pack_trunc_hi here?
> > 
> > To avoid breaking things for the VnBImode using targets we probably
> > want to retain the SCALAR_INT_MODE_P (prev_mode) check.  And we
> > probably want to adjust the documentation a bit.
> > 
> > This all is with my pasted pattern patch or is this with the weird
> > inserted conversion still?
> 
> It's w/o your patch, I'm try to handle the weird conversion with multi
> steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8
> -> HI:16 through vec_pack_sbool_trunc_hi). But on the othersize the weird
> inserted conversion shouldn't be existed.

But the weird conversion suggests packing { 0, 1, 0, 1 } as
{ 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ... }
thus expanding each bit to 8 bits.  So it's rather an unpacking :/
As said, the scalar conversion does not make any sense...

But maybe I'm missing something very obvious?

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #17 from Hongtao.liu  ---
> As said, the scalar conversion does not make any sense...
Agree.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #18 from rsandifo at gcc dot gnu.org  
---
Again haven't really looked at this in detail, so could be wrong, but:

(In reply to rguent...@suse.de from comment #13)
> On Thu, 13 Jan 2022, rsandifo at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> > 
> > --- Comment #9 from rsandifo at gcc dot gnu.org  > gnu.org> ---
> > (In reply to Richard Biener from comment #7)
> > > I think that is what we need to add.  We also don't have a good
> > > representation
> > > for "packing" of masks.
> > > 
> > > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> > > index 3ea905538e1..729a1d32612 100644
> > > --- a/gcc/tree-vect-patterns.c
> > > +++ b/gcc/tree-vect-patterns.c
> > > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info 
> > > *vinfo,
> > >   rhs1_type);
> > > }
> > >  
> > > -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > > -   TYPE_VECTOR_SUBPARTS (vectype2)))
> > > +  /* AVX512 style masks cannot be packed/unpacked.  */
> > > +  if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
> > > + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > > +  TYPE_VECTOR_SUBPARTS (vectype2)))
> > > tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
> > >else
> > > tmp = rhs1;
> > Haven't had time to look at it properly yet, but my first impression
> > is that that's likely to regress SVE.  Packing and unpacking are
> > natural operations on boolean vector modes.
> 
> Sure, but we can't produce scalar code mimicking this for
> 1 bit element vectors.
Yeah, but at least in the SVE case, we're not aiming to do that.
The vector boolean type that we want to use is recorded in the
STMT_VINFO_VECTYPE of the conversion instead, and doesn't get
recomputed later.

Like you said in the earlier comment, the fact that we can't
do that on scalar code is why this needs to be a vector pattern.

(As discussed elsewhere, it would be good if we didn't commit
vector types early like this.  But in this case I don't think
we can rely on scalar code to model the conversion either.)

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #19 from Richard Biener  ---
Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 style
mask for V32QImode.  With -mavx512bw the code vectorizes fine.

I guess ix86_get_mask_mode is too restrictive here?

And indeed to build a AVX2 style mask vector from a AVX512 style mask vector
we'd need to use a ?:, not a simple conversion.  Unfortunately we don't support
vectorizing that:

t.c:10:21: note:   ==> examining statement: patt_42 = patt_40 ? -1 : 0;
t.c:10:21: note:   vect_is_simple_use: operand x.1_14 > 255, type of def:
internal
t.c:10:21: note:   vect_is_simple_use: vectype vector(8) 
t.c:10:21: note:   vect_is_simple_use: operand -1, type of def: constant
t.c:10:21: note:   vect_is_simple_use: operand 0, type of def: constant
t.c:8:6: missed:   not vectorized: relevant stmt not supported: patt_42 =
patt_40 ? -1 : 0;

or maybe we need to tweak the conversion vectorization for better support
of this case as noted in comment#10 ... but the code generated from that
AFAIU is to produce a HImode mask, not a V32QImode one which puts the fault
on ix86_get_mask_mode again.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-17 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #20 from Hongtao.liu  ---
(In reply to Richard Biener from comment #19)
> Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> style
> mask for V32QImode.  With -mavx512bw the code vectorizes fine.

Vectorization code is worse than before, now we need to pack vectorized mask
which takes extra 3 instructions.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #21 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #20)
> (In reply to Richard Biener from comment #19)
> > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> > style
> > mask for V32QImode.  With -mavx512bw the code vectorizes fine.
> 
> Vectorization code is worse than before, now we need to pack vectorized mask
> which takes extra 3 instructions.

Current ifcvt convert

-dump of .ch_vect---
  if (x.1_14 > 255)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 477815112]:
  _17 = -_5;
  _18 = _17 >> 31;
  iftmp.0_19 = (unsigned char) _18;
  goto ; [100.00%]

   [local count: 477815112]:
  iftmp.0_20 = (unsigned char) _5;

   [local count: 955630225]:
  # iftmp.0_21 = PHI 
---dump end-


to 
 dump of .ifcvt-
  _41 = -x.1_14;
  _17 = (int) _41;
  _18 = _17 >> 31;
  iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc
  iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc
  *_6 = iftmp.0_21;
  x_16 = x_24 + 1;
-dump end--


if ifcvt output things like
optimal .ifcvt--
  _41 = -x.1_14;
  _17 = (int) _41;
  _18 = _17 >> 31;
  iftmp.0_21 = x.1_14 > 255 ? _18 : _5;
  iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc
  *_6 = iftmp.0_22;
  x_16 = x_24 + 1;
end

we can save operations for packing mask(3 vec_pack_trunc vs 1 vec_pack_trunc?).

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #22 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #21)
> (In reply to Hongtao.liu from comment #20)
> > (In reply to Richard Biener from comment #19)
> > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> > > style
> > > mask for V32QImode.  With -mavx512bw the code vectorizes fine.
> > 
> > Vectorization code is worse than before, now we need to pack vectorized mask
> > which takes extra 3 instructions.
> 
> Current ifcvt convert
> 
> -dump of .ch_vect---
>   if (x.1_14 > 255)
> goto ; [50.00%]
>   else
> goto ; [50.00%]
> 
>[local count: 477815112]:
>   _17 = -_5;
>   _18 = _17 >> 31;
>   iftmp.0_19 = (unsigned char) _18;
>   goto ; [100.00%]
> 
>[local count: 477815112]:
>   iftmp.0_20 = (unsigned char) _5;
> 
>[local count: 955630225]:
>   # iftmp.0_21 = PHI 
> ---dump end-
> 
> 
> to 
>  dump of .ifcvt-
>   _41 = -x.1_14;
>   _17 = (int) _41;
>   _18 = _17 >> 31;
>   iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc
>   iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc
>   iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc
>   *_6 = iftmp.0_21;
>   x_16 = x_24 + 1;
> -dump end--
> 
> 
> if ifcvt output things like
> optimal .ifcvt--
>   _41 = -x.1_14;
>   _17 = (int) _41;
>   _18 = _17 >> 31;
>   iftmp.0_21 = x.1_14 > 255 ? _18 : _5;
>   iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc
>   *_6 = iftmp.0_22;
>   x_16 = x_24 + 1;
> end
> 
> we can save operations for packing mask(3 vec_pack_trunc vs 1
> vec_pack_trunc?).

Or maybe a gimple simplification for it?

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #23 from rguenther at suse dot de  ---
On Tue, 18 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #22 from Hongtao.liu  ---
> (In reply to Hongtao.liu from comment #21)
> > (In reply to Hongtao.liu from comment #20)
> > > (In reply to Richard Biener from comment #19)
> > > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> > > > style
> > > > mask for V32QImode.  With -mavx512bw the code vectorizes fine.
> > > 
> > > Vectorization code is worse than before, now we need to pack vectorized 
> > > mask
> > > which takes extra 3 instructions.
> > 
> > Current ifcvt convert
> > 
> > -dump of .ch_vect---
> >   if (x.1_14 > 255)
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> > 
> >[local count: 477815112]:
> >   _17 = -_5;
> >   _18 = _17 >> 31;
> >   iftmp.0_19 = (unsigned char) _18;
> >   goto ; [100.00%]
> > 
> >[local count: 477815112]:
> >   iftmp.0_20 = (unsigned char) _5;
> > 
> >[local count: 955630225]:
> >   # iftmp.0_21 = PHI 
> > ---dump end-
> > 
> > 
> > to 
> >  dump of .ifcvt-
> >   _41 = -x.1_14;
> >   _17 = (int) _41;
> >   _18 = _17 >> 31;
> >   iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc
> >   iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc
> >   iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc
> >   *_6 = iftmp.0_21;
> >   x_16 = x_24 + 1;
> > -dump end--
> > 
> > 
> > if ifcvt output things like
> > optimal .ifcvt--
> >   _41 = -x.1_14;
> >   _17 = (int) _41;
> >   _18 = _17 >> 31;
> >   iftmp.0_21 = x.1_14 > 255 ? _18 : _5;
> >   iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc
> >   *_6 = iftmp.0_22;
> >   x_16 = x_24 + 1;
> > end
> > 
> > we can save operations for packing mask(3 vec_pack_trunc vs 1
> > vec_pack_trunc?).
> 
> Or maybe a gimple simplification for it?

Yes, I think that's a candidate for a match.pd simplification.
Fortunately if-conversion already folds the built stmts.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

Andrew Pinski  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org

--- Comment #24 from Andrew Pinski  ---
Note phiopt has code which is supposed to factor out the cast but many reasons,
I have noticed it almost never does it (I still don't understand the cases when
it does and when it does not yet). This was something which I was going to fix
for GCC 13.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #25 from Hongtao.liu  ---
in fold_unary_loc
---cut from fold-const.cc-
 9276  else if (TREE_CODE (arg0) == COND_EXPR)
 9277{
 9278  tree arg01 = TREE_OPERAND (arg0, 1);
 9279  tree arg02 = TREE_OPERAND (arg0, 2);
 9280  if (! VOID_TYPE_P (TREE_TYPE (arg01)))
 9281arg01 = fold_build1_loc (loc, code, type,
 9282 fold_convert_loc (loc,
 9283   TREE_TYPE (op0),
arg01));
 9284  if (! VOID_TYPE_P (TREE_TYPE (arg02)))
 9285arg02 = fold_build1_loc (loc, code, type,
 9286 fold_convert_loc (loc,
 9287   TREE_TYPE (op0),
arg02));
 9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND (arg0,
0),
 9289 arg01, arg02);

---end---

gcc always tries to simplify (convert (cond (cmp a b) c d)  > (cond (cmp a
b) (convert c) (convert d)), exactly the opposite of what this case wants.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #26 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #25)
> 
> gcc always tries to simplify (convert (cond (cmp a b) c d)  > (cond (cmp
> a b) (convert c) (convert d)), exactly the opposite of what this case wants.

Right that is for generic. For the gimple level the function
factor_out_conditional_conversion in phiopt is supposed to undo this but as I
mentioned it does not always.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #27 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #26)
> (In reply to Hongtao.liu from comment #25)
> > 
> > gcc always tries to simplify (convert (cond (cmp a b) c d)  > (cond (cmp
> > a b) (convert c) (convert d)), exactly the opposite of what this case wants.
> 
> Right that is for generic. For the gimple level the function
> factor_out_conditional_conversion in phiopt is supposed to undo this but as
Oh, thanks for the info. I'll leave it to you in GCC13.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #28 from Richard Biener  ---
(In reply to Hongtao.liu from comment #25)
> in fold_unary_loc
> ---cut from fold-const.cc-
>  9276  else if (TREE_CODE (arg0) == COND_EXPR)
>  9277{
>  9278  tree arg01 = TREE_OPERAND (arg0, 1);
>  9279  tree arg02 = TREE_OPERAND (arg0, 2);
>  9280  if (! VOID_TYPE_P (TREE_TYPE (arg01)))
>  9281arg01 = fold_build1_loc (loc, code, type,
>  9282 fold_convert_loc (loc,
>  9283   TREE_TYPE (op0),
> arg01));
>  9284  if (! VOID_TYPE_P (TREE_TYPE (arg02)))
>  9285arg02 = fold_build1_loc (loc, code, type,
>  9286 fold_convert_loc (loc,
>  9287   TREE_TYPE (op0),
> arg02));
>  9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND
> (arg0, 0),
>  9289 arg01, arg02);
> 
> ---end---
> 
> gcc always tries to simplify (convert (cond (cmp a b) c d)  > (cond (cmp
> a b) (convert c) (convert d)), exactly the opposite of what this case wants.

It also then undos this if the result didn't simplify and plays trick to avoid
recursions.

I think this particular transform ought to be specialized, maybe to
(T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only.

The "cheap" way of avoiding a conflict is to wrap the match.pd pattern
with opposite logic in

#if GIMPLE
#endif

(with a comment explaining this)

Note that we can move a conversion out only if the sources of the conversions
have compatible types but we always can move a conversion in.

Alternatively this transform can also be done in a vectorizer pattern based
on vector compatibility of the ?: predicate with the data.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #29 from Hongtao.liu  ---
(In reply to Richard Biener from comment #28)
> (In reply to Hongtao.liu from comment #25)
> > in fold_unary_loc
> > ---cut from fold-const.cc-
> >  9276  else if (TREE_CODE (arg0) == COND_EXPR)
> >  9277{
> >  9278  tree arg01 = TREE_OPERAND (arg0, 1);
> >  9279  tree arg02 = TREE_OPERAND (arg0, 2);
> >  9280  if (! VOID_TYPE_P (TREE_TYPE (arg01)))
> >  9281arg01 = fold_build1_loc (loc, code, type,
> >  9282 fold_convert_loc (loc,
> >  9283   TREE_TYPE (op0),
> > arg01));
> >  9284  if (! VOID_TYPE_P (TREE_TYPE (arg02)))
> >  9285arg02 = fold_build1_loc (loc, code, type,
> >  9286 fold_convert_loc (loc,
> >  9287   TREE_TYPE (op0),
> > arg02));
> >  9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND
> > (arg0, 0),
> >  9289 arg01, arg02);
> > 
> > ---end---
> > 
> > gcc always tries to simplify (convert (cond (cmp a b) c d)  > (cond (cmp
> > a b) (convert c) (convert d)), exactly the opposite of what this case wants.
> 
> It also then undos this if the result didn't simplify and plays trick to
> avoid
> recursions.
> 
> I think this particular transform ought to be specialized, maybe to
> (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only.
> 
> The "cheap" way of avoiding a conflict is to wrap the match.pd pattern
> with opposite logic in
> 
> #if GIMPLE
> #endif
> 
It doesn't work, 
> (with a comment explaining this)
> 
> Note that we can move a conversion out only if the sources of the conversions
> have compatible types but we always can move a conversion in.
> 
> Alternatively this transform can also be done in a vectorizer pattern based
> on vector compatibility of the ?: predicate with the data.
yes, I'm thinking of doing this in fold_build_cond_expr which is only used by
pass_ifcvt to generate cond_expr.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #30 from rguenther at suse dot de  ---
On Wed, 19 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #29 from Hongtao.liu  ---
> (In reply to Richard Biener from comment #28)
> > (In reply to Hongtao.liu from comment #25)
> > > in fold_unary_loc
> > > ---cut from fold-const.cc-
> > >  9276  else if (TREE_CODE (arg0) == COND_EXPR)
> > >  9277{
> > >  9278  tree arg01 = TREE_OPERAND (arg0, 1);
> > >  9279  tree arg02 = TREE_OPERAND (arg0, 2);
> > >  9280  if (! VOID_TYPE_P (TREE_TYPE (arg01)))
> > >  9281arg01 = fold_build1_loc (loc, code, type,
> > >  9282 fold_convert_loc (loc,
> > >  9283   TREE_TYPE (op0),
> > > arg01));
> > >  9284  if (! VOID_TYPE_P (TREE_TYPE (arg02)))
> > >  9285arg02 = fold_build1_loc (loc, code, type,
> > >  9286 fold_convert_loc (loc,
> > >  9287   TREE_TYPE (op0),
> > > arg02));
> > >  9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND
> > > (arg0, 0),
> > >  9289 arg01, arg02);
> > > 
> > > ---end---
> > > 
> > > gcc always tries to simplify (convert (cond (cmp a b) c d)  > (cond 
> > > (cmp
> > > a b) (convert c) (convert d)), exactly the opposite of what this case 
> > > wants.
> > 
> > It also then undos this if the result didn't simplify and plays trick to
> > avoid
> > recursions.
> > 
> > I think this particular transform ought to be specialized, maybe to
> > (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only.
> > 
> > The "cheap" way of avoiding a conflict is to wrap the match.pd pattern
> > with opposite logic in
> > 
> > #if GIMPLE
> > #endif
> > 
> It doesn't work, 
> > (with a comment explaining this)
> > 
> > Note that we can move a conversion out only if the sources of the 
> > conversions
> > have compatible types but we always can move a conversion in.
> > 
> > Alternatively this transform can also be done in a vectorizer pattern based
> > on vector compatibility of the ?: predicate with the data.
> yes, I'm thinking of doing this in fold_build_cond_expr which is only used by
> pass_ifcvt to generate cond_expr.

That's also a reasonable place but the vectorizer pattern recog phase
might have more contextual information to determine the best types
to use.   fold_build_cond_expr is probably easiest to adjust though.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-20 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #31 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:8bc700f4c3fbe405413db02281ef2918bfa831fc

commit r12-6756-g8bc700f4c3fbe405413db02281ef2918bfa831fc
Author: liuhongt 
Date:   Mon Jan 17 10:47:46 2022 +0800

Enhance vec_pack_trunc for integral mode mask.

For testcase in PR, the patch supports QI:4 -> HI:16 pack with
multi steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi,
then pack QI:8 -> HI:16 through vec_pack_trunc_hi).
Similar for QI:2 -> HI:16 which is test4 in mask-pack-prefer-128.c.

gcc/ChangeLog:

PR target/103771
* tree-vect-stmts.cc (supportable_narrowing_operation): Enhance
integral mode mask pack by multi steps which takes
vec_pack_sbool_trunc_optab as start when elements number is
less than BITS_PER_UNITS.

gcc/testsuite/ChangeLog:

* gcc.target/i386/mask-pack-prefer128.c: New test.
* gcc.target/i386/mask-pack-prefer256.c: New test.
* gcc.target/i386/pr103771.c: New test.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #32 from Hongtao.liu  ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589168.html

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-31 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

Andrew Pinski  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |pinskia at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #33 from Andrew Pinski  ---
Oh yes I remember why factor_out_conditional_conversion is not doing it. it is
because we have two bb but the current code does not handle that case.
That is we have:
  if (x.1_1 > 255)
goto ; [INV]
  else
goto ; [INV]

   :
  _2 = -x_5(D);
  _3 = _2 >> 31;
  iftmp.0_7 = (uint8_t) _3;
  goto ; [INV]

   :
  iftmp.0_6 = (uint8_t) x_5(D);

   :
  # iftmp.0_4 = PHI 

While the correct code only handles the case where the one of the bb4 does not
exist. Yes I am working on this for GCC 13.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-01 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #34 from Jeffrey A. Law  ---
I've always wanted to see us be able to push something like those matched
conversions down through the PHI.

That would make the code look like:

  if (x.1_1 > 255)
goto ; [INV]
  else
goto ; [INV]

   :
  _2 = -x_5(D);
  _3 = _2 >> 31;
  goto ; [INV]

   :

   :
  # tmp = PHI <_3, x_5(4)>
  iftmp.0_4 = (uint8_t) tmp

And presumably we'd clean up the empty bb4 which would in turn unblock other
optimizations.

Is that what you're working on?

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-01 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #35 from Andrew Pinski  ---
(In reply to Jeffrey A. Law from comment #34)
> I've always wanted to see us be able to push something like those matched
> conversions down through the PHI.
> 


> Is that what you're working on?

Yes.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #36 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:7e204bd2f189850cb940677c99d8d93eb7dd40cd

commit r12-7216-g7e204bd2f189850cb940677c99d8d93eb7dd40cd
Author: liuhongt 
Date:   Mon Jan 24 11:05:47 2022 +0800

Add vect_recog_cond_expr_convert_pattern.

The pattern converts (cond (cmp a b) (convert c) (convert d))
to (convert (cond (cmp a b) c d)) when
1) types_match (c, d)
2) single_use for (convert c) and (convert d)
3) TYPE_PRECISION (TREE_TYPE (c)) == TYPE_PRECISION (TREE_TYPE (a))
4) INTEGERAL_TYPE_P (TREE_TYPE (c))

The pattern can save packing of mask and data(partial for data, 2 vs
1).

gcc/ChangeLog:

PR target/103771
* match.pd (cond_expr_convert_p): New match.
* tree-vect-patterns.cc (gimple_cond_expr_convert_p): Declare.
(vect_recog_cond_expr_convert_pattern): New.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr103771-2.c: New test.
* gcc.target/i386/pr103771-3.c: New test.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-13 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #37 from Richard Biener  ---
Is this now "good enough" for GCC 12?

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-13 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #38 from Hongtao.liu  ---
(In reply to Richard Biener from comment #37)
> Is this now "good enough" for GCC 12?

Yes.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-13 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
   Target Milestone|12.0|13.0

--- Comment #39 from Richard Biener  ---
Fixed for the target regression part, Andrew is working on a more generic fix
for GCC 13.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-02-17 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #40 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:754dce903ca28c4c2f2bc8614a8de5e631655f2e

commit r12-7276-g754dce903ca28c4c2f2bc8614a8de5e631655f2e
Author: liuhongt 
Date:   Wed Feb 16 12:15:18 2022 +0800

Restrict the two sources of vect_recog_cond_expr_convert_pattern to be of
the same type when convert is extension.

It's not equal to transform

 (cond (cmp @1 @2) (convert@3 @4) (convert@5 @6))

 to

 (convert (cmp @1 @2) (convert)@4 @6)

when(convert@3 @4) is extension because it's zero_extend vs sign_extend.

gcc/ChangeLog:

PR tree-optimization/104551
PR tree-optimization/103771
* match.pd (cond_expr_convert_p): Add types_match check when
convert is extension.
* tree-vect-patterns.cc
(gimple_cond_expr_convert_p): Adjust comments.
(vect_recog_cond_expr_convert_pattern): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr104551.c: New test.