[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 Andrew Pinski changed: What|Removed |Added Summary|Missed vectorization under |[12 Regression] Missed |-mavx512f -mavx512vl after |vectorization under |r12-5489|-mavx512f -mavx512vl after ||r12-5489 Target Milestone|--- |12.0 Keywords||missed-optimization
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #1 from Tamar Christina --- Looks like the change causes the simpler conditional to be detected by the vectorizer as a masked operation, which in principle makes sense: note: vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; note: mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 : iftmp.0_20; note: extra pattern stmt: patt_40 = x.1_14 > 255; note: extra pattern stmt: patt_42 = () patt_40; However not quite sure how the masking works on x86. The additional statement generated for patt_42 causes it to fail during vectorization: note: ==> examining pattern def statement: patt_42 = () patt_40; note: ==> examining statement: patt_42 = () patt_40; note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal note: vect_is_simple_use: vectype vector(8) missed: conversion not supported by target. note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal note: vect_is_simple_use: vectype vector(8) note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal note: vect_is_simple_use: vectype vector(8) missed: not vectorized: relevant stmt not supported: patt_42 = () patt_40; missed: bad operation or unsupported loop bound. note: * Analysis failed with vector mode V32QI as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and a mask. which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to allow any QI mode conversions [1] so it fails.. Not sure why it's doing the replacement without checking to see that the target is able to vectorize the statements it generates later. Specifically it doesn't check if what's returned by build_mask_conversion is supported or not. My guess is because vectorizable_condition will fail anyway without the type of the conditional being a vector boolean. With -mavx512vl V32QI seems to generate in the pattern mask conversions between vector (8) and without it vector(32) . I think some x86 person needs to give a hint here :) [1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #2 from Hongtao.liu --- (In reply to Tamar Christina from comment #1) > Looks like the change causes the simpler conditional to be detected by the > vectorizer as a masked operation, which in principle makes sense: > > note: vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 > > 255 ? iftmp.0_19 : iftmp.0_20; > note: mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 : > iftmp.0_20; > note: extra pattern stmt: patt_40 = x.1_14 > 255; > note: extra pattern stmt: patt_42 = () patt_40; > > However not quite sure how the masking works on x86. The additional > statement generated for patt_42 causes it to fail during vectorization: > > note: ==> examining pattern def statement: patt_42 = () > patt_40; > note: ==> examining statement: patt_42 = () patt_40; > note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal > note: vect_is_simple_use: vectype vector(8) > missed: conversion not supported by target. > note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal > note: vect_is_simple_use: vectype vector(8) > note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal > note: vect_is_simple_use: vectype vector(8) > missed: not vectorized: relevant stmt not supported: patt_42 = > () patt_40; > missed: bad operation or unsupported loop bound. > note: * Analysis failed with vector mode V32QI > > as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and > a mask. W/ avx512, we're using scalar mode for mask, can we use VEC_UNPACKS_SBOOL_LO_ here? Since we have vec_unpsack_sbool_lo/hi_qi which should be used for conversion from vector<8> to vector<4> . > > which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to > allow any QI mode conversions [1] so it fails.. > > Not sure why it's doing the replacement without checking to see that the > target is able to vectorize the statements it generates later. Specifically > it doesn't check if what's returned by build_mask_conversion is supported or > not. > > My guess is because vectorizable_condition will fail anyway without the type > of the conditional being a vector boolean. > > With -mavx512vl V32QI seems to generate in the pattern mask conversions > between vector (8) and without it vector(32) > . I think some x86 person needs to give a hint here :) > > [1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #3 from Hongtao.liu --- for patt_42 = () patt_40; vectype_in (QImode:nunits 4) unit-size align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18ddc8 precision:1 min max > QI size unit-size align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18de70 nunits:4> vectype_out(HImode) unit-size align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18ddc8 precision:1 min max > HI size constant 16> unit-size constant 2> align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18df18 nunits:16> And ‘vec_pack_sbool_trunc_m’ only handle situation when input and output have same mode.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #4 from Hongtao.liu --- (In reply to Hongtao.liu from comment #3) > for patt_42 = () patt_40; > > vectype_in (QImode:nunits 4) > > type size > unit-size > align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type > 0x7fffea18ddc8 precision:1 min max > > > QI size unit-size 0x7fffea2e2e58 1> > align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type > 0x7fffea18de70 nunits:4> > > vectype_out(HImode) > > type size > unit-size > align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type > 0x7fffea18ddc8 precision:1 min max > > > HI > size bitsizetype> constant 16> > unit-size sizetype> constant 2> > align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type > 0x7fffea18df18 nunits:16> > > And ‘vec_pack_sbool_trunc_m’ only handle situation when input and output > have same mode. And GCC vectorizer only handle 2X elements, but not 4X,8X,... /* For scalar masks we may have different boolean vector types having the same QImode. Thus we add additional check for elements number. */ if (known_eq (TYPE_VECTOR_SUBPARTS (vectype) * 2, TYPE_VECTOR_SUBPARTS (narrow_vectype)))
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #5 from Hongtao.liu --- > And GCC vectorizer only handle 2X elements, but not 4X,8X,... > > /* For scalar masks we may have different boolean >vector types having the same QImode. Thus we >add additional check for elements number. */ > if (known_eq (TYPE_VECTOR_SUBPARTS (vectype) * 2, > TYPE_VECTOR_SUBPARTS (narrow_vectype))) We can have vec_pack_trunc_qi + vec_pack_sbool_trunc_qi by extending vec_pack_sbool_trunc_qi to handle 8/4, 4/2, 2/1 vectype_out/vectype_in elements.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2022-01-13 CC||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org Status|UNCONFIRMED |NEW --- Comment #6 from Richard Biener --- (In reply to Tamar Christina from comment #1) > Looks like the change causes the simpler conditional to be detected by the > vectorizer as a masked operation, which in principle makes sense: > > note: vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 > > 255 ? iftmp.0_19 : iftmp.0_20; > note: mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 : > iftmp.0_20; > note: extra pattern stmt: patt_40 = x.1_14 > 255; > note: extra pattern stmt: patt_42 = () patt_40; if we look at the ifcvt result we see iftmp.0_19 = (unsigned char) _18; iftmp.0_20 = (unsigned char) _5; iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; *_6 = iftmp.0_21; I suspect we intended to carry the fact that x.1_14 > 255 will produce a mask from a SImode vector element compare but we need a mask suitable for a QImode vector element select (iftmp.0_19 and iftmp.0_20). That's not something we can express in scalar code and thus a pattern I think. So the pattern as generated doesn't make very much sense to me. I suppose it might try to convert a AVX512 mask to a AVX2 style mask but that needs to be done with patt_42 = patt_40 ? ()-1 : 0; not with a conversion. But it's also somewhat pointless since it will simply cause the same issue again - the COND_EXPR vectorization will need to cobble up 4 AVX512 masks to produce the desired result.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #7 from Richard Biener --- Forcing the pattern to not trigger produces the expected t.c:8:6: missed: not vectorized: relevant stmt not supported: iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; since condition vectorization itself doesn't know how to handle this, we end up at if (vectype1 && !useless_type_conversion_p (vectype, vectype1)) return false; with vectype V32QI and vectype1 V8SI. Splitting out the compare from the COND_EXPR in the pattern but leaving out the attempt to "widen" it reveals the same fact that vectorizable_condition doesn't support packing of multiple vector defs for the mask operand. I think that is what we need to add. We also don't have a good representation for "packing" of masks. diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index 3ea905538e1..729a1d32612 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo, rhs1_type); } - if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), - TYPE_VECTOR_SUBPARTS (vectype2))) + /* AVX512 style masks cannot be packed/unpacked. */ + if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1 + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), + TYPE_VECTOR_SUBPARTS (vectype2))) tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo); else tmp = rhs1;
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #8 from Hongtao.liu --- (In reply to Richard Biener from comment #7) > Forcing the pattern to not trigger produces the expected > > t.c:8:6: missed: not vectorized: relevant stmt not supported: iftmp.0_21 = > x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; > > since condition vectorization itself doesn't know how to handle this, we end > up at > > if (vectype1 && !useless_type_conversion_p (vectype, vectype1)) > return false; > > with vectype V32QI and vectype1 V8SI. > > Splitting out the compare from the COND_EXPR in the pattern but leaving out > the attempt to "widen" it reveals the same fact that vectorizable_condition > doesn't support packing of multiple vector defs for the mask operand. > > I think that is what we need to add. We also don't have a good > representation > for "packing" of masks. > shouldn't multi_step_cvt be supposed to handle this, just a little ambiguous between ‘vec_pack_sbool_trunc_m’ and ‘vec_pack_trunc_m’, need to make vectorizer use vec_pack_sbool_trunc_qi + vec_pack_trunc_qi to get a HI mask, just like how we pack 4 QImode to 1 SImode mask by vec_pack_trunc_qi + vec_pack_trunc_hi (in the main loop with -mprefer-vector-width=256) mask_patt_40.26_118 = vect_x.18_95 > { 255, 255, 255, 255, 255, 255, 255, 255 }; mask_patt_40.26_119 = vect_x.18_96 > { 255, 255, 255, 255, 255, 255, 255, 255 }; mask_patt_40.26_120 = vect_x.18_97 > { 255, 255, 255, 255, 255, 255, 255, 255 }; mask_patt_40.26_121 = vect_x.18_98 > { 255, 255, 255, 255, 255, 255, 255, 255 }; mask_patt_42.28_122 = VEC_PACK_TRUNC_EXPR ; mask_patt_42.28_123 = VEC_PACK_TRUNC_EXPR ; mask_patt_42.27_124 = VEC_PACK_TRUNC_EXPR ; vect_patt_43.29_125 = VEC_COND_EXPR ; iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #9 from rsandifo at gcc dot gnu.org --- (In reply to Richard Biener from comment #7) > I think that is what we need to add. We also don't have a good > representation > for "packing" of masks. > > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c > index 3ea905538e1..729a1d32612 100644 > --- a/gcc/tree-vect-patterns.c > +++ b/gcc/tree-vect-patterns.c > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo, > rhs1_type); > } > > - if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), > - TYPE_VECTOR_SUBPARTS (vectype2))) > + /* AVX512 style masks cannot be packed/unpacked. */ > + if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1 > + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), > + TYPE_VECTOR_SUBPARTS (vectype2))) > tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo); >else > tmp = rhs1; Haven't had time to look at it properly yet, but my first impression is that that's likely to regress SVE. Packing and unpacking are natural operations on boolean vector modes.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #10 from Hongtao.liu --- with @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code, c1 = VEC_PACK_TRUNC_EXPR; if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) && VECTOR_BOOLEAN_TYPE_P (vectype) - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT)) && SCALAR_INT_MODE_P (TYPE_MODE (vectype))) optab1 = vec_pack_sbool_trunc_optab; else @@ -12213,6 +12214,7 @@ supportable_narrowing_operation (enum tree_code code, if (VECTOR_BOOLEAN_TYPE_P (intermediate_type) && VECTOR_BOOLEAN_TYPE_P (prev_type) && intermediate_mode == prev_mode + && known_lt (TYPE_VECTOR_SUBPARTS (intermediate_type), BITS_PER_UNIT) && SCALAR_INT_MODE_P (prev_mode)) interm_optab = vec_pack_sbool_trunc_optab; else -march=icelake-server -O3 -mprefer-vector-width=128 now can get vectorized loop. vmovdqu8(%rsi,%rax), %xmm0 vpmovzxbw %xmm0, %xmm2 vpmovzxwd %xmm2, %xmm1 vpsrldq $8, %xmm0, %xmm0 vpsrldq $8, %xmm2, %xmm2 vpmovzxbw %xmm0, %xmm0 vpmovzxwd %xmm2, %xmm2 vpmulld %xmm9, %xmm1, %xmm1 vpmulld %xmm9, %xmm2, %xmm2 vpmovzxwd %xmm0, %xmm4 vpsrldq $8, %xmm0, %xmm0 vpmovzxwd %xmm0, %xmm0 vpmulld %xmm9, %xmm4, %xmm4 vpmulld %xmm9, %xmm0, %xmm0 vpcmpud $6, %xmm6, %xmm1, %k0 vpsubd %xmm1, %xmm7, %xmm3 vpcmpud $6, %xmm6, %xmm2, %k1 vpsubd %xmm2, %xmm7, %xmm5 vpsrad $31, %xmm5, %xmm5 vpsrad $31, %xmm3, %xmm3 vpermt2w%xmm5, %xmm8, %xmm3 vpsubd %xmm0, %xmm7, %xmm10 vpsubd %xmm4, %xmm7, %xmm5 kshiftlb$4, %k1, %k1 vpcmpud $6, %xmm6, %xmm0, %k2 vpsrad $31, %xmm5, %xmm5 vpsrad $31, %xmm10, %xmm10 kandb %k3, %k0, %k0 korb%k1, %k0, %k0 vpcmpud $6, %xmm6, %xmm4, %k1 vpermt2w%xmm10, %xmm8, %xmm5 vpermt2w%xmm2, %xmm8, %xmm1 vpermt2w%xmm0, %xmm8, %xmm4 vpermt2b%xmm5, %xmm11, %xmm3 vpermt2b%xmm4, %xmm11, %xmm1 kandb %k3, %k1, %k1 kshiftlb$4, %k2, %k2 korb%k2, %k1, %k1 kunpckbw%k0, %k1, %k1 vmovdqu8%xmm3, %xmm1{%k1} vmovdqu8%xmm1, (%rdi,%rax) addq$16, %rax cmpq%rax, %r8 jne .L4
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #11 from Hongtao.liu --- (In reply to Hongtao.liu from comment #10) > with > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code, >c1 = VEC_PACK_TRUNC_EXPR; >if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) > && VECTOR_BOOLEAN_TYPE_P (vectype) > - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT)) > && SCALAR_INT_MODE_P (TYPE_MODE (vectype))) > optab1 = vec_pack_sbool_trunc_optab; >else > @@ -12213,6 +12214,7 @@ supportable_narrowing_operation (enum tree_code code, >if (VECTOR_BOOLEAN_TYPE_P (intermediate_type) > && VECTOR_BOOLEAN_TYPE_P (prev_type) > && intermediate_mode == prev_mode > + && known_lt (TYPE_VECTOR_SUBPARTS (intermediate_type), BITS_PER_UNIT) > && SCALAR_INT_MODE_P (prev_mode)) > interm_optab = vec_pack_sbool_trunc_optab; >else > > -march=icelake-server -O3 -mprefer-vector-width=128 now can get vectorized > loop. > > > vmovdqu8(%rsi,%rax), %xmm0 > vpmovzxbw %xmm0, %xmm2 > vpmovzxwd %xmm2, %xmm1 > vpsrldq $8, %xmm0, %xmm0 > vpsrldq $8, %xmm2, %xmm2 > vpmovzxbw %xmm0, %xmm0 > vpmovzxwd %xmm2, %xmm2 > vpmulld %xmm9, %xmm1, %xmm1 > vpmulld %xmm9, %xmm2, %xmm2 > vpmovzxwd %xmm0, %xmm4 > vpsrldq $8, %xmm0, %xmm0 > vpmovzxwd %xmm0, %xmm0 > vpmulld %xmm9, %xmm4, %xmm4 > vpmulld %xmm9, %xmm0, %xmm0 > vpcmpud $6, %xmm6, %xmm1, %k0 > vpsubd %xmm1, %xmm7, %xmm3 > vpcmpud $6, %xmm6, %xmm2, %k1 > vpsubd %xmm2, %xmm7, %xmm5 > vpsrad $31, %xmm5, %xmm5 > vpsrad $31, %xmm3, %xmm3 > vpermt2w%xmm5, %xmm8, %xmm3 > vpsubd %xmm0, %xmm7, %xmm10 > vpsubd %xmm4, %xmm7, %xmm5 > kshiftlb$4, %k1, %k1 > vpcmpud $6, %xmm6, %xmm0, %k2 > vpsrad $31, %xmm5, %xmm5 > vpsrad $31, %xmm10, %xmm10 > kandb %k3, %k0, %k0 > korb%k1, %k0, %k0 > vpcmpud $6, %xmm6, %xmm4, %k1 > vpermt2w%xmm10, %xmm8, %xmm5 > vpermt2w%xmm2, %xmm8, %xmm1 > vpermt2w%xmm0, %xmm8, %xmm4 > vpermt2b%xmm5, %xmm11, %xmm3 > vpermt2b%xmm4, %xmm11, %xmm1 > kandb %k3, %k1, %k1 > kshiftlb$4, %k2, %k2 > korb%k2, %k1, %k1 > kunpckbw%k0, %k1, %k1 > vmovdqu8%xmm3, %xmm1{%k1} > vmovdqu8%xmm1, (%rdi,%rax) > addq$16, %rax > cmpq%rax, %r8 > jne .L4 But still not as good as before, since original version we only need to pack data which is produced by vec_cond_expr, but now need to extraly pack mask. before # x_24 = PHI # vectp_src.11_73 = PHI # vectp_dst.23_112 = PHI # ivtmp_115 = PHI # DEBUG x => NULL # DEBUG BEGIN_STMT _1 = (sizetype) x_24; _2 = src_11(D) + _1; vect__3.13_75 = MEM [(uint8_t *)vectp_src.11_73]; _3 = *_2; vect__4.15_76 = [vec_unpack_lo_expr] vect__3.13_75; vect__4.15_77 = [vec_unpack_hi_expr] vect__3.13_75; vect__4.14_78 = [vec_unpack_lo_expr] vect__4.15_76; vect__4.14_79 = [vec_unpack_hi_expr] vect__4.15_76; vect__4.14_80 = [vec_unpack_lo_expr] vect__4.15_77; vect__4.14_81 = [vec_unpack_hi_expr] vect__4.15_77; _4 = (int) _3; vect__5.16_83 = vect__4.14_78 * vect_cst__82; vect__5.16_84 = vect__4.14_79 * vect_cst__82; vect__5.16_85 = vect__4.14_80 * vect_cst__82; vect__5.16_86 = vect__4.14_81 * vect_cst__82; _5 = _4 * i_scale_12(D); _6 = dst_13(D) + _1; # DEBUG x => NULL # DEBUG INLINE_ENTRY x264_clip_uint8 # DEBUG BEGIN_STMT vect__14.17_88 = vect__5.16_83 & vect_cst__87; vect__14.17_89 = vect__5.16_84 & vect_cst__87; vect__14.17_90 = vect__5.16_85 & vect_cst__87; vect__14.17_91 = vect__5.16_86 & vect_cst__87; _14 = _5 & -256; vect__17.18_92 = -vect__5.16_83; vect__17.18_93 = -vect__5.16_84; vect__17.18_94 = -vect__5.16_85; vect__17.18_95 = -vect__5.16_86; _17 = -_5; vect__18.19_96 = vect__17.18_92 >> 31; vect__18.19_97 = vect__17.18_93 >> 31; vect__18.19_98 = vect__17.18_94 >> 31; vect__18.19_99 = vect__17.18_95 >> 31; _18 = _17 >> 31; iftmp.0_19 = (unsigned char) _18; iftmp.0_20 = (unsigned char) _5; _101 = vect__14.17_88 != vect_cst__100; vect_patt_40.20_102 = VEC_COND_EXPR <_101, vect__18.19_96, vect__5.16_83>; _103 = vect__14.17_89 != vect_cst__100; vect_patt_40.20_104 = VEC_COND_EXPR <_103, vect__18.19_97, vect__5.16_84>; _105 = vect__14.17_90 != vect_cst__100; vect_patt_40.20_106 = VEC_COND_EXPR <_105, vect__18.19_98, vect__5.16_85>; _107 = vect__14.17_91 != vect_cst__100; vect_patt_40.20_108 = VEC_COND_EXPR <_107, vect__18.19_99, vect__5.16_86>; vect_patt_41.22_109 = VEC_PACK_TRUNC_EXPR ; vect_p
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #12 from rguenther at suse dot de --- On Thu, 13 Jan 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > --- Comment #10 from Hongtao.liu --- > with > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code, >c1 = VEC_PACK_TRUNC_EXPR; >if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) > && VECTOR_BOOLEAN_TYPE_P (vectype) > - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT)) > && SCALAR_INT_MODE_P (TYPE_MODE (vectype))) I think we instead simply want if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1 && VECTOR_BOOLEAN_TYPE_P (vectype) && TYPE_PRECISION (TREE_TYPE (vectype)) == 1) note the docs of vec_pack_sbool_trunc say This instruction pattern is used when all the vector input and output operands have the same scalar mode @var{m} and thus using @code{vec_pack_trunc_@var{m}} would be ambiguous. It also says "_Narrow_ and merge the elements of two vectors.", I think "narrow" is misleading here, _trunc in the optab name as well. So with the above it suggests we could have used vect_pack_trunc_hi here? To avoid breaking things for the VnBImode using targets we probably want to retain the SCALAR_INT_MODE_P (prev_mode) check. And we probably want to adjust the documentation a bit. This all is with my pasted pattern patch or is this with the weird inserted conversion still?
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #13 from rguenther at suse dot de --- On Thu, 13 Jan 2022, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > --- Comment #9 from rsandifo at gcc dot gnu.org > --- > (In reply to Richard Biener from comment #7) > > I think that is what we need to add. We also don't have a good > > representation > > for "packing" of masks. > > > > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c > > index 3ea905538e1..729a1d32612 100644 > > --- a/gcc/tree-vect-patterns.c > > +++ b/gcc/tree-vect-patterns.c > > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo, > > rhs1_type); > > } > > > > - if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), > > - TYPE_VECTOR_SUBPARTS (vectype2))) > > + /* AVX512 style masks cannot be packed/unpacked. */ > > + if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1 > > + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), > > + TYPE_VECTOR_SUBPARTS (vectype2))) > > tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo); > >else > > tmp = rhs1; > Haven't had time to look at it properly yet, but my first impression > is that that's likely to regress SVE. Packing and unpacking are > natural operations on boolean vector modes. Sure, but we can't produce scalar code mimicking this for 1 bit element vectors. It "works" for the others because based on the width of the elements we choose a vector with different number of elements. But here the pattern produces a 8 bit element which isn't want is desired here.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #14 from Hongtao.liu --- > > But still not as good as before, since original version we only need to pack > data which is produced by vec_cond_expr, but now need to extraly pack mask. > > Also for non-avx512 target, it looks like a regression for non-sve target. https://godbolt.org/z/4aavMjjG4
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #15 from Hongtao.liu --- (In reply to rguent...@suse.de from comment #12) > On Thu, 13 Jan 2022, crazylht at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > > > --- Comment #10 from Hongtao.liu --- > > with > > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code > > code, > >c1 = VEC_PACK_TRUNC_EXPR; > >if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) > > && VECTOR_BOOLEAN_TYPE_P (vectype) > > - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > > + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > > + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT)) > > && SCALAR_INT_MODE_P (TYPE_MODE (vectype))) > > I think we instead simply want > > if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) > && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1 > && VECTOR_BOOLEAN_TYPE_P (vectype) > && TYPE_PRECISION (TREE_TYPE (vectype)) == 1) > > note the docs of vec_pack_sbool_trunc say > > This instruction pattern is used when all the vector input and output > operands have the same scalar mode @var{m} and thus using > @code{vec_pack_trunc_@var{m}} would be ambiguous. > > It also says "_Narrow_ and merge the elements of two vectors.", I think > "narrow" is misleading here, _trunc in the optab name as well. So > with the above it suggests we could have used vect_pack_trunc_hi here? > > To avoid breaking things for the VnBImode using targets we probably > want to retain the SCALAR_INT_MODE_P (prev_mode) check. And we > probably want to adjust the documentation a bit. > > This all is with my pasted pattern patch or is this with the weird > inserted conversion still? It's w/o your patch, I'm try to handle the weird conversion with multi steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8 -> HI:16 through vec_pack_sbool_trunc_hi). But on the othersize the weird inserted conversion shouldn't be existed.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #16 from rguenther at suse dot de --- On Thu, 13 Jan 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > --- Comment #15 from Hongtao.liu --- > (In reply to rguent...@suse.de from comment #12) > > On Thu, 13 Jan 2022, crazylht at gmail dot com wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > > > > > --- Comment #10 from Hongtao.liu --- > > > with > > > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code > > > code, > > >c1 = VEC_PACK_TRUNC_EXPR; > > >if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) > > > && VECTOR_BOOLEAN_TYPE_P (vectype) > > > - && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > > > + && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype) > > > + || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT)) > > > && SCALAR_INT_MODE_P (TYPE_MODE (vectype))) > > > > I think we instead simply want > > > > if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype) > > && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1 > > && VECTOR_BOOLEAN_TYPE_P (vectype) > > && TYPE_PRECISION (TREE_TYPE (vectype)) == 1) > > > > note the docs of vec_pack_sbool_trunc say > > > > This instruction pattern is used when all the vector input and output > > operands have the same scalar mode @var{m} and thus using > > @code{vec_pack_trunc_@var{m}} would be ambiguous. > > > > It also says "_Narrow_ and merge the elements of two vectors.", I think > > "narrow" is misleading here, _trunc in the optab name as well. So > > with the above it suggests we could have used vect_pack_trunc_hi here? > > > > To avoid breaking things for the VnBImode using targets we probably > > want to retain the SCALAR_INT_MODE_P (prev_mode) check. And we > > probably want to adjust the documentation a bit. > > > > This all is with my pasted pattern patch or is this with the weird > > inserted conversion still? > > It's w/o your patch, I'm try to handle the weird conversion with multi > steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8 > -> HI:16 through vec_pack_sbool_trunc_hi). But on the othersize the weird > inserted conversion shouldn't be existed. But the weird conversion suggests packing { 0, 1, 0, 1 } as { 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ... } thus expanding each bit to 8 bits. So it's rather an unpacking :/ As said, the scalar conversion does not make any sense... But maybe I'm missing something very obvious?
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #17 from Hongtao.liu --- > As said, the scalar conversion does not make any sense... Agree.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #18 from rsandifo at gcc dot gnu.org --- Again haven't really looked at this in detail, so could be wrong, but: (In reply to rguent...@suse.de from comment #13) > On Thu, 13 Jan 2022, rsandifo at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > > > --- Comment #9 from rsandifo at gcc dot gnu.org > gnu.org> --- > > (In reply to Richard Biener from comment #7) > > > I think that is what we need to add. We also don't have a good > > > representation > > > for "packing" of masks. > > > > > > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c > > > index 3ea905538e1..729a1d32612 100644 > > > --- a/gcc/tree-vect-patterns.c > > > +++ b/gcc/tree-vect-patterns.c > > > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info > > > *vinfo, > > > rhs1_type); > > > } > > > > > > - if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), > > > - TYPE_VECTOR_SUBPARTS (vectype2))) > > > + /* AVX512 style masks cannot be packed/unpacked. */ > > > + if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1 > > > + && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1), > > > + TYPE_VECTOR_SUBPARTS (vectype2))) > > > tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo); > > >else > > > tmp = rhs1; > > Haven't had time to look at it properly yet, but my first impression > > is that that's likely to regress SVE. Packing and unpacking are > > natural operations on boolean vector modes. > > Sure, but we can't produce scalar code mimicking this for > 1 bit element vectors. Yeah, but at least in the SVE case, we're not aiming to do that. The vector boolean type that we want to use is recorded in the STMT_VINFO_VECTYPE of the conversion instead, and doesn't get recomputed later. Like you said in the earlier comment, the fact that we can't do that on scalar code is why this needs to be a vector pattern. (As discussed elsewhere, it would be good if we didn't commit vector types early like this. But in this case I don't think we can rely on scalar code to model the conversion either.)
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #19 from Richard Biener --- Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 style mask for V32QImode. With -mavx512bw the code vectorizes fine. I guess ix86_get_mask_mode is too restrictive here? And indeed to build a AVX2 style mask vector from a AVX512 style mask vector we'd need to use a ?:, not a simple conversion. Unfortunately we don't support vectorizing that: t.c:10:21: note: ==> examining statement: patt_42 = patt_40 ? -1 : 0; t.c:10:21: note: vect_is_simple_use: operand x.1_14 > 255, type of def: internal t.c:10:21: note: vect_is_simple_use: vectype vector(8) t.c:10:21: note: vect_is_simple_use: operand -1, type of def: constant t.c:10:21: note: vect_is_simple_use: operand 0, type of def: constant t.c:8:6: missed: not vectorized: relevant stmt not supported: patt_42 = patt_40 ? -1 : 0; or maybe we need to tweak the conversion vectorization for better support of this case as noted in comment#10 ... but the code generated from that AFAIU is to produce a HImode mask, not a V32QImode one which puts the fault on ix86_get_mask_mode again.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #20 from Hongtao.liu --- (In reply to Richard Biener from comment #19) > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 > style > mask for V32QImode. With -mavx512bw the code vectorizes fine. Vectorization code is worse than before, now we need to pack vectorized mask which takes extra 3 instructions.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #21 from Hongtao.liu --- (In reply to Hongtao.liu from comment #20) > (In reply to Richard Biener from comment #19) > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 > > style > > mask for V32QImode. With -mavx512bw the code vectorizes fine. > > Vectorization code is worse than before, now we need to pack vectorized mask > which takes extra 3 instructions. Current ifcvt convert -dump of .ch_vect--- if (x.1_14 > 255) goto ; [50.00%] else goto ; [50.00%] [local count: 477815112]: _17 = -_5; _18 = _17 >> 31; iftmp.0_19 = (unsigned char) _18; goto ; [100.00%] [local count: 477815112]: iftmp.0_20 = (unsigned char) _5; [local count: 955630225]: # iftmp.0_21 = PHI ---dump end- to dump of .ifcvt- _41 = -x.1_14; _17 = (int) _41; _18 = _17 >> 31; iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc *_6 = iftmp.0_21; x_16 = x_24 + 1; -dump end-- if ifcvt output things like optimal .ifcvt-- _41 = -x.1_14; _17 = (int) _41; _18 = _17 >> 31; iftmp.0_21 = x.1_14 > 255 ? _18 : _5; iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc *_6 = iftmp.0_22; x_16 = x_24 + 1; end we can save operations for packing mask(3 vec_pack_trunc vs 1 vec_pack_trunc?).
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #22 from Hongtao.liu --- (In reply to Hongtao.liu from comment #21) > (In reply to Hongtao.liu from comment #20) > > (In reply to Richard Biener from comment #19) > > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 > > > style > > > mask for V32QImode. With -mavx512bw the code vectorizes fine. > > > > Vectorization code is worse than before, now we need to pack vectorized mask > > which takes extra 3 instructions. > > Current ifcvt convert > > -dump of .ch_vect--- > if (x.1_14 > 255) > goto ; [50.00%] > else > goto ; [50.00%] > >[local count: 477815112]: > _17 = -_5; > _18 = _17 >> 31; > iftmp.0_19 = (unsigned char) _18; > goto ; [100.00%] > >[local count: 477815112]: > iftmp.0_20 = (unsigned char) _5; > >[local count: 955630225]: > # iftmp.0_21 = PHI > ---dump end- > > > to > dump of .ifcvt- > _41 = -x.1_14; > _17 = (int) _41; > _18 = _17 >> 31; > iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc > iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc > iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc > *_6 = iftmp.0_21; > x_16 = x_24 + 1; > -dump end-- > > > if ifcvt output things like > optimal .ifcvt-- > _41 = -x.1_14; > _17 = (int) _41; > _18 = _17 >> 31; > iftmp.0_21 = x.1_14 > 255 ? _18 : _5; > iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc > *_6 = iftmp.0_22; > x_16 = x_24 + 1; > end > > we can save operations for packing mask(3 vec_pack_trunc vs 1 > vec_pack_trunc?). Or maybe a gimple simplification for it?
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #23 from rguenther at suse dot de --- On Tue, 18 Jan 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > --- Comment #22 from Hongtao.liu --- > (In reply to Hongtao.liu from comment #21) > > (In reply to Hongtao.liu from comment #20) > > > (In reply to Richard Biener from comment #19) > > > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 > > > > style > > > > mask for V32QImode. With -mavx512bw the code vectorizes fine. > > > > > > Vectorization code is worse than before, now we need to pack vectorized > > > mask > > > which takes extra 3 instructions. > > > > Current ifcvt convert > > > > -dump of .ch_vect--- > > if (x.1_14 > 255) > > goto ; [50.00%] > > else > > goto ; [50.00%] > > > >[local count: 477815112]: > > _17 = -_5; > > _18 = _17 >> 31; > > iftmp.0_19 = (unsigned char) _18; > > goto ; [100.00%] > > > >[local count: 477815112]: > > iftmp.0_20 = (unsigned char) _5; > > > >[local count: 955630225]: > > # iftmp.0_21 = PHI > > ---dump end- > > > > > > to > > dump of .ifcvt- > > _41 = -x.1_14; > > _17 = (int) _41; > > _18 = _17 >> 31; > > iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc > > iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc > > iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc > > *_6 = iftmp.0_21; > > x_16 = x_24 + 1; > > -dump end-- > > > > > > if ifcvt output things like > > optimal .ifcvt-- > > _41 = -x.1_14; > > _17 = (int) _41; > > _18 = _17 >> 31; > > iftmp.0_21 = x.1_14 > 255 ? _18 : _5; > > iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc > > *_6 = iftmp.0_22; > > x_16 = x_24 + 1; > > end > > > > we can save operations for packing mask(3 vec_pack_trunc vs 1 > > vec_pack_trunc?). > > Or maybe a gimple simplification for it? Yes, I think that's a candidate for a match.pd simplification. Fortunately if-conversion already folds the built stmts.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 Andrew Pinski changed: What|Removed |Added CC||pinskia at gcc dot gnu.org --- Comment #24 from Andrew Pinski --- Note phiopt has code which is supposed to factor out the cast but many reasons, I have noticed it almost never does it (I still don't understand the cases when it does and when it does not yet). This was something which I was going to fix for GCC 13.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #25 from Hongtao.liu --- in fold_unary_loc ---cut from fold-const.cc- 9276 else if (TREE_CODE (arg0) == COND_EXPR) 9277{ 9278 tree arg01 = TREE_OPERAND (arg0, 1); 9279 tree arg02 = TREE_OPERAND (arg0, 2); 9280 if (! VOID_TYPE_P (TREE_TYPE (arg01))) 9281arg01 = fold_build1_loc (loc, code, type, 9282 fold_convert_loc (loc, 9283 TREE_TYPE (op0), arg01)); 9284 if (! VOID_TYPE_P (TREE_TYPE (arg02))) 9285arg02 = fold_build1_loc (loc, code, type, 9286 fold_convert_loc (loc, 9287 TREE_TYPE (op0), arg02)); 9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND (arg0, 0), 9289 arg01, arg02); ---end--- gcc always tries to simplify (convert (cond (cmp a b) c d) > (cond (cmp a b) (convert c) (convert d)), exactly the opposite of what this case wants.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #26 from Andrew Pinski --- (In reply to Hongtao.liu from comment #25) > > gcc always tries to simplify (convert (cond (cmp a b) c d) > (cond (cmp > a b) (convert c) (convert d)), exactly the opposite of what this case wants. Right that is for generic. For the gimple level the function factor_out_conditional_conversion in phiopt is supposed to undo this but as I mentioned it does not always.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #27 from Hongtao.liu --- (In reply to Andrew Pinski from comment #26) > (In reply to Hongtao.liu from comment #25) > > > > gcc always tries to simplify (convert (cond (cmp a b) c d) > (cond (cmp > > a b) (convert c) (convert d)), exactly the opposite of what this case wants. > > Right that is for generic. For the gimple level the function > factor_out_conditional_conversion in phiopt is supposed to undo this but as Oh, thanks for the info. I'll leave it to you in GCC13.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #28 from Richard Biener --- (In reply to Hongtao.liu from comment #25) > in fold_unary_loc > ---cut from fold-const.cc- > 9276 else if (TREE_CODE (arg0) == COND_EXPR) > 9277{ > 9278 tree arg01 = TREE_OPERAND (arg0, 1); > 9279 tree arg02 = TREE_OPERAND (arg0, 2); > 9280 if (! VOID_TYPE_P (TREE_TYPE (arg01))) > 9281arg01 = fold_build1_loc (loc, code, type, > 9282 fold_convert_loc (loc, > 9283 TREE_TYPE (op0), > arg01)); > 9284 if (! VOID_TYPE_P (TREE_TYPE (arg02))) > 9285arg02 = fold_build1_loc (loc, code, type, > 9286 fold_convert_loc (loc, > 9287 TREE_TYPE (op0), > arg02)); > 9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND > (arg0, 0), > 9289 arg01, arg02); > > ---end--- > > gcc always tries to simplify (convert (cond (cmp a b) c d) > (cond (cmp > a b) (convert c) (convert d)), exactly the opposite of what this case wants. It also then undos this if the result didn't simplify and plays trick to avoid recursions. I think this particular transform ought to be specialized, maybe to (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only. The "cheap" way of avoiding a conflict is to wrap the match.pd pattern with opposite logic in #if GIMPLE #endif (with a comment explaining this) Note that we can move a conversion out only if the sources of the conversions have compatible types but we always can move a conversion in. Alternatively this transform can also be done in a vectorizer pattern based on vector compatibility of the ?: predicate with the data.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #29 from Hongtao.liu --- (In reply to Richard Biener from comment #28) > (In reply to Hongtao.liu from comment #25) > > in fold_unary_loc > > ---cut from fold-const.cc- > > 9276 else if (TREE_CODE (arg0) == COND_EXPR) > > 9277{ > > 9278 tree arg01 = TREE_OPERAND (arg0, 1); > > 9279 tree arg02 = TREE_OPERAND (arg0, 2); > > 9280 if (! VOID_TYPE_P (TREE_TYPE (arg01))) > > 9281arg01 = fold_build1_loc (loc, code, type, > > 9282 fold_convert_loc (loc, > > 9283 TREE_TYPE (op0), > > arg01)); > > 9284 if (! VOID_TYPE_P (TREE_TYPE (arg02))) > > 9285arg02 = fold_build1_loc (loc, code, type, > > 9286 fold_convert_loc (loc, > > 9287 TREE_TYPE (op0), > > arg02)); > > 9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND > > (arg0, 0), > > 9289 arg01, arg02); > > > > ---end--- > > > > gcc always tries to simplify (convert (cond (cmp a b) c d) > (cond (cmp > > a b) (convert c) (convert d)), exactly the opposite of what this case wants. > > It also then undos this if the result didn't simplify and plays trick to > avoid > recursions. > > I think this particular transform ought to be specialized, maybe to > (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only. > > The "cheap" way of avoiding a conflict is to wrap the match.pd pattern > with opposite logic in > > #if GIMPLE > #endif > It doesn't work, > (with a comment explaining this) > > Note that we can move a conversion out only if the sources of the conversions > have compatible types but we always can move a conversion in. > > Alternatively this transform can also be done in a vectorizer pattern based > on vector compatibility of the ?: predicate with the data. yes, I'm thinking of doing this in fold_build_cond_expr which is only used by pass_ifcvt to generate cond_expr.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #30 from rguenther at suse dot de --- On Wed, 19 Jan 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 > > --- Comment #29 from Hongtao.liu --- > (In reply to Richard Biener from comment #28) > > (In reply to Hongtao.liu from comment #25) > > > in fold_unary_loc > > > ---cut from fold-const.cc- > > > 9276 else if (TREE_CODE (arg0) == COND_EXPR) > > > 9277{ > > > 9278 tree arg01 = TREE_OPERAND (arg0, 1); > > > 9279 tree arg02 = TREE_OPERAND (arg0, 2); > > > 9280 if (! VOID_TYPE_P (TREE_TYPE (arg01))) > > > 9281arg01 = fold_build1_loc (loc, code, type, > > > 9282 fold_convert_loc (loc, > > > 9283 TREE_TYPE (op0), > > > arg01)); > > > 9284 if (! VOID_TYPE_P (TREE_TYPE (arg02))) > > > 9285arg02 = fold_build1_loc (loc, code, type, > > > 9286 fold_convert_loc (loc, > > > 9287 TREE_TYPE (op0), > > > arg02)); > > > 9288=>tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND > > > (arg0, 0), > > > 9289 arg01, arg02); > > > > > > ---end--- > > > > > > gcc always tries to simplify (convert (cond (cmp a b) c d) > (cond > > > (cmp > > > a b) (convert c) (convert d)), exactly the opposite of what this case > > > wants. > > > > It also then undos this if the result didn't simplify and plays trick to > > avoid > > recursions. > > > > I think this particular transform ought to be specialized, maybe to > > (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only. > > > > The "cheap" way of avoiding a conflict is to wrap the match.pd pattern > > with opposite logic in > > > > #if GIMPLE > > #endif > > > It doesn't work, > > (with a comment explaining this) > > > > Note that we can move a conversion out only if the sources of the > > conversions > > have compatible types but we always can move a conversion in. > > > > Alternatively this transform can also be done in a vectorizer pattern based > > on vector compatibility of the ?: predicate with the data. > yes, I'm thinking of doing this in fold_build_cond_expr which is only used by > pass_ifcvt to generate cond_expr. That's also a reasonable place but the vectorizer pattern recog phase might have more contextual information to determine the best types to use. fold_build_cond_expr is probably easiest to adjust though.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #31 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:8bc700f4c3fbe405413db02281ef2918bfa831fc commit r12-6756-g8bc700f4c3fbe405413db02281ef2918bfa831fc Author: liuhongt Date: Mon Jan 17 10:47:46 2022 +0800 Enhance vec_pack_trunc for integral mode mask. For testcase in PR, the patch supports QI:4 -> HI:16 pack with multi steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8 -> HI:16 through vec_pack_trunc_hi). Similar for QI:2 -> HI:16 which is test4 in mask-pack-prefer-128.c. gcc/ChangeLog: PR target/103771 * tree-vect-stmts.cc (supportable_narrowing_operation): Enhance integral mode mask pack by multi steps which takes vec_pack_sbool_trunc_optab as start when elements number is less than BITS_PER_UNITS. gcc/testsuite/ChangeLog: * gcc.target/i386/mask-pack-prefer128.c: New test. * gcc.target/i386/mask-pack-prefer256.c: New test. * gcc.target/i386/pr103771.c: New test.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #32 from Hongtao.liu --- A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589168.html
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 Andrew Pinski changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |pinskia at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #33 from Andrew Pinski --- Oh yes I remember why factor_out_conditional_conversion is not doing it. it is because we have two bb but the current code does not handle that case. That is we have: if (x.1_1 > 255) goto ; [INV] else goto ; [INV] : _2 = -x_5(D); _3 = _2 >> 31; iftmp.0_7 = (uint8_t) _3; goto ; [INV] : iftmp.0_6 = (uint8_t) x_5(D); : # iftmp.0_4 = PHI While the correct code only handles the case where the one of the bb4 does not exist. Yes I am working on this for GCC 13.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #34 from Jeffrey A. Law --- I've always wanted to see us be able to push something like those matched conversions down through the PHI. That would make the code look like: if (x.1_1 > 255) goto ; [INV] else goto ; [INV] : _2 = -x_5(D); _3 = _2 >> 31; goto ; [INV] : : # tmp = PHI <_3, x_5(4)> iftmp.0_4 = (uint8_t) tmp And presumably we'd clean up the empty bb4 which would in turn unblock other optimizations. Is that what you're working on?
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #35 from Andrew Pinski --- (In reply to Jeffrey A. Law from comment #34) > I've always wanted to see us be able to push something like those matched > conversions down through the PHI. > > Is that what you're working on? Yes.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #36 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:7e204bd2f189850cb940677c99d8d93eb7dd40cd commit r12-7216-g7e204bd2f189850cb940677c99d8d93eb7dd40cd Author: liuhongt Date: Mon Jan 24 11:05:47 2022 +0800 Add vect_recog_cond_expr_convert_pattern. The pattern converts (cond (cmp a b) (convert c) (convert d)) to (convert (cond (cmp a b) c d)) when 1) types_match (c, d) 2) single_use for (convert c) and (convert d) 3) TYPE_PRECISION (TREE_TYPE (c)) == TYPE_PRECISION (TREE_TYPE (a)) 4) INTEGERAL_TYPE_P (TREE_TYPE (c)) The pattern can save packing of mask and data(partial for data, 2 vs 1). gcc/ChangeLog: PR target/103771 * match.pd (cond_expr_convert_p): New match. * tree-vect-patterns.cc (gimple_cond_expr_convert_p): Declare. (vect_recog_cond_expr_convert_pattern): New. gcc/testsuite/ChangeLog: * gcc.target/i386/pr103771-2.c: New test. * gcc.target/i386/pr103771-3.c: New test.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #37 from Richard Biener --- Is this now "good enough" for GCC 12?
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #38 from Hongtao.liu --- (In reply to Richard Biener from comment #37) > Is this now "good enough" for GCC 12? Yes.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Target Milestone|12.0|13.0 --- Comment #39 from Richard Biener --- Fixed for the target regression part, Andrew is working on a more generic fix for GCC 13.
[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771 --- Comment #40 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:754dce903ca28c4c2f2bc8614a8de5e631655f2e commit r12-7276-g754dce903ca28c4c2f2bc8614a8de5e631655f2e Author: liuhongt Date: Wed Feb 16 12:15:18 2022 +0800 Restrict the two sources of vect_recog_cond_expr_convert_pattern to be of the same type when convert is extension. It's not equal to transform (cond (cmp @1 @2) (convert@3 @4) (convert@5 @6)) to (convert (cmp @1 @2) (convert)@4 @6) when(convert@3 @4) is extension because it's zero_extend vs sign_extend. gcc/ChangeLog: PR tree-optimization/104551 PR tree-optimization/103771 * match.pd (cond_expr_convert_p): Add types_match check when convert is extension. * tree-vect-patterns.cc (gimple_cond_expr_convert_p): Adjust comments. (vect_recog_cond_expr_convert_pattern): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr104551.c: New test.