[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #10 from Robin Dapp --- >From what I can tell with my barely working connection no regressions on x86, aarch64 or power10 with the adjusted check.
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #9 from Robin Dapp --- Yes, that's from pattern recog: slp.c:11:20: note: === vect_pattern_recog === slp.c:11:20: note: vect_recog_mask_conversion_pattern: detected: _5 = _2 & _4; slp.c:11:20: note: mask_conversion pattern recognized: patt_157 = patt_156 & _4; slp.c:11:20: note: extra pattern stmt: patt_156 = () _2; slp.c:11:20: note: vect_recog_bool_pattern: detected: _6 = (int) _5; slp.c:11:20: note: bool pattern recognized: patt_159 = (int) patt_158; slp.c:11:20: note: extra pattern stmt: patt_158 = _5 ? 1 : 0; slp.c:11:20: note: vect_recog_mask_conversion_pattern: detected: _11 = _8 & _10; slp.c:11:20: note: mask_conversion pattern recognized: patt_161 = patt_160 & _10; slp.c:11:20: note: extra pattern stmt: patt_160 = () _8; ... In vect_recog_mask_conversion_pattern we arrive at if (TYPE_PRECISION (rhs1_type) < TYPE_PRECISION (rhs2_type)) { vectype1 = get_mask_type_for_scalar_type (vinfo, rhs1_type); if (!vectype1) return NULL; rhs2 = build_mask_conversion (vinfo, rhs2, vectype1, stmt_vinfo); } else { vectype1 = get_mask_type_for_scalar_type (vinfo, rhs2_type); if (!vectype1) return NULL; rhs1 = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo); } lhs = vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL); pattern_stmt = gimple_build_assign (lhs, rhs_code, rhs1, rhs2); vectype1 is then e.g. vector([8,8]) . Then vect_recog_bool_pattern creates the COND_EXPR. Testsuites are running with your proposed change.
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #8 from rguenther at suse dot de --- On Mon, 16 Oct 2023, rdapp at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 > > --- Comment #7 from Robin Dapp --- > vectp.4_188 = x_50(D); > vect__1.5_189 = MEM [(int *)vectp.4_188]; > mask__2.6_190 = { 1, 1, 1, 1, 1, 1, 1, 1 } == vect__1.5_189; > mask_patt_156.7_191 = VIEW_CONVERT_EXPR >(mask__2.6_190); > _1 = *x_50(D); > _2 = _1 == 1; > vectp.9_192 = y_51(D); > vect__3.10_193 = MEM [(short int *)vectp.9_192]; > mask__4.11_194 = { 2, 2, 2, 2, 2, 2, 2, 2 } == vect__3.10_193; > mask_patt_157.12_195 = mask_patt_156.7_191 & mask__4.11_194; > vect_patt_158.13_196 = VEC_COND_EXPR 1, 1, 1 }, { 0, 0, 0, 0, 0, 0, 0, 0 }>; > vect_patt_159.14_197 = (vector(8) int) vect_patt_158.13_196; > > > This yields the following assembly: > vsetivlizero,8,e32,m2,ta,ma > vle32.v v2,0(a0) > vmv.v.i v4,1 > vle16.v v1,0(a1) > vmseq.vvv0,v2,v4 > vsetvli zero,zero,e16,m1,ta,ma > vmseq.viv1,v1,2 > vsetvli zero,zero,e32,m2,ta,ma > vmv.v.i v2,0 > vmand.mmv0,v0,v1 > vmerge.vvm v2,v2,v4,v0 > vse32.v v2,0(a0) > > Apart from CSE'ing v4 this looks pretty good to me. My connection is really > poor at the moment so I cannot quickly compare what aarch64 does for that > example. That looks reasonable. Note this then goes through vectorizable_assignment as a no-op move. The question is if we can arrive here with signed bool : 2 vs. _Bool : 2 somehow (I wonder how we arrive with singed bool : 1 here - that's from pattern recog, right? why didn't that produce a COND_EXPR for this?). I think for more thorough testing the condition should change to /* But a conversion that does not change the bit-pattern is ok. */ && !(INTEGRAL_TYPE_P (TREE_TYPE (scalar_dest)) && INTEGRAL_TYPE_P (TREE_TYPE (op)) && ((TYPE_PRECISION (TREE_TYPE (scalar_dest)) > TYPE_PRECISION (TREE_TYPE (op))) && TYPE_UNSIGNED (TREE_TYPE (op || TYPE_PRECISION (TREE_TYPE (scalar_dest)) == TYPE_PRECISION (TREE_TYPE (op) rather than just doing >= which would be odd (why allow to skip sign-extenting from the unsigned MSB but not allow to skip zero-extending from it)
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #7 from Robin Dapp --- vectp.4_188 = x_50(D); vect__1.5_189 = MEM [(int *)vectp.4_188]; mask__2.6_190 = { 1, 1, 1, 1, 1, 1, 1, 1 } == vect__1.5_189; mask_patt_156.7_191 = VIEW_CONVERT_EXPR>(mask__2.6_190); _1 = *x_50(D); _2 = _1 == 1; vectp.9_192 = y_51(D); vect__3.10_193 = MEM [(short int *)vectp.9_192]; mask__4.11_194 = { 2, 2, 2, 2, 2, 2, 2, 2 } == vect__3.10_193; mask_patt_157.12_195 = mask_patt_156.7_191 & mask__4.11_194; vect_patt_158.13_196 = VEC_COND_EXPR ; vect_patt_159.14_197 = (vector(8) int) vect_patt_158.13_196; This yields the following assembly: vsetivlizero,8,e32,m2,ta,ma vle32.v v2,0(a0) vmv.v.i v4,1 vle16.v v1,0(a1) vmseq.vvv0,v2,v4 vsetvli zero,zero,e16,m1,ta,ma vmseq.viv1,v1,2 vsetvli zero,zero,e32,m2,ta,ma vmv.v.i v2,0 vmand.mmv0,v0,v1 vmerge.vvm v2,v2,v4,v0 vse32.v v2,0(a0) Apart from CSE'ing v4 this looks pretty good to me. My connection is really poor at the moment so I cannot quickly compare what aarch64 does for that example.
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #6 from rguenther at suse dot de --- On Mon, 16 Oct 2023, rdapp at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 > > --- Comment #5 from Robin Dapp --- > Disregarding the reasons for the precision adjustment, for this case here, we > seem to fail at: > > /* We do not handle bit-precision changes. */ > if ((CONVERT_EXPR_CODE_P (code) >|| code == VIEW_CONVERT_EXPR) > && ((INTEGRAL_TYPE_P (TREE_TYPE (scalar_dest)) >&& !type_has_mode_precision_p (TREE_TYPE (scalar_dest))) > || (INTEGRAL_TYPE_P (TREE_TYPE (op)) > && !type_has_mode_precision_p (TREE_TYPE (op > /* But a conversion that does not change the bit-pattern is ok. */ > && !(INTEGRAL_TYPE_P (TREE_TYPE (scalar_dest)) >&& INTEGRAL_TYPE_P (TREE_TYPE (op)) >&& (TYPE_PRECISION (TREE_TYPE (scalar_dest)) >> TYPE_PRECISION (TREE_TYPE (op))) >&& TYPE_UNSIGNED (TREE_TYPE (op > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "type conversion to/from bit-precision " > "unsupported.\n"); > return false; > } > > for the expression > patt_156 = () _2; > where _2 (op) is of type _Bool (i.e. TYPE_MODE QImode) and patt_156 > (scalar_dest) is signed-boolean:1. In that case the mode's precision (8) does > not match the type's precision (1) for both op and _scalar_dest. > > The second part of the condition I don't fully get. When does a conversion > change the bit pattern? When the source has higher precision than the dest we > would need to truncate which we probably don't want. When the dest has higher > precision that's considered ok? What about equality? > > If both op and dest have precision 1 the padding could differ (or rather the 1 > could be at different positions) but do we even support that? In other words, > could we relax the condition to TYPE_PRECISION (TREE_TYPE (scalar_dest)) >= > TYPE_PRECISION (TREE_TYPE (op)) (>= instead of >)? > > FWIW bootstrap and testsuite unchanged with >= instead of > on x86, aarch64 > and > power10 but we might not have a proper test for that? It's about sign- vs. zero-extending into padding. What kind of code does the vectorizer emit?
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #5 from Robin Dapp --- Disregarding the reasons for the precision adjustment, for this case here, we seem to fail at: /* We do not handle bit-precision changes. */ if ((CONVERT_EXPR_CODE_P (code) || code == VIEW_CONVERT_EXPR) && ((INTEGRAL_TYPE_P (TREE_TYPE (scalar_dest)) && !type_has_mode_precision_p (TREE_TYPE (scalar_dest))) || (INTEGRAL_TYPE_P (TREE_TYPE (op)) && !type_has_mode_precision_p (TREE_TYPE (op /* But a conversion that does not change the bit-pattern is ok. */ && !(INTEGRAL_TYPE_P (TREE_TYPE (scalar_dest)) && INTEGRAL_TYPE_P (TREE_TYPE (op)) && (TYPE_PRECISION (TREE_TYPE (scalar_dest)) > TYPE_PRECISION (TREE_TYPE (op))) && TYPE_UNSIGNED (TREE_TYPE (op { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "type conversion to/from bit-precision " "unsupported.\n"); return false; } for the expression patt_156 = () _2; where _2 (op) is of type _Bool (i.e. TYPE_MODE QImode) and patt_156 (scalar_dest) is signed-boolean:1. In that case the mode's precision (8) does not match the type's precision (1) for both op and _scalar_dest. The second part of the condition I don't fully get. When does a conversion change the bit pattern? When the source has higher precision than the dest we would need to truncate which we probably don't want. When the dest has higher precision that's considered ok? What about equality? If both op and dest have precision 1 the padding could differ (or rather the 1 could be at different positions) but do we even support that? In other words, could we relax the condition to TYPE_PRECISION (TREE_TYPE (scalar_dest)) >= TYPE_PRECISION (TREE_TYPE (op)) (>= instead of >)? FWIW bootstrap and testsuite unchanged with >= instead of > on x86, aarch64 and power10 but we might not have a proper test for that?
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #4 from Robin Dapp --- Just to mention here as well. As this seems ninstance++ where the adjust_precision thing comes back to bite us, I'm going to go back and check if the issue why it was introduced (DCE?) cannot be solved differently. I'd rather have us not deviate from other backends at such a central part as mode precisions.
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 Richard Biener changed: What|Removed |Added Keywords||missed-optimization --- Comment #3 from Richard Biener --- I'm failing to see the issue as with -march=rv64gcv I run into t.c:4:8: missed: not vectorized: relevant stmt not supported: *x_50(D) = _6; t.c:4:8: note: removing SLP instance operations starting from: *x_50(D) = _6; t.c:4:8: missed: not vectorized: bad operation in basic block. but just guessing, the issue is bool pattern recognition and t.c:12:1: note: using normal nonmask vectors for _2 = _1 == 1; t.c:12:1: note: using normal nonmask vectors for _4 = _3 == 2; t.c:12:1: note: using normal nonmask vectors for _5 = _2 & _4; ... ? To vectorize you'd want to see t.c:12:1: note: using boolean precision 32 for _2 = _1 == 1; t.c:12:1: note: using boolean precision 16 for _4 = _3 == 2; t.c:12:1: note: using boolean precision 16 for _5 = _2 & _4; ... and a pattern used for the value use: t.c:12:1: note: extra pattern stmt: patt_62 = _5 ? 1 : 0; You need to see why this doesn't work (it's a very delicate area).
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #2 from JuzheZhong --- Note that the reason we adjust the mask mode precision here is because the DSE bug for "small mask mode" https://github.com/gcc-mirror/gcc/commit/247cacc9e381d666a492dfa4ed61b7b19e2d008f This is the commit show why we adjust precision.
[Bug c/111794] RISC-V: Missed SLP optimization due to mask mode precision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794 --- Comment #1 from JuzheZhong --- This is RISC-V target specific issue. ARM SVE can vectorize it.