On Wed, 25 Oct 2023 at 02:58, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Hi, > > Sorry the slow review. I clearly didn't think this through properly > when doing the review of the original patch, so I wanted to spend > some time working on the code to get a better understanding of > the problem. > > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > > Hi, > > For the following test-case: > > > > typedef float __attribute__((__vector_size__ (16))) F; > > F foo (F a, F b) > > { > > F v = (F) { 9 }; > > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > } > > > > Compiling with -O2 results in following ICE: > > foo.c: In function ‘foo’: > > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode> > >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> > > const&) > > ../../gcc/gcc/rtl.h:2314 > > 0x7f3185 wide_int_ref_storage<false, > > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode> > >>(std::pair<rtx_def*, machine_mode> const&) > > ../../gcc/gcc/wide-int.h:1089 > > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false> > >>::generic_wide_int<std::pair<rtx_def*, machine_mode> > >>(std::pair<rtx_def*, machine_mode> const&) > > ../../gcc/gcc/wide-int.h:847 > > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false, > > false> > >::poly_int<std::pair<rtx_def*, machine_mode> > >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&) > > ../../gcc/gcc/poly-int.h:467 > > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false, > > false> > >::poly_int<std::pair<rtx_def*, machine_mode> > >>(std::pair<rtx_def*, machine_mode> const&) > > ../../gcc/gcc/poly-int.h:453 > > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > > ../../gcc/gcc/rtl.h:2383 > > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > > ../../gcc/gcc/rtx-vector-builder.h:122 > > 0xfd4e1b vector_builder<rtx_def*, machine_mode, > > rtx_vector_builder>::elt(unsigned int) const > > ../../gcc/gcc/vector-builder.h:253 > > 0xfd4d11 rtx_vector_builder::build() > > ../../gcc/gcc/rtx-vector-builder.cc:73 > > 0xc21d9c const_vector_from_tree > > ../../gcc/gcc/expr.cc:13487 > > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > ../../gcc/gcc/expr.cc:11059 > > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) > > ../../gcc/gcc/expr.h:310 > > 0xaee682 expand_return > > ../../gcc/gcc/cfgexpand.cc:3809 > > 0xaee682 expand_gimple_stmt_1 > > ../../gcc/gcc/cfgexpand.cc:3918 > > 0xaee682 expand_gimple_stmt > > ../../gcc/gcc/cfgexpand.cc:4044 > > 0xaf28f0 expand_gimple_basic_block > > ../../gcc/gcc/cfgexpand.cc:6100 > > 0xaf4996 execute > > ../../gcc/gcc/cfgexpand.cc:6835 > > > > IIUC, the issue is that fold_vec_perm returns a vector having float element > > type with res_nelts_per_pattern == 3, and later ICE's when it tries > > to derive element v[3], not present in the encoding, while trying to > > build rtx vector > > in rtx_vector_builder::build(): > > for (unsigned int i = 0; i < nelts; ++i) > > RTVEC_ELT (v, i) = elt (i); > > > > The attached patch tries to fix this by returning false from > > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > > input vector has non-integral element type, so for VLA vectors, it > > will only build result with dup sequence (nelts_per_pattern < 3) for > > non-integral element type. > > > > For VLS vectors, this will still work for stepped sequence since it > > will then use the "VLS exception" in fold_vec_perm_cst, and set: > > res_npattern = res_nelts and > > res_nelts_per_pattern = 1 > > > > and fold the above case to: > > F foo (F a, F b) > > { > > <bb 2> [local count: 1073741824]: > > return { 0.0, 9.0e+0, 0.0, 0.0 }; > > } > > > > But I am not sure if this is entirely correct, since: > > tree res = out_elts.build (); > > will canonicalize the encoding and may result in a stepped sequence > > (vector_builder::finalize() may reduce npatterns at the cost of increasing > > nelts_per_pattern) ? > > > > PS: This issue is now latent after PR111648 fix, since > > valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns > > false because the corresponding pattern in arg0 is not a natural > > stepped sequence, and folds correctly using VLS exception. However, I > > guess the underlying issue of dealing with non-integral element types > > in fold_vec_perm_cst still remains ? > > > > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu, > > and on x86_64-linux-gnu. > > I think the problem is instead in the way that we're calculating > res_npatterns and res_nelts_per_pattern. > > If the selector is a duplication of { a1, ..., an }, then the > result will be a duplication of n elements, regardless of the shape > of the other arguments. > > Similarly, if the selector is { a1, ...., an } followed by a > duplication of { b1, ..., bn }, the result be n elements followed > by a duplication of n elements, regardless of the shape of the other > arguments. > > So for these two cases, res_npatterns and res_nelts_per_pattern > can come directly from the selector's encoding. > > If: > > (1) the selector is an n-pattern stepped sequence > (2) the stepped part of each pattern selects from the same input pattern > (3) the stepped part of each pattern does not select the first element > of the input pattern, or the full input pattern is stepped > (your previous patch) > > then the result is stepped only if one of the inputs is stepped. > This is because, if an input pattern has 1 or 2 elements, (3) means > that each element of the stepped sequence will select the same value, > as if the selector step had been 0. Hi Richard, Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or base elements followed up dup {a1, .. an, b1, ... bn, ...} in that case we can set res_nelts_per_pattern from selector's encoding. However even if we provide more nelts_per_pattern that necessary, I guess vector_builder::finalize() will canonicalize it to the correct encoding for result ? > > So I think the PR could be solved by something like the attached. > Do you agree? If so, could you base the patch on this instead? > > Only tested against the self-tests. > > Thanks, > Richard > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 40767736389..00fce4945a7 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, > const vec_perm_indices &sel, > unsigned res_npatterns, res_nelts_per_pattern; > unsigned HOST_WIDE_INT res_nelts; > > - /* (1) If SEL is a suitable mask as determined by > - valid_mask_for_fold_vec_perm_cst_p, then: > - res_npatterns = max of npatterns between ARG0, ARG1, and SEL > - res_nelts_per_pattern = max of nelts_per_pattern between > - ARG0, ARG1 and SEL. > - (2) If SEL is not a suitable mask, and TYPE is VLS then: > - res_npatterns = nelts in result vector. > - res_nelts_per_pattern = 1. > - This exception is made so that VLS ARG0, ARG1 and SEL work as before. > */ > - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > - { > - res_npatterns > - = std::max (VECTOR_CST_NPATTERNS (arg0), > - std::max (VECTOR_CST_NPATTERNS (arg1), > - sel.encoding ().npatterns ())); > + /* First try to implement the fold in a VLA-friendly way. > + > + (1) If the selector is simply a duplication of N elements, the > + result is likewise a duplication of N elements. > + > + (2) If the selector is N elements followed by a duplication > + of N elements, the result is too. > > - res_nelts_per_pattern > - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > - sel.encoding ().nelts_per_pattern ())); > + (3) If the selector is N elements followed by an interleaving > + of N linear series, the situation is more complex. > > + valid_mask_for_fold_vec_perm_cst_p detects whether we > + can handle this case. If we can, then each of the N linear > + series either (a) selects the same element each time or > + (b) selects a linear series from one of the input patterns. > + > + If (b) holds for one of the linear series, the result > + will contain a linear series, and so the result will have > + the same shape as the selector. If (a) holds for all of > + the lienar series, the result will be the same as (2) above. > + > + (b) can only hold if one of the inputs pattern has a > + stepped encoding. */ > + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > + { > + res_npatterns = sel.encoding ().npatterns (); > + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > + if (res_nelts_per_pattern == 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > + res_nelts_per_pattern = 2; Um, in this case, should we set: res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1)) if both have nelts_per_pattern == 1 ?
Also I suppose this matters only for non-integral element type, since for integral element type, vector_cst_elt will return the correct value even if the element is not explicitly encoded and input vector is dup ? Thanks, Prathamesh > res_nelts = res_npatterns * res_nelts_per_pattern; > } > else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts)) > -- > 2.25.1 >