Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: >> >> On Thu, 26 Oct 2023 at 04:09, Richard Sandiford >> <richard.sandif...@arm.com> wrote: >> > >> > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: >> > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford >> > > <richard.sandif...@arm.com> wrote: >> > >> So I think the PR could be solved by something like the attached. >> > >> Do you agree? If so, could you base the patch on this instead? >> > >> >> > >> Only tested against the self-tests. >> > >> >> > >> Thanks, >> > >> Richard >> > >> >> > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc >> > >> index 40767736389..00fce4945a7 100644 >> > >> --- a/gcc/fold-const.cc >> > >> +++ b/gcc/fold-const.cc >> > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree >> > >> arg1, const vec_perm_indices &sel, >> > >> unsigned res_npatterns, res_nelts_per_pattern; >> > >> unsigned HOST_WIDE_INT res_nelts; >> > >> >> > >> - /* (1) If SEL is a suitable mask as determined by >> > >> - valid_mask_for_fold_vec_perm_cst_p, then: >> > >> - res_npatterns = max of npatterns between ARG0, ARG1, and SEL >> > >> - res_nelts_per_pattern = max of nelts_per_pattern between >> > >> - ARG0, ARG1 and SEL. >> > >> - (2) If SEL is not a suitable mask, and TYPE is VLS then: >> > >> - res_npatterns = nelts in result vector. >> > >> - res_nelts_per_pattern = 1. >> > >> - This exception is made so that VLS ARG0, ARG1 and SEL work as >> > >> before. */ >> > >> - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) >> > >> - { >> > >> - res_npatterns >> > >> - = std::max (VECTOR_CST_NPATTERNS (arg0), >> > >> - std::max (VECTOR_CST_NPATTERNS (arg1), >> > >> - sel.encoding ().npatterns ())); >> > >> + /* First try to implement the fold in a VLA-friendly way. >> > >> + >> > >> + (1) If the selector is simply a duplication of N elements, the >> > >> + result is likewise a duplication of N elements. >> > >> + >> > >> + (2) If the selector is N elements followed by a duplication >> > >> + of N elements, the result is too. >> > >> >> > >> - res_nelts_per_pattern >> > >> - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), >> > >> - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), >> > >> - sel.encoding ().nelts_per_pattern ())); >> > >> + (3) If the selector is N elements followed by an interleaving >> > >> + of N linear series, the situation is more complex. >> > >> >> > >> + valid_mask_for_fold_vec_perm_cst_p detects whether we >> > >> + can handle this case. If we can, then each of the N linear >> > >> + series either (a) selects the same element each time or >> > >> + (b) selects a linear series from one of the input patterns. >> > >> + >> > >> + If (b) holds for one of the linear series, the result >> > >> + will contain a linear series, and so the result will have >> > >> + the same shape as the selector. If (a) holds for all of >> > >> + the lienar series, the result will be the same as (2) above. >> > >> + >> > >> + (b) can only hold if one of the inputs pattern has a >> > >> + stepped encoding. */ >> > >> + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) >> > >> + { >> > >> + res_npatterns = sel.encoding ().npatterns (); >> > >> + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); >> > >> + if (res_nelts_per_pattern == 3 >> > >> + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 >> > >> + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) >> > >> + res_nelts_per_pattern = 2; >> > > Um, in this case, should we set: >> > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), >> > > nelts_per_pattern(arg1)) >> > > if both have nelts_per_pattern == 1 ? >> > >> > No, it still needs to be 2 even if arg0 and arg1 are duplicates. >> > E.g. consider a selector that picks the first element of arg0 >> > followed by a duplicate of the first element of arg1. >> > >> > > Also I suppose this matters only for non-integral element type, since >> > > for integral element type, >> > > vector_cst_elt will return the correct value even if the element is >> > > not explicitly encoded and input vector is dup ? >> > >> > Yeah, but it might help even for integers. If we build fewer >> > elements explicitly, and so read fewer implicitly-encoded inputs, >> > there's less risk of running into: >> > >> > if (!can_div_trunc_p (sel[i], len, &q, &r)) >> > { >> > if (reason) >> > *reason = "cannot divide selector element by arg len"; >> > return NULL_TREE; >> > } >> Ah right, thanks for the clarification! >> I am currently away on vacation and will return next Thursday, and >> will post a follow up patch based on your patch. >> Sorry for the delay. > Hi, > Sorry for slow response, I have rebased your patch and added couple of tests. > The attached patch resulted in fallout for aarch64/sve/slp_3.c and > aarch64/sve/slp_4.c. > > Specifically for slp_3.c, we didn't fold following case: > arg0, arg1 are dup vectors. > sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2, > nelts_per_pattern = 3) > because res_nelts_per_pattern was set to 3, and upon encountering 2, > fold_vec_perm_cst returned false. > > With patch, we set res_nelts_per_pattern = 2 (since input vectors are > dup), and thus gets folded to: > res = { arg0[0], arg1[0], ... } // (2, 1) > > Which results in using ldrqd for loading the result instead of doing > the permutation at runtime with mov and zip1. > I have adjusted the tests for new code-gen. > Does it look OK ? > > There's also this strange failure observed on x86_64, as well as on aarch64: > New tests that FAIL (1 tests): > libitm.c++/dropref.C -B > /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs > execution test > > Looking at dropref.C: > /* { dg-xfail-run-if "unsupported" { *-*-* } } */ > #include <libitm.h> > > char *pp; > > int main() > { > __transaction_atomic { > _ITM_dropReferences (pp, 555); > } > return 0; > } > > doesn't seem relevant to VEC_PERM_EXPR folding ? > The patch otherwise passes bootstrap+test on aarch64-linux-gnu with > and without SVE, and on x86_64-linux-gnu. > > Thanks, > Prathamesh >> >> Thanks, >> Prathamesh >> > >> > Thanks, >> > Richard > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 40767736389..75410869796 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, > const vec_perm_indices &sel, > unsigned res_npatterns, res_nelts_per_pattern; > unsigned HOST_WIDE_INT res_nelts; > > - /* (1) If SEL is a suitable mask as determined by > - valid_mask_for_fold_vec_perm_cst_p, then: > - res_npatterns = max of npatterns between ARG0, ARG1, and SEL > - res_nelts_per_pattern = max of nelts_per_pattern between > - ARG0, ARG1 and SEL. > - (2) If SEL is not a suitable mask, and TYPE is VLS then: > - res_npatterns = nelts in result vector. > - res_nelts_per_pattern = 1. > - This exception is made so that VLS ARG0, ARG1 and SEL work as before. > */ > - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > - { > - res_npatterns > - = std::max (VECTOR_CST_NPATTERNS (arg0), > - std::max (VECTOR_CST_NPATTERNS (arg1), > - sel.encoding ().npatterns ())); > + /* First try to implement the fold in a VLA-friendly way. > + > + (1) If the selector is simply a duplication of N elements, the > + result is likewise a duplication of N elements. > + > + (2) If the selector is N elements followed by a duplication > + of N elements, the result is too. > + > + (3) If the selector is N elements followed by an interleaving > + of N linear series, the situation is more complex. > + > + valid_mask_for_fold_vec_perm_cst_p detects whether we > + can handle this case. If we can, then each of the N linear > + series either (a) selects the same element each time or > + (b) selects a linear series from one of the input patterns. > + > + If (b) holds for one of the linear series, the result > + will contain a linear series, and so the result will have > + the same shape as the selector. If (a) holds for all of > + the lienar series, the result will be the same as (2) above.
my typo: linear > > - res_nelts_per_pattern > - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > - sel.encoding ().nelts_per_pattern ())); > + (b) can only hold if one of the input patterns has a > + stepped encoding. */ > > + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > + { > + res_npatterns = sel.encoding ().npatterns (); > + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > + if (res_nelts_per_pattern == 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > + res_nelts_per_pattern = 2; > res_nelts = res_npatterns * res_nelts_per_pattern; > } > else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts)) > @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode) > tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) }; > validate_res (1, 3, res, expected_res); > } > + > + /* Case 8: Same as aarch64/sve/slp_3.c: > + arg0, arg1 are dup vectors. > + sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3) > + So res = { arg0[0], arg1[0], ... } // (2, 1) > + > + In this case, since the input vectors are dup, only the first two > + elements per pattern in sel are considered significant. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 1); > + tree arg1 = build_vec_cst_rand (vmode, 1, 1); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 2, 3); > + poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len); > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > + > + tree expected_res[] = { ARG0(0), ARG1(0) }; > + validate_res (2, 1, res, expected_res); > + } > } > } > > @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode) > ASSERT_TRUE (res == NULL_TREE); > ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns")); > } > + > + /* Case 8: PR111754: When input vector is not a stepped sequence, > + check that the result is not a stepped sequence either, even > + if sel has a stepped sequence. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 2); > + tree arg1 = build_vec_cst_rand (vmode, 1, 2); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 1, 3); > + poly_uint64 mask_elems[] = { 0, 1, 2 }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len); > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > + > + tree expected_res[] = { ARG0(0), ARG0(1) }; > + validate_res (sel.encoding ().npatterns (), 2, res, expected_res); The test is OK, but I think it's worth noting that the fold_vec_perm_cst arguments aren't canonical. Since sel selects only from the first input, the canonical form would be: tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel); So OK with a comment, but also OK with the line above instead (and no arg1). > + } > + > + /* Case 9: If sel doesn't contain a stepped sequence, > + check that the result has same encoding as sel, irrespective > + of shape of input vectors. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1); > + tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 1, 2); > + poly_uint64 mask_elems[] = { 0, len }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len); > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > + > + tree expected_res[] = { ARG0(0), ARG1(0) }; > + validate_res (sel.encoding ().npatterns (), > + sel.encoding ().nelts_per_pattern (), res, expected_res); > + } > } > } > > diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c > b/gcc/testsuite/gcc.dg/vect/pr111754.c > new file mode 100644 > index 00000000000..7c1c16875c7 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > + > +typedef float __attribute__((__vector_size__ (16))) F; > + > +F foo (F a, F b) > +{ > + F v = (F) { 9 }; > + return __builtin_shufflevector (v, v, 1, 0, 1, 2); > +} > + > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */ > +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" > "optimized" } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c > b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c > index 82dd43a4d98..cb649bc1aa9 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c > @@ -33,21 +33,15 @@ TEST_ALL (VEC_PERM) > > /* 1 for each 8-bit type. */ > /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */ > -/* 1 for each 16-bit type plus 1 for double. */ > -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */ > +/* 1 for each 16-bit type */ > +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */ > /* 1 for each 32-bit type. */ > /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */ Let's replace the deleted lines with: /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */ > -/* 3 for double. */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */ > /* The 64-bit types need: > > ZIP1 ZIP1 (2 ZIP2s optimized away) This line should be deleted, now that the ZIP1s are gone. > ZIP1 ZIP2. */ > -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, > z[0-9]+\.d\n} 9 } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, > z[0-9]+\.d\n} 3 } } */ > /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, > z[0-9]+\.d\n} 3 } } */ > > /* The loop should be fully-masked. The 64-bit types need two loads > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c > b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c > index b1fa5e3cf68..ce940a28597 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c > @@ -35,20 +35,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n) > \ > > TEST_ALL (VEC_PERM) > > -/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double. */ > -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */ > +/* 1 for each 8-bit type */ > +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */ > /* 1 for each 16-bit type. */ > /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */ > -/* 4 for double. */ > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */ Similarly here: /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */ > /* The 32-bit types need: > > ZIP1 ZIP1 (2 ZIP2s optimized away) This line should be deleted. > @@ -59,7 +49,7 @@ TEST_ALL (VEC_PERM) > ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away) Same here. OK with those changes, and sorry for the slow review. Thanks, Richard > ZIP1 ZIP2 ZIP1 ZIP2 > ZIP1 ZIP2 ZIP1 ZIP2. */ > -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, > z[0-9]+\.d\n} 33 } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, > z[0-9]+\.d\n} 15 } } */ > /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, > z[0-9]+\.d\n} 15 } } */ > > /* The loop should be fully-masked. The 32-bit types need two loads