On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> wrote: > > On Thu, 26 Oct 2023 at 04:09, Richard Sandiford > <richard.sandif...@arm.com> wrote: > > > > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford > > > <richard.sandif...@arm.com> wrote: > > >> > > >> Hi, > > >> > > >> Sorry the slow review. I clearly didn't think this through properly > > >> when doing the review of the original patch, so I wanted to spend > > >> some time working on the code to get a better understanding of > > >> the problem. > > >> > > >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > > >> > Hi, > > >> > For the following test-case: > > >> > > > >> > typedef float __attribute__((__vector_size__ (16))) F; > > >> > F foo (F a, F b) > > >> > { > > >> > F v = (F) { 9 }; > > >> > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > >> > } > > >> > > > >> > Compiling with -O2 results in following ICE: > > >> > foo.c: In function ‘foo’: > > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > > >> > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > >> > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > >> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode> > > >> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> > > >> > const&) > > >> > ../../gcc/gcc/rtl.h:2314 > > >> > 0x7f3185 wide_int_ref_storage<false, > > >> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode> > > >> >>(std::pair<rtx_def*, machine_mode> const&) > > >> > ../../gcc/gcc/wide-int.h:1089 > > >> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false> > > >> >>::generic_wide_int<std::pair<rtx_def*, machine_mode> > > >> >>(std::pair<rtx_def*, machine_mode> const&) > > >> > ../../gcc/gcc/wide-int.h:847 > > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false, > > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode> > > >> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&) > > >> > ../../gcc/gcc/poly-int.h:467 > > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false, > > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode> > > >> >>(std::pair<rtx_def*, machine_mode> const&) > > >> > ../../gcc/gcc/poly-int.h:453 > > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > > >> > ../../gcc/gcc/rtl.h:2383 > > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > > >> > ../../gcc/gcc/rtx-vector-builder.h:122 > > >> > 0xfd4e1b vector_builder<rtx_def*, machine_mode, > > >> > rtx_vector_builder>::elt(unsigned int) const > > >> > ../../gcc/gcc/vector-builder.h:253 > > >> > 0xfd4d11 rtx_vector_builder::build() > > >> > ../../gcc/gcc/rtx-vector-builder.cc:73 > > >> > 0xc21d9c const_vector_from_tree > > >> > ../../gcc/gcc/expr.cc:13487 > > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > >> > expand_modifier, rtx_def**, bool) > > >> > ../../gcc/gcc/expr.cc:11059 > > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, > > >> > expand_modifier) > > >> > ../../gcc/gcc/expr.h:310 > > >> > 0xaee682 expand_return > > >> > ../../gcc/gcc/cfgexpand.cc:3809 > > >> > 0xaee682 expand_gimple_stmt_1 > > >> > ../../gcc/gcc/cfgexpand.cc:3918 > > >> > 0xaee682 expand_gimple_stmt > > >> > ../../gcc/gcc/cfgexpand.cc:4044 > > >> > 0xaf28f0 expand_gimple_basic_block > > >> > ../../gcc/gcc/cfgexpand.cc:6100 > > >> > 0xaf4996 execute > > >> > ../../gcc/gcc/cfgexpand.cc:6835 > > >> > > > >> > IIUC, the issue is that fold_vec_perm returns a vector having float > > >> > element > > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries > > >> > to derive element v[3], not present in the encoding, while trying to > > >> > build rtx vector > > >> > in rtx_vector_builder::build(): > > >> > for (unsigned int i = 0; i < nelts; ++i) > > >> > RTVEC_ELT (v, i) = elt (i); > > >> > > > >> > The attached patch tries to fix this by returning false from > > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > > >> > input vector has non-integral element type, so for VLA vectors, it > > >> > will only build result with dup sequence (nelts_per_pattern < 3) for > > >> > non-integral element type. > > >> > > > >> > For VLS vectors, this will still work for stepped sequence since it > > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set: > > >> > res_npattern = res_nelts and > > >> > res_nelts_per_pattern = 1 > > >> > > > >> > and fold the above case to: > > >> > F foo (F a, F b) > > >> > { > > >> > <bb 2> [local count: 1073741824]: > > >> > return { 0.0, 9.0e+0, 0.0, 0.0 }; > > >> > } > > >> > > > >> > But I am not sure if this is entirely correct, since: > > >> > tree res = out_elts.build (); > > >> > will canonicalize the encoding and may result in a stepped sequence > > >> > (vector_builder::finalize() may reduce npatterns at the cost of > > >> > increasing > > >> > nelts_per_pattern) ? > > >> > > > >> > PS: This issue is now latent after PR111648 fix, since > > >> > valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns > > >> > false because the corresponding pattern in arg0 is not a natural > > >> > stepped sequence, and folds correctly using VLS exception. However, I > > >> > guess the underlying issue of dealing with non-integral element types > > >> > in fold_vec_perm_cst still remains ? > > >> > > > >> > The patch passes bootstrap+test with and without SVE on > > >> > aarch64-linux-gnu, > > >> > and on x86_64-linux-gnu. > > >> > > >> I think the problem is instead in the way that we're calculating > > >> res_npatterns and res_nelts_per_pattern. > > >> > > >> If the selector is a duplication of { a1, ..., an }, then the > > >> result will be a duplication of n elements, regardless of the shape > > >> of the other arguments. > > >> > > >> Similarly, if the selector is { a1, ...., an } followed by a > > >> duplication of { b1, ..., bn }, the result be n elements followed > > >> by a duplication of n elements, regardless of the shape of the other > > >> arguments. > > >> > > >> So for these two cases, res_npatterns and res_nelts_per_pattern > > >> can come directly from the selector's encoding. > > >> > > >> If: > > >> > > >> (1) the selector is an n-pattern stepped sequence > > >> (2) the stepped part of each pattern selects from the same input pattern > > >> (3) the stepped part of each pattern does not select the first element > > >> of the input pattern, or the full input pattern is stepped > > >> (your previous patch) > > >> > > >> then the result is stepped only if one of the inputs is stepped. > > >> This is because, if an input pattern has 1 or 2 elements, (3) means > > >> that each element of the stepped sequence will select the same value, > > >> as if the selector step had been 0. > > > Hi Richard, > > > Thanks for the suggestions! I agree when selector is dup of {a1, ... an, > > > ...} or > > > base elements followed up dup {a1, .. an, b1, ... bn, ...} in that > > > case we can set > > > res_nelts_per_pattern from selector's encoding. However even if we provide > > > more nelts_per_pattern that necessary, I guess vector_builder::finalize() > > > will > > > canonicalize it to the correct encoding for result ? > > > > Right. The elements before finalize is called have to be correct, > > but they don't need to be canonical or minimal. > > > > After the change, we'll build no more elements than we did before. > > > > >> So I think the PR could be solved by something like the attached. > > >> Do you agree? If so, could you base the patch on this instead? > > >> > > >> Only tested against the self-tests. > > >> > > >> Thanks, > > >> Richard > > >> > > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > > >> index 40767736389..00fce4945a7 100644 > > >> --- a/gcc/fold-const.cc > > >> +++ b/gcc/fold-const.cc > > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree > > >> arg1, const vec_perm_indices &sel, > > >> unsigned res_npatterns, res_nelts_per_pattern; > > >> unsigned HOST_WIDE_INT res_nelts; > > >> > > >> - /* (1) If SEL is a suitable mask as determined by > > >> - valid_mask_for_fold_vec_perm_cst_p, then: > > >> - res_npatterns = max of npatterns between ARG0, ARG1, and SEL > > >> - res_nelts_per_pattern = max of nelts_per_pattern between > > >> - ARG0, ARG1 and SEL. > > >> - (2) If SEL is not a suitable mask, and TYPE is VLS then: > > >> - res_npatterns = nelts in result vector. > > >> - res_nelts_per_pattern = 1. > > >> - This exception is made so that VLS ARG0, ARG1 and SEL work as > > >> before. */ > > >> - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > > >> - { > > >> - res_npatterns > > >> - = std::max (VECTOR_CST_NPATTERNS (arg0), > > >> - std::max (VECTOR_CST_NPATTERNS (arg1), > > >> - sel.encoding ().npatterns ())); > > >> + /* First try to implement the fold in a VLA-friendly way. > > >> + > > >> + (1) If the selector is simply a duplication of N elements, the > > >> + result is likewise a duplication of N elements. > > >> + > > >> + (2) If the selector is N elements followed by a duplication > > >> + of N elements, the result is too. > > >> > > >> - res_nelts_per_pattern > > >> - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > > >> - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > > >> - sel.encoding ().nelts_per_pattern ())); > > >> + (3) If the selector is N elements followed by an interleaving > > >> + of N linear series, the situation is more complex. > > >> > > >> + valid_mask_for_fold_vec_perm_cst_p detects whether we > > >> + can handle this case. If we can, then each of the N linear > > >> + series either (a) selects the same element each time or > > >> + (b) selects a linear series from one of the input patterns. > > >> + > > >> + If (b) holds for one of the linear series, the result > > >> + will contain a linear series, and so the result will have > > >> + the same shape as the selector. If (a) holds for all of > > >> + the lienar series, the result will be the same as (2) above. > > >> + > > >> + (b) can only hold if one of the inputs pattern has a > > >> + stepped encoding. */ > > >> + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > > >> + { > > >> + res_npatterns = sel.encoding ().npatterns (); > > >> + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > > >> + if (res_nelts_per_pattern == 3 > > >> + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > > >> + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > > >> + res_nelts_per_pattern = 2; > > > Um, in this case, should we set: > > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), > > > nelts_per_pattern(arg1)) > > > if both have nelts_per_pattern == 1 ? > > > > No, it still needs to be 2 even if arg0 and arg1 are duplicates. > > E.g. consider a selector that picks the first element of arg0 > > followed by a duplicate of the first element of arg1. > > > > > Also I suppose this matters only for non-integral element type, since > > > for integral element type, > > > vector_cst_elt will return the correct value even if the element is > > > not explicitly encoded and input vector is dup ? > > > > Yeah, but it might help even for integers. If we build fewer > > elements explicitly, and so read fewer implicitly-encoded inputs, > > there's less risk of running into: > > > > if (!can_div_trunc_p (sel[i], len, &q, &r)) > > { > > if (reason) > > *reason = "cannot divide selector element by arg len"; > > return NULL_TREE; > > } > Ah right, thanks for the clarification! > I am currently away on vacation and will return next Thursday, and > will post a follow up patch based on your patch. > Sorry for the delay. Hi, Sorry for slow response, I have rebased your patch and added couple of tests. The attached patch resulted in fallout for aarch64/sve/slp_3.c and aarch64/sve/slp_4.c.
Specifically for slp_3.c, we didn't fold following case: arg0, arg1 are dup vectors. sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2, nelts_per_pattern = 3) because res_nelts_per_pattern was set to 3, and upon encountering 2, fold_vec_perm_cst returned false. With patch, we set res_nelts_per_pattern = 2 (since input vectors are dup), and thus gets folded to: res = { arg0[0], arg1[0], ... } // (2, 1) Which results in using ldrqd for loading the result instead of doing the permutation at runtime with mov and zip1. I have adjusted the tests for new code-gen. Does it look OK ? There's also this strange failure observed on x86_64, as well as on aarch64: New tests that FAIL (1 tests): libitm.c++/dropref.C -B /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs execution test Looking at dropref.C: /* { dg-xfail-run-if "unsupported" { *-*-* } } */ #include <libitm.h> char *pp; int main() { __transaction_atomic { _ITM_dropReferences (pp, 555); } return 0; } doesn't seem relevant to VEC_PERM_EXPR folding ? The patch otherwise passes bootstrap+test on aarch64-linux-gnu with and without SVE, and on x86_64-linux-gnu. Thanks, Prathamesh > > Thanks, > Prathamesh > > > > Thanks, > > Richard
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 40767736389..75410869796 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel, unsigned res_npatterns, res_nelts_per_pattern; unsigned HOST_WIDE_INT res_nelts; - /* (1) If SEL is a suitable mask as determined by - valid_mask_for_fold_vec_perm_cst_p, then: - res_npatterns = max of npatterns between ARG0, ARG1, and SEL - res_nelts_per_pattern = max of nelts_per_pattern between - ARG0, ARG1 and SEL. - (2) If SEL is not a suitable mask, and TYPE is VLS then: - res_npatterns = nelts in result vector. - res_nelts_per_pattern = 1. - This exception is made so that VLS ARG0, ARG1 and SEL work as before. */ - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) - { - res_npatterns - = std::max (VECTOR_CST_NPATTERNS (arg0), - std::max (VECTOR_CST_NPATTERNS (arg1), - sel.encoding ().npatterns ())); + /* First try to implement the fold in a VLA-friendly way. + + (1) If the selector is simply a duplication of N elements, the + result is likewise a duplication of N elements. + + (2) If the selector is N elements followed by a duplication + of N elements, the result is too. + + (3) If the selector is N elements followed by an interleaving + of N linear series, the situation is more complex. + + valid_mask_for_fold_vec_perm_cst_p detects whether we + can handle this case. If we can, then each of the N linear + series either (a) selects the same element each time or + (b) selects a linear series from one of the input patterns. + + If (b) holds for one of the linear series, the result + will contain a linear series, and so the result will have + the same shape as the selector. If (a) holds for all of + the lienar series, the result will be the same as (2) above. - res_nelts_per_pattern - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), - sel.encoding ().nelts_per_pattern ())); + (b) can only hold if one of the input patterns has a + stepped encoding. */ + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) + { + res_npatterns = sel.encoding ().npatterns (); + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); + if (res_nelts_per_pattern == 3 + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) + res_nelts_per_pattern = 2; res_nelts = res_npatterns * res_nelts_per_pattern; } else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts)) @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode) tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) }; validate_res (1, 3, res, expected_res); } + + /* Case 8: Same as aarch64/sve/slp_3.c: + arg0, arg1 are dup vectors. + sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3) + So res = { arg0[0], arg1[0], ... } // (2, 1) + + In this case, since the input vectors are dup, only the first two + elements per pattern in sel are considered significant. */ + { + tree arg0 = build_vec_cst_rand (vmode, 1, 1); + tree arg1 = build_vec_cst_rand (vmode, 1, 1); + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); + + vec_perm_builder builder (len, 2, 3); + poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 }; + builder_push_elems (builder, mask_elems); + + vec_perm_indices sel (builder, 2, len); + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); + + tree expected_res[] = { ARG0(0), ARG1(0) }; + validate_res (2, 1, res, expected_res); + } } } @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode) ASSERT_TRUE (res == NULL_TREE); ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns")); } + + /* Case 8: PR111754: When input vector is not a stepped sequence, + check that the result is not a stepped sequence either, even + if sel has a stepped sequence. */ + { + tree arg0 = build_vec_cst_rand (vmode, 1, 2); + tree arg1 = build_vec_cst_rand (vmode, 1, 2); + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); + + vec_perm_builder builder (len, 1, 3); + poly_uint64 mask_elems[] = { 0, 1, 2 }; + builder_push_elems (builder, mask_elems); + + vec_perm_indices sel (builder, 2, len); + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); + + tree expected_res[] = { ARG0(0), ARG0(1) }; + validate_res (sel.encoding ().npatterns (), 2, res, expected_res); + } + + /* Case 9: If sel doesn't contain a stepped sequence, + check that the result has same encoding as sel, irrespective + of shape of input vectors. */ + { + tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1); + tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1); + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); + + vec_perm_builder builder (len, 1, 2); + poly_uint64 mask_elems[] = { 0, len }; + builder_push_elems (builder, mask_elems); + + vec_perm_indices sel (builder, 2, len); + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); + + tree expected_res[] = { ARG0(0), ARG1(0) }; + validate_res (sel.encoding ().npatterns (), + sel.encoding ().nelts_per_pattern (), res, expected_res); + } } } diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c new file mode 100644 index 00000000000..7c1c16875c7 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +typedef float __attribute__((__vector_size__ (16))) F; + +F foo (F a, F b) +{ + F v = (F) { 9 }; + return __builtin_shufflevector (v, v, 1, 0, 1, 2); +} + +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */ +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c index 82dd43a4d98..cb649bc1aa9 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c @@ -33,21 +33,15 @@ TEST_ALL (VEC_PERM) /* 1 for each 8-bit type. */ /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */ -/* 1 for each 16-bit type plus 1 for double. */ -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */ +/* 1 for each 16-bit type */ +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */ /* 1 for each 32-bit type. */ /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */ -/* 3 for double. */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */ /* The 64-bit types need: ZIP1 ZIP1 (2 ZIP2s optimized away) ZIP1 ZIP2. */ -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */ /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */ /* The loop should be fully-masked. The 64-bit types need two loads diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c index b1fa5e3cf68..ce940a28597 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c @@ -35,20 +35,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \ TEST_ALL (VEC_PERM) -/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double. */ -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */ +/* 1 for each 8-bit type */ +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */ /* 1 for each 16-bit type. */ /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */ -/* 4 for double. */ -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */ /* The 32-bit types need: ZIP1 ZIP1 (2 ZIP2s optimized away) @@ -59,7 +49,7 @@ TEST_ALL (VEC_PERM) ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away) ZIP1 ZIP2 ZIP1 ZIP2 ZIP1 ZIP2 ZIP1 ZIP2. */ -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */ /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */ /* The loop should be fully-masked. The 32-bit types need two loads