On Tue, 14 May 2019, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Tue, 14 May 2019, Richard Sandiford wrote: > > > >> Richard Biener <rguent...@suse.de> writes: > >> > The following makes SSA rewrite (update-address-taken) recognize > >> > sets of aligned sub-vectors in aligned position > >> > (v2qi into v16qi, but esp. v8qi into v16qi). It uses the > >> > BIT_INSERT_EXPR support for this, enabling that for vector > >> > typed values. This makes us turn for example > >> > > >> > typedef unsigned char v16qi __attribute__((vector_size(16))); > >> > v16qi load (const void *p) > >> > { > >> > v16qi r; > >> > __builtin_memcpy (&r, p, 8); > >> > return r; > >> > } > >> > > >> > into the following > >> > > >> > load (const void * p) > >> > { > >> > v16qi r; > >> > long unsigned int _3; > >> > v16qi _5; > >> > vector(8) unsigned char _7; > >> > > >> > <bb 2> : > >> > _3 = MEM[(char * {ref-all})p_2(D)]; > >> > _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3); > >> > r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>; > >> > _5 = r_9; > >> > return _5; > >> > > >> > this isn't yet nicely expanded since the BIT_INSERT_EXPR > >> > expansion simply goes through store_bit_field and there's > >> > no vector-mode vec_set. > >> > > >> > Similar as to the single-element insert SSA rewrite already > >> > handles the transform is conditional on the involved > >> > vector types having non-BLKmode. This is somewhat bad > >> > since the transform is supposed to enable SSA optimizations > >> > by rewriting memory vectors into SSA form. Since splitting > >> > of larger generic vectors happens very much later only > >> > this pessimizes their use. But the BIT_INSERT_EXPR > >> > expansion doesn't cope with BLKmode entities (source or > >> > destination). > >> > > >> > Extending BIT_INSERT_EXPR this way seems natural given > >> > the support of CONSTRUCTORs with smaller vectors. > >> > BIT_FIELD_REF isn't particularly restricted so can be > >> > used to extract sub-vectors as well. > >> > > >> > Code generation is as bad as before (RTL expansion eventually > >> > spills) but SSA optimizations are enabled on less trivial > >> > testcases. > >> > > >> > Boostrap / regtest running on x86_64-unknown-linux-gnu. > >> > > >> > Comments? > >> > > >> > Richard. > >> > > >> > 2019-05-14 Richard Biener <rguent...@suse.de> > >> > > >> > PR tree-optimization/90424 > >> > * tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from > >> > aligned subvectors. > >> > (execute_update_addresses_taken): Likewise. > >> > * tree-cfg.c (verify_gimple_assign_ternary): Likewise. > >> > > >> > * g++.target/i386/pr90424-1.C: New testcase. > >> > * g++.target/i386/pr90424-2.C: Likewise. > >> > > >> > Index: gcc/tree-ssa.c > >> > =================================================================== > >> > --- gcc/tree-ssa.c (revision 271155) > >> > +++ gcc/tree-ssa.c (working copy) > >> > @@ -1521,14 +1521,28 @@ non_rewritable_lvalue_p (tree lhs) > >> > if (DECL_P (decl) > >> > && VECTOR_TYPE_P (TREE_TYPE (decl)) > >> > && TYPE_MODE (TREE_TYPE (decl)) != BLKmode > >> > - && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)), > >> > - TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE > >> > (decl))), 0) > >> > + && multiple_of_p (sizetype, > >> > + TYPE_SIZE_UNIT (TREE_TYPE (decl)), > >> > + TYPE_SIZE_UNIT (TREE_TYPE (lhs))) > >> > && known_ge (mem_ref_offset (lhs), 0) > >> > && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE > >> > (decl))), > >> > mem_ref_offset (lhs)) > >> > && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1), > >> > TYPE_SIZE_UNIT (TREE_TYPE (lhs)))) > >> > - return false; > >> > + { > >> > + if (operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)), > >> > + TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE > >> > (decl))), > >> > + 0)) > >> > + return false; > >> > + /* For sub-vector inserts the insert vector mode has to be > >> > + supported. */ > >> > + tree vtype = build_vector_type > >> > + (TREE_TYPE (TREE_TYPE (decl)), > >> > + tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs))) > >> > + / tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE > >> > (decl))))); > >> > >> AFAICT nothing guarantees tree_fits_uhwi_p for the lhs, so this isn't > >> poly-int clean. Is there a guarantee that lhs is a multiple of the > >> element size even for integers? Or are we just relying on a vector > >> of 0 elements being rejected? > > > > That it is a multiple of the element size is tested indirectly by > > > >> > + && multiple_of_p (sizetype, > >> > + TYPE_SIZE_UNIT (TREE_TYPE (decl)), > >> > + TYPE_SIZE_UNIT (TREE_TYPE (lhs))) > > > > given 'decl' has vector type. That also guarantees non-zero size? > > I was thinking of the case where the element size is still bigger > than lhs, so the division would end up being 0. Maybe one of those > conditions indirectly prevents that though. > > > But yes, I guess TYPE_SIZE_UNIT might be a poly-int. > > > >> Maybe something like: > >> > >> tree elt_type = TREE_TYPE (TREE_TYPE (decl)); > >> unsigned int elt_bits = tree_to_uhwi (TYPE_SIZE (elt_type)); > >> poly_uint64 lhs_bits, nelts; > >> if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits) > >> && multiple_p (lhs_bits, elt_bits, &nelts)) > >> { > >> tree vtype = build_vector_type (elt_type, nelts); > > > > This should work. > > > > Of course it even more so makes duplicating all the bits in > > two places ugly :/ > > > > I guess I might take it as opportunity to refactor the pass > > Sounds good :-) > > > (as excuse to not work on that SLP thing...). > > except for that bit.
The following seems to work for me then ;) Bootstrap / regtest running on x86_64-unknown-linux-gnu. To who is listening: the refactoring would queue up stmts we need to rewrite as discovered from non_rewritable_lvalue_p or non_rewritable_mem_ref_base together with a transform kind. During transform we then simply walk those, not re-analyzing everything. Probably easyhack but given the ugliness of the code it might be easy to introduce mistakes as well ;) Richard. 2019-05-14 Richard Biener <rguent...@suse.de> PR tree-optimization/90424 * tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from aligned subvectors. (execute_update_addresses_taken): Likewise. * tree-cfg.c (verify_gimple_assign_ternary): Likewise. * g++.target/i386/pr90424-1.C: New testcase. * g++.target/i386/pr90424-2.C: Likewise. Index: gcc/tree-ssa.c =================================================================== --- gcc/tree-ssa.c (revision 271203) +++ gcc/tree-ssa.c (working copy) @@ -1521,14 +1521,29 @@ non_rewritable_lvalue_p (tree lhs) if (DECL_P (decl) && VECTOR_TYPE_P (TREE_TYPE (decl)) && TYPE_MODE (TREE_TYPE (decl)) != BLKmode - && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)), - TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0) && known_ge (mem_ref_offset (lhs), 0) && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))), mem_ref_offset (lhs)) && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1), TYPE_SIZE_UNIT (TREE_TYPE (lhs)))) - return false; + { + poly_uint64 lhs_bits, nelts; + if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits) + && multiple_p (lhs_bits, + tree_to_uhwi + (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))), + &nelts)) + { + if (known_eq (nelts, 1)) + return false; + /* For sub-vector inserts the insert vector mode has to be + supported. */ + tree vtype = build_vector_type (TREE_TYPE (TREE_TYPE (decl)), + nelts); + if (TYPE_MODE (vtype) != BLKmode) + return false; + } + } } /* A vector-insert using a BIT_FIELD_REF is rewritable using @@ -1866,20 +1881,30 @@ execute_update_addresses_taken (void) && bitmap_bit_p (suitable_for_renaming, DECL_UID (sym)) && VECTOR_TYPE_P (TREE_TYPE (sym)) && TYPE_MODE (TREE_TYPE (sym)) != BLKmode - && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)), - TYPE_SIZE_UNIT - (TREE_TYPE (TREE_TYPE (sym))), 0) - && tree_fits_uhwi_p (TREE_OPERAND (lhs, 1)) - && tree_int_cst_lt (TREE_OPERAND (lhs, 1), - TYPE_SIZE_UNIT (TREE_TYPE (sym))) - && (tree_to_uhwi (TREE_OPERAND (lhs, 1)) - % tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (lhs)))) == 0) + && known_ge (mem_ref_offset (lhs), 0) + && known_gt (wi::to_poly_offset + (TYPE_SIZE_UNIT (TREE_TYPE (sym))), + mem_ref_offset (lhs)) + && multiple_of_p (sizetype, + TREE_OPERAND (lhs, 1), + TYPE_SIZE_UNIT (TREE_TYPE (lhs)))) { tree val = gimple_assign_rhs1 (stmt); if (! types_compatible_p (TREE_TYPE (val), TREE_TYPE (TREE_TYPE (sym)))) { - tree tem = make_ssa_name (TREE_TYPE (TREE_TYPE (sym))); + poly_uint64 lhs_bits, nelts; + tree temtype = TREE_TYPE (TREE_TYPE (sym)); + if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), + &lhs_bits) + && multiple_p (lhs_bits, + tree_to_uhwi + (TYPE_SIZE (TREE_TYPE + (TREE_TYPE (sym)))), + &nelts) + && maybe_ne (nelts, 1)) + temtype = build_vector_type (temtype, nelts); + tree tem = make_ssa_name (temtype); gimple *pun = gimple_build_assign (tem, build1 (VIEW_CONVERT_EXPR, Index: gcc/tree-cfg.c =================================================================== --- gcc/tree-cfg.c (revision 271203) +++ gcc/tree-cfg.c (working copy) @@ -4263,8 +4263,17 @@ verify_gimple_assign_ternary (gassign *s } if (! ((INTEGRAL_TYPE_P (rhs1_type) && INTEGRAL_TYPE_P (rhs2_type)) + /* Vector element insert. */ || (VECTOR_TYPE_P (rhs1_type) - && types_compatible_p (TREE_TYPE (rhs1_type), rhs2_type)))) + && types_compatible_p (TREE_TYPE (rhs1_type), rhs2_type)) + /* Aligned sub-vector insert. */ + || (VECTOR_TYPE_P (rhs1_type) + && VECTOR_TYPE_P (rhs2_type) + && types_compatible_p (TREE_TYPE (rhs1_type), + TREE_TYPE (rhs2_type)) + && multiple_p (TYPE_VECTOR_SUBPARTS (rhs1_type), + TYPE_VECTOR_SUBPARTS (rhs2_type)) + && multiple_of_p (bitsizetype, rhs3, TYPE_SIZE (rhs2_type))))) { error ("not allowed type combination in BIT_INSERT_EXPR"); debug_generic_expr (rhs1_type); Index: gcc/testsuite/g++.target/i386/pr90424-1.C =================================================================== --- gcc/testsuite/g++.target/i386/pr90424-1.C (nonexistent) +++ gcc/testsuite/g++.target/i386/pr90424-1.C (working copy) @@ -0,0 +1,32 @@ +/* { dg-do compile { target c++11 } } */ +/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */ + +template <class T> +using V [[gnu::vector_size(16)]] = T; + +template <class T, unsigned M = sizeof(V<T>)> +V<T> load(const void *p) { + using W = V<T>; + W r; + __builtin_memcpy(&r, p, M); + return r; +} + +// movq or movsd +template V<char> load<char, 8>(const void *); // bad +template V<short> load<short, 8>(const void *); // bad +template V<int> load<int, 8>(const void *); // bad +template V<long> load<long, 8>(const void *); // good +// the following is disabled because V2SF isn't a supported mode +// template V<float> load<float, 8>(const void *); // bad +template V<double> load<double, 8>(const void *); // good (movsd?) + +// movd or movss +template V<char> load<char, 4>(const void *); // bad +template V<short> load<short, 4>(const void *); // bad +template V<int> load<int, 4>(const void *); // good +template V<float> load<float, 4>(const void *); // good + +/* We should end up with one load and one insert for each function. */ +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 9 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "MEM" 9 "optimized" } } */ Index: gcc/testsuite/g++.target/i386/pr90424-2.C =================================================================== --- gcc/testsuite/g++.target/i386/pr90424-2.C (nonexistent) +++ gcc/testsuite/g++.target/i386/pr90424-2.C (working copy) @@ -0,0 +1,31 @@ +/* { dg-do compile { target c++11 } } */ +/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */ + +template <class T> +using V [[gnu::vector_size(16)]] = T; + +template <class T, unsigned M = sizeof(V<T>)> +V<T> load(const void *p) { + V<T> r = {}; + __builtin_memcpy(&r, p, M); + return r; +} + +// movq or movsd +template V<char> load<char, 8>(const void *); // bad +template V<short> load<short, 8>(const void *); // bad +template V<int> load<int, 8>(const void *); // bad +template V<long> load<long, 8>(const void *); // good +// the following is disabled because V2SF isn't a supported mode +// template V<float> load<float, 8>(const void *); // bad +template V<double> load<double, 8>(const void *); // good (movsd?) + +// movd or movss +template V<char> load<char, 4>(const void *); // bad +template V<short> load<short, 4>(const void *); // bad +template V<int> load<int, 4>(const void *); // good +template V<float> load<float, 4>(const void *); // good + +/* We should end up with one load and one insert for each function. */ +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 9 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "MEM" 9 "optimized" } } */