[gcc r15-2374] vect: Fix single_imm_use in tree_vect_patterns
https://gcc.gnu.org/g:49339d8b7e03a7ba0d4a5e118af993f175485b41 commit r15-2374-g49339d8b7e03a7ba0d4a5e118af993f175485b41 Author: Feng Xue Date: Fri Jun 14 15:49:23 2024 +0800 vect: Fix single_imm_use in tree_vect_patterns Since pattern statement coexists with normal statements in a way that it is not linked into function body, we should not invoke utility procedures that depends on def/use graph on pattern statement, such as counting uses of a pseudo value defined by a pattern statement. This patch is to fix a bug of this type in vect pattern formation. 2024-06-14 Feng Xue gcc/ * tree-vect-patterns.cc (vect_recog_bitfield_ref_pattern): Only call single_imm_use if statement is not generated from pattern recognition. Diff: --- gcc/tree-vect-patterns.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 5fbd1a4fa6b4..4674a16d15f4 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -2702,7 +2702,8 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, stmt_vec_info stmt_info, /* If the only use of the result of this BIT_FIELD_REF + CONVERT is a PLUS_EXPR then do the shift last as some targets can combine the shift and add into a single instruction. */ - if (lhs && single_imm_use (lhs, _p, _stmt)) + if (lhs && !is_pattern_stmt_p (stmt_info) + && single_imm_use (lhs, _p, _stmt)) { if (gimple_code (use_stmt) == GIMPLE_ASSIGN && gimple_assign_rhs_code (use_stmt) == PLUS_EXPR)
[gcc r15-2097] vect: Optimize order of lane-reducing operations in loop def-use cycles
https://gcc.gnu.org/g:db3c8c9726d0bafbb9f85b6d7027fe83602643e7 commit r15-2097-gdb3c8c9726d0bafbb9f85b6d7027fe83602643e7 Author: Feng Xue Date: Wed May 29 17:28:14 2024 +0800 vect: Optimize order of lane-reducing operations in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } 2024-03-22 Feng Xue gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. Diff: --- gcc/tree-vect-loop.cc | 64 +-- gcc/tree-vectorizer.h | 6 + 2 files changed, 63 insertions(+), 7 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 1c3dbf4bc71b..d7d628efa60f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8844,6 +8844,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8861,19 +8862,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); - sum_v1 = sum_v1; // copy + sum_v0 = sum_v0; // copy + sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 - */ +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vector lane-reducing +ops be distributed evenly among all def-use cycles. In the above +example, DOT_PROD, WIDEN_SUM and SADs are generated into
[gcc r15-2096] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
https://gcc.gnu.org/g:178cc419512f7e358f88dfe2336625aa99cd7438 commit r15-2096-g178cc419512f7e358f88dfe2336625aa99cd7438 Author: Feng Xue Date: Wed May 29 17:22:36 2024 +0800 vect: Support multiple lane-reducing operations for loop reduction [PR114440] For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 2024-03-22 Feng Xue gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Adjust comments with updated example. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c Diff: --- gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c | 64 ++ gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c | 79 +++ gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c | 68 ++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 ++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 +++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 ++ gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c | 60 + gcc/tree-vect-loop.cc | 241 +++-- gcc/tree-vect-stmts.cc | 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 750 insertions(+), 70 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index ..80b0089ea0fa --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,64 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) +{ +
[gcc r15-2095] vect: Refit lane-reducing to be normal operation
https://gcc.gnu.org/g:8b59fa9d8ca25bdf0792390a8bdeae151532a530 commit r15-2095-g8b59fa9d8ca25bdf0792390a8bdeae151532a530 Author: Feng Xue Date: Tue Jul 2 17:12:00 2024 +0800 vect: Refit lane-reducing to be normal operation Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 2024-07-02 Feng Xue gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. Diff: --- gcc/tree-vect-loop.cc | 134 ++ gcc/tree-vect-slp.cc | 27 +++--- 2 files changed, 121 insertions(+), 40 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a464bc8607c2..9c5c30535713 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7472,12 +7472,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, = get_masked_reduction_fn (reduc_fn, vectype_in); vec_loop_masks *masks = _VINFO_MASKS (loop_vinfo); vec_loop_lens *lens = _VINFO_LENS (loop_vinfo); - unsigned nvectors; - - if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - else - nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node, + vectype_in); if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); @@ -8599,12 +8595,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); gphi *reduc_def_phi = as_a (phi_info->stmt); int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + + if (!vectype_in) +vectype_in = STMT_VINFO_VECTYPE (stmt_info); if (slp_node) { ncopies = 1; - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in); } else { @@ -8662,13 +8661,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo, bool lane_reducing = lane_reducing_op_p (code); gcc_assert (single_defuse_cycle || lane_reducing); + if (lane_reducing) +{ + /* The last operand of lane-reducing op is for reduction. */ + gcc_assert (reduc_index == (int) op.num_ops - 1); +} + /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); + if (lane_reducing && !slp_node && !single_defuse_cycle) +{ + /* Note: there are still vectorizable cases that can not be handled by +single-lane slp. Probably it would take some time to evolve the +feature to a mature state. So we have to keep the below non-slp code +path as failsafe for lane-reducing support. */ + gcc_assert (op.num_ops <= 3); + for (unsigned i = 0; i < op.num_ops; i++) + { + unsigned oprnd_ncopies = ncopies; + + if ((int) i == reduc_index) + { + tree vectype = STMT_VINFO_VECTYPE (stmt_info); + oprnd_ncopies = vect_get_num_copies (loop_vinfo,
[gcc r15-2094] vect: Add a unified vect_get_num_copies for slp and non-slp
https://gcc.gnu.org/g:e7fbae834f8db2508d3161d88efe7ddbb702e437 commit r15-2094-ge7fbae834f8db2508d3161d88efe7ddbb702e437 Author: Feng Xue Date: Fri Jul 12 16:38:28 2024 +0800 vect: Add a unified vect_get_num_copies for slp and non-slp Extend original vect_get_num_copies (pure loop-based) to calculate number of vector stmts for slp node regarding a generic vect region. 2024-07-12 Feng Xue gcc/ * tree-vectorizer.h (vect_get_num_copies): New overload function. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate number of vector stmts for slp node with vect_get_num_copies. (vect_slp_analyze_node_operations): Calculate number of vector elements for constant/external slp node with vect_get_num_copies. Diff: --- gcc/tree-vect-slp.cc | 19 +++ gcc/tree-vectorizer.h | 28 +++- 2 files changed, 30 insertions(+), 17 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index d0a8531fd3b3..4dadbc6854de 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, } } else -{ - poly_uint64 vf; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; - else - vf = 1; - unsigned int group_size = SLP_TREE_LANES (node); - tree vectype = SLP_TREE_VECTYPE (node); - SLP_TREE_NUMBER_OF_VEC_STMTS (node) - = vect_get_num_vectors (vf * group_size, vectype); -} +SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); /* Handle purely internal nodes. */ if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) @@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, && j == 1); continue; } - unsigned group_size = SLP_TREE_LANES (child); - poly_uint64 vf = 1; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; + SLP_TREE_NUMBER_OF_VEC_STMTS (child) - = vect_get_num_vectors (vf * group_size, vector_type); + = vect_get_num_copies (vinfo, child); /* And cost them. */ vect_prologue_cost_for_slp (child, cost_vec); } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 8eb3ec4df869..1e2121abaffc 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2080,6 +2080,32 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant (); } +/* Return the number of vectors in the context of vectorization region VINFO, + needed for a group of statements, whose size is specified by lanes of NODE, + if NULL, it is 1. The statements are supposed to be interleaved together + with no gap, and all operate on vectors of type VECTYPE, if NULL, the + vectype of NODE is used. */ + +inline unsigned int +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL) +{ + poly_uint64 vf; + + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + else +vf = 1; + + if (node) +{ + vf *= SLP_TREE_LANES (node); + if (!vectype) + vectype = SLP_TREE_VECTYPE (node); +} + + return vect_get_num_vectors (vf, vectype); +} + /* Return the number of copies needed for loop vectorization when a statement operates on vectors of type VECTYPE. This is the vectorization factor divided by the number of elements in @@ -2088,7 +2114,7 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) inline unsigned int vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype) { - return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype); + return vect_get_num_copies (loop_vinfo, NULL, vectype); } /* Update maximum unit count *MAX_NUNITS so that it accounts for
[gcc r15-1727] vect: Determine input vectype for multiple lane-reducing operations
https://gcc.gnu.org/g:3aa004f1db327d5728a8fd0afcfed24e767f0499 commit r15-1727-g3aa004f1db327d5728a8fd0afcfed24e767f0499 Author: Feng Xue Date: Sun Jun 16 13:00:32 2024 +0800 vect: Determine input vectype for multiple lane-reducing operations The input vectype of reduction PHI statement must be determined before vect cost computation for the reduction. Since lance-reducing operation has different input vectype from normal one, so we need to traverse all reduction statements to find out the input vectype with the least lanes, and set that to the PHI statement. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vectorizable_reduction): Determine input vectype during traversal of reduction statements. Diff: --- gcc/tree-vect-loop.cc | 79 --- 1 file changed, 56 insertions(+), 23 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6f32867f85a..3095ff5ab6b 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, { stmt_vec_info def = loop_vinfo->lookup_def (reduc_def); stmt_vec_info vdef = vect_stmt_to_vectorize (def); - if (STMT_VINFO_REDUC_IDX (vdef) == -1) + int reduc_idx = STMT_VINFO_REDUC_IDX (vdef); + + if (reduc_idx == -1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -7686,10 +7688,57 @@ vectorizable_reduction (loop_vec_info loop_vinfo, return false; } } - else if (!stmt_info) - /* First non-conversion stmt. */ - stmt_info = vdef; - reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)]; + else + { + /* First non-conversion stmt. */ + if (!stmt_info) + stmt_info = vdef; + + if (lane_reducing_op_p (op.code)) + { + enum vect_def_type dt; + tree vectype_op; + + /* The last operand of lane-reducing operation is for +reduction. */ + gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 1); + + if (!vect_is_simple_use (op.ops[0], loop_vinfo, , _op)) + return false; + + tree type_op = TREE_TYPE (op.ops[0]); + + if (!vectype_op) + { + vectype_op = get_vectype_for_scalar_type (loop_vinfo, + type_op); + if (!vectype_op) + return false; + } + + /* For lane-reducing operation vectorizable analysis needs the +reduction PHI information */ + STMT_VINFO_REDUC_DEF (def) = phi_info; + + /* Each lane-reducing operation has its own input vectype, while +reduction PHI will record the input vectype with the least +lanes. */ + STMT_VINFO_REDUC_VECTYPE_IN (vdef) = vectype_op; + + /* To accommodate lane-reducing operations of mixed input +vectypes, choose input vectype with the least lanes for the +reduction PHI statement, which would result in the most +ncopies for vectorized reduction results. */ + if (!vectype_in + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) + < GET_MODE_SIZE (SCALAR_TYPE_MODE (type_op + vectype_in = vectype_op; + } + else + vectype_in = STMT_VINFO_VECTYPE (phi_info); + } + + reduc_def = op.ops[reduc_idx]; reduc_chain_length++; if (!stmt_info && slp_node) slp_for_stmt_info = SLP_TREE_CHILDREN (slp_for_stmt_info)[0]; @@ -7747,6 +7796,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, tree vectype_out = STMT_VINFO_VECTYPE (stmt_info); STMT_VINFO_REDUC_VECTYPE (reduc_info) = vectype_out; + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; + gimple_match_op op; if (!gimple_extract_op (stmt_info->stmt, )) gcc_unreachable (); @@ -7831,16 +7882,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op.ops[i]), slp_op[i]); - /* To properly compute ncopies we are interested in the widest -non-reduction input type in case we're looking at a widening -accumulation that we later handle in vect_transform_reduction. */ - if (lane_reducing - && vectype_op[i] - && (!vectype_in - || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) - < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_op[i])) - vectype_in = vectype_op[i]; - /* Record how the non-reduction-def value of COND_EXPR is
[gcc r15-1726] vect: Fix shift-by-induction for single-lane slp
https://gcc.gnu.org/g:1ff5f8f8a05dd57620a1e2abbf87bd511b113cce commit r15-1726-g1ff5f8f8a05dd57620a1e2abbf87bd511b113cce Author: Feng Xue Date: Wed Jun 26 22:02:53 2024 +0800 vect: Fix shift-by-induction for single-lane slp Allow shift-by-induction for slp node, when it is single lane, which is aligned with the original loop-based handling. 2024-06-26 Feng Xue gcc/ * tree-vect-stmts.cc (vectorizable_shift): Allow shift-by-induction for single-lane slp node. gcc/testsuite/ * gcc.dg/vect/vect-shift-6.c * gcc.dg/vect/vect-shift-7.c Diff: --- gcc/testsuite/gcc.dg/vect/vect-shift-6.c | 52 gcc/testsuite/gcc.dg/vect/vect-shift-7.c | 69 gcc/tree-vect-stmts.cc | 2 +- 3 files changed, 122 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-shift-6.c b/gcc/testsuite/gcc.dg/vect/vect-shift-6.c new file mode 100644 index 000..277093bc7bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-shift-6.c @@ -0,0 +1,52 @@ +/* { dg-require-effective-target vect_shift } */ +/* { dg-require-effective-target vect_int } */ + +#include +#include +#include "tree-vect.h" + +#define N 32 + +int32_t A[N]; +int32_t B[N]; + +#define FN(name) \ +__attribute__((noipa)) \ +void name(int32_t *a) \ +{ \ + for (int i = 0; i < N / 2; i++) \ +{ \ + a[2 * i + 0] <<= i; \ + a[2 * i + 1] <<= i; \ +} \ +} + + +FN(foo_vec) + +#pragma GCC push_options +#pragma GCC optimize ("O0") +FN(foo_novec) +#pragma GCC pop_options + +int main () +{ + int i; + + check_vect (); + +#pragma GCC novector + for (i = 0; i < N; i++) +A[i] = B[i] = -(i + 1); + + foo_vec(A); + foo_novec(B); + + /* check results: */ +#pragma GCC novector + for (i = 0; i < N; i++) +if (A[i] != B[i]) + abort (); + + return 0; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-shift-7.c b/gcc/testsuite/gcc.dg/vect/vect-shift-7.c new file mode 100644 index 000..6de3f39a87f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-shift-7.c @@ -0,0 +1,69 @@ +/* { dg-require-effective-target vect_shift } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "--param max-completely-peel-times=6" } */ + +#include +#include +#include "tree-vect.h" + +#define N 16 +#define M 16 + +int32_t A[N]; +int32_t B[N]; + +#define FN(name) \ +__attribute__((noipa)) \ +void name(int32_t *a, int m) \ +{ \ + for (int i = 0; i < N / 2; i++) \ +{ \ + int s1 = i; \ + int s2 = s1 + 1; \ + int32_t r1 = 0; \ + int32_t r2 = 7; \ + int32_t t1 = m; \ + \ + for (int j = 0; j < M; j++) \ + { \ +r1 += t1 << s1;\ +r2 += t1 << s2;\ +t1++; \ +s1++; \ +s2++; \ + } \ + \ + a[2 * i + 0] = r1; \ + a[2 * i + 1] = r2; \ +} \ +} + + +FN(foo_vec) + +#pragma GCC push_options +#pragma GCC optimize ("O0") +FN(foo_novec) +#pragma GCC pop_options + +int main () +{ + int i; + + check_vect (); + +#pragma GCC novector + for (i = 0; i < N; i++) +A[i] = B[i] = 0; + + foo_vec(A, 0); + foo_novec(B, 0); + + /* check results: */ +#pragma GCC novector + for (i = 0; i < N; i++) +if (A[i] != B[i]) + abort (); + + return 0; +} diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 7b889f31645..aab3aa59962 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -6175,7 +6175,7 @@ vectorizable_shift (vec_info *vinfo, if ((dt[1] == vect_internal_def || dt[1] == vect_induction_def || dt[1] == vect_nested_cycle) - && !slp_node) + && (!slp_node || SLP_TREE_LANES (slp_node) == 1)) scalar_shift_arg = false; else if (dt[1] == vect_constant_def || dt[1] == vect_external_def
[gcc r15-1465] vect: Tighten an assertion for lane-reducing in transform
https://gcc.gnu.org/g:ecbc96bb2873e453b0bd33d602ce34ad0d9d9cfd commit r15-1465-gecbc96bb2873e453b0bd33d602ce34ad0d9d9cfd Author: Feng Xue Date: Sun Jun 16 13:33:52 2024 +0800 vect: Tighten an assertion for lane-reducing in transform According to logic of code nearby the assertion, all lane-reducing operations should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p" treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed by the following assertion "gcc_assert (commutative_binary_op_p (...))", so tighten the assertion. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vect_transform_reduction): Change assertion to cover all lane-reducing ops. Diff: --- gcc/tree-vect-loop.cc | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 1d60ac47e553..347dac97e497 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8618,7 +8618,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); - gcc_assert (single_defuse_cycle || lane_reducing_op_p (code)); + bool lane_reducing = lane_reducing_op_p (code); + gcc_assert (single_defuse_cycle || lane_reducing); /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); @@ -8674,8 +8675,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; if (masked_loop_p && !mask_by_cond_expr) { - /* No conditional ifns have been defined for dot-product yet. */ - gcc_assert (code != DOT_PROD_EXPR); + /* No conditional ifns have been defined for lane-reducing op +yet. */ + gcc_assert (!lane_reducing); /* Make sure that the reduction accumulator is vop[0]. */ if (reduc_index == 1)
[gcc r15-1464] vect: Use an array to replace 3 relevant variables
https://gcc.gnu.org/g:b9c369d900ccfbd2271028611af3f08b5cf6f998 commit r15-1464-gb9c369d900ccfbd2271028611af3f08b5cf6f998 Author: Feng Xue Date: Sun Jun 16 13:21:13 2024 +0800 vect: Use an array to replace 3 relevant variables It's better to place 3 relevant independent variables into array, since we have requirement to access them via an index in the following patch. At the same time, this change may get some duplicated code be more compact. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vect_transform_reduction): Replace vec_oprnds0/1/2 with one new array variable vec_oprnds[3]. Diff: --- gcc/tree-vect-loop.cc | 43 ++- 1 file changed, 18 insertions(+), 25 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 27f77ed8b0b6..1d60ac47e553 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8580,9 +8580,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, /* Transform. */ tree new_temp = NULL_TREE; - auto_vec vec_oprnds0; - auto_vec vec_oprnds1; - auto_vec vec_oprnds2; + auto_vec vec_oprnds[3]; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n"); @@ -8630,14 +8628,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, definition. */ if (!cond_fn_p) { + gcc_assert (reduc_index >= 0 && reduc_index <= 2); vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, single_defuse_cycle && reduc_index == 0 -? NULL_TREE : op.ops[0], _oprnds0, +? NULL_TREE : op.ops[0], _oprnds[0], single_defuse_cycle && reduc_index == 1 -? NULL_TREE : op.ops[1], _oprnds1, +? NULL_TREE : op.ops[1], _oprnds[1], op.num_ops == 3 && !(single_defuse_cycle && reduc_index == 2) -? op.ops[2] : NULL_TREE, _oprnds2); +? op.ops[2] : NULL_TREE, _oprnds[2]); } else { @@ -8645,12 +8644,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vectype. */ gcc_assert (single_defuse_cycle && (reduc_index == 1 || reduc_index == 2)); - vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, -op.ops[0], truth_type_for (vectype_in), _oprnds0, + vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, op.ops[0], +truth_type_for (vectype_in), _oprnds[0], reduc_index == 1 ? NULL_TREE : op.ops[1], -NULL_TREE, _oprnds1, +NULL_TREE, _oprnds[1], reduc_index == 2 ? NULL_TREE : op.ops[2], -NULL_TREE, _oprnds2); +NULL_TREE, _oprnds[2]); } /* For single def-use cycles get one copy of the vectorized reduction @@ -8658,20 +8657,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo, if (single_defuse_cycle) { vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, 1, -reduc_index == 0 ? op.ops[0] : NULL_TREE, _oprnds0, -reduc_index == 1 ? op.ops[1] : NULL_TREE, _oprnds1, +reduc_index == 0 ? op.ops[0] : NULL_TREE, +_oprnds[0], +reduc_index == 1 ? op.ops[1] : NULL_TREE, +_oprnds[1], reduc_index == 2 ? op.ops[2] : NULL_TREE, -_oprnds2); +_oprnds[2]); } bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); + unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); - unsigned num = (reduc_index == 0 - ? vec_oprnds1.length () : vec_oprnds0.length ()); for (unsigned i = 0; i < num; ++i) { gimple *new_stmt; - tree vop[3] = { vec_oprnds0[i], vec_oprnds1[i], NULL_TREE }; + tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; if (masked_loop_p && !mask_by_cond_expr) { /* No conditional ifns have been defined for dot-product yet. */ @@ -8696,7 +8696,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, else { if (op.num_ops >= 3) - vop[2] = vec_oprnds2[i]; + vop[2] = vec_oprnds[2][i]; if (masked_loop_p && mask_by_cond_expr) { @@ -8727,14 +8727,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } if (single_defuse_cycle && i < num - 1) - { - if (reduc_index == 0) - vec_oprnds0.safe_push (gimple_get_lhs (new_stmt)); - else if (reduc_index == 1) - vec_oprnds1.safe_push (gimple_get_lhs (new_stmt)); - else if (reduc_index == 2) -
[gcc r15-1463] vect: Use one reduction_type local variable
https://gcc.gnu.org/g:0726f1cde5459ccdbaa6af8c6904276a28d572ba commit r15-1463-g0726f1cde5459ccdbaa6af8c6904276a28d572ba Author: Feng Xue Date: Sun Jun 16 12:17:26 2024 +0800 vect: Use one reduction_type local variable Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better to keep only one. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and replace it to another local variable reduction_type. Diff: --- gcc/tree-vect-loop.cc | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index aab408d1019d..27f77ed8b0b6 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (lane_reducing) STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; - enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info); - STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type; + enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info); + STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type; /* If we have a condition reduction, see if we can simplify it further. */ - if (v_reduc_type == COND_REDUCTION) + if (reduction_type == COND_REDUCTION) { if (slp_node && SLP_TREE_LANES (slp_node) != 1) return false; @@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, STMT_VINFO_REDUC_CODE (reduc_info) = orig_code; - vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); + reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); if (reduction_type == TREE_CODE_REDUCTION) { /* Check whether it's ok to change the order of the computation.
[gcc r15-1462] vect: Remove duplicated check on reduction operand
https://gcc.gnu.org/g:a944e57506fc64b8eede79c2405ba0b498461f0b commit r15-1462-ga944e57506fc64b8eede79c2405ba0b498461f0b Author: Feng Xue Date: Sun Jun 16 12:08:56 2024 +0800 vect: Remove duplicated check on reduction operand In vectorizable_reduction, one check on a reduction operand via index could be contained by another one check via pointer, so remove the former. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated check. Diff: --- gcc/tree-vect-loop.cc | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index eeb75c09e91a..aab408d1019d 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, "use not simple.\n"); return false; } - if (i == STMT_VINFO_REDUC_IDX (stmt_info)) - continue; - /* For an IFN_COND_OP we might hit the reduction definition operand -twice (once as definition, once as else). */ + /* Skip reduction operands, and for an IFN_COND_OP we might hit the +reduction operand twice (once as definition, once as else). */ if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)]) continue;
[gcc r15-1461] vect: Add a function to check lane-reducing stmt
https://gcc.gnu.org/g:70466e6f9d9fb87f78ffe2e397ca876b380cb493 commit r15-1461-g70466e6f9d9fb87f78ffe2e397ca876b380cb493 Author: Feng Xue Date: Sat Jun 15 23:17:10 2024 +0800 vect: Add a function to check lane-reducing stmt Add a utility function to check if a statement is lane-reducing operation, which could simplify some existing code. 2024-06-16 Feng Xue gcc/ * tree-vectorizer.h (lane_reducing_stmt_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_stmt_p to check statement. Diff: --- gcc/tree-vect-slp.cc | 4 +--- gcc/tree-vectorizer.h | 12 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 7d18b5bfee5d..a5665946a4eb 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) scalar_stmts.create (loop_vinfo->reductions.length ()); for (auto next_info : loop_vinfo->reductions) { - gassign *g; next_info = vect_stmt_to_vectorize (next_info); if ((STMT_VINFO_RELEVANT_P (next_info) || STMT_VINFO_LIVE_P (next_info)) @@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) { /* Do not discover SLP reductions combining lane-reducing ops, that will fail later. */ - if (!(g = dyn_cast (STMT_VINFO_STMT (next_info))) - || !lane_reducing_op_p (gimple_assign_rhs_code (g))) + if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info))) scalar_stmts.quick_push (next_info); else { diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 6bb0f5c3a56f..60224f4e2847 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo) && th >= vect_vf_for_cost (loop_vinfo)); } +/* Return true if CODE is a lane-reducing opcode. */ + inline bool lane_reducing_op_p (code_helper code) { return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR; } +/* Return true if STMT is a lane-reducing statement. */ + +inline bool +lane_reducing_stmt_p (gimple *stmt) +{ + if (auto *assign = dyn_cast (stmt)) +return lane_reducing_op_p (gimple_assign_rhs_code (assign)); + return false; +} + /* Source location + hotness information. */ extern dump_user_location_t vect_location;
[gcc r15-963] vect: Bind input vectype to lane-reducing operation
https://gcc.gnu.org/g:d53f555edb95248dbf81347ba5e4136e9a491eca commit r15-963-gd53f555edb95248dbf81347ba5e4136e9a491eca Author: Feng Xue Date: Wed May 29 16:41:57 2024 +0800 vect: Bind input vectype to lane-reducing operation The input vectype is an attribute of lane-reducing operation, instead of reduction PHI that it is associated to, since there might be more than one lane-reducing operations with different type in a loop reduction chain. So bind each lane-reducing operation with its own input type. 2024-05-29 Feng Xue gcc/ * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter loop_vinfo. Get input vectype from stmt_info instead of reduction PHI. (vect_model_reduction_cost): Remove loop_vinfo argument of call to vect_is_emulated_mixed_dot_prod. (vect_transform_reduction): Likewise. (vectorizable_reduction): Likewise, and bind input vectype to lane-reducing operation. Diff: --- gcc/tree-vect-loop.cc | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 7a6a6b6161d..5b85cffb37f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5270,8 +5270,7 @@ have_whole_vector_shift (machine_mode mode) See vect_emulate_mixed_dot_prod for the actual sequence used. */ static bool -vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, -stmt_vec_info stmt_info) +vect_is_emulated_mixed_dot_prod (stmt_vec_info stmt_info) { gassign *assign = dyn_cast (stmt_info->stmt); if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR) @@ -5282,10 +5281,9 @@ vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2))) return false; - stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info); - gcc_assert (reduc_info->is_reduc_info); + gcc_assert (STMT_VINFO_REDUC_VECTYPE_IN (stmt_info)); return !directly_supported_p (DOT_PROD_EXPR, - STMT_VINFO_REDUC_VECTYPE_IN (reduc_info), + STMT_VINFO_REDUC_VECTYPE_IN (stmt_info), optab_vector_mixed_sign); } @@ -5324,8 +5322,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, if (!gimple_extract_op (orig_stmt_info->stmt, )) gcc_unreachable (); - bool emulated_mixed_dot_prod -= vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); + bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); + if (reduction_type == EXTRACT_LAST_REDUCTION) /* No extra instructions are needed in the prologue. The loop body operations are costed in vectorizable_condition. */ @@ -7837,6 +7835,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo, vectype_in = STMT_VINFO_VECTYPE (phi_info); STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; + /* Each lane-reducing operation has its own input vectype, while reduction + PHI records the input vectype with least lanes. */ + if (lane_reducing) +STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; + enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info); STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type; /* If we have a condition reduction, see if we can simplify it further. */ @@ -8363,7 +8366,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (single_defuse_cycle || lane_reducing) { int factor = 1; - if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info)) + if (vect_is_emulated_mixed_dot_prod (stmt_info)) /* Three dot-products and a subtraction. */ factor = 4; record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, @@ -8615,8 +8618,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, : _oprnds2)); } - bool emulated_mixed_dot_prod -= vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); + bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); + FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) { gimple *new_stmt;
[gcc r15-962] vect: Split out partial vect checking for reduction into a function
https://gcc.gnu.org/g:79c3547b8adfdfdb2a167c1b9c9428902510adab commit r15-962-g79c3547b8adfdfdb2a167c1b9c9428902510adab Author: Feng Xue Date: Wed May 29 13:45:09 2024 +0800 vect: Split out partial vect checking for reduction into a function Partial vectorization checking for vectorizable_reduction is a piece of relatively isolated code, which may be reused by other places. Move the code into a new function for sharing. 2024-05-29 Feng Xue gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): New function. (vectorizable_reduction): Move partial vectorization checking code to vect_reduction_update_partial_vector_usage. Diff: --- gcc/tree-vect-loop.cc | 137 -- 1 file changed, 77 insertions(+), 60 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a42d79c7cbf..7a6a6b6161d 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7391,6 +7391,79 @@ build_vect_cond_expr (code_helper code, tree vop[3], tree mask, } } +/* Given an operation with CODE in loop reduction path whose reduction PHI is + specified by REDUC_INFO, the operation has TYPE of scalar result, and its + input vectype is represented by VECTYPE_IN. The vectype of vectorized result + may be different from VECTYPE_IN, either in base type or vectype lanes, + lane-reducing operation is the case. This function check if it is possible, + and how to perform partial vectorization on the operation in the context + of LOOP_VINFO. */ + +static void +vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, + stmt_vec_info reduc_info, + slp_tree slp_node, + code_helper code, tree type, + tree vectype_in) +{ + enum vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (reduc_info); + internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info); + internal_fn cond_fn = get_conditional_internal_fn (code, type); + + if (reduc_type != FOLD_LEFT_REDUCTION + && !use_mask_by_cond_expr_p (code, cond_fn, vectype_in) + && (cond_fn == IFN_LAST + || !direct_internal_fn_supported_p (cond_fn, vectype_in, + OPTIMIZE_FOR_SPEED))) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"can't operate on partial vectors because" +" no conditional operation is available.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; +} + else if (reduc_type == FOLD_LEFT_REDUCTION + && reduc_fn == IFN_LAST + && !expand_vec_cond_expr_p (vectype_in, truth_type_for (vectype_in), + SSA_NAME)) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't operate on partial vectors because" + " no conditional operation is available.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; +} + else if (reduc_type == FOLD_LEFT_REDUCTION + && internal_fn_mask_index (reduc_fn) == -1 + && FLOAT_TYPE_P (vectype_in) + && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in)) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"can't operate on partial vectors because" +" signed zeros cannot be preserved.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; +} + else +{ + internal_fn mask_reduc_fn + = get_masked_reduction_fn (reduc_fn, vectype_in); + vec_loop_masks *masks = _VINFO_MASKS (loop_vinfo); + vec_loop_lens *lens = _VINFO_LENS (loop_vinfo); + unsigned nvectors; + + if (slp_node) + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + else + nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + + if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); + else + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_in, NULL); +} +} + /* Function vectorizable_reduction. Check if STMT_INFO performs a reduction operation that can be vectorized. @@ -7456,7 +7529,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, bool single_defuse_cycle = false; bool nested_cycle = false; bool double_reduc = false; - int vec_num; tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE; tree cond_reduc_val = NULL_TREE; @@ -8283,11 +8355,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, return false; } - if
[gcc r15-961] vect: Add a function to check lane-reducing code
https://gcc.gnu.org/g:c0f31701556c4162463f28bc0f03007f40a6176e commit r15-961-gc0f31701556c4162463f28bc0f03007f40a6176e Author: Feng Xue Date: Wed May 29 13:12:12 2024 +0800 vect: Add a function to check lane-reducing code Check if an operation is lane-reducing requires comparison of code against three kinds (DOT_PROD_EXPR/WIDEN_SUM_EXPR/SAD_EXPR). Add an utility function to make source coding for the check handy and concise. 2024-05-29 Feng Xue gcc/ * tree-vectorizer.h (lane_reducing_op_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_op_p to check statement code. * tree-vect-loop.cc (vect_transform_reduction): Likewise. (vectorizable_reduction): Likewise, and change name of a local variable that holds the result flag. Diff: --- gcc/tree-vect-loop.cc | 29 - gcc/tree-vect-slp.cc | 4 +--- gcc/tree-vectorizer.h | 6 ++ 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 04a9ac64df7..a42d79c7cbf 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7650,9 +7650,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, gimple_match_op op; if (!gimple_extract_op (stmt_info->stmt, )) gcc_unreachable (); - bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR - || op.code == WIDEN_SUM_EXPR - || op.code == SAD_EXPR); + bool lane_reducing = lane_reducing_op_p (op.code); if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type) && !SCALAR_FLOAT_TYPE_P (op.type)) @@ -7664,7 +7662,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* For lane-reducing ops we're reducing the number of reduction PHIs which means the only use of that may be in the lane-reducing operation. */ - if (lane_reduc_code_p + if (lane_reducing && reduc_chain_length != 1 && !only_slp_reduc_chain) { @@ -7678,7 +7676,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, since we'll mix lanes belonging to different reductions. But it's OK to use them in a reduction chain or when the reduction group has just one element. */ - if (lane_reduc_code_p + if (lane_reducing && slp_node && !REDUC_GROUP_FIRST_ELEMENT (stmt_info) && SLP_TREE_LANES (slp_node) > 1) @@ -7738,7 +7736,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* To properly compute ncopies we are interested in the widest non-reduction input type in case we're looking at a widening accumulation that we later handle in vect_transform_reduction. */ - if (lane_reduc_code_p + if (lane_reducing && vectype_op[i] && (!vectype_in || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) @@ -8211,7 +8209,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; - if (single_defuse_cycle || lane_reduc_code_p) + if (single_defuse_cycle || lane_reducing) { gcc_assert (op.code != COND_EXPR); @@ -8227,7 +8225,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, mixed-sign dot-products can be implemented using signed dot-products. */ machine_mode vec_mode = TYPE_MODE (vectype_in); - if (!lane_reduc_code_p + if (!lane_reducing && !directly_supported_p (op.code, vectype_in, optab_vector)) { if (dump_enabled_p ()) @@ -8252,7 +8250,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, For the other cases try without the single cycle optimization. */ if (!ok) { - if (lane_reduc_code_p) + if (lane_reducing) return false; else single_defuse_cycle = false; @@ -8263,7 +8261,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* If the reduction stmt is one of the patterns that have lane reduction embedded we cannot handle the case of ! single_defuse_cycle. */ if ((ncopies > 1 && ! single_defuse_cycle) - && lane_reduc_code_p) + && lane_reducing) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -8274,7 +8272,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (slp_node && !(!single_defuse_cycle - && !lane_reduc_code_p + && !lane_reducing && reduction_type != FOLD_LEFT_REDUCTION)) for (i = 0; i < (int) op.num_ops; i++) if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])) @@ -8295,7 +8293,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* Cost the reduction op inside the loop if transformed via vect_transform_reduction. Otherwise this is costed by the separate vectorizable_* routines. */ - if
[gcc r15-904] Delete a file due to push error
https://gcc.gnu.org/g:b24b081113c696f4e523c8ae53fc3ab89c3b4e4d commit r15-904-gb24b081113c696f4e523c8ae53fc3ab89c3b4e4d Author: Feng Xue Date: Wed May 29 22:20:45 2024 +0800 Delete a file due to push error gcc/ * tree-vect-loop.c : Removed. Diff: --- gcc/tree-vect-loop.c | 0 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c deleted file mode 100644 index e69de29bb2d..000
[gcc r15-903] vect: Unify bbs in loop_vec_info and bb_vec_info
https://gcc.gnu.org/g:9c747183efa555e45200523c162021e385511be5 commit r15-903-g9c747183efa555e45200523c162021e385511be5 Author: Feng Xue Date: Thu May 16 11:08:38 2024 +0800 vect: Unify bbs in loop_vec_info and bb_vec_info Both derived classes have their own "bbs" field, which have exactly same purpose of recording all basic blocks inside the corresponding vect region, while the fields are composed by different data type, one is normal array, the other is auto_vec. This difference causes some duplicated code even handling the same stuff, almost in tree-vect-patterns. One refinement is lifting this field into the base class "vec_info", and reset its value to the continuous memory area pointed by two old "bbs" in each constructor of derived classes. 2024-05-16 Feng Xue gcc/ * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move initialization of bbs to explicit construction code. Adjust the definition of nbbs. (update_epilogue_loop_vinfo): Update nbbs for epilog vinfo. * tree-vect-patterns.cc (vect_determine_precisions): Make loop_vec_info and bb_vec_info share same code. (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop. * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0] via base vec_info class. (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data fields of input auto_vec<> bbs. (vect_slp_region): Use access to nbbs to replace original bbs.length(). (vect_schedule_slp_node): Access to bbs[0] via base vec_info class. * tree-vectorizer.cc (vec_info::vec_info): Add initialization of bbs and nbbs. (vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info class. * tree-vectorizer.h (vec_info): Add new fields bbs and nbbs. (LOOP_VINFO_NBBS): New macro. (BB_VINFO_BBS): Rename BB_VINFO_BB to BB_VINFO_BBS. (BB_VINFO_NBBS): New macro. (_loop_vec_info): Remove field bbs. (_bb_vec_info): Rename field bbs. Diff: --- gcc/tree-vect-loop.c | 0 gcc/tree-vect-loop.cc | 7 ++- gcc/tree-vect-patterns.cc | 142 +- gcc/tree-vect-slp.cc | 23 +--- gcc/tree-vectorizer.cc| 7 ++- gcc/tree-vectorizer.h | 23 6 files changed, 74 insertions(+), 128 deletions(-) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c new file mode 100644 index 000..e69de29bb2d diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 3b94bb13a8b..04a9ac64df7 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -1028,7 +1028,6 @@ bb_in_loop_p (const_basic_block bb, const void *data) _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) : vec_info (vec_info::loop, shared), loop (loop_in), -bbs (XCNEWVEC (basic_block, loop->num_nodes)), num_itersm1 (NULL_TREE), num_iters (NULL_TREE), num_iters_unchanged (NULL_TREE), @@ -1079,8 +1078,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) case of the loop forms we allow, a dfs order of the BBs would the same as reversed postorder traversal, so we are safe. */ - unsigned int nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, - bbs, loop->num_nodes, loop); + bbs = XCNEWVEC (basic_block, loop->num_nodes); + nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, bbs, +loop->num_nodes, loop); gcc_assert (nbbs == loop->num_nodes); for (unsigned int i = 0; i < nbbs; i++) @@ -11667,6 +11667,7 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance) free (LOOP_VINFO_BBS (epilogue_vinfo)); LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs; + LOOP_VINFO_NBBS (epilogue_vinfo) = epilogue->num_nodes; /* Advance data_reference's with the number of iterations of the previous loop and its prologue. */ diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 8929e5aa7f3..88e7e34d78d 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -6925,81 +6925,41 @@ vect_determine_stmt_precisions (vec_info *vinfo, stmt_vec_info stmt_info) void vect_determine_precisions (vec_info *vinfo) { + basic_block *bbs = vinfo->bbs; + unsigned int nbbs = vinfo->nbbs; + DUMP_VECT_SCOPE ("vect_determine_precisions"); - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) + for (unsigned int i = 0; i < nbbs; i++) { - class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); - basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); - unsigned int nbbs = loop->num_nodes; - - for (unsigned int i = 0; i < nbbs; i++) + basic_block bb = bbs[i]; + for (auto gsi
[gcc r15-863] vect: Use vect representative statement instead of original in patch recog [PR115060]
https://gcc.gnu.org/g:a3aeff4ce95bd616a2108dc2363d9cbaba53b170 commit r15-863-ga3aeff4ce95bd616a2108dc2363d9cbaba53b170 Author: Feng Xue Date: Thu May 23 15:25:53 2024 +0800 vect: Use vect representative statement instead of original in patch recog [PR115060] Some utility functions (such as vect_look_through_possible_promotion) that are to find out certain kind of direct or indirect definition SSA for a value, may return the original one of the SSA, not its pattern representative SSA, even pattern is involved. For example, a = (T1) patt_b; patt_b = (T2) c;// b = ... patt_c = not-a-cast;// c = ... Given 'a', the mentioned function will return 'c', instead of 'patt_c'. This subtlety would make some pattern recog code that is unaware of it mis-use the original instead of the new pattern statement, which is inconsistent wth processing logic of the pattern formation pass. This patch corrects the issue by forcing another utility function (vect_get_internal_def) return the pattern statement information to caller by default. 2024-05-23 Feng Xue gcc/ PR tree-optimization/115060 * tree-vect-patterns.cc (vect_get_internal_def): Return statement for vectorization. (vect_widened_op_tree): Call vect_get_internal_def instead of look_def to get statement information. (vect_recog_widen_abd_pattern): No need to call vect_stmt_to_vectorize. Diff: --- gcc/tree-vect-patterns.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index a313dc64643..8929e5aa7f3 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -266,7 +266,7 @@ vect_get_internal_def (vec_info *vinfo, tree op) stmt_vec_info def_stmt_info = vinfo->lookup_def (op); if (def_stmt_info && STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_internal_def) -return def_stmt_info; +return vect_stmt_to_vectorize (def_stmt_info); return NULL; } @@ -655,7 +655,8 @@ vect_widened_op_tree (vec_info *vinfo, stmt_vec_info stmt_info, tree_code code, /* Recursively process the definition of the operand. */ stmt_vec_info def_stmt_info - = vinfo->lookup_def (this_unprom->op); + = vect_get_internal_def (vinfo, this_unprom->op); + nops = vect_widened_op_tree (vinfo, def_stmt_info, code, widened_code, shift_p, max_nops, this_unprom, common_type, @@ -1739,7 +1740,6 @@ vect_recog_widen_abd_pattern (vec_info *vinfo, stmt_vec_info stmt_vinfo, if (!abd_pattern_vinfo) return NULL; - abd_pattern_vinfo = vect_stmt_to_vectorize (abd_pattern_vinfo); gcall *abd_stmt = dyn_cast (STMT_VINFO_STMT (abd_pattern_vinfo)); if (!abd_stmt || !gimple_call_internal_p (abd_stmt)