[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 Feng Xue changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from Feng Xue --- Fixed
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #9 from GCC Commits --- The master branch has been updated by Feng Xue : https://gcc.gnu.org/g:57f611604e8bab67af6c0bcfe6ea88c001408412 commit r14-7272-g57f611604e8bab67af6c0bcfe6ea88c001408412 Author: Feng Xue Date: Thu Dec 28 16:55:39 2023 +0800 Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091] When pattern recognition is involved, a statement whose definition is consumed in some pattern, may not be included in the final replacement pattern statements, and would be skipped when building SLP graph. * Original char a_c = *(char *) a; char b_c = *(char *) b; unsigned short a_s = (unsigned short) a_c; int a_i = (int) a_s; int b_i = (int) b_c; int r_i = a_i - b_i; * After pattern replacement a_s = (unsigned short) a_c; a_i = (int) a_s; patt_b_s = (unsigned short) b_c;// b_i = (int) b_c patt_b_i = (int) patt_b_s; // b_i = (int) b_c patt_r_s = widen_minus(a_c, b_c); // r_i = a_i - b_i patt_r_i = (int) patt_r_s; // r_i = a_i - b_i The definitions of a_i(original statement) and b_i(pattern statement) are related to, but actually not part of widen_minus pattern. Vectorizing the pattern does not cause these definition statements to be marked as PURE_SLP. For this case, we need to recursively check whether their uses are all absorbed into vectorized code. But there is an exception that some use may participate in an vectorized operation via an external SLP node containing that use as an element. gcc/ChangeLog: PR tree-optimization/113091 * tree-vect-slp.cc (vect_slp_has_scalar_use): New function. (vect_bb_slp_mark_live_stmts): New parameter scalar_use_map, check scalar use with new function. (vect_bb_slp_mark_live_stmts): New function as entry to existing overriden functions with same name. (vect_slp_analyze_operations): Call new entry function to mark live statements. gcc/testsuite/ChangeLog: * gcc.target/aarch64/bb-slp-pr113091.c: New test.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #8 from Feng Xue --- https://gcc.gnu.org/pipermail/gcc-patches/2023-December/641547.html
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #7 from Feng Xue --- > The issue here is that because the "outer" pattern consumes > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 > stmt-to-vectorize > as being the outer pattern root stmt for all this logic to work correctly. We could not simply make this adjustment since pattern recognition does not require replaced SSA to be of single-use. If we change the above case to attach another scalar use to "_2" as: int foo(char *a, char *b) { unsigned array[8]; int a0 = a[0]; // _2 = (int) _1; array[0] = (a0 - b[0]); array[1] = (a[1] - b[1]); array[2] = (a[2] - b[2]); array[3] = (a[3] - b[3]); array[4] = (a[4] - b[4]); array[5] = (a[5] - b[5]); array[6] = (a[6] - b[6]); array[7] = (a[7] - b[7]); return test(array) + a0; } The pattern statement "patt_64 = (int) patt_63" for "_2 = (int) _1" should be kept. So we also need the check of "identifying whether a scalar stmt takes part in vectorization or not" to ensure the adjustment is doable.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #6 from Feng Xue --- (In reply to Richard Sandiford from comment #5) > > The issue here is that because the "outer" pattern consumes > > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 > > stmt-to-vectorize > > as being the outer pattern root stmt for all this logic to work correctly. > > I don't think it can though, at least not in general. The final pattern > stmt has to compute the same value as the original scalar stmt. Could current pattern replacement support N:1 mapping (N stmts -> 1 pattern)? If not, probably this handing would break related code somewhere.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #5 from Richard Sandiford --- > The issue here is that because the "outer" pattern consumes > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 > stmt-to-vectorize > as being the outer pattern root stmt for all this logic to work correctly. I don't think it can though, at least not in general. The final pattern stmt has to compute the same value as the original scalar stmt.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2023-12-21 Status|UNCONFIRMED |NEW --- Comment #4 from Richard Biener --- "The use stmt is "_2 = (int) _1", whose pattern statement is "patt_64 = (int) patt_63", which is not referenced by any original or other pattern statements. Or in other word, the orig_stmt could be absorbed into a vector operation, without any outlier scalar use." That means the code sees that _2 = (int) _1 isn't vectorized (the pattern stmt isn't actually used) which means _2 = (int) _1 stays in the code and thus _1 is live. The issue here is that because the "outer" pattern consumes patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 stmt-to-vectorize as being the outer pattern root stmt for all this logic to work correctly. Otherwise we have no means of identifying whether a scalar stmt takes part in vectorization or not. I'm not sure what restrictions we place on pattern recognition of patterns - do we require single-uses or do we allow the situation that one vectorization path picks up the "inner" pattern while another picks the "outer" one? In theory we can hack up the liveness analysis but as you noticed that isn't the part doing the costing. The costing part is just written in the very same way (vect_bb_vectorization_profitable_p, specifically vect_slp_gather_vectorized_scalar_stmts and vect_bb_slp_scalar_cost). Basically the scalar cost is the cost of the scalar stmts that are fully replaced (can be DCEd after vectorization) by the vector stmts.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #3 from Feng Xue --- The function vect_bb_vectorization_profitable_p resorts to a recursive way to identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would not change scalar cost computation.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #2 from Feng Xue --- (In reply to Richard Biener from comment #1) > It's the logic > > FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info) > { > if (svisited.contains (stmt_info)) > continue; > stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info); > if (STMT_VINFO_IN_PATTERN_P (orig_stmt_info) > && STMT_VINFO_RELATED_STMT (orig_stmt_info) != stmt_info) > /* Only the pattern root stmt computes the original scalar value. */ > continue; > bool mark_visited = true; > gimple *orig_stmt = orig_stmt_info->stmt; > ssa_op_iter op_iter; > def_operand_p def_p; > FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF) > { > imm_use_iterator use_iter; > gimple *use_stmt; > stmt_vec_info use_stmt_info; > FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p)) > if (!is_gimple_debug (use_stmt)) > { > use_stmt_info = bb_vinfo->lookup_stmt (use_stmt); > if (!use_stmt_info > || !PURE_SLP_STMT (vect_stmt_to_vectorize > (use_stmt_info))) > { > STMT_VINFO_LIVE_P (stmt_info) = true; > > specifically the last check. That's supposed to pick up the "main" pattern > that's now covering the scalar stmt. > > But somehow the "main" pattern, > > patt_67 = .VEC_WIDEN_MINUS (_1, _3); // _5 = _2 - _4; > patt_68 = (signed short) patt_67; // _5 = _2 - _4; > patt_69 = (int) patt_68; // _5 = _2 - _4; > > doesn't get picked up here. I wonder what's the orig_stmt and the def > picked and what original scalar use we end up in where the > vect_stmt_to_vectorize isn't the "last" pattern. Maybe we really want This problem happens at slp node: note: node 0x425bc38 (max_nunits=8, refcnt=1) vector(8) char note: op template: _1 = *a_50(D); note: stmt 0 _1 = *a_50(D); note: stmt 1 _7 = MEM[(char *)a_50(D) + 1B]; note: stmt 2 _13 = MEM[(char *)a_50(D) + 2B]; note: stmt 3 _19 = MEM[(char *)a_50(D) + 3B]; note: stmt 4 _25 = MEM[(char *)a_50(D) + 4B]; note: stmt 5 _31 = MEM[(char *)a_50(D) + 5B]; note: stmt 6 _37 = MEM[(char *)a_50(D) + 6B]; note: stmt 7 _43 = MEM[(char *)a_50(D) + 7B]; The orig_stmt is "_1 = *a_50(D)" The use stmt is "_2 = (int) _1", whose pattern statement is "patt_64 = (int) patt_63", which is not referenced by any original or other pattern statements. Or in other word, the orig_stmt could be absorbed into a vector operation, without any outlier scalar use. The fore-mentioned "last check" in vect_bb_slp_mark_live_stmts would make the orig_stmt to be STMT_VINFO_LIVE_P, which actually implies it has scalar use (though it should not have), the difference is re-generating the def somewhere, rather than retaining the original scalar statement. And the following "vectorizable_live_operation" would account the new operations into vectorization cost of the SLP instance. The function vect_bb_vectorization_profitable_p resorts to a recursive way to identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would change scalar cost computation. If we can avoid such fake-liveness adjustment on the statements we are interested in, vectorization cost could beat scalar cost, and make the former succeed. Unvectorized: mov x2, x0 stp x29, x30, [sp, -48]! mov x29, sp ldrbw3, [x1] ldrbw4, [x1, 1] add x0, sp, 16 ldrbw9, [x2] ldrbw8, [x2, 1] sub w9, w9, w3 ldrbw7, [x2, 2] ldrbw3, [x1, 2] sub w8, w8, w4 ldrbw6, [x2, 3] ldrbw4, [x1, 3] sub w7, w7, w3 ldrbw10, [x1, 5] ldrbw3, [x1, 4] sub w6, w6, w4 ldrbw5, [x2, 4] ldrbw4, [x2, 5] sub w5, w5, w3 ldrbw3, [x2, 6] sub w4, w4, w10 ldrbw2, [x2, 7] ldrbw10, [x1, 6] ldrbw1, [x1, 7] sub w3, w3, w10 stp w9, w8, [sp, 16] sub w1, w2, w1 stp w7, w6, [sp, 24] stp w5, w4, [sp, 32] stp w3, w1, [sp, 40] bl test ldp x29, x30, [sp], 48 ret Vectorized: mov x2, x0 stp x29, x30, [sp, -48]! mov x29, sp ldr d1, [x1] add x0, sp, 16 ldr d0, [x2] usubl v0.8h, v0.8b, v1.8b sxtlv1.4s, v0.4h sxtl2 v0.4s, v0.8h stp q1, q0, [sp, 16] bl test ldp x29, x30, [sp], 48 ret > these "overlapping" patterns, but IMHO having "two entries" into > a chain of scalar stmts is bad and we should link up the whole matched > sequence to the final "root" instead? > > That said, the
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org --- Comment #1 from Richard Biener --- It's the logic FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info) { if (svisited.contains (stmt_info)) continue; stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info); if (STMT_VINFO_IN_PATTERN_P (orig_stmt_info) && STMT_VINFO_RELATED_STMT (orig_stmt_info) != stmt_info) /* Only the pattern root stmt computes the original scalar value. */ continue; bool mark_visited = true; gimple *orig_stmt = orig_stmt_info->stmt; ssa_op_iter op_iter; def_operand_p def_p; FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF) { imm_use_iterator use_iter; gimple *use_stmt; stmt_vec_info use_stmt_info; FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p)) if (!is_gimple_debug (use_stmt)) { use_stmt_info = bb_vinfo->lookup_stmt (use_stmt); if (!use_stmt_info || !PURE_SLP_STMT (vect_stmt_to_vectorize (use_stmt_info))) { STMT_VINFO_LIVE_P (stmt_info) = true; specifically the last check. That's supposed to pick up the "main" pattern that's now covering the scalar stmt. But somehow the "main" pattern, patt_67 = .VEC_WIDEN_MINUS (_1, _3); // _5 = _2 - _4; patt_68 = (signed short) patt_67; // _5 = _2 - _4; patt_69 = (int) patt_68; // _5 = _2 - _4; doesn't get picked up here. I wonder what's the orig_stmt and the def picked and what original scalar use we end up in where the vect_stmt_to_vectorize isn't the "last" pattern. Maybe we really want these "overlapping" patterns, but IMHO having "two entries" into a chain of scalar stmts is bad and we should link up the whole matched sequence to the final "root" instead? That said, the current code doesn't see that wherever we end up isn't dead code (aka fully covered by the vectorization). IMO vect_stmt_to_vectorize for each of those stmts should end up at patt_69 = (int) patt_68;