On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin <li...@linux.ibm.com> wrote: > > Hi Bin, > > on 2019/8/22 下午1:46, Bin.Cheng wrote: > > On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <li...@linux.ibm.com> wrote: > >> > >> Hi Bin, > >> > >> Thanks for your time! > >> > >> on 2019/8/21 下午8:32, Bin.Cheng wrote: > >>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <li...@linux.ibm.com> wrote: > >>>> > >>>> Hi! > >>>> > >>>> Comparing to the previous versions of implementation mainly based on the > >>>> existing IV cands but zeroing the related group/use cost, this new one > >>>> is based > >>>> on Richard and Segher's suggestion introducing one doloop dedicated IV > >>>> cand. > >>>> > >>>> Some key points are listed below: > >>>> 1) New field doloop_p in struct iv_cand to indicate doloop dedicated > >>>> IV cand. > >>>> 2) Special name "doloop" assigned. > >>>> 3) Doloop IV cand with form (niter+1, +, -1) > >>>> 4) For doloop IV cand, no extra one cost like BIV, assign zero cost > >>>> for step. > >>>> 5) Support may_be_zero (regressed PR is in this case), the base of > >>>> doloop IV > >>>> can be COND_EXPR, add handlings in cand_value_at and > >>>> may_eliminate_iv. > >>>> 6) Add more expr support in force_expr_to_var_cost for reasonable cost > >>>> calculation on the IV base with may_be_zero (like COND_EXPR). > >>>> 7) Set zero cost when using doloop IV cand for doloop use. > >>>> 8) Add three hooks (should we merge _generic and _address?). > >>>> *) have_count_reg_decr_p, is to indicate the target has special > >>>> hardware > >>>> count register, we shouldn't consider the impact of doloop IV when > >>>> calculating register pressures. > >>>> *) doloop_cost_for_generic, is the extra cost when using doloop IV > >>>> cand for > >>>> generic type IV use. > >>>> *) doloop_cost_for_address, is the extra cost when using doloop IV > >>>> cand for > >>>> address type IV use. > >>> What will happen if doloop IV cand be used for generic/address type iv > >>> use? Can RTL doloop can still perform doloop optimization in this > >>> case? > >>> > >> > >> On Power, we put the iteration count into hardware count register, it > >> takes very > >> high cost to move the count to GPR, so the cost is set as INF to make it > >> impossible > >> to use it for generic/address type iv use. But as some discussion before, > >> on some > >> targets using GPR instead of hardware count register, they probably want > >> to use this > >> doloop iv used for other uses if profitable. These two hooks offer the > >> possibility. > >> In that case, I think RTL doloop can still perform since it can still get > >> the > >> pattern and transform. The generic/address uses can still use it. > >>>> > >>>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed > >>>> excepting > >>>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is > >>>> tracked > >>>> by PR89983. > >>>> > >>>> Any comments and suggestions are highly appreciated. Thanks! > >>> Not sure if I understand the patch correctly, some comments embedded. > >>> > >>> + /* The number of doloop candidate in the set. */ > >>> + unsigned n_doloop_cands; > >>> + > >>> This is unnecessary. See below comments. > >>> > >>> - add_candidate_1 (data, base, step, important, > >>> - IP_NORMAL, use, NULL, orig_iv); > >>> + add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, > >>> doloop, > >>> + orig_iv); > >>> if (ip_end_pos (data->current_loop) > >>> && allow_ip_end_pos_p (data->current_loop)) > >>> - add_candidate_1 (data, base, step, important, IP_END, use, NULL, > >>> orig_iv); > >>> + add_candidate_1 (data, base, step, important, IP_END, use, NULL, > >>> doloop, > >>> + orig_iv); > >>> Do we need to skip ip_end_pos case for doloop candidate? Because the > >>> candidate increment will be inserted in latch, i.e, increment position > >>> is after exit condition. > >>> > >> > >> Yes, we should skip it. Currently function find_doloop_use has the check > >> on an > >> empty latch and gimple_cond to latch, partially excluding it. But it's > >> still good > >> to guard it directly here. > >> > >>> - tree_to_aff_combination (iv->base, type, val); > >>> + tree base = iv->base; > >>> + /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to > >>> extract > >>> + the value under !may_be_zero to get the compact bound which also > >>> well fits > >>> + for may_be_zero since we ensure the value for it is const one. */ > >>> + if (cand->doloop_p && desc->may_be_zero && !integer_zerop > >>> (desc->may_be_zero)) > >>> + base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter), > >>> + unshare_expr (rewrite_to_non_trapping_overflow > >>> (niter)), > >>> + build_int_cst (TREE_TYPE (niter), 1)); > >>> + tree_to_aff_combination (base, type, val); > >>> I don't quite follow here. The iv->base is computed from niter, I > >>> suppose compact bound is for cheaper candidate initialization? Why > >>> it's possible to extract !may_be_zero niter for may_be_zero here? The > >>> niter under !may_be_zero has no indication about the real niter under > >>> may_be_zero. > >>> > >> > >> As you note below, the cand_value for doloop would be zero, but for the > >> case > >> may_be_zero set, the current calculation would take care of the whole niter > >> expression including the cond_expr introduced by may_be_zero check, it's > >> unexpected. The purpose is to use the value under condition !may_be_zero > >> for the calculation, and yes, to get expected zero finally. > >> > >>> - cand_value_at (loop, cand, use->stmt, desc->niter, &bnd); > >>> + cand_value_at (loop, cand, use->stmt, desc, &bnd); > >>> If I understand correctly, doloop use/cand will only be > >>> identified/added for single exit loop, and there will be only one > >>> cond(doloop) iv_use and only one doloop cand for doloop loop. So the > >>> cand_value at niter at use position would be 0. If that's the case, > >>> we can skip calling cand_value_at here for doloop cand. The change to > >>> cand_value_at would be unnecessary neither. > >>> > >> > >> Exactly, I'll add the early return with zero bound for doloop. > >> > >>> - expensive. */ > >>> - if (!integer_zerop (desc->may_be_zero)) > >>> + expensive. > >>> + > >>> + For doloop candidate, we have considered MAY_BE_ZERO for IV base, > >>> need to > >>> + support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check. */ > >>> + if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p) > >>> return iv_elimination_compare_lt (data, cand, comp, desc); > >>> And we can early return before this? > >>> > >> > >> OK. > >> > >>> + if (may_be_zero) > >>> + { > >>> + if (COMPARISON_CLASS_P (may_be_zero)) > >>> + { > >>> + niter = fold_build3 (COND_EXPR, ntype, may_be_zero, > >>> + build_int_cst (ntype, 0), > >>> + rewrite_to_non_trapping_overflow (niter)); > >>> + } > >>> + /* Don't try to obtain the iteration count expression when > >>> may_be_zero is > >>> + integer_nonzerop (actually iteration count is one) or else. */ > >>> + else > >>> + return; > >>> + } > >>> + > >>> + tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter), > >>> + build_int_cst (ntype, 1)); > >>> niter is the number of latch executions, so niter + 1 could wrap here, > >>> but guess it's not a problem the similar issue is not handled in > >>> vectorizer neither. > >>> > >> > >> OK. > >> > >>> + unsigned n_old = data->regs_used, n_spr_for_doloop = 0; > >>> + /* If target supports count register for doloop, it doesn't take GPR. > >>> */ > >>> + if (targetm.have_count_reg_decr_p) > >>> + n_spr_for_doloop = n_doloop_cands; > >>> + unsigned n_new = n_invs + n_cands - n_spr_for_doloop; > >>> Not necessary. See below. > >> > >>> - cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands); > >>> + cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands, > >>> + ivs->n_doloop_cands); > >>> Also. > >>> > >>> ivs->n_cands--; > >>> + if (cp->cand->doloop_p) > >>> + ivs->n_doloop_cands--; > >>> > >>> ivs->n_cands++; > >>> + if (cp->cand->doloop_p) > >>> + ivs->n_doloop_cands++; > >>> You can just book n_cands under condition !cp->cand->doloop_p. > >> > >> If my understanding is correct, you are suggesting the code like: > >> > >> if (!cp->cand->doloop_p) > >> ivs->n_cands++; > >> > >> But I'm afraid that it can NOT satisfy the need in function > >> ivopts_estimate_reg_pressure. As the comments, "if target supports > >> count register for doloop it doesn't take GPR.". If we make doloop > >> cand invisible in n_cands, it's fine for target with count register, > >> but we may miss to count them on targets without count register. > > Why not one more step do checks: > > if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p) > > ivs->n_cands++; > > > > Yes, it works. Thanks! > > The new patch addressing the comments is attached. > Could you please have a look again? Thanks in advance! Thanks for working on this. A bit more nit-pickings.
- add_candidate_1 (data, base, step, important, - IP_NORMAL, use, NULL, orig_iv); - if (ip_end_pos (data->current_loop) + add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop, + orig_iv); + if (!doloop && ip_end_pos (data->current_loop) Could you add some comments elaborating why ip_end_pos candidate shouldn't be added for doloop case? Because the increment position is wrong. Also if you make doloop the last default parameter of add_candidate_1, you can save more unnecessary changes to calls to add_candidate? - cost = get_computation_cost (data, use, cand, false, - &inv_vars, NULL, &inv_expr); + { + cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL, + &inv_expr); + if (cand->doloop_p) + cost += targetm.doloop_cost_for_generic; + } This adjustment cost = get_computation_cost (data, use, cand, true, &inv_vars, &can_autoinc, &inv_expr); + if (cand->doloop_p) + cost += targetm.doloop_cost_for_address; + and this adjustment can be moved into get_computation_cost where all cost adjustments are done. + /* For doloop candidate/use pair, adjust to zero cost. */ + if (group->doloop_p && cand->doloop_p) + cost = no_cost; Note above code handles comparing against zero case and decreases the cost by one (which prefers the same kind candidate as doloop one), it's very possible to have -1 cost for doloop cand here. how about just set to no_cost if it's positive? Your call. +/* For the targets which support doloop, to predict whether later RTL doloop + transformation will perform on this loop, further detect the doloop use and + mark the flag doloop_use_p if predicted. */ + +void +predict_and_process_doloop (struct ivopts_data *data) A better name here? Sorry I don't have another candidate in mind... + data->doloop_use_p = false; This can be moved to the beginning of above 'predict_and_process_doloop' function. Lastly, could you please add some brief description/comment about doloop handling as a subsection in the file head comment? Otherwise, the ivopt changes look good to me. Thanks, bin > > > Kewen > > --------- > > gcc/ChangeLog > > 2019-08-22 Kewen Lin <li...@gcc.gnu.org> > > PR middle-end/80791 > * config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro. > (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise. > (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise. > * target.def (have_count_reg_decr_p): New hook. > (doloop_cost_for_generic): Likewise. > (doloop_cost_for_address): Likewise. > * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise. > (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise. > (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise. > * doc/tm.texi: Regenerate. > * tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite > cost > addend. > (record_group): Init doloop_p. > (add_candidate_1): Add optional argument doloop, change the handlings > accordingly. > (add_candidate): Likewise. > (add_iv_candidate_for_biv): Update the call to add_candidate. > (generic_predict_doloop_p): Update attribute. > (force_expr_to_var_cost): Add costing for expressions > COND_EXPR/LT_EXPR/ > LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/ > UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/ > MIN_EXPR. > (determine_group_iv_cost_generic): Update for doloop IV cand. > (determine_group_iv_cost_address): Likewise. > (determine_group_iv_cost_cond): Likewise. > (determine_iv_cost): Likewise. > (ivopts_estimate_reg_pressure): Likewise. > (may_eliminate_iv): Likewise. > (add_iv_candidate_for_doloop): New function. > (find_iv_candidates): Call function add_iv_candidate_for_doloop. > (iv_ca_set_no_cp): Update for doloop IV cand. > (iv_ca_set_cp): Likewise. > (iv_ca_dump): Dump register cost. > (find_doloop_use): New function. > (predict_and_process_doloop): Likewise. > (tree_ssa_iv_optimize_loop): Call function predict_and_process_doloop. > > gcc/testsuite/ChangeLog > > 2019-08-22 Kewen Lin <li...@gcc.gnu.org> > > PR middle-end/80791 > * gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change. > * gcc.dg/tree-ssa/ivopts-lt.c: Likewise. > * gcc.dg/tree-ssa/pr32044.c: Likewise. >