Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

Bin.Cheng Thu, 22 Aug 2019 19:19:48 -0700

On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin <li...@linux.ibm.com> wrote:
>
> Hi Bin,
>
> on 2019/8/22 下午1:46, Bin.Cheng wrote:
> > On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <li...@linux.ibm.com> wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for your time!
> >>
> >> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> >>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <li...@linux.ibm.com> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> Comparing to the previous versions of implementation mainly based on the
> >>>> existing IV cands but zeroing the related group/use cost, this new one 
> >>>> is based
> >>>> on Richard and Segher's suggestion introducing one doloop dedicated IV 
> >>>> cand.
> >>>>
> >>>> Some key points are listed below:
> >>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated 
> >>>> IV cand.
> >>>>   2) Special name "doloop" assigned.
> >>>>   3) Doloop IV cand with form (niter+1, +, -1)
> >>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost 
> >>>> for step.
> >>>>   5) Support may_be_zero (regressed PR is in this case), the base of 
> >>>> doloop IV
> >>>>      can be COND_EXPR, add handlings in cand_value_at and 
> >>>> may_eliminate_iv.
> >>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>>>      calculation on the IV base with may_be_zero (like COND_EXPR).
> >>>>   7) Set zero cost when using doloop IV cand for doloop use.
> >>>>   8) Add three hooks (should we merge _generic and _address?).
> >>>>     *) have_count_reg_decr_p, is to indicate the target has special 
> >>>> hardware
> >>>>        count register, we shouldn't consider the impact of doloop IV when
> >>>>        calculating register pressures.
> >>>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV 
> >>>> cand for
> >>>>        generic type IV use.
> >>>>     *) doloop_cost_for_address, is the extra cost when using doloop IV 
> >>>> cand for
> >>>>        address type IV use.
> >>> What will happen if doloop IV cand be used for generic/address type iv
> >>> use?  Can RTL doloop can still perform doloop optimization in this
> >>> case?
> >>>
> >>
> >> On Power, we put the iteration count into hardware count register, it 
> >> takes very
> >> high cost to move the count to GPR, so the cost is set as INF to make it 
> >> impossible
> >> to use it for generic/address type iv use.  But as some discussion before, 
> >> on some
> >> targets using GPR instead of hardware count register, they probably want 
> >> to use this
> >> doloop iv used for other uses if profitable.  These two hooks offer the 
> >> possibility.
> >> In that case, I think RTL doloop can still perform since it can still get 
> >> the
> >> pattern and transform.  The generic/address uses can still use it.
> >>>>
> >>>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed 
> >>>> excepting
> >>>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is 
> >>>> tracked
> >>>> by PR89983.
> >>>>
> >>>> Any comments and suggestions are highly appreciated.  Thanks!
> >>> Not sure if I understand the patch correctly, some comments embedded.
> >>>
> >>> +  /* The number of doloop candidate in the set.  */
> >>> +  unsigned n_doloop_cands;
> >>> +
> >>> This is unnecessary.  See below comments.
> >>>
> >>> -    add_candidate_1 (data, base, step, important,
> >>> -                    IP_NORMAL, use, NULL, orig_iv);
> >>> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, 
> >>> doloop,
> >>> +                    orig_iv);
> >>>    if (ip_end_pos (data->current_loop)
> >>>        && allow_ip_end_pos_p (data->current_loop))
> >>> -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> >>> orig_iv);
> >>> +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> >>> doloop,
> >>> +                    orig_iv);
> >>> Do we need to skip ip_end_pos case for doloop candidate?  Because the
> >>> candidate increment will be inserted in latch, i.e, increment position
> >>> is after exit condition.
> >>>
> >>
> >> Yes, we should skip it.  Currently function find_doloop_use has the check 
> >> on an
> >> empty latch and gimple_cond to latch, partially excluding it.  But it's 
> >> still good
> >> to guard it directly here.
> >>
> >>> -  tree_to_aff_combination (iv->base, type, val);
> >>> +  tree base = iv->base;
> >>> +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to 
> >>> extract
> >>> +     the value under !may_be_zero to get the compact bound which also 
> >>> well fits
> >>> +     for may_be_zero since we ensure the value for it is const one.  */
> >>> +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
> >>> (desc->may_be_zero))
> >>> +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
> >>> +                       unshare_expr (rewrite_to_non_trapping_overflow 
> >>> (niter)),
> >>> +                       build_int_cst (TREE_TYPE (niter), 1));
> >>> +  tree_to_aff_combination (base, type, val);
> >>> I don't quite follow here.  The iv->base is computed from niter, I
> >>> suppose compact bound is for cheaper candidate initialization?  Why
> >>> it's possible to extract !may_be_zero niter for may_be_zero here?  The
> >>> niter under !may_be_zero has no indication about the real niter under
> >>> may_be_zero.
> >>>
> >>
> >> As you note below, the cand_value for doloop would be zero, but for the 
> >> case
> >> may_be_zero set, the current calculation would take care of the whole niter
> >> expression including the cond_expr introduced by may_be_zero check, it's
> >> unexpected.  The purpose is to use the value under condition !may_be_zero
> >> for the calculation, and yes, to get expected zero finally.
> >>
> >>> -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
> >>> +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
> >>> If I understand correctly, doloop use/cand will only be
> >>> identified/added for single exit loop, and there will be only one
> >>> cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
> >>> cand_value at niter at use position would be 0.  If that's the case,
> >>> we can skip calling cand_value_at here for doloop cand.  The change to
> >>> cand_value_at would be unnecessary neither.
> >>>
> >>
> >> Exactly, I'll add the early return with zero bound for doloop.
> >>
> >>> -          expensive.  */
> >>> -  if (!integer_zerop (desc->may_be_zero))
> >>> +          expensive.
> >>> +
> >>> +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, 
> >>> need to
> >>> +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
> >>> +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
> >>>      return iv_elimination_compare_lt (data, cand, comp, desc);
> >>> And we can early return before this?
> >>>
> >>
> >> OK.
> >>
> >>> +  if (may_be_zero)
> >>> +    {
> >>> +      if (COMPARISON_CLASS_P (may_be_zero))
> >>> +       {
> >>> +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
> >>> +                              build_int_cst (ntype, 0),
> >>> +                              rewrite_to_non_trapping_overflow (niter));
> >>> +       }
> >>> +      /* Don't try to obtain the iteration count expression when 
> >>> may_be_zero is
> >>> +        integer_nonzerop (actually iteration count is one) or else.  */
> >>> +      else
> >>> +       return;
> >>> +    }
> >>> +
> >>> +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> >>> +                          build_int_cst (ntype, 1));
> >>> niter is the number of latch executions, so niter + 1 could wrap here,
> >>> but guess it's not a problem the similar issue is not handled in
> >>> vectorizer neither.
> >>>
> >>
> >> OK.
> >>
> >>> +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
> >>> +  /* If target supports count register for doloop, it doesn't take GPR.  
> >>> */
> >>> +  if (targetm.have_count_reg_decr_p)
> >>> +    n_spr_for_doloop = n_doloop_cands;
> >>> +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
> >>> Not necessary.  See below.
> >>
> >>> -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
> >>> +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
> >>> +                                       ivs->n_doloop_cands);
> >>> Also.
> >>>
> >>>        ivs->n_cands--;
> >>> +      if (cp->cand->doloop_p)
> >>> +       ivs->n_doloop_cands--;
> >>>
> >>>           ivs->n_cands++;
> >>> +         if (cp->cand->doloop_p)
> >>> +           ivs->n_doloop_cands++;
> >>> You can just book n_cands under condition !cp->cand->doloop_p.
> >>
> >> If my understanding is correct, you are suggesting the code like:
> >>
> >> if (!cp->cand->doloop_p)
> >>   ivs->n_cands++;
> >>
> >> But I'm afraid that it can NOT satisfy the need in function
> >> ivopts_estimate_reg_pressure.  As the comments, "if target supports
> >> count register for doloop it doesn't take GPR.".  If we make doloop
> >> cand invisible in n_cands, it's fine for target with count register,
> >> but we may miss to count them on targets without count register.
> > Why not one more step do checks:
> > if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
> >   ivs->n_cands++;
> >
>
> Yes, it works.  Thanks!
>
> The new patch addressing the comments is attached.
> Could you please have a look again?  Thanks in advance!
Thanks for working on this.  A bit more nit-pickings.


-    add_candidate_1 (data, base, step, important,
-                    IP_NORMAL, use, NULL, orig_iv);
-  if (ip_end_pos (data->current_loop)
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+                    orig_iv);
+  if (!doloop && ip_end_pos (data->current_loop)
Could you add some comments elaborating why ip_end_pos candidate
shouldn't be added for doloop case?  Because the increment position is
wrong.

Also if you make doloop the last default parameter of add_candidate_1,
you can save more unnecessary changes to calls to add_candidate?

-    cost = get_computation_cost (data, use, cand, false,
-                                &inv_vars, NULL, &inv_expr);
+    {
+      cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL,
+                                  &inv_expr);
+      if (cand->doloop_p)
+       cost += targetm.doloop_cost_for_generic;
+    }
This adjustment

   cost = get_computation_cost (data, use, cand, true,
                               &inv_vars, &can_autoinc, &inv_expr);

+  if (cand->doloop_p)
+    cost += targetm.doloop_cost_for_address;
+
and this adjustment can be moved into get_computation_cost where all
cost adjustments are done.

+      /* For doloop candidate/use pair, adjust to zero cost.  */
+      if (group->doloop_p && cand->doloop_p)
+       cost = no_cost;
Note above code handles comparing against zero case and decreases the
cost by one (which prefers the same kind candidate as doloop one),
it's very possible to have -1 cost for doloop cand here.  how about
just set to no_cost if it's positive?  Your call.

+/* For the targets which support doloop, to predict whether later RTL doloop
+   transformation will perform on this loop, further detect the doloop use and
+   mark the flag doloop_use_p if predicted.  */
+
+void
+predict_and_process_doloop (struct ivopts_data *data)
A better name here? Sorry I don't have another candidate in mind...

+  data->doloop_use_p = false;
This can be moved to the beginning of above
'predict_and_process_doloop' function.

Lastly, could you please add some brief description/comment about
doloop handling as a subsection in the file head comment?

Otherwise, the ivopt changes look good to me.

Thanks,
bin

>
>
> Kewen
>
> ---------
>
> gcc/ChangeLog
>
> 2019-08-22  Kewen Lin  <li...@gcc.gnu.org>
>
>         PR middle-end/80791
>         * config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * target.def (have_count_reg_decr_p): New hook.
>         (doloop_cost_for_generic): Likewise.
>         (doloop_cost_for_address): Likewise.
>         * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * doc/tm.texi: Regenerate.
>         * tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite 
> cost
>         addend.
>         (record_group): Init doloop_p.
>         (add_candidate_1): Add optional argument doloop, change the handlings
>         accordingly.
>         (add_candidate): Likewise.
>         (add_iv_candidate_for_biv): Update the call to add_candidate.
>         (generic_predict_doloop_p): Update attribute.
>         (force_expr_to_var_cost): Add costing for expressions 
> COND_EXPR/LT_EXPR/
>         LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
>         UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
>         MIN_EXPR.
>         (determine_group_iv_cost_generic): Update for doloop IV cand.
>         (determine_group_iv_cost_address): Likewise.
>         (determine_group_iv_cost_cond): Likewise.
>         (determine_iv_cost): Likewise.
>         (ivopts_estimate_reg_pressure): Likewise.
>         (may_eliminate_iv): Likewise.
>         (add_iv_candidate_for_doloop): New function.
>         (find_iv_candidates): Call function add_iv_candidate_for_doloop.
>         (iv_ca_set_no_cp): Update for doloop IV cand.
>         (iv_ca_set_cp): Likewise.
>         (iv_ca_dump): Dump register cost.
>         (find_doloop_use): New function.
>         (predict_and_process_doloop): Likewise.
>         (tree_ssa_iv_optimize_loop): Call function predict_and_process_doloop.
>
> gcc/testsuite/ChangeLog
>
> 2019-08-22  Kewen Lin  <li...@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
>         * gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
>         * gcc.dg/tree-ssa/pr32044.c: Likewise.
>

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

Reply via email to