On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin <li...@linux.ibm.com> wrote:
>
> Hi Bin,
>
> on 2020/8/10 下午8:38, Bin.Cheng wrote:
> > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <li...@linux.ibm.com> wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for the review!!
> >>
> >> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> >>> Hi Kewen,
> >>> Sorry for the late reply.
> >>> The patch's most important change is below cost computation:
> >>>
> >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, 
> >>>> struct iv_cand *cand)
> >>>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>>>
> >>>> +  /* Consider additional step updates during unrolling.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >>>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> >>> This is a bit strange, to me the add instructions are additional
> >>> computation caused by unrolling+addressing_mode, rather than a native
> >>> part in candidate itself.  Specifically, an additional cost is needed
> >>> if candidates (without reg_offset_p) are chosen for the address type
> >>> group/uses.
> >>
> >> Good point, ideally it should be one additional cost for each cand set,
> >> when we select one cand for one group, we need to check this pair need
> >> more (estimated_unroll - 1) step costs, we probably need to care about
> >> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> >> doesn't support this and it could increase the selection complexity and
> >> time.  I hesitated to do it and put it to cand step cost initially instead.
> >>
> >> I was thinking those candidates with reg_offset_p should be only used for
> >> those reg_offset_p groups in most cases (very limited) meanwhile the others
> >> are simply scaled up like before.  But indeed this can cover some similar
> >> cases like one cand is only used for the compare type group which is for
> >> loop closing, then it doesn't need more step costs for unrolling.
> >>
> >> Do you prefer me to improve the current cost framework?
> > No, I don't think it's relevant to the candidate selecting algorithm.
> > I was thinking about adjusting cost somehow in
> > determine_group_iv_cost_address. Given we don't expose selected
> > addressing mode in this function, you may need to do it in
> > get_address_cost, either way.
> >
>
> Thanks for your suggestion!
>
> Sorry, I may miss something, but I still think the additional cost is
> per candidate.  The justification is that we miss to model the iv
> candidate step well in the context of unrolling, the step cost is part
> of candidate cost, which is per candidate.
>
> To initialize it in determine_iv_cost isn't perfect as you pointed out,
> ideally we should check any uses of the candidate requires iv update
> after each replicated iteration, and take extra step costs into account
> if at least one needs, meanwhile scaling up all the computation cost to
> reflect unrolling cost nature.
I see, it's similar to the auto-increment case where cost should be
recorded only once.  So this is okay given 1) fine predicting
rtl-unroll is likely impossible here; 2) the patch has very limited
impact.

Thanks,
bin
>
> Besides, the reg_offset desirable pair already takes zero cost for
> cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
> suggesting increasing the cost for non reg_offset pairs?  If so and per
> pair, the extra cost looks possible to be computed several times
> unexpectedly.
>
> >>
> >>>> +
> >>>>   /* Prefer the original ivs unless we may gain something by replacing 
> >>>> it.
> >>>>      The reason is to make debugging simpler; so this is not relevant for
> >>>>      artificial ivs created by other optimization passes.  */
> >>>>
> >>>
> >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>>>       return;
> >>>>     }
> >>>>
> >>>> +  /* Since we priced more on non reg_offset IV cand step cost, we 
> >>>> should scale
> >>>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at 
> >>>> the
> >>>> +     loop exit, FIXME if multiple exits supported or no loop exit 
> >>>> comparisons
> >>>> +     matter.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p
> >>>> +      && group->vuses[0]->type != USE_COMPARE)
> >>>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> >>> doesn't make much sense to me.  Also why generic type uses are not
> >>> skipped?  We want to model the cost required for address computation,
> >>> however, for generic type uses there is no way to save the computation
> >>> in "address expression".  Once unrolled, the computation is always
> >>> there?
> >>>
> >>
> >> The main intention is to scale up the group/cand cost for unrolling since
> >> we have scaled up the step costs.  The assumption is that the original
> > If we adjust cost appropriately in function *group_iv_cost_address,
> > this would become unnecessary, right?  And naturally.
> >> costing (without this patch) can be viewed as either for all unrolled
> >> iterations or just one single iteration.  Since IVOPTs doesn't support
> >> fractional costing, I interpreted it as single iterations, to emulate
> >> unrolling scenario based on the previous step cost scaling, we need to
> >> scale up the cost for all computation.
> >>
> >> In most cases, the compare type use is for loop closing, there is still
> >> only one computation even unrolling happens, so I excluded it here.
> >> As "FIXME", if we find some cases are off, we can further restrict it to
> >> those USE_COMPARE uses which is exactly for loop closing.
> >>
> >>> And what's the impact on targets supporting [base + index + offset]
> >>> addressing mode?
> >>
> >> Good question, I didn't notice it since power doesn't support it.
> >> I noticed the comments of function addr_offset_valid_p only mentioning
> >> [base + offset], I guess it excludes [base + index + offset]?
> >> But I guess the address-based IV can work for this mode?
> > No, addr_offset_valid_p is only used to split address use groups.  See
> > get_address_cost and struct mem_address.
> > I forgot to ask, what about target only supports [base + offset]
> > addressing mode like RISC-V?  I would expect it's not affected at all.
> >
> addr_offset_valid_p is also used in this patch as Richard S. and Segher
> suggested to check the offset after unrolling (like: offset+(uf-1)*step)
> is still valid for the target.  If address-based IV can not work for
> [base + index + offset], it won't affect [base + index + offset].
>
> It can help all targets which supports [base + offset], so I'm afraid
> it can affect RISC-V too, but I would expect it's positive.  Or do you
> happen to see some potential issues? or have some concerns?
>
> And as Richard S. suggested before, it has one parameter to control,
> target can simply disable this if it dislikes.
>
> >>
> >>>
> >>> Given the patch is not likely to harm because rtl loop unrolling is
> >>> (or was?) by default disabled, so I am OK once the above comments are
> >>> addressed.
> >>>
> >>
> >> Yes, it needs explicit unrolling options, excepting for some targets
> >> wants to enable it for some cases with specific loop_unroll_adjust checks.
> >>
> >>> I wonder if it's possible to get 10% of (all which should be unrolled)
> >>> loops unrolled (conservatively) on gimple and enable it by default at
> >>> O3, rather than teaching ivopts to model a future pass which not
> >>> likely be used outside of benchmarks?
> >>>
> >>
> >> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
> >> have any future unrollings to get it off.  PR[1] seems to have some
> >> discussion on gimple unrolling.
> > Thanks for directing me to the discussion.  I am on Wilco's side on
> > this problem, IMHO, It might be useful getting small loops unrolled
> > (at O3 by default) by simply saving induction variable stepping and
> > exit condition check, which can be modeled on gimple.  Especially for
> > RISC-V, it doesn't support index addressing mode, which means there
> > will be as many induction variables as distinct arrays.  Also
> > interleaving after unrolling is not that important, it's at high
> > chance that small loops eligible for interleaving are handled by
> > vectorizer already.
> >
>
> Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see
> whether we can bring in one gimple unrolling pass.  Regardless we
> have/don't have it, if the RTL unrolling is there, I guess this patch
> set is still beneficial?  If so, I would take it separately.
>
> BR,
> Kewen

Reply via email to