Hi Bin,

on 2020/8/10 下午8:38, Bin.Cheng wrote:
> On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <li...@linux.ibm.com> wrote:
>>
>> Hi Bin,
>>
>> Thanks for the review!!
>>
>> on 2020/8/8 下午4:01, Bin.Cheng wrote:
>>> Hi Kewen,
>>> Sorry for the late reply.
>>> The patch's most important change is below cost computation:
>>>
>>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct 
>>>> iv_cand *cand)
>>>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>>>>
>>>> +  /* Consider additional step updates during unrolling.  */
>>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
>>>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
>>> This is a bit strange, to me the add instructions are additional
>>> computation caused by unrolling+addressing_mode, rather than a native
>>> part in candidate itself.  Specifically, an additional cost is needed
>>> if candidates (without reg_offset_p) are chosen for the address type
>>> group/uses.
>>
>> Good point, ideally it should be one additional cost for each cand set,
>> when we select one cand for one group, we need to check this pair need
>> more (estimated_unroll - 1) step costs, we probably need to care about
>> this during remove/replace etc.  IIUC the current IVOPTs cost framework
>> doesn't support this and it could increase the selection complexity and
>> time.  I hesitated to do it and put it to cand step cost initially instead.
>>
>> I was thinking those candidates with reg_offset_p should be only used for
>> those reg_offset_p groups in most cases (very limited) meanwhile the others
>> are simply scaled up like before.  But indeed this can cover some similar
>> cases like one cand is only used for the compare type group which is for
>> loop closing, then it doesn't need more step costs for unrolling.
>>
>> Do you prefer me to improve the current cost framework?
> No, I don't think it's relevant to the candidate selecting algorithm.
> I was thinking about adjusting cost somehow in
> determine_group_iv_cost_address. Given we don't expose selected
> addressing mode in this function, you may need to do it in
> get_address_cost, either way.
> 

Thanks for your suggestion!

Sorry, I may miss something, but I still think the additional cost is
per candidate.  The justification is that we miss to model the iv
candidate step well in the context of unrolling, the step cost is part
of candidate cost, which is per candidate.

To initialize it in determine_iv_cost isn't perfect as you pointed out,
ideally we should check any uses of the candidate requires iv update
after each replicated iteration, and take extra step costs into account
if at least one needs, meanwhile scaling up all the computation cost to
reflect unrolling cost nature.

Besides, the reg_offset desirable pair already takes zero cost for
cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
suggesting increasing the cost for non reg_offset pairs?  If so and per
pair, the extra cost looks possible to be computed several times
unexpectedly.

>>
>>>> +
>>>>   /* Prefer the original ivs unless we may gain something by replacing it.
>>>>      The reason is to make debugging simpler; so this is not relevant for
>>>>      artificial ivs created by other optimization passes.  */
>>>>
>>>
>>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>>>>       return;
>>>>     }
>>>>
>>>> +  /* Since we priced more on non reg_offset IV cand step cost, we should 
>>>> scale
>>>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at 
>>>> the
>>>> +     loop exit, FIXME if multiple exits supported or no loop exit 
>>>> comparisons
>>>> +     matter.  */
>>>> +  if (data->consider_reg_offset_for_unroll_p
>>>> +      && group->vuses[0]->type != USE_COMPARE)
>>>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
>>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
>>> doesn't make much sense to me.  Also why generic type uses are not
>>> skipped?  We want to model the cost required for address computation,
>>> however, for generic type uses there is no way to save the computation
>>> in "address expression".  Once unrolled, the computation is always
>>> there?
>>>
>>
>> The main intention is to scale up the group/cand cost for unrolling since
>> we have scaled up the step costs.  The assumption is that the original
> If we adjust cost appropriately in function *group_iv_cost_address,
> this would become unnecessary, right?  And naturally.
>> costing (without this patch) can be viewed as either for all unrolled
>> iterations or just one single iteration.  Since IVOPTs doesn't support
>> fractional costing, I interpreted it as single iterations, to emulate
>> unrolling scenario based on the previous step cost scaling, we need to
>> scale up the cost for all computation.
>>
>> In most cases, the compare type use is for loop closing, there is still
>> only one computation even unrolling happens, so I excluded it here.
>> As "FIXME", if we find some cases are off, we can further restrict it to
>> those USE_COMPARE uses which is exactly for loop closing.
>>
>>> And what's the impact on targets supporting [base + index + offset]
>>> addressing mode?
>>
>> Good question, I didn't notice it since power doesn't support it.
>> I noticed the comments of function addr_offset_valid_p only mentioning
>> [base + offset], I guess it excludes [base + index + offset]?
>> But I guess the address-based IV can work for this mode?
> No, addr_offset_valid_p is only used to split address use groups.  See
> get_address_cost and struct mem_address.
> I forgot to ask, what about target only supports [base + offset]
> addressing mode like RISC-V?  I would expect it's not affected at all.
> 
addr_offset_valid_p is also used in this patch as Richard S. and Segher
suggested to check the offset after unrolling (like: offset+(uf-1)*step)
is still valid for the target.  If address-based IV can not work for
[base + index + offset], it won't affect [base + index + offset].

It can help all targets which supports [base + offset], so I'm afraid
it can affect RISC-V too, but I would expect it's positive.  Or do you
happen to see some potential issues? or have some concerns?

And as Richard S. suggested before, it has one parameter to control,
target can simply disable this if it dislikes.

>>
>>>
>>> Given the patch is not likely to harm because rtl loop unrolling is
>>> (or was?) by default disabled, so I am OK once the above comments are
>>> addressed.
>>>
>>
>> Yes, it needs explicit unrolling options, excepting for some targets
>> wants to enable it for some cases with specific loop_unroll_adjust checks.
>>
>>> I wonder if it's possible to get 10% of (all which should be unrolled)
>>> loops unrolled (conservatively) on gimple and enable it by default at
>>> O3, rather than teaching ivopts to model a future pass which not
>>> likely be used outside of benchmarks?
>>>
>>
>> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
>> have any future unrollings to get it off.  PR[1] seems to have some
>> discussion on gimple unrolling.
> Thanks for directing me to the discussion.  I am on Wilco's side on
> this problem, IMHO, It might be useful getting small loops unrolled
> (at O3 by default) by simply saving induction variable stepping and
> exit condition check, which can be modeled on gimple.  Especially for
> RISC-V, it doesn't support index addressing mode, which means there
> will be as many induction variables as distinct arrays.  Also
> interleaving after unrolling is not that important, it's at high
> chance that small loops eligible for interleaving are handled by
> vectorizer already.
> 

Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see
whether we can bring in one gimple unrolling pass.  Regardless we
have/don't have it, if the RTL unrolling is there, I guess this patch
set is still beneficial?  If so, I would take it separately.

BR,
Kewen

Reply via email to