On Wed, Nov 9, 2016 at 1:02 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 4:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Thu, Nov 3, 2016 at 1:35 PM, Evgeny Kudryashov <kudryas...@ispras.ru> 
>> wrote:
>>> Hello,
>>>
>>> I'm facing the following problem related to ivopts. The problem is that GCC
>>> generates a lot of induction variables and as a result there is an
>>> unnecessary increase of stack usage and register pressure.
>>>
>>> For instance, for the attached testcase (tc_ivopts.c) GCC generates 26
>>> induction variables (25 for each of lhsX[{0-4}][{0-4}][k] and one for
>>> rhs[k][j][{0-2}]). You should be able to reproduce this issue on arm target
>>> using GCC with "-O2 -mcpu=cortex-a9 -mtune=cortex-a9". For reference, I'm
>>> attaching assembly I get on current trunk.
>>>
>>> The reason might be in use groups costs, in particular, in the way of
>>> estimation. Currently, the cost of a tuple (group, candidate) is a sum of
>>> per-use costs in the group. So, the cost of a group grows proportional to
>>> the number of uses in the group. This approach has a negative effect on the
>>> algorithm for finding the best set of induction variables: the part of a
>>> total cost related to adding a new candidate is almost washed out by the
>>> cost of the group. In addition, when there is a lot of groups with many uses
>>> inside and a target is out of free registers, the cost of spill is washing
>>> out too. As a result, GCC prefers to use native candidates (candidate
>>> created for a particular group) for each group of uses instead of
>>> considering the real cost of introducing a new variable into a set.
>>>
>>> The summing approach was added as a part of this patch
>>> (https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00641.html) and the
>>> motivation for taking the sum does not seem to be
>>> specifically discussed.
>>>
>>> I propose the following patch that changes a group cost from cost summing to
>>> selecting the largest one inside a group. For the given test case I have:
>>> necessary size of stack is decreased by almost 3 times and ldr\str amount
>>> are decreased by less than 2 times. Also I'm attaching assembly after
>>> applying the patch.
>>>
>>> The essential change in the patch is just:
>>>
>>> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
>>> index f9211ad..a149418 100644
>>> --- a/gcc/tree-ssa-loop-ivopts.c
>>> +++ b/gcc/tree-ssa-loop-ivopts.c
>>> @@ -5151,7 +5151,8 @@ determine_group_iv_cost_address (struct ivopts_data
>>> *data,
>>>          offset and where the iv_use is.  */
>>>         cost = get_computation_cost (data, next, cand, true,
>>>                                      NULL, &can_autoinc, NULL);
>>> -      sum_cost += cost;
>>> +      if (sum_cost < cost)
>>> +        sum_cost = cost;
>>>      }
>>>    set_group_iv_cost (data, group, cand, sum_cost, depends_on,
>>>                      NULL_TREE, ERROR_MARK, inv_expr);
>>>
>>> Any suggestions?
>> Hi Evgeny,
>>
>> I tend to be practical here.  Given cost is not model that well in
>> IVOPT, any heuristic tuning in cost is quite likely resulting in
>> improvement in some cases, while regressions in others.  At least we
>> need to run good number of benchmarks for such changes.  Specific to
>> this one, the summary of cost in a group is a natural result of the
>> original code, in which uses' cost is added up before candidate
>> selection.  Take your code as an example, choosing loop's original
>> candidate for group results in lots of memory accesses with [base +
>> index << scale] addressing mode, which could have higher cost than
>> [base + offset] mode wrto u-arch, IMHO, it makes sense to add up this
>> cost.  OTOH, I totally agree that IVOPT tends to choose too many
>> candidates at the moment, especially for large loops.  Is it possible
>> to achieve the same effect by penalizing register pressure cost?
>> Meanwhile, I can collect benchmark data for your patch on AArch64 and
>> get back to you later.
> I collected spec2k6 data on one AArch64 processors, it doesn't give
> meaningful improvement or regression.  Looking at the test, it boils
> down to how we choose induction variable for loops having below memory
> references:
> for (biv)
>   MEM[base + biv << scale + off1];
>   MEM[base + biv << scale + off2];
>   // ...
>   MEM[base + biv << scale + offn];
>
> On targets support [base + index << scale + offset] addressing mode ,
> the biv should be preferred (if cost of the addressing mode is not
> blocking) thus register pressure is 2.  While on targets only support
> [base + index << scale], it is more complicated.  Choosing biv
> actually increases the register pressure by 1 than choosing {base_1 +
> biv << scale, step} as the candidate, or an additional computation
> outside of address expression is required for each memory referece.
> Well, there is one exception when "base" is actually anticipated on
> exit edge of this loop.  So this is really target dependent cost model
> issue, IMHO, decreasing group cost in target-independent way is not
> that good, what do you think?  I will look deeper why choosing biv can
> get rid of spilling.
>
> BTW, the case can be further extended:
> for (biv)
>   MEM[base_1 + biv << scale + off1];
>   MEM[base_1 + biv << scale + off2];
>   // ...
>   MEM[base_1 + biv << scale + offn];
>   //
>   // ...
>   //
>   MEM[base_m + biv << scale + off1];
>   MEM[base_m + biv << scale + off2];
>   // ...
>   MEM[base_m + biv << scale + offn];
>
> IVOPT doesn't work well for this either.

Hi,
I see the cost problem with your test now.  When computing an address
type iv_use with a candidate, the computation consists of two parts,
for computation can be represented by addressing mode, it is done in
memory reference; for computation cannot be represented by addressing
mode, it is done outside of memory reference.  The final cost is added
up from the two computation parts.
For address iv_use:
    MEM[base + biv << scale + offset]
when it is computed with below candidate on target only supports [base
+ biv << scale] addressing mode:
    biv
The computations would be like:
    base' = base + offset
    MEM[base' + biv << scale]
Both computations has its own cost, the first one is normal RTX cost,
the second one is addressing mode cost.  Final cost is added up from
both parts.

Normally, all these cost should be added up in cost model, but there
should be one exception found in your test: If iv_uses of a group has
exactly the same iv ({base, step}), the first part computation (RTX)
can be shared among all iv_uses, thus the cost should only counted one
time.  That is, we should be able to model such CSE opportunities.
Apparently, we can't CSE the second part computation, of course there
won't be CSE opportunities in address expression anyway.

That said, this patch should make difference between cost of RTX
computation and address expression, and only add up RTX cost once if
it can be CSEed.  Well, it might be not trivial to check CSE
opportunities of RTX computation, for example, some iv_uses of the
group are the same, others are not.

Thanks,
bin

Reply via email to