Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-17 Thread Segher Boessenkool
Hi Jeff,

On Thu, Sep 17, 2020 at 05:12:17PM -0600, Jeff Law wrote:
> On 9/3/20 4:37 PM, Segher Boessenkool wrote:
> >> Apart from that, one P9 specific point is that the update form load isn't
> >> preferred,  the reason is that the instruction can not retire until both
> >> parts complete, it can hold up subsequent instructions from retiring.
> >> If the addi stalls (starvation), the instruction can not retire and can
> >> cause things stuck.  It seems also something we can model here?
> > This is (almost) no problem on p9, since we no longer have issue groups.
> > It can hold up older insns from retiring, sure, but they *will* have
> > finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> > is gone.  The only problem is if things stick around so long that
> > resources run out...  but you're talking 100s of insns there.
> 
> So the PA8xxx had the same issue with its dual output insns -- the big
> difference is the PA8xxx systems were considered retirement bandwidth
> limited (2 memory and 2 non-memory per cycle, with just a 56 entry
> reorder buffer, split between memory and non-memory ops).  Holding a
> slot in the reorder buffer was relatively costly.
> 
> 
> If you can retire 64 ops per cycle and you've probably got an enormous
> reorder buffer, so I wouldn't worry much about holding up insns from
> retiring on your target.

Power9 doesn't have a reorder buffer or anything similar at all -- it
uses history buffers, so committing insns (pretty much what you call
retiring) is essentially for free (restoring old register values after
flushes now costs more though).


Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-17 Thread Jeff Law via Gcc-patches

On 9/3/20 4:37 PM, Segher Boessenkool wrote:
>> Apart from that, one P9 specific point is that the update form load isn't
>> preferred,  the reason is that the instruction can not retire until both
>> parts complete, it can hold up subsequent instructions from retiring.
>> If the addi stalls (starvation), the instruction can not retire and can
>> cause things stuck.  It seems also something we can model here?
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.

So the PA8xxx had the same issue with its dual output insns -- the big
difference is the PA8xxx systems were considered retirement bandwidth
limited (2 memory and 2 non-memory per cycle, with just a 56 entry
reorder buffer, split between memory and non-memory ops).  Holding a
slot in the reorder buffer was relatively costly.


If you can retire 64 ops per cycle and you've probably got an enormous
reorder buffer, so I wouldn't worry much about holding up insns from
retiring on your target.


jeff



pEpkey.asc
Description: application/pgp-keys


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-15 Thread Kewen.Lin via Gcc-patches
Hi Hans,

on 2020/9/6 上午10:47, Hans-Peter Nilsson wrote:
> On Tue, 1 Sep 2020, Bin.Cheng via Gcc-patches wrote:
>>> Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
>>> but the regression testing did show one failure (the only one):
>>>
>>>   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
>>>
>>> It exposes two issues:
>>>
>>> 1) Currently address_cost hook on rs6000 always return zero, but at least
>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>>> have to take the address update into account (scalar normal operation).
>>> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
>>> ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
>>> with scaling up, the off becomes much.  With one simple hack on for pre_inc/
>>> pre_dec in rs6000 address_cost, the case passed.  It should be handled in
>>> one separated issue.
>>>
>>> 2) This case makes me think we should exclude ainc candidates in function
>>> mark_reg_offset_candidates.  The justification is that: ainc candidate
>>> handles step update itself and when we calculate the cost for it against
>>> its ainc_use, the cost_step has been reduced. When unrolling happens,
>>> the ainc computations are replicated and it doesn't save step updates
>>> like normal reg_offset_p candidates.
>> Though auto-inc candidate embeds stepping operation into memory
>> access, we might want to avoid it in case of unroll if there are many
>> sequences of memory accesses, and if the unroll factor is big.  The
>> rationale is embedded stepping is a u-arch operation and does have its
>> cost.
> 
> Forgive me for barging in here (though the context is powerpc,
> the dialogue and the patch seems to be generic ivopts), but
> that's not a general remark I hope, about auto-inc (always)
> having a cost?
> 
> For some architectures, auto-inc *is* free, as free as
> register-indirect, so the more auto-inc use, the better.  All
> this should be reflected by the address-cost, IMHO, and not
> hardcoded into ivopts.
> 

Yeah, now ivopts doesn't hardcode the cost for auto-inc (always),
instead it allows targets to set its cost by themselves through
address_cost hook.  As the function get_address_cost_ainc, it
checks auto-inc operations supported or not and set the cost
as address_cost hook further.

One example on Power is listed as below:

Group 0:
  cand  costcompl.  inv.expr.   inv.vars
  1 4   1   NIL;1
  3 0   0   NIL;NIL;
  4 0   1   NIL;1
  5 0   1   NIL;NIL;
  130   1   NIL;NIL;
  18-4  0   NIL;NIL;

Cand 18 is one auto-inc candidate, whose group 0/cand cost is
-4 (minus step_cost), the iv_cost of cand 18 is 5 (step_cost +
non-original_iv cost), when it's selected, the step_cost parts
counteract, the remaining cost (1) is for non-original iv,
it shows it doesn't put any hardcoded cost to this ainc_cost
candidate.

I guess some misunderstanding was derived from some discussion
above.  Sorry if some of my previous comments misled you.

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-05 Thread Hans-Peter Nilsson
On Tue, 1 Sep 2020, Bin.Cheng via Gcc-patches wrote:
> > Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
> > but the regression testing did show one failure (the only one):
> >
> >   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
> >
> > It exposes two issues:
> >
> > 1) Currently address_cost hook on rs6000 always return zero, but at least
> > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> > have to take the address update into account (scalar normal operation).
> > Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
> > ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
> > with scaling up, the off becomes much.  With one simple hack on for pre_inc/
> > pre_dec in rs6000 address_cost, the case passed.  It should be handled in
> > one separated issue.
> >
> > 2) This case makes me think we should exclude ainc candidates in function
> > mark_reg_offset_candidates.  The justification is that: ainc candidate
> > handles step update itself and when we calculate the cost for it against
> > its ainc_use, the cost_step has been reduced. When unrolling happens,
> > the ainc computations are replicated and it doesn't save step updates
> > like normal reg_offset_p candidates.
> Though auto-inc candidate embeds stepping operation into memory
> access, we might want to avoid it in case of unroll if there are many
> sequences of memory accesses, and if the unroll factor is big.  The
> rationale is embedded stepping is a u-arch operation and does have its
> cost.

Forgive me for barging in here (though the context is powerpc,
the dialogue and the patch seems to be generic ivopts), but
that's not a general remark I hope, about auto-inc (always)
having a cost?

For some architectures, auto-inc *is* free, as free as
register-indirect, so the more auto-inc use, the better.  All
this should be reflected by the address-cost, IMHO, and not
hardcoded into ivopts.

brgds, H-P


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Kewen.Lin via Gcc-patches
Hi Segher,

on 2020/9/4 下午10:16, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote:
 Apart from that, one P9 specific point is that the update form load isn't
 preferred,  the reason is that the instruction can not retire until both
 parts complete, it can hold up subsequent instructions from retiring.
 If the addi stalls (starvation), the instruction can not retire and can
 cause things stuck.  It seems also something we can model here?
>>>
>>> This is (almost) no problem on p9, since we no longer have issue groups.
>>> It can hold up older insns from retiring, sure, but they *will* have
>>> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
>>> is gone.  The only problem is if things stick around so long that
>>> resources run out...  but you're talking 100s of insns there.
>>
>> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
>> instructions intensive loop code, which did cause some worse performance.  :(
> 
> "addi starvation" has nothing to do with addi (it also happens for other
> insns), and nothing with update form memory insns either.  What happens
> is simply that no shorter latency insns are issued by the core so long
> as longer latency insns (like most float insns) are available.  So in
> really nice floating point loops we execute the few integer add insns
> much too late, much later than they were in the machine code, which then
> makes the memory insns late as well, etc.
> 

Yeah, the starvation issue isn't addi specific, but in the FP/SIMD
insns intensive loop, "addi/add" is the major/all portion of the
shorter latency insns in most time.  So I'd argue that it's related. :) 
Since they are mainly for IV updates, memory insns depend on it,
the FP/SIMD insns depend on the memory insns, ..., it can easily
cause the stall chain reaction, I guess that's why some people call
it as "addi starvation".

As the example Bin gave in another email, more auto-inc candidates
would have more iv updates (cracked ADDIs), if one/several common
index iv can be shared among the memory insns (fewer ADDIs), we can
reduce the number of shorter latency insns.  As I know, some compiler
did implement not to perfer auto-inc candidates, it can mitigate
starvation issue in those FP/SIMD intensive loops to some extent.

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Segher Boessenkool
Hi!

On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote:
> >> Apart from that, one P9 specific point is that the update form load isn't
> >> preferred,  the reason is that the instruction can not retire until both
> >> parts complete, it can hold up subsequent instructions from retiring.
> >> If the addi stalls (starvation), the instruction can not retire and can
> >> cause things stuck.  It seems also something we can model here?
> > 
> > This is (almost) no problem on p9, since we no longer have issue groups.
> > It can hold up older insns from retiring, sure, but they *will* have
> > finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> > is gone.  The only problem is if things stick around so long that
> > resources run out...  but you're talking 100s of insns there.
> 
> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
> instructions intensive loop code, which did cause some worse performance.  :(

"addi starvation" has nothing to do with addi (it also happens for other
insns), and nothing with update form memory insns either.  What happens
is simply that no shorter latency insns are issued by the core so long
as longer latency insns (like most float insns) are available.  So in
really nice floating point loops we execute the few integer add insns
much too late, much later than they were in the machine code, which then
makes the memory insns late as well, etc.


Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Segher Boessenkool
Hi Bin,

On Fri, Sep 04, 2020 at 04:27:32PM +0800, Bin.Cheng wrote:
> On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool
>  wrote:
> > It should have cost, certainly, but not address_cost I think.  The total
> > cost of an ldu should be a tiny bit less than that of ld + that of addi;
> > the address_cost of ldu should be the same as that of ld.
> Hi Segher,
> In simple cases, yes, and it is also the (rough) idea of modeling
> auto-inc addressing mode in ivopts, however, things are different if
> loop gets complicated.

The address_cost function is used for many other things, not just
ivopts, so this shouldn't be done there.  That is all :-)

> Considering the case choosing 10 auto-inc
> addressing_mode/candidate vs. [base_x + iv_index].  The latter only
> needs one add instruction, while the former needs 10 embedded auto-inc
> operations.

Yeah.

> Another issue is register pressure, choosing auto-inc candidates could
> result in more IV, while choosing IV_index results in one IV (and more
> Base pointers), however, spilling base pointer (which is loop
> invariant) is usually cheaper than IV.
> Another issue is auto-inc candidates probably lead to more bloated
> setup code in the preheader BB, due to problems in expression
> canonicalization, CSE, etc..
> 
> So it's not that easy to answer the question for complicated cases.
> As for simple cases, the current model works fine with auto-inc
> (somehow) preferred.

Right, I wasn't saying that at all, sorry if I confused things.

Thanks,


Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Kewen.Lin via Gcc-patches
Hi Segher,

>> Good question!  I agree that they can execute in parallel, but it depends
>> on how we interprete the addressing cost, if it's for required execution
>> resource, I think it's off, since comparing with ld, the ldu has two iops
>> and extra ALU requirement.
> 
> OTOH, if you do not use an ldu you need to use a real addi insn, which
> gives you all the same cost (plus it takes more code space and decode etc.
> resources).

Agreed.

> 
>> I'm not sure its usage elsewhere, but in the
>> context of IVOPTs on Power, for one normal candidate, its step cost is 4,
>> the cost for group (1) is zero, total cost is 4 for this combination.
>> for the scenario like:
>> ldx rx, iv // (1)
>> ...
>> iv = iv + step // (2)
>>
>> While for ainc_use candidate (like ldu), its step cost is 4, but the cost
>> for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
>> say the step update is free.
> 
> That seems wrong, but the address_cost is used in more places, that is
> not where to fix this?

Good point, I had this question in mind too, it's used somewhere, one of
them even uses one magic number, I planned to check all its usages once
started to investigate it.  But as your comment below, this hook looks
not appropriate.

> 
>> We can also see (1) and (2) can also execute in parallel (same iteration).
>> If we consider the next iteration, it will have the dependency, but it's
>> the same for ldu.  So basically they are similar, I think it's unfair to
>> have this difference in the cost modeling.  The cracked addi should have
>> its cost here.  Does it make sense?
> 
> It should have cost, certainly, but not address_cost I think.  The total
> cost of an ldu should be a tiny bit less than that of ld + that of addi;
> the address_cost of ldu should be the same as that of ld.

OK, I'll check whether there is some other way suitable for this in the
context of IVOPTs.  Good to see that we agree on the current modeling is
a bit off on Power.  :)

> 
>> Apart from that, one P9 specific point is that the update form load isn't
>> preferred,  the reason is that the instruction can not retire until both
>> parts complete, it can hold up subsequent instructions from retiring.
>> If the addi stalls (starvation), the instruction can not retire and can
>> cause things stuck.  It seems also something we can model here?
> 
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.
> 

Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
instructions intensive loop code, which did cause some worse performance.  :(

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Bin.Cheng via Gcc-patches
On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool
 wrote:
>
> On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> > on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> >  1) Currently address_cost hook on rs6000 always return zero, but at 
> >  least
> >  from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> >  have to take the address update into account (scalar normal operation).
> > >>>
> > >>> From Power4 on already (not sure about Power6, but does anyone care?)
> > >>
> > >> Thanks for the information, it looks this issue exists for a long time.
> > >
> > > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > > For example, an
> > >   ldu 3,16(4)
> > > is cracked to an
> > >   ld 3,16(4)
> > > and an
> > >   addi 4,4,16
> > > (the addi is not on the critical path of the load).  So it seems to me
> > > this shouldn't increase the addressing cost at all?  (The instruction of
> > > course is really two insns in one.)
> >
> > Good question!  I agree that they can execute in parallel, but it depends
> > on how we interprete the addressing cost, if it's for required execution
> > resource, I think it's off, since comparing with ld, the ldu has two iops
> > and extra ALU requirement.
>
> OTOH, if you do not use an ldu you need to use a real addi insn, which
> gives you all the same cost (plus it takes more code space and decode etc.
> resources).
>
> > I'm not sure its usage elsewhere, but in the
> > context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> > the cost for group (1) is zero, total cost is 4 for this combination.
> > for the scenario like:
> > ldx rx, iv // (1)
> > ...
> > iv = iv + step // (2)
> >
> > While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> > for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> > say the step update is free.
>
> That seems wrong, but the address_cost is used in more places, that is
> not where to fix this?
>
> > We can also see (1) and (2) can also execute in parallel (same iteration).
> > If we consider the next iteration, it will have the dependency, but it's
> > the same for ldu.  So basically they are similar, I think it's unfair to
> > have this difference in the cost modeling.  The cracked addi should have
> > its cost here.  Does it make sense?
>
> It should have cost, certainly, but not address_cost I think.  The total
> cost of an ldu should be a tiny bit less than that of ld + that of addi;
> the address_cost of ldu should be the same as that of ld.
Hi Segher,
In simple cases, yes, and it is also the (rough) idea of modeling
auto-inc addressing mode in ivopts, however, things are different if
loop gets complicated.  Considering the case choosing 10 auto-inc
addressing_mode/candidate vs. [base_x + iv_index].  The latter only
needs one add instruction, while the former needs 10 embedded auto-inc
operations.
Another issue is register pressure, choosing auto-inc candidates could
result in more IV, while choosing IV_index results in one IV (and more
Base pointers), however, spilling base pointer (which is loop
invariant) is usually cheaper than IV.
Another issue is auto-inc candidates probably lead to more bloated
setup code in the preheader BB, due to problems in expression
canonicalization, CSE, etc..

So it's not that easy to answer the question for complicated cases.
As for simple cases, the current model works fine with auto-inc
(somehow) preferred.

Thanks,
bin
>
> > Apart from that, one P9 specific point is that the update form load isn't
> > preferred,  the reason is that the instruction can not retire until both
> > parts complete, it can hold up subsequent instructions from retiring.
> > If the addi stalls (starvation), the instruction can not retire and can
> > cause things stuck.  It seems also something we can model here?
>
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.
>
>
> Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-03 Thread Segher Boessenkool
On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
>  1) Currently address_cost hook on rs6000 always return zero, but at least
>  from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>  have to take the address update into account (scalar normal operation).
> >>>
> >>> From Power4 on already (not sure about Power6, but does anyone care?)
> >>
> >> Thanks for the information, it looks this issue exists for a long time.
> > 
> > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > For example, an
> >   ldu 3,16(4)
> > is cracked to an
> >   ld 3,16(4)
> > and an
> >   addi 4,4,16
> > (the addi is not on the critical path of the load).  So it seems to me
> > this shouldn't increase the addressing cost at all?  (The instruction of
> > course is really two insns in one.)
> 
> Good question!  I agree that they can execute in parallel, but it depends
> on how we interprete the addressing cost, if it's for required execution
> resource, I think it's off, since comparing with ld, the ldu has two iops
> and extra ALU requirement.

OTOH, if you do not use an ldu you need to use a real addi insn, which
gives you all the same cost (plus it takes more code space and decode etc.
resources).

> I'm not sure its usage elsewhere, but in the
> context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> the cost for group (1) is zero, total cost is 4 for this combination.
> for the scenario like:
> ldx rx, iv // (1)
> ...
> iv = iv + step // (2)
> 
> While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> say the step update is free.

That seems wrong, but the address_cost is used in more places, that is
not where to fix this?

> We can also see (1) and (2) can also execute in parallel (same iteration).
> If we consider the next iteration, it will have the dependency, but it's
> the same for ldu.  So basically they are similar, I think it's unfair to
> have this difference in the cost modeling.  The cracked addi should have
> its cost here.  Does it make sense?

It should have cost, certainly, but not address_cost I think.  The total
cost of an ldu should be a tiny bit less than that of ld + that of addi;
the address_cost of ldu should be the same as that of ld.

> Apart from that, one P9 specific point is that the update form load isn't
> preferred,  the reason is that the instruction can not retire until both
> parts complete, it can hold up subsequent instructions from retiring.
> If the addi stalls (starvation), the instruction can not retire and can
> cause things stuck.  It seems also something we can model here?

This is (almost) no problem on p9, since we no longer have issue groups.
It can hold up older insns from retiring, sure, but they *will* have
finished, and p9 can retire 64 insns per cycle.  The "completion wall"
is gone.  The only problem is if things stick around so long that
resources run out...  but you're talking 100s of insns there.


Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-02 Thread Kewen.Lin via Gcc-patches
Hi Segher,

on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
>> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
>>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
 1) Currently address_cost hook on rs6000 always return zero, but at least
 from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
 have to take the address update into account (scalar normal operation).
>>>
>>> From Power4 on already (not sure about Power6, but does anyone care?)
>>
>> Thanks for the information, it looks this issue exists for a long time.
> 
> Well, *is* it an issue?  The addressing doesn't get more expensive...
> For example, an
>   ldu 3,16(4)
> is cracked to an
>   ld 3,16(4)
> and an
>   addi 4,4,16
> (the addi is not on the critical path of the load).  So it seems to me
> this shouldn't increase the addressing cost at all?  (The instruction of
> course is really two insns in one.)
> 

Good question!  I agree that they can execute in parallel, but it depends
on how we interprete the addressing cost, if it's for required execution
resource, I think it's off, since comparing with ld, the ldu has two iops
and extra ALU requirement.  I'm not sure its usage elsewhere, but in the
context of IVOPTs on Power, for one normal candidate, its step cost is 4,
the cost for group (1) is zero, total cost is 4 for this combination.
for the scenario like:
ldx rx, iv // (1)
...
iv = iv + step // (2)

While for ainc_use candidate (like ldu), its step cost is 4, but the cost
for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
say the step update is free.

We can also see (1) and (2) can also execute in parallel (same iteration).
If we consider the next iteration, it will have the dependency, but it's
the same for ldu.  So basically they are similar, I think it's unfair to
have this difference in the cost modeling.  The cracked addi should have
its cost here.  Does it make sense?

Apart from that, one P9 specific point is that the update form load isn't
preferred,  the reason is that the instruction can not retire until both
parts complete, it can hold up subsequent instructions from retiring.
If the addi stalls (starvation), the instruction can not retire and can
cause things stuck.  It seems also something we can model here?

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-02 Thread Segher Boessenkool
Hi!

On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> > On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> >> 1) Currently address_cost hook on rs6000 always return zero, but at least
> >> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> >> have to take the address update into account (scalar normal operation).
> > 
> > From Power4 on already (not sure about Power6, but does anyone care?)
> 
> Thanks for the information, it looks this issue exists for a long time.

Well, *is* it an issue?  The addressing doesn't get more expensive...
For example, an
  ldu 3,16(4)
is cracked to an
  ld 3,16(4)
and an
  addi 4,4,16
(the addi is not on the critical path of the load).  So it seems to me
this shouldn't increase the addressing cost at all?  (The instruction of
course is really two insns in one.)


Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Kewen.Lin via Gcc-patches
Hi Bin,

 I've updated the patch to punt ainc_use candidates as below:

> + /* Skip AINC candidate since it contains address update itself,
> +the replicated AINC computations when unrolling still have
> +updates, unlike reg_offset_p candidates can save step updates
> +when unrolling.  */
> + if (cand->ainc_use)
> +   continue;
> +

 For this new attached patch, it's bootstrapped and regress-tested without
 explicit unrolling, while the only one failure has been identified as
 rs6000 specific address_cost issue in bootstrapping and regression testing
 with explicit unrolling.

 By the way, with above simple hack of address_cost, I also did one
 bootstrapping and regression testing with explicit unrolling, the above
 sms-4.c did pass as I expected but had two failures instead:

   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts 
 "PHI" 2

 By further investigation, the 2nd one is expected due to the adddress_cost 
 hook
 hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
 sequence starts to off from sms, just some NOTE_INSN_DELETED positions 
 change.
 I believe it's just exposed by this combination unluckily/luckily ;-) I 
 will
 send a patch separately for it once I got time to look into it, but it 
 should
 be unrelated to this patch series for sure.
>>> This is the kind of situation I intended to avoid before.  IMHO, this
>>> isn't a neat change (it can't be given we are predicting the future
>>> transformation in compilation pipeline), accumulation of such changes
>>> could make IVOPTs break in one way or another.  So as long as you make
>>> sure it doesn't have functional impact in case of no-rtl_unroll, I am
>>> fine.
>>
>> Yeah, I admit it's not neat, but the proposals in the previous discussions
>> without predicting unroll factor can not work well for all scenarios with
>> different unroll factors, they could over-blame some kind of candidates.
>> For the case of no-rtl_unroll, unroll factor estimation should set
>> loop->estimated_unroll to zero, all these changes won't take effect. The
 
Oops, one correction, should set it to *one* rather than zero.  My memory... :(

>> estimation function follows the same logics as that of RTL unroll factor
>> calculation, I did test with explicit unrolling disablement before, it
>> worked expectedly.
> Thanks for working on this, also sorry for being nitpicking.

Oh, you don't!  Your constructive advices/comments make me consider/test
more scenarios/things that I didn't realize before, it's really helpful
to get this patch better, really appreciate that!!!

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Bin.Cheng via Gcc-patches
On Wed, Sep 2, 2020 at 11:50 AM Kewen.Lin  wrote:
>
> Hi Bin,
>
> >> 2) This case makes me think we should exclude ainc candidates in function
> >> mark_reg_offset_candidates.  The justification is that: ainc candidate
> >> handles step update itself and when we calculate the cost for it against
> >> its ainc_use, the cost_step has been reduced. When unrolling happens,
> >> the ainc computations are replicated and it doesn't save step updates
> >> like normal reg_offset_p candidates.
> > Though auto-inc candidate embeds stepping operation into memory
> > access, we might want to avoid it in case of unroll if there are many
> > sequences of memory accesses, and if the unroll factor is big.  The
> > rationale is embedded stepping is a u-arch operation and does have its
> > cost.
> >
>
> Thanks for the comments!  Agree!  Excluding them from reg_offset_p
> candidates here is consistent with this expectation, it makes us
> consider the unroll factor effect when checking the corresponding
> step cost and the embedded stepping cost (in group/candidate cost,
> minus step cost and use the cost from the address_cost hook).
>
> >>
> >> I've updated the patch to punt ainc_use candidates as below:
> >>
> >>> + /* Skip AINC candidate since it contains address update itself,
> >>> +the replicated AINC computations when unrolling still have
> >>> +updates, unlike reg_offset_p candidates can save step updates
> >>> +when unrolling.  */
> >>> + if (cand->ainc_use)
> >>> +   continue;
> >>> +
> >>
> >> For this new attached patch, it's bootstrapped and regress-tested without
> >> explicit unrolling, while the only one failure has been identified as
> >> rs6000 specific address_cost issue in bootstrapping and regression testing
> >> with explicit unrolling.
> >>
> >> By the way, with above simple hack of address_cost, I also did one
> >> bootstrapping and regression testing with explicit unrolling, the above
> >> sms-4.c did pass as I expected but had two failures instead:
> >>
> >>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
> >>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts 
> >> "PHI" 2
> >>
> >> By further investigation, the 2nd one is expected due to the adddress_cost 
> >> hook
> >> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> >> sequence starts to off from sms, just some NOTE_INSN_DELETED positions 
> >> change.
> >> I believe it's just exposed by this combination unluckily/luckily ;-) I 
> >> will
> >> send a patch separately for it once I got time to look into it, but it 
> >> should
> >> be unrelated to this patch series for sure.
> > This is the kind of situation I intended to avoid before.  IMHO, this
> > isn't a neat change (it can't be given we are predicting the future
> > transformation in compilation pipeline), accumulation of such changes
> > could make IVOPTs break in one way or another.  So as long as you make
> > sure it doesn't have functional impact in case of no-rtl_unroll, I am
> > fine.
>
> Yeah, I admit it's not neat, but the proposals in the previous discussions
> without predicting unroll factor can not work well for all scenarios with
> different unroll factors, they could over-blame some kind of candidates.
> For the case of no-rtl_unroll, unroll factor estimation should set
> loop->estimated_unroll to zero, all these changes won't take effect. The
> estimation function follows the same logics as that of RTL unroll factor
> calculation, I did test with explicit unrolling disablement before, it
> worked expectedly.
Thanks for working on this, also sorry for being nitpicking.

Thanks,
bin


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Kewen.Lin via Gcc-patches
Hi Bin,

>> 2) This case makes me think we should exclude ainc candidates in function
>> mark_reg_offset_candidates.  The justification is that: ainc candidate
>> handles step update itself and when we calculate the cost for it against
>> its ainc_use, the cost_step has been reduced. When unrolling happens,
>> the ainc computations are replicated and it doesn't save step updates
>> like normal reg_offset_p candidates.
> Though auto-inc candidate embeds stepping operation into memory
> access, we might want to avoid it in case of unroll if there are many
> sequences of memory accesses, and if the unroll factor is big.  The
> rationale is embedded stepping is a u-arch operation and does have its
> cost.
> 

Thanks for the comments!  Agree!  Excluding them from reg_offset_p
candidates here is consistent with this expectation, it makes us
consider the unroll factor effect when checking the corresponding
step cost and the embedded stepping cost (in group/candidate cost,
minus step cost and use the cost from the address_cost hook).

>>
>> I've updated the patch to punt ainc_use candidates as below:
>>
>>> + /* Skip AINC candidate since it contains address update itself,
>>> +the replicated AINC computations when unrolling still have
>>> +updates, unlike reg_offset_p candidates can save step updates
>>> +when unrolling.  */
>>> + if (cand->ainc_use)
>>> +   continue;
>>> +
>>
>> For this new attached patch, it's bootstrapped and regress-tested without
>> explicit unrolling, while the only one failure has been identified as
>> rs6000 specific address_cost issue in bootstrapping and regression testing
>> with explicit unrolling.
>>
>> By the way, with above simple hack of address_cost, I also did one
>> bootstrapping and regression testing with explicit unrolling, the above
>> sms-4.c did pass as I expected but had two failures instead:
>>
>>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>>
>> By further investigation, the 2nd one is expected due to the adddress_cost 
>> hook
>> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
>> sequence starts to off from sms, just some NOTE_INSN_DELETED positions 
>> change.
>> I believe it's just exposed by this combination unluckily/luckily ;-) I will
>> send a patch separately for it once I got time to look into it, but it should
>> be unrelated to this patch series for sure.
> This is the kind of situation I intended to avoid before.  IMHO, this
> isn't a neat change (it can't be given we are predicting the future
> transformation in compilation pipeline), accumulation of such changes
> could make IVOPTs break in one way or another.  So as long as you make
> sure it doesn't have functional impact in case of no-rtl_unroll, I am
> fine.

Yeah, I admit it's not neat, but the proposals in the previous discussions
without predicting unroll factor can not work well for all scenarios with
different unroll factors, they could over-blame some kind of candidates.
For the case of no-rtl_unroll, unroll factor estimation should set
loop->estimated_unroll to zero, all these changes won't take effect. The
estimation function follows the same logics as that of RTL unroll factor
calculation, I did test with explicit unrolling disablement before, it
worked expectedly.

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Kewen.Lin via Gcc-patches
Hi Segher,

on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> Hi!
> 
> Just a note:
> 
> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
>> 1) Currently address_cost hook on rs6000 always return zero, but at least
>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>> have to take the address update into account (scalar normal operation).
> 
> From Power4 on already (not sure about Power6, but does anyone care?)
> 

Thanks for the information, it looks this issue exists for a long time.

BR,
Kewen


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Bin.Cheng via Gcc-patches
On Tue, Aug 25, 2020 at 8:47 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> >>
> >> For one particular case like:
> >>
> >> for (i = 0; i < SIZE; i++)
> >>   y[i] = a * x[i] + z[i];
> >>
> >> we will mark reg_offset_p for IV candidates on x as below:
> >>- (unsigned long) (x_18(D) + 8)// only mark this before.
> >>- x_18(D) + 8
> >>- (unsigned long) (x_18(D) + 24)
> >>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 
> >> 18446744073709551600)
> >>...
> >>
> >> Do you mind to have a review again?  Thanks in advance!
> > I trust you with the change.
>
> Thanks again!  Sorry for the late since it took some time to investigate
> the exposed issues.
>
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
> >>
> >> SPEC2017 P9 performance run has no remarkable degradations/improvements.
> > Is this run with unroll-loops?
>
> Yes, the options what I used were:
>
>-Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops
>
> I also re-tested the newly attached patch, nothing changes for SPEC2017 data.
>
> > Could you exercise the code with unroll-loops enabled when
> > bootstrap/regtest please?  It doesn't matter if cases fail with
> > unroll-loops, just making sure there is no fallout.  Otherwise it's
> > fine with me.
>
> Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
> but the regression testing did show one failure (the only one):
>
>   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
>
> It exposes two issues:
>
> 1) Currently address_cost hook on rs6000 always return zero, but at least
> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> have to take the address update into account (scalar normal operation).
> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
> ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
> with scaling up, the off becomes much.  With one simple hack on for pre_inc/
> pre_dec in rs6000 address_cost, the case passed.  It should be handled in
> one separated issue.
>
> 2) This case makes me think we should exclude ainc candidates in function
> mark_reg_offset_candidates.  The justification is that: ainc candidate
> handles step update itself and when we calculate the cost for it against
> its ainc_use, the cost_step has been reduced. When unrolling happens,
> the ainc computations are replicated and it doesn't save step updates
> like normal reg_offset_p candidates.
Though auto-inc candidate embeds stepping operation into memory
access, we might want to avoid it in case of unroll if there are many
sequences of memory accesses, and if the unroll factor is big.  The
rationale is embedded stepping is a u-arch operation and does have its
cost.

>
> I've updated the patch to punt ainc_use candidates as below:
>
> > + /* Skip AINC candidate since it contains address update itself,
> > +the replicated AINC computations when unrolling still have
> > +updates, unlike reg_offset_p candidates can save step updates
> > +when unrolling.  */
> > + if (cand->ainc_use)
> > +   continue;
> > +
>
> For this new attached patch, it's bootstrapped and regress-tested without
> explicit unrolling, while the only one failure has been identified as
> rs6000 specific address_cost issue in bootstrapping and regression testing
> with explicit unrolling.
>
> By the way, with above simple hack of address_cost, I also did one
> bootstrapping and regression testing with explicit unrolling, the above
> sms-4.c did pass as I expected but had two failures instead:
>
>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>
> By further investigation, the 2nd one is expected due to the adddress_cost 
> hook
> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
> I believe it's just exposed by this combination unluckily/luckily ;-) I will
> send a patch separately for it once I got time to look into it, but it should
> be unrelated to this patch series for sure.
This is the kind of situation I intended to avoid before.  IMHO, this
isn't a neat change (it can't be given we are predicting the future
transformation in compilation pipeline), accumulation of such changes
could make IVOPTs break in one way or another.  So as long as you make
sure it doesn't have functional impact in case of no-rtl_unroll, I am
fine.
>
> In a word, bootstrapping/regress-testing with unroll-loops enabled shows this
> patch looks fine.
>
> BR,
> Kewen
> -
> gcc/ChangeLog:
>
> * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (mark_reg_offset_candidates): New function.
> (add_candidate_1): Set reg

Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-08-31 Thread Segher Boessenkool
Hi!

Just a note:

On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> 1) Currently address_cost hook on rs6000 always return zero, but at least
> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> have to take the address update into account (scalar normal operation).

>From Power4 on already (not sure about Power6, but does anyone care?)


Segher


[PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-08-25 Thread Kewen.Lin via Gcc-patches
Hi Bin,

>>
>> For one particular case like:
>>
>> for (i = 0; i < SIZE; i++)
>>   y[i] = a * x[i] + z[i];
>>
>> we will mark reg_offset_p for IV candidates on x as below:
>>- (unsigned long) (x_18(D) + 8)// only mark this before.
>>- x_18(D) + 8
>>- (unsigned long) (x_18(D) + 24)
>>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 
>> 18446744073709551600)
>>...
>>
>> Do you mind to have a review again?  Thanks in advance!
> I trust you with the change.

Thanks again!  Sorry for the late since it took some time to investigate
the exposed issues.

>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
>>
>> SPEC2017 P9 performance run has no remarkable degradations/improvements.
> Is this run with unroll-loops?

Yes, the options what I used were: 

   -Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops

I also re-tested the newly attached patch, nothing changes for SPEC2017 data.

> Could you exercise the code with unroll-loops enabled when
> bootstrap/regtest please?  It doesn't matter if cases fail with
> unroll-loops, just making sure there is no fallout.  Otherwise it's
> fine with me.

Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
but the regression testing did show one failure (the only one):

  PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1

It exposes two issues:

1) Currently address_cost hook on rs6000 always return zero, but at least
from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
have to take the address update into account (scalar normal operation).
Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
with scaling up, the off becomes much.  With one simple hack on for pre_inc/
pre_dec in rs6000 address_cost, the case passed.  It should be handled in
one separated issue.

2) This case makes me think we should exclude ainc candidates in function
mark_reg_offset_candidates.  The justification is that: ainc candidate
handles step update itself and when we calculate the cost for it against
its ainc_use, the cost_step has been reduced. When unrolling happens,
the ainc computations are replicated and it doesn't save step updates 
like normal reg_offset_p candidates.

I've updated the patch to punt ainc_use candidates as below:

> + /* Skip AINC candidate since it contains address update itself,
> +the replicated AINC computations when unrolling still have
> +updates, unlike reg_offset_p candidates can save step updates
> +when unrolling.  */
> + if (cand->ainc_use)
> +   continue;
> +

For this new attached patch, it's bootstrapped and regress-tested without
explicit unrolling, while the only one failure has been identified as
rs6000 specific address_cost issue in bootstrapping and regression testing
with explicit unrolling.

By the way, with above simple hack of address_cost, I also did one
bootstrapping and regression testing with explicit unrolling, the above
sms-4.c did pass as I expected but had two failures instead:

  PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
  PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2

By further investigation, the 2nd one is expected due to the adddress_cost hook
hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
I believe it's just exposed by this combination unluckily/luckily ;-) I will
send a patch separately for it once I got time to look into it, but it should
be unrelated to this patch series for sure.

In a word, bootstrapping/regress-testing with unroll-loops enabled shows this
patch looks fine.

BR,
Kewen
-
gcc/ChangeLog:

* tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
(mark_reg_offset_candidates): New function.
(add_candidate_1): Set reg_offset_p to false for new candidate.
(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
consider_reg_offset_for_unroll_p.
(determine_iv_cost): Increase step cost with estimate_unroll_factor if
consider_reg_offset_for_unroll_p.
(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
consider_reg_offset_for_unroll_p.
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..4b58b620ddd 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -473,6 +473,9 @@ struct iv_cand
   struct iv *orig_iv;  /* The original iv if this cand is added from biv with
   smaller type.  */
   bool doloop_p;   /* Whether this is a doloop candidate.  */
+  bool reg_offset_p;   /* Whether this is available for an address type group