[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-10-11 Thread wilson at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

Jim Wilson  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Jim Wilson  ---
Patch checked in to fix it.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-10-10 Thread wilson at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #9 from Jim Wilson  ---
Author: wilson
Date: Wed Oct 11 03:23:41 2017
New Revision: 253628

URL: https://gcc.gnu.org/viewcvs?rev=253628=gcc=rev
Log:
Allow 2 insns from sched group to issue in same cycle, if no stalls needed.

gcc/
PR rtl-optimization/81434
* haifa-sched.c (prune_ready_list): Init min_cost_group to 0.  Update
comment for main loop.  In sched_group_found if, also add checks for
pass and min_cost_group.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/haifa-sched.c

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-08-13 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

Andrew Pinski  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org

--- Comment #9 from Andrew Pinski  ---
(In reply to jim.wilson from comment #7) 
> If we represent a fused pair as a single instruction, then we are
> getting the issue count wrong, as they take two issue slots, but one
> function unit slot.

It depends on how the fusion works.  On the processor I am working on, the
fused instruction will take only one issue slot but two decode slots.

>  However, there is a way to deal with the issue
> count.  We could use TARGET_SCHED_VARIABLE_ISSUE to make the single
> fused insn take two issue slots.  I already wrote a patch like that
> for a different reason as an experiment so I know this can work.

If this is done, please make it dependent on the micro-arch.

I have another issue which I am trying to figure out how to handle.  In some
cases, the instructions don't take up an issue slots at all and their latency
are zero but only if one of those per dispatch group.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-20 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #8 from Wilco  ---
(In reply to jim.wilson from comment #7)
> On Thu, Jul 20, 2017 at 4:20 AM, wilco at gcc dot gnu.org
>  wrote:
> > Do you think it might be feasible to update resource usage of a schedule 
> > group?
> > Or would it be easier to replace a fused pair with a single instruction with
> > correct resource usage, and expand after scheduling in say split5?
> >
> > For some cases (single destination reg like ADRP/ADD, AES, MOV/MOVK) it 
> > would
> > be simpler to treat them as a single instruction from early on.
> 
> I haven't looked at this issue yet.  This problem wastes one issue
> slot per cycle, where as the SCHED_GROUP problem wastes up to N-1
> issue slots per cycle, where N is the issue rate.  For falkor, this is
> up to 3 issue slots per cycle.  Since the SCHED_GROUP problem is more
> serious, I looked at that one first.
> 
> Thinking about this a bit, I don't know if there is an easy way to
> correct resource usage for a fused pair if represented as two insns.
> If we represent a fused pair as a single instruction, then we are
> getting the issue count wrong, as they take two issue slots, but one
> function unit slot.  However, there is a way to deal with the issue
> count.  We could use TARGET_SCHED_VARIABLE_ISSUE to make the single
> fused insn take two issue slots.  I already wrote a patch like that
> for a different reason as an experiment so I know this can work.  We
> may also have to worry about instruction lengths, depending on when
> exactly we split the fused insn into two insns.  There could be some
> other details that turn up when we try to implement this.  What do we
> do if a fused insn takes the last issue slot for instance?  Maybe we
> try to prevent that, or maybe we reduce the issue rate by one for the
> next cycle.  This may depend on how the hardware implements insn
> fusing.

Yes the details depend on how the hardware implements fusion, but based on the
tuning I did on Cortex-A53 model I'd say that you get good schedules with a
reasonable approximation. So if it is the last issue slot then it may be best
to force it to the next cycle like you say - that's no worse than if the fusion
didn't happen.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-20 Thread jim.wilson at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #7 from jim.wilson at linaro dot org ---
On Thu, Jul 20, 2017 at 4:20 AM, wilco at gcc dot gnu.org
 wrote:
> Do you think it might be feasible to update resource usage of a schedule 
> group?
> Or would it be easier to replace a fused pair with a single instruction with
> correct resource usage, and expand after scheduling in say split5?
>
> For some cases (single destination reg like ADRP/ADD, AES, MOV/MOVK) it would
> be simpler to treat them as a single instruction from early on.

I haven't looked at this issue yet.  This problem wastes one issue
slot per cycle, where as the SCHED_GROUP problem wastes up to N-1
issue slots per cycle, where N is the issue rate.  For falkor, this is
up to 3 issue slots per cycle.  Since the SCHED_GROUP problem is more
serious, I looked at that one first.

Thinking about this a bit, I don't know if there is an easy way to
correct resource usage for a fused pair if represented as two insns.
If we represent a fused pair as a single instruction, then we are
getting the issue count wrong, as they take two issue slots, but one
function unit slot.  However, there is a way to deal with the issue
count.  We could use TARGET_SCHED_VARIABLE_ISSUE to make the single
fused insn take two issue slots.  I already wrote a patch like that
for a different reason as an experiment so I know this can work.  We
may also have to worry about instruction lengths, depending on when
exactly we split the fused insn into two insns.  There could be some
other details that turn up when we try to implement this.  What do we
do if a fused insn takes the last issue slot for instance?  Maybe we
try to prevent that, or maybe we reduce the issue rate by one for the
next cycle.  This may depend on how the hardware implements insn
fusing.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-20 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-07-20
 Ever confirmed|0   |1

--- Comment #6 from Wilco  ---
(In reply to jim.wilson from comment #5)
> On Wed, Jul 19, 2017 at 4:25 AM, wilco at gcc dot gnu.org
>  wrote:
> > To more accurately schedule fusion pairs wouldn't we need to specify the
> > scheduling behaviour of the group as well? From the dumps below it schedules
> > the adrp/add in the same cycle if we're lucky, but it is modelled as using 2
> > ALUs rather than a new single fused instruction.
> 
> The fusion pair takes two issue slots and uses one alu, but is modeled
> as taking two issues slots and using two alus.  I haven't tried to
> address this problem.  I'm just trying to get the issue slot handling
> correct for now, so that they can issue in the same cycle.

Do you think it might be feasible to update resource usage of a schedule group?
Or would it be easier to replace a fused pair with a single instruction with
correct resource usage, and expand after scheduling in say split5?

For some cases (single destination reg like ADRP/ADD, AES, MOV/MOVK) it would
be simpler to treat them as a single instruction from early on.

> > Also is your change fixing the issue that the scheduler cannot schedule 2
> > instructions with a zero latency dependency between them in the same cycle?
> > That's a very similar bug.
> 
> The scheduler can schedule 2 insns w/ zero latency dependency in the
> same cycle.  However, it does not do so when a SCHED_GROUP is
> involved.  This is the bug that my patch fixes.

You're right, now I try a few cases, it works fine on non-fused instructions -
I can't find my original example but it's possible it was caused by fused
instructions messing up the schedule.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-19 Thread jim.wilson at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #5 from jim.wilson at linaro dot org ---
On Wed, Jul 19, 2017 at 4:25 AM, wilco at gcc dot gnu.org
 wrote:
> To more accurately schedule fusion pairs wouldn't we need to specify the
> scheduling behaviour of the group as well? From the dumps below it schedules
> the adrp/add in the same cycle if we're lucky, but it is modelled as using 2
> ALUs rather than a new single fused instruction.

The fusion pair takes two issue slots and uses one alu, but is modeled
as taking two issues slots and using two alus.  I haven't tried to
address this problem.  I'm just trying to get the issue slot handling
correct for now, so that they can issue in the same cycle.

> Also is your change fixing the issue that the scheduler cannot schedule 2
> instructions with a zero latency dependency between them in the same cycle?
> That's a very similar bug.

The scheduler can schedule 2 insns w/ zero latency dependency in the
same cycle.  However, it does not do so when a SCHED_GROUP is
involved.  This is the bug that my patch fixes.

Jim

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
To more accurately schedule fusion pairs wouldn't we need to specify the
scheduling behaviour of the group as well? From the dumps below it schedules
the adrp/add in the same cycle if we're lucky, but it is modelled as using 2
ALUs rather than a new single fused instruction.

Also is your change fixing the issue that the scheduler cannot schedule 2
instructions with a zero latency dependency between them in the same cycle?
That's a very similar bug.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-13 Thread wilson at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #3 from Jim Wilson  ---
Created attachment 41754
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41754=edit
Assembly output with patch.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-13 Thread wilson at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #2 from Jim Wilson  ---
Created attachment 41753
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41753=edit
Assembly output without patch.

[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem

2017-07-13 Thread wilson at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

--- Comment #1 from Jim Wilson  ---
Created attachment 41752
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41752=edit
Proposed patch to fix scheduler/fusing problem.