[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 Jim Wilson changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from Jim Wilson --- Patch checked in to fix it.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #9 from Jim Wilson --- Author: wilson Date: Wed Oct 11 03:23:41 2017 New Revision: 253628 URL: https://gcc.gnu.org/viewcvs?rev=253628=gcc=rev Log: Allow 2 insns from sched group to issue in same cycle, if no stalls needed. gcc/ PR rtl-optimization/81434 * haifa-sched.c (prune_ready_list): Init min_cost_group to 0. Update comment for main loop. In sched_group_found if, also add checks for pass and min_cost_group. Modified: trunk/gcc/ChangeLog trunk/gcc/haifa-sched.c
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 Andrew Pinski changed: What|Removed |Added CC||pinskia at gcc dot gnu.org --- Comment #9 from Andrew Pinski --- (In reply to jim.wilson from comment #7) > If we represent a fused pair as a single instruction, then we are > getting the issue count wrong, as they take two issue slots, but one > function unit slot. It depends on how the fusion works. On the processor I am working on, the fused instruction will take only one issue slot but two decode slots. > However, there is a way to deal with the issue > count. We could use TARGET_SCHED_VARIABLE_ISSUE to make the single > fused insn take two issue slots. I already wrote a patch like that > for a different reason as an experiment so I know this can work. If this is done, please make it dependent on the micro-arch. I have another issue which I am trying to figure out how to handle. In some cases, the instructions don't take up an issue slots at all and their latency are zero but only if one of those per dispatch group.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #8 from Wilco --- (In reply to jim.wilson from comment #7) > On Thu, Jul 20, 2017 at 4:20 AM, wilco at gcc dot gnu.org >wrote: > > Do you think it might be feasible to update resource usage of a schedule > > group? > > Or would it be easier to replace a fused pair with a single instruction with > > correct resource usage, and expand after scheduling in say split5? > > > > For some cases (single destination reg like ADRP/ADD, AES, MOV/MOVK) it > > would > > be simpler to treat them as a single instruction from early on. > > I haven't looked at this issue yet. This problem wastes one issue > slot per cycle, where as the SCHED_GROUP problem wastes up to N-1 > issue slots per cycle, where N is the issue rate. For falkor, this is > up to 3 issue slots per cycle. Since the SCHED_GROUP problem is more > serious, I looked at that one first. > > Thinking about this a bit, I don't know if there is an easy way to > correct resource usage for a fused pair if represented as two insns. > If we represent a fused pair as a single instruction, then we are > getting the issue count wrong, as they take two issue slots, but one > function unit slot. However, there is a way to deal with the issue > count. We could use TARGET_SCHED_VARIABLE_ISSUE to make the single > fused insn take two issue slots. I already wrote a patch like that > for a different reason as an experiment so I know this can work. We > may also have to worry about instruction lengths, depending on when > exactly we split the fused insn into two insns. There could be some > other details that turn up when we try to implement this. What do we > do if a fused insn takes the last issue slot for instance? Maybe we > try to prevent that, or maybe we reduce the issue rate by one for the > next cycle. This may depend on how the hardware implements insn > fusing. Yes the details depend on how the hardware implements fusion, but based on the tuning I did on Cortex-A53 model I'd say that you get good schedules with a reasonable approximation. So if it is the last issue slot then it may be best to force it to the next cycle like you say - that's no worse than if the fusion didn't happen.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #7 from jim.wilson at linaro dot org --- On Thu, Jul 20, 2017 at 4:20 AM, wilco at gcc dot gnu.orgwrote: > Do you think it might be feasible to update resource usage of a schedule > group? > Or would it be easier to replace a fused pair with a single instruction with > correct resource usage, and expand after scheduling in say split5? > > For some cases (single destination reg like ADRP/ADD, AES, MOV/MOVK) it would > be simpler to treat them as a single instruction from early on. I haven't looked at this issue yet. This problem wastes one issue slot per cycle, where as the SCHED_GROUP problem wastes up to N-1 issue slots per cycle, where N is the issue rate. For falkor, this is up to 3 issue slots per cycle. Since the SCHED_GROUP problem is more serious, I looked at that one first. Thinking about this a bit, I don't know if there is an easy way to correct resource usage for a fused pair if represented as two insns. If we represent a fused pair as a single instruction, then we are getting the issue count wrong, as they take two issue slots, but one function unit slot. However, there is a way to deal with the issue count. We could use TARGET_SCHED_VARIABLE_ISSUE to make the single fused insn take two issue slots. I already wrote a patch like that for a different reason as an experiment so I know this can work. We may also have to worry about instruction lengths, depending on when exactly we split the fused insn into two insns. There could be some other details that turn up when we try to implement this. What do we do if a fused insn takes the last issue slot for instance? Maybe we try to prevent that, or maybe we reduce the issue rate by one for the next cycle. This may depend on how the hardware implements insn fusing.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-07-20 Ever confirmed|0 |1 --- Comment #6 from Wilco --- (In reply to jim.wilson from comment #5) > On Wed, Jul 19, 2017 at 4:25 AM, wilco at gcc dot gnu.org >wrote: > > To more accurately schedule fusion pairs wouldn't we need to specify the > > scheduling behaviour of the group as well? From the dumps below it schedules > > the adrp/add in the same cycle if we're lucky, but it is modelled as using 2 > > ALUs rather than a new single fused instruction. > > The fusion pair takes two issue slots and uses one alu, but is modeled > as taking two issues slots and using two alus. I haven't tried to > address this problem. I'm just trying to get the issue slot handling > correct for now, so that they can issue in the same cycle. Do you think it might be feasible to update resource usage of a schedule group? Or would it be easier to replace a fused pair with a single instruction with correct resource usage, and expand after scheduling in say split5? For some cases (single destination reg like ADRP/ADD, AES, MOV/MOVK) it would be simpler to treat them as a single instruction from early on. > > Also is your change fixing the issue that the scheduler cannot schedule 2 > > instructions with a zero latency dependency between them in the same cycle? > > That's a very similar bug. > > The scheduler can schedule 2 insns w/ zero latency dependency in the > same cycle. However, it does not do so when a SCHED_GROUP is > involved. This is the bug that my patch fixes. You're right, now I try a few cases, it works fine on non-fused instructions - I can't find my original example but it's possible it was caused by fused instructions messing up the schedule.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #5 from jim.wilson at linaro dot org --- On Wed, Jul 19, 2017 at 4:25 AM, wilco at gcc dot gnu.orgwrote: > To more accurately schedule fusion pairs wouldn't we need to specify the > scheduling behaviour of the group as well? From the dumps below it schedules > the adrp/add in the same cycle if we're lucky, but it is modelled as using 2 > ALUs rather than a new single fused instruction. The fusion pair takes two issue slots and uses one alu, but is modeled as taking two issues slots and using two alus. I haven't tried to address this problem. I'm just trying to get the issue slot handling correct for now, so that they can issue in the same cycle. > Also is your change fixing the issue that the scheduler cannot schedule 2 > instructions with a zero latency dependency between them in the same cycle? > That's a very similar bug. The scheduler can schedule 2 insns w/ zero latency dependency in the same cycle. However, it does not do so when a SCHED_GROUP is involved. This is the bug that my patch fixes. Jim
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wilco --- To more accurately schedule fusion pairs wouldn't we need to specify the scheduling behaviour of the group as well? From the dumps below it schedules the adrp/add in the same cycle if we're lucky, but it is modelled as using 2 ALUs rather than a new single fused instruction. Also is your change fixing the issue that the scheduler cannot schedule 2 instructions with a zero latency dependency between them in the same cycle? That's a very similar bug.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #3 from Jim Wilson --- Created attachment 41754 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41754=edit Assembly output with patch.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #2 from Jim Wilson --- Created attachment 41753 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41753=edit Assembly output without patch.
[Bug rtl-optimization/81434] AArch64 instruction fusing and pipeline scheduling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434 --- Comment #1 from Jim Wilson --- Created attachment 41752 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41752=edit Proposed patch to fix scheduler/fusing problem.