[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #23 from collison at gcc dot gnu.org --- Author: collison Date: Tue Mar 10 07:34:20 2015 New Revision: 221302 URL: https://gcc.gnu.org/viewcvs?rev=221302root=gccview=rev Log: 2015-03-10 Michael Collison michael.colli...@linaro.org Backport from trunk r217780. 2014-11-19 Wilco Dijkstra wdijk...@arm.com PR target/61915 * config/aarch64/aarch64.c (generic_regmove_cost): Increase FP move cost. Modified: branches/linaro/gcc-4_9-branch/gcc/ChangeLog.linaro branches/linaro/gcc-4_9-branch/gcc/config/aarch64/aarch64.c
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #21 from Jiong Wang jiwang at gcc dot gnu.org --- Author: jiwang Date: Wed Nov 19 14:40:26 2014 New Revision: 217780 URL: https://gcc.gnu.org/viewcvs?rev=217780root=gccview=rev Log: [AArch64] Adjust generic move costs 2014-11-19 Wilco Dijkstra wdijk...@arm.com PR target/61915 * config/aarch64/aarch64.c (generic_regmove_cost): Increase FP move cost. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.c
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 Wilco wdijkstr at arm dot com changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #22 from Wilco wdijkstr at arm dot com --- Fixed
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #20 from Evandro e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #19) To my mind it seems like 407 fmoves is just a bit too berserk and regardless of how efficient your core is, there is no point in having so many moves back and forth. It seems that the only LRA parameter exposed is lra-max-considered-reload-pseudos. It defaults to 500 and decreasing it, results in more FMOVs; increasing it, in less. It doesn't have any effect over 1000. At 1000, the number of FMOVs decreases by 5% in some cases.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 Ramana Radhakrishnan ramana at gcc dot gnu.org changed: What|Removed |Added Summary|[AArch64] High amounts of |[AArch64] High amounts of |GP to FP register moves |GP to FP register moves |using LRA on AArch64|using LRA on AArch64 - ||Improve Generic ||register_move_cost and ||memory_move_cost --- Comment #19 from Ramana Radhakrishnan ramana at gcc dot gnu.org --- To my mind it seems like 407 fmoves is just a bit too berserk and regardless of how efficient your core is, there is no point in having so many moves back and forth.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 Ramana Radhakrishnan ramana at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|ramana at gcc dot gnu.org |wdijkstr at arm dot com Target Milestone|--- |5.0
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #18 from Wilco wdijkstr at arm dot com --- (In reply to Andrew Pinski from comment #17) (In reply to Wilco from comment #16) (In reply to Andrew Pinski from comment #13) (In reply to Wilco from comment #9) I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57. Note when I submitted ThunderX support I used a base of 2 instead of a base of 1 due to 2 being the default and all values are relative to that. This is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html . In fact a value of 2 means reload will not look at the constraints of a move instruction. So I think the cortex* cpus should also re-base these values based on 2 being gpr-to-gpr value. You mean only use multiples of 2? That's interesting as I've not seen that done elsewhere. Are these costs in any way related to real issue and latency cycles? Most targets have complex tables with all the exact latencies for every little uarch detail, but from what I've seen in the allocator these costs have almost no meaning. Not always multiple of 2 though in the case of ThunderX they are multiple of twos. The costs are not really directly related to the latency cost but it is relative to one another. So I could have used 2, 3, 4 (meaning latency of 1, 2, 3) instead. I used the factor of 2 instead of 1 for ThunderX because 2 + 3 != 4 but rather 5. OK. So did you find that setting the FP move cost so low actually works alright on ThunderX? I'd like to figure out a setting for the generic target that works out well across all AArch64 implementations. Yes it seems to at least on the things we have benchmarked but we have not done much big benchmarks like SPEC yet. Well in one testcase I'm seeing 11 str and 26 ldr spills on a53/a57 but 407 fmoves on thunderx. I don't see how that could be a good tradeoff unless fmov has negative latency...
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #11 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #9) The performance cost is a much bigger issue than codesize. The problem is that when register pressure is high, the register allocator decides to allocate integer liveranges to FP registers and insert int-fp moves for every use/define (ie. you end up with far more moves than you would if it were spilled, so it is a bad thing even if int-fp moves are cheap). I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57. I believe that it pretty much is, after a cursory examination. The code size after the patch is back down about 2% for the test case above. Of note, the prolog and epilog are much smaller, because the FP registers don't have to be saved and restored anymore, and the stack frame shrank correspondingly. Do you have an idea of the performance impact of this patch?
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #12 from Evandro e.menezes at samsung dot com --- (In reply to Evandro from comment #11) Do you have an idea of the performance impact of this patch? At least in Dhrystone, it improved by over 2% on A57.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #13 from Andrew Pinski pinskia at gcc dot gnu.org --- (In reply to Wilco from comment #9) I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57. Note when I submitted ThunderX support I used a base of 2 instead of a base of 1 due to 2 being the default and all values are relative to that. This is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html . In fact a value of 2 means reload will not look at the constraints of a move instruction. So I think the cortex* cpus should also re-base these values based on 2 being gpr-to-gpr value.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #14 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #10) Note currently it is not possible to use FP registers for spilling using the hooks - basically you still end up with int-fp moves for every definition and use (even when multiple uses are right next to each other), and rematerialization does not happen at all. Vladimir, I had also noticed that the hooks that you pointed me to didn't seem to work as documented. Are we missing anything?
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #15 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #12) (In reply to Evandro from comment #11) Do you have an idea of the performance impact of this patch? At least in Dhrystone, it improved by over 2% on A57. It was ~2% on SPECINT2k, ~3% on SPECFP2k. There were large gains on other benchmarks where the allocator had gone berserk on FP moves inside the hot loop. The removal of the redundant FP saves/restores in many functions helps as well.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #16 from Wilco wdijkstr at arm dot com --- (In reply to Andrew Pinski from comment #13) (In reply to Wilco from comment #9) I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57. Note when I submitted ThunderX support I used a base of 2 instead of a base of 1 due to 2 being the default and all values are relative to that. This is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html . In fact a value of 2 means reload will not look at the constraints of a move instruction. So I think the cortex* cpus should also re-base these values based on 2 being gpr-to-gpr value. You mean only use multiples of 2? That's interesting as I've not seen that done elsewhere. Are these costs in any way related to real issue and latency cycles? Most targets have complex tables with all the exact latencies for every little uarch detail, but from what I've seen in the allocator these costs have almost no meaning. So did you find that setting the FP move cost so low actually works alright on ThunderX? I'd like to figure out a setting for the generic target that works out well across all AArch64 implementations.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #17 from Andrew Pinski pinskia at gcc dot gnu.org --- (In reply to Wilco from comment #16) (In reply to Andrew Pinski from comment #13) (In reply to Wilco from comment #9) I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57. Note when I submitted ThunderX support I used a base of 2 instead of a base of 1 due to 2 being the default and all values are relative to that. This is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html . In fact a value of 2 means reload will not look at the constraints of a move instruction. So I think the cortex* cpus should also re-base these values based on 2 being gpr-to-gpr value. You mean only use multiples of 2? That's interesting as I've not seen that done elsewhere. Are these costs in any way related to real issue and latency cycles? Most targets have complex tables with all the exact latencies for every little uarch detail, but from what I've seen in the allocator these costs have almost no meaning. Not always multiple of 2 though in the case of ThunderX they are multiple of twos. The costs are not really directly related to the latency cost but it is relative to one another. So I could have used 2, 3, 4 (meaning latency of 1, 2, 3) instead. I used the factor of 2 instead of 1 for ThunderX because 2 + 3 != 4 but rather 5. So did you find that setting the FP move cost so low actually works alright on ThunderX? I'd like to figure out a setting for the generic target that works out well across all AArch64 implementations. Yes it seems to at least on the things we have benchmarked but we have not done much big benchmarks like SPEC yet.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 Wilco wdijkstr at arm dot com changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #9 from Wilco wdijkstr at arm dot com --- (In reply to Evandro Menezes from comment #0) The issue that I observed in code size due to the default use of the LRA results in the spilling of the FP register used to spill variables into, which increases code-size. The performance cost is a much bigger issue than codesize. The problem is that when register pressure is high, the register allocator decides to allocate integer liveranges to FP registers and insert int-fp moves for every use/define (ie. you end up with far more moves than you would if it were spilled, so it is a bad thing even if int-fp moves are cheap). I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #10 from Wilco wdijkstr at arm dot com --- (In reply to Andrew Pinski from comment #2) https://gcc.gnu.org/ml/gcc/2014-05/msg00160.html Note currently it is not possible to use FP registers for spilling using the hooks - basically you still end up with int-fp moves for every definition and use (even when multiple uses are right next to each other), and rematerialization does not happen at all. However what you'd expect is that all spill optimizations apply first and if all else fails every load/store of a stack slot is turned into an int-fp move.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #6 from Vladimir Makarov vmakarov at gcc dot gnu.org --- (In reply to Evandro Menezes from comment #5) Created attachment 33249 [details] Dhrystone, part 2 of 3 I firstly observed this issue when looking into Dhrystone built with fairly standard options: -O2 -fno-short-enums -fno-inline -fno-inline-functions -fno-inline-small-functions -fno-inline-functions-called-once -fomit-frame-pointer -funroll-all-loops If I add -mno-lra, the code size in dhry_1.o is about 2% smaller. Evandro, thanks for reporting this. Sorry, I am busy with other thing these days. I'll start to work on this PR in September to try to make some progress for the next GCC release. May be a better remeaterialization in LRA I am working on now will help the PR too.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #7 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Vladimir Makarov from comment #6) Evandro, thanks for reporting this. Sorry, I am busy with other thing these days. I'll start to work on this PR in September to try to make some progress for the next GCC release. May be a better remeaterialization in LRA I am working on now will help the PR too. Vladimir, I was thinking about using the hook function to avoid using FPR, at least when -Os is specified, for the time being. This way, registers would still be allocated by the LRA, but this side-effect would be under control. Or do y'all think that it's better to wait a little while longer?
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #8 from Vladimir Makarov vmakarov at gcc dot gnu.org --- (In reply to Evandro Menezes from comment #5) Created attachment 33249 [details] Dhrystone, part 2 of 3 I firstly observed this issue when looking into Dhrystone built with fairly standard options: -O2 -fno-short-enums -fno-inline -fno-inline-functions -fno-inline-small-functions -fno-inline-functions-called-once -fomit-frame-pointer -funroll-all-loops If I add -mno-lra, the code size in dhry_1.o is about 2% smaller. Evandro, thanks for reporting this. Sorry, I am busy with other thing these days. I'll start to work on this PR in September to try to make some progress for the next GCC release. May be a better remeaterialization in LRA I am working on now will help the PR too. (In reply to Evandro Menezes from comment #7) (In reply to Vladimir Makarov from comment #6) Evandro, thanks for reporting this. Sorry, I am busy with other thing these days. I'll start to work on this PR in September to try to make some progress for the next GCC release. May be a better remeaterialization in LRA I am working on now will help the PR too. Vladimir, I was thinking about using the hook function to avoid using FPR, at least when -Os is specified, for the time being. This way, registers would still be allocated by the LRA, but this side-effect would be under control. Or do y'all think that it's better to wait a little while longer? If it works and it is ok for ARM mainteiners, it is ok for me too. I will look at this with the point of LRA, can be the code decreased or not. Your solution is on the machine-dependent part. So it is up to you and ARM maintainers. I think you should not wait for what I may or may not find in LRA itself to fix it.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 Ramana Radhakrishnan ramana at gcc dot gnu.org changed: What|Removed |Added Keywords||missed-optimization Target||aarch64-* Status|UNCONFIRMED |NEW Last reconfirmed||2014-08-05 Assignee|unassigned at gcc dot gnu.org |ramana at gcc dot gnu.org Summary|[AArch64] Default use of|[AArch64] High amounts of |the LRA results in extra|GP to FP register moves |code size |using LRA on AArch64 Ever confirmed|0 |1 --- Comment #4 from Ramana Radhakrishnan ramana at gcc dot gnu.org --- We've noticed this overeagerness hurting in a number of places including SPEC2k(6) for Cortex-A57 and are in the process of fixing up REGISTER_MOVE_COST and MEMORY_MOVE_COST to fix this up for those cores. That is the first source of reducing these number of moves. If you have more examples and more analysis from outside these benchmarks it would be useful to help look for such cases.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #5 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33249 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249action=edit Dhrystone, part 2 of 3 I firstly observed this issue when looking into Dhrystone built with fairly standard options: -O2 -fno-short-enums -fno-inline -fno-inline-functions -fno-inline-small-functions -fno-inline-functions-called-once -fomit-frame-pointer -funroll-all-loops If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.