[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2015-03-10 Thread collison at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #23 from collison at gcc dot gnu.org ---
Author: collison
Date: Tue Mar 10 07:34:20 2015
New Revision: 221302

URL: https://gcc.gnu.org/viewcvs?rev=221302root=gccview=rev
Log:
2015-03-10  Michael Collison  michael.colli...@linaro.org

Backport from trunk r217780.
2014-11-19  Wilco Dijkstra  wdijk...@arm.com

PR target/61915
* config/aarch64/aarch64.c (generic_regmove_cost): Increase FP move
cost.


Modified:
branches/linaro/gcc-4_9-branch/gcc/ChangeLog.linaro
branches/linaro/gcc-4_9-branch/gcc/config/aarch64/aarch64.c


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-11-19 Thread jiwang at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #21 from Jiong Wang jiwang at gcc dot gnu.org ---
Author: jiwang
Date: Wed Nov 19 14:40:26 2014
New Revision: 217780

URL: https://gcc.gnu.org/viewcvs?rev=217780root=gccview=rev
Log:
[AArch64] Adjust generic move costs

  2014-11-19  Wilco Dijkstra  wdijk...@arm.com

PR target/61915
* config/aarch64/aarch64.c (generic_regmove_cost): Increase FP move cost.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/aarch64/aarch64.c


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-11-19 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Wilco wdijkstr at arm dot com changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #22 from Wilco wdijkstr at arm dot com ---
Fixed


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-10-31 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #20 from Evandro e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #19)
 To my mind it seems like 407 fmoves is just a bit too berserk and regardless
 of how efficient your core is, there is no point in having so many moves
 back and forth.

It seems that the only LRA parameter exposed is
lra-max-considered-reload-pseudos. It defaults to 500 and decreasing it,
results in more FMOVs; increasing it, in less. It doesn't have any effect over
1000. At 1000, the number of FMOVs decreases by 5% in some cases.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-10-28 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Ramana Radhakrishnan ramana at gcc dot gnu.org changed:

   What|Removed |Added

Summary|[AArch64] High amounts of   |[AArch64] High amounts of
   |GP to FP register moves |GP to FP register moves
   |using LRA on AArch64|using LRA on AArch64 -
   ||Improve Generic
   ||register_move_cost and
   ||memory_move_cost

--- Comment #19 from Ramana Radhakrishnan ramana at gcc dot gnu.org ---
To my mind it seems like 407 fmoves is just a bit too berserk and regardless of
how efficient your core is, there is no point in having so many moves back and
forth.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-10-28 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Ramana Radhakrishnan ramana at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|ramana at gcc dot gnu.org  |wdijkstr at arm dot com
   Target Milestone|--- |5.0


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-27 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #18 from Wilco wdijkstr at arm dot com ---
(In reply to Andrew Pinski from comment #17)
 (In reply to Wilco from comment #16)
  (In reply to Andrew Pinski from comment #13)
   (In reply to Wilco from comment #9)
I committed a workaround
(http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing 
the
int-fp move cost. Can you try this and check the issue has indeed 
gone?
You need -mcpu=cortex-a57.
   
   Note when I submitted ThunderX support I used a base of 2 instead of a 
   base
   of 1 due to 2 being the default and all values are relative to that.  This
   is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In 
   fact
   a value of 2 means reload will not look at the constraints of a move
   instruction.
   
   So I think the cortex* cpus should also re-base these values based on 2
   being gpr-to-gpr value.
  
  You mean only use multiples of 2? That's interesting as I've not seen that
  done elsewhere. Are these costs in any way related to real issue and latency
  cycles? Most targets have complex tables with all the exact latencies for
  every little uarch detail, but from what I've seen in the allocator these
  costs have almost no meaning.
 
 Not always multiple of 2 though in the case of ThunderX they are multiple of
 twos.  The costs are not really directly related to the latency cost but it
 is relative to one another.  So I could have used 2, 3, 4 (meaning latency
 of 1, 2, 3) instead.  I used the factor of 2 instead of 1 for ThunderX
 because 2 + 3 != 4 but rather 5.

OK.

  So did you find that setting the FP move cost so low actually works alright
  on ThunderX? I'd like to figure out a setting for the generic target that
  works out well across all AArch64 implementations.
 
 Yes it seems to at least on the things we have benchmarked but we have not
 done much big benchmarks like SPEC yet.

Well in one testcase I'm seeing 11 str and 26 ldr spills on a53/a57 but 407
fmoves on thunderx. I don't see how that could be a good tradeoff unless fmov
has negative latency...


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #11 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #9)
 The performance cost is a much bigger issue than codesize. The problem is
 that when register pressure is high, the register allocator decides to
 allocate integer liveranges to FP registers and insert int-fp moves for
 every use/define (ie. you end up with far more moves than you would if it
 were spilled, so it is a bad thing even if int-fp moves are cheap).
 
 I committed a workaround
 (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
 int-fp move cost. Can you try this and check the issue has indeed gone?
 You need -mcpu=cortex-a57.

I believe that it pretty much is, after a cursory examination.  The code size 
after the patch is back down about 2% for the test case above.  Of note, the
prolog and epilog are much smaller, because the FP registers don't have to be
saved and restored anymore, and the stack frame shrank correspondingly.

Do you have an idea of the performance impact of this patch?


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #12 from Evandro e.menezes at samsung dot com ---
(In reply to Evandro from comment #11)
 Do you have an idea of the performance impact of this patch?

At least in Dhrystone, it improved by over 2% on A57.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #13 from Andrew Pinski pinskia at gcc dot gnu.org ---
(In reply to Wilco from comment #9)
 I committed a workaround
 (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
 int-fp move cost. Can you try this and check the issue has indeed gone?
 You need -mcpu=cortex-a57.

Note when I submitted ThunderX support I used a base of 2 instead of a base of
1 due to 2 being the default and all values are relative to that.  This is
mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In fact a
value of 2 means reload will not look at the constraints of a move instruction.

So I think the cortex* cpus should also re-base these values based on 2 being
gpr-to-gpr value.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #14 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #10)
 Note currently it is not possible to use FP registers for spilling using the
 hooks - basically you still end up with int-fp moves for every definition
 and use (even when multiple uses are right next to each other), and
 rematerialization does not happen at all.

Vladimir,

I had also noticed that the hooks that you pointed me to didn't seem to work as
documented.  Are we missing anything?


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #15 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #12)
 (In reply to Evandro from comment #11)
  Do you have an idea of the performance impact of this patch?
 
 At least in Dhrystone, it improved by over 2% on A57.

It was ~2% on SPECINT2k, ~3% on SPECFP2k. There were large gains on other
benchmarks where the allocator had gone berserk on FP moves inside the hot
loop. The removal of the redundant FP saves/restores in many functions helps as
well.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #16 from Wilco wdijkstr at arm dot com ---
(In reply to Andrew Pinski from comment #13)
 (In reply to Wilco from comment #9)
  I committed a workaround
  (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
  int-fp move cost. Can you try this and check the issue has indeed gone?
  You need -mcpu=cortex-a57.
 
 Note when I submitted ThunderX support I used a base of 2 instead of a base
 of 1 due to 2 being the default and all values are relative to that.  This
 is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In fact
 a value of 2 means reload will not look at the constraints of a move
 instruction.
 
 So I think the cortex* cpus should also re-base these values based on 2
 being gpr-to-gpr value.

You mean only use multiples of 2? That's interesting as I've not seen that done
elsewhere. Are these costs in any way related to real issue and latency cycles?
Most targets have complex tables with all the exact latencies for every little
uarch detail, but from what I've seen in the allocator these costs have almost
no meaning.

So did you find that setting the FP move cost so low actually works alright on
ThunderX? I'd like to figure out a setting for the generic target that works
out well across all AArch64 implementations.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #17 from Andrew Pinski pinskia at gcc dot gnu.org ---
(In reply to Wilco from comment #16)
 (In reply to Andrew Pinski from comment #13)
  (In reply to Wilco from comment #9)
   I committed a workaround
   (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing 
   the
   int-fp move cost. Can you try this and check the issue has indeed gone?
   You need -mcpu=cortex-a57.
  
  Note when I submitted ThunderX support I used a base of 2 instead of a base
  of 1 due to 2 being the default and all values are relative to that.  This
  is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In fact
  a value of 2 means reload will not look at the constraints of a move
  instruction.
  
  So I think the cortex* cpus should also re-base these values based on 2
  being gpr-to-gpr value.
 
 You mean only use multiples of 2? That's interesting as I've not seen that
 done elsewhere. Are these costs in any way related to real issue and latency
 cycles? Most targets have complex tables with all the exact latencies for
 every little uarch detail, but from what I've seen in the allocator these
 costs have almost no meaning.

Not always multiple of 2 though in the case of ThunderX they are multiple of
twos.  The costs are not really directly related to the latency cost but it is
relative to one another.  So I could have used 2, 3, 4 (meaning latency of 1,
2, 3) instead.  I used the factor of 2 instead of 1 for ThunderX because 2 + 3
!= 4 but rather 5.

 
 So did you find that setting the FP move cost so low actually works alright
 on ThunderX? I'd like to figure out a setting for the generic target that
 works out well across all AArch64 implementations.

Yes it seems to at least on the things we have benchmarked but we have not done
much big benchmarks like SPEC yet.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Wilco wdijkstr at arm dot com changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #9 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro Menezes from comment #0)
 The issue that I observed in code size due to the default use of the LRA
 results in the spilling of the FP register used to spill variables into,
 which increases code-size.

The performance cost is a much bigger issue than codesize. The problem is that
when register pressure is high, the register allocator decides to allocate
integer liveranges to FP registers and insert int-fp moves for every
use/define (ie. you end up with far more moves than you would if it were
spilled, so it is a bad thing even if int-fp moves are cheap).

I committed a workaround
(http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
int-fp move cost. Can you try this and check the issue has indeed gone? You
need -mcpu=cortex-a57.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #10 from Wilco wdijkstr at arm dot com ---
(In reply to Andrew Pinski from comment #2)
 https://gcc.gnu.org/ml/gcc/2014-05/msg00160.html

Note currently it is not possible to use FP registers for spilling using the
hooks - basically you still end up with int-fp moves for every definition and
use (even when multiple uses are right next to each other), and
rematerialization does not happen at all.

However what you'd expect is that all spill optimizations apply first and if
all else fails every load/store of a stack slot is turned into an int-fp
move.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-14 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #6 from Vladimir Makarov vmakarov at gcc dot gnu.org ---
(In reply to Evandro Menezes from comment #5)
 Created attachment 33249 [details]
 Dhrystone, part 2 of 3
 
 I firstly observed this issue when looking into Dhrystone built with fairly
 standard options:
 
 -O2 -fno-short-enums -fno-inline -fno-inline-functions
 -fno-inline-small-functions -fno-inline-functions-called-once
 -fomit-frame-pointer -funroll-all-loops
 
 If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.

Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
days.  I'll start to work on this PR in September to try to make some progress
for the next GCC release.

May be a better remeaterialization in LRA I am working on now will help the PR
too.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #7 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Vladimir Makarov from comment #6)
 
 Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
 days.  I'll start to work on this PR in September to try to make some
 progress for the next GCC release.
 
 May be a better remeaterialization in LRA I am working on now will help the
 PR too.

Vladimir,

I was thinking about using the hook function to avoid using FPR, at least when
-Os is specified, for the time being.  This way, registers would still be
allocated by the LRA, but this side-effect would be under control.  Or do y'all
think that it's better to wait a little while longer?


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-14 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #8 from Vladimir Makarov vmakarov at gcc dot gnu.org ---
(In reply to Evandro Menezes from comment #5)
 Created attachment 33249 [details]
 Dhrystone, part 2 of 3
 
 I firstly observed this issue when looking into Dhrystone built with fairly
 standard options:
 
 -O2 -fno-short-enums -fno-inline -fno-inline-functions
 -fno-inline-small-functions -fno-inline-functions-called-once
 -fomit-frame-pointer -funroll-all-loops
 
 If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.

Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
days.  I'll start to work on this PR in September to try to make some progress
for the next GCC release.

May be a better remeaterialization in LRA I am working on now will help the PR
too.

(In reply to Evandro Menezes from comment #7)
 (In reply to Vladimir Makarov from comment #6)
  
  Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
  days.  I'll start to work on this PR in September to try to make some
  progress for the next GCC release.
  
  May be a better remeaterialization in LRA I am working on now will help the
  PR too.
 
 Vladimir,
 
 I was thinking about using the hook function to avoid using FPR, at least
 when -Os is specified, for the time being.  This way, registers would still
 be allocated by the LRA, but this side-effect would be under control.  Or do
 y'all think that it's better to wait a little while longer?

If it works and it is ok for ARM mainteiners, it is ok for me too.

I will look at this with the point of LRA, can be the code decreased or not.

Your solution is on the machine-dependent part.  So it is up to you and ARM
maintainers.  I think you should not wait for what I may or may not find in LRA
itself to fix it.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-05 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Ramana Radhakrishnan ramana at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||aarch64-*
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2014-08-05
   Assignee|unassigned at gcc dot gnu.org  |ramana at gcc dot 
gnu.org
Summary|[AArch64] Default use of|[AArch64] High amounts of
   |the LRA results in extra|GP to FP register moves
   |code size   |using LRA on AArch64
 Ever confirmed|0   |1

--- Comment #4 from Ramana Radhakrishnan ramana at gcc dot gnu.org ---
We've noticed this overeagerness hurting in a number of places including
SPEC2k(6) for Cortex-A57 and are in the process of fixing up REGISTER_MOVE_COST
and MEMORY_MOVE_COST to fix this up for those cores. That is the first source
of reducing these number of moves. 

If you have more examples and more analysis from outside these benchmarks it
would be useful to help look for such cases.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #5 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33249
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249action=edit
Dhrystone, part 2 of 3

I firstly observed this issue when looking into Dhrystone built with fairly
standard options:

-O2 -fno-short-enums -fno-inline -fno-inline-functions
-fno-inline-small-functions -fno-inline-functions-called-once
-fomit-frame-pointer -funroll-all-loops

If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.