[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-13 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304

--- Comment #37 from Evandro  ---
Here's what I had in mind:
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01787.html

Feedback is welcome.

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304

--- Comment #36 from Evandro  ---
(In reply to Ramana Radhakrishnan from comment #35)
> (In reply to Evandro from comment #32)
> > Because of side effects of the Haiffa scheduler, the loads now pile up, and
> > the ADRPs may affect the load issue rate rather badly if not fused.  At leas
> > on our processor.  
> 
> In straight line code I can imagine this happening - In loopy code I would
> have expected the constants to be hoisted - atleast that's what I remember
> seeing in my analysis. You have seen -mcprelative-literal-loads haven't you
> ? 

The cases that I have in mind involve SL code in functions which are called
form a loop.  Since they are external, only LTO would address such cases.  And,
since we do not control how they are built, we have to handle them as they
come.

As long as there's an opening to investigate the benefits and drawbacks of
reverting to the legacy way considering the function size, I think that it's
interesting to find out the results.

Thank you.

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304

--- Comment #30 from Evandro  ---
The performance impact of always referring to constants as if they were far
away is significant on targets which do not fuse ADRP and LDR together.  What's
the status of the solution that evaluates the function size?  Should this be
optionally enabled only?  Would it be the case to come up with a medium code
model? :-P  Could the assembler be left to address this issue by relaxing such
loads? :-P  

Thank you.

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304

--- Comment #32 from Evandro  ---
(In reply to Ramana Radhakrishnan from comment #31)
> (In reply to Evandro from comment #30)
> > The performance impact of always referring to constants as if they were far
> > away is significant on targets which do not fuse ADRP and LDR together. 
> 
> What happens if you split them up and schedule them appropriately ? I didn't
> see any significant impact in my benchmarking on implementations that did
> not implement such fusion. Where people want performance in these cases they
> can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there
> already.

Because of side effects of the Haiffa scheduler, the loads now pile up, and the
ADRPs may affect the load issue rate rather badly if not fused.  At leas on our
processor.  

Which brings another point, shouldn't there be just one ADRP per BB or,
ideally, per function?  Or am I missing something?

> > What's the status of the solution that evaluates the function size? 
> 
> I am not working on that follow-up as I didn't see the real need for it in
> the benchmarking results I was looking at. You are welcome to investigate.

OK

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304

--- Comment #34 from Evandro  ---
(In reply to Wilco from comment #33)
> (In reply to Evandro from comment #32)
> ADRP latency to load-address should be zero on any OoO core - ADRP is
> basically a move-immediate, so can execute early and hide any latency.

In an ideal world, yes.  In the actual world, they compete for limited
resources that could be used by other insns.

> > Which brings another point, shouldn't there be just one ADRP per BB or,
> > ideally, per function?  Or am I missing something?
> 
> That's not possible in this case as the section is mergeable. An alternative
> implementation using anchors may be feasible, but GCC is extremely bad at
> using anchors efficiently - functions using several global variables also
> end up with a large number of ADRPs when you'd expect a single ADRP.

I see.  

I'll investigate placing the constant after the function, as before, if the
estimated function size allows for it.  I think that eliminating the ADRPs
could potentially be more beneficial to code size than merging constants in a
common literal pool (v. http://bit.ly/1Ptc8nh).

Thank you.

[Bug target/58623] lack of ldp/stp optimization

2014-12-15 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58623

--- Comment #6 from Evandro e.menezes at samsung dot com ---
What's the PR of the fwprop issue?

Thank you.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-10-31 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #20 from Evandro e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #19)
 To my mind it seems like 407 fmoves is just a bit too berserk and regardless
 of how efficient your core is, there is no point in having so many moves
 back and forth.

It seems that the only LRA parameter exposed is
lra-max-considered-reload-pseudos. It defaults to 500 and decreasing it,
results in more FMOVs; increasing it, in less. It doesn't have any effect over
1000. At 1000, the number of FMOVs decreases by 5% in some cases.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #21 from Evandro e.menezes at samsung dot com ---
(In reply to ramana.radhakrish...@arm.com from comment #20)
 What's the kind of performance delta you see if you managed to unroll 
 the loop just a wee bit ? Probably not much looking at the code produced 
 here.

Comparing the cycle counts on Juno when running the program from the matrix
multiplication test above built with -Ofast and unrolling:

-fno-unroll-loops: 592000
-funroll-loops --param max-unroll-times=2: 594000
-funroll-loops --param max-unroll-times=4: 592000
-funroll-loops: 59 (implies --param max-unroll-times=8)
-funroll-loops --param max-unroll-times=16: 581000

It seems to me that without effective iv-opt in place, loops have to be
unrolled too aggressively to make any difference in this case, greatly
sacrificing code size.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #23 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #22)
 Unrolling alone isn't good enough in sum reductions. As I mentioned before,
 GCC doesn't enable any of the useful loop optimizations by default. So add
 -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again
 these are all generic GCC issues.

Adding -fvariable-expansion-in-unroller when using -funroll-loops results in
practically the same code being emitted.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #11 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #9)
 The performance cost is a much bigger issue than codesize. The problem is
 that when register pressure is high, the register allocator decides to
 allocate integer liveranges to FP registers and insert int-fp moves for
 every use/define (ie. you end up with far more moves than you would if it
 were spilled, so it is a bad thing even if int-fp moves are cheap).
 
 I committed a workaround
 (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
 int-fp move cost. Can you try this and check the issue has indeed gone?
 You need -mcpu=cortex-a57.

I believe that it pretty much is, after a cursory examination.  The code size 
after the patch is back down about 2% for the test case above.  Of note, the
prolog and epilog are much smaller, because the FP registers don't have to be
saved and restored anymore, and the stack frame shrank correspondingly.

Do you have an idea of the performance impact of this patch?


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #12 from Evandro e.menezes at samsung dot com ---
(In reply to Evandro from comment #11)
 Do you have an idea of the performance impact of this patch?

At least in Dhrystone, it improved by over 2% on A57.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #14 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #10)
 Note currently it is not possible to use FP registers for spilling using the
 hooks - basically you still end up with int-fp moves for every definition
 and use (even when multiple uses are right next to each other), and
 rematerialization does not happen at all.

Vladimir,

I had also noticed that the hooks that you pointed me to didn't seem to work as
documented.  Are we missing anything?


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #14 from Evandro Menezes e.menezes at samsung dot com ---
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on target
using perf.

The binary created by GCC runs in about 447000 user cycles and the one created
by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a win on A57.

Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
LLVM, both using -Ofast, GCC fails to vectorize the loop in
gemm_block_kernel, while LLVM does.

I should've done a more detailed analysis in this issue before submitting this
bug, sorry.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #16 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #15)
 Using -Ofast is not any different from -O3 -ffast-math when compiling
 non-Fortran code. As comment 10 shows, both loops are vectorized, however
 LLVM unrolls twice and uses multiple accumulators while GCC doesn't.

You're right.  LLVM produces:

.LBB0_1:// %vector.body
// =This Inner Loop Header: Depth=1
add  x11, x9, x8
add  x12, x10, x8
ldp  q2, q3, [x11]
ldp  q4, q5, [x12]
add  x8, x8, #32 // =32
fmla v0.2d, v2.2d, v4.2d
fmla v1.2d, v3.2d, v5.2d
cmp  x8, #128, lsl #12  // =524288
b.ne.LBB0_1

And GCC:

.L3:
ldr q2, [x2, x0]
add w1, w1, 1
ldr q1, [x3, x0]
cmp w1, w4
add x0, x0, 16
fmlav0.2d, v2.2d, v1.2d
bcc .L3

 I still don't see what this has to do with A57. You should open a generic
 bug about GCC not applying basic loop optimizations with -O3 (in fact
 limited unrolling is useful even for -O2).

Indeed, but I think that there's still a code-generation opportunity for A57
here.

Note above that the registers are loaded in pairs by LLVM, while GCC, when it
unrolls the loop, more aggressively BTW, each vector is loaded individually:

.L3:
ldr q28, [x15, x16]
add x17, x16, 16
ldr q29, [x14, x16]
add x0, x16, 32
ldr q30, [x15, x17]
add x18, x16, 48
ldr q31, [x14, x17]
add x1, x16, 64
...
fmlav27.2d, v28.2d, v29.2d
...
fmlav27.2d, v30.2d, v31.2d
... # Rest of 8x unroll
bcc .L3

It also goes without saying that this code could also benefit from the
post-increment addressing mode.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #17 from Evandro e.menezes at samsung dot com ---
Created attachment 33785
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit
Simple matrix multiplication


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Evandro e.menezes at samsung dot com changed:

   What|Removed |Added

  Attachment #33774|0   |1
is obsolete||

--- Comment #18 from Evandro e.menezes at samsung dot com ---
Created attachment 33786
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33786action=edit
Simple test-case


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #12 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33774
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit
Simple test-case


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #8 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #7)
 As Evandro doesn't mention flags it's hard to say whether there really is a
 problem here or not.

Both GCC and LLVM were given -O3 -ffast-math.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #9 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Wilco from comment #6)
 I ran the assembler examples on A57 hardware with identical input. The FMADD
 code is ~20% faster irrespectively of the size of the input. This is not a
 surprise given that the FMADD latency is lower than the FADD and FMUL
 latency.

I ran the same Geekbench binaries on A53 and the result is about the same
between the GCC and the LLVM code, if with a slight ( 1%) advantage for GCC.


[Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Bug ID: 63503
   Summary: [AArch64] A57 executes fused multiply-add poorly in
some situations
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: e.menezes at samsung dot com
CC: spop at gcc dot gnu.org
Target: aarch64-*

Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I was
baffled to find that the code emitted by GCC for the innermost loop in the
algorithm core is actually very good:

.L8:
ldr d2, [x8, w5, uxtw 3]
ldr d1, [x7, w5, uxtw 3]
add w5, w5, 1
cmp w5, w6
fmadd   d0, d2, d1, d0
bne .L8

LLVM's code is not so neat:

.LBB0_10:
ldr d1, [x27, x22, lsl #3]
ldr d2, [x9, x22, lsl #3]
fmuld1, d1, d2
faddd0, d0, d1
add w21, w21, #1
add x22, x22, #1
cmp w21, w24, uxtw
b.ne .LBB0_10

However, it runs faster.

Methinks that the A57 microarchitecture is performing tricks for discrete FP
operations but not for fused multiply-add, since both code sequences are
semantically the same.  Whatever it is, it seems that fused multiply-add, and
perhaps its cousins, is actually a performance hit only when one depends on the
results of a previous one, as in this case on the results of the fused
operation in the previous loop iteration.

I'll try to create a simple test-case, but, in the meantime, please chime in
about your thoughts.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Andrew Pinski from comment #1)
 The other question here are there denormals happening?  That might cause
 some performance differences between using fmadd and fmul/fadd.

Nope, no denormals.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #4 from Evandro Menezes e.menezes at samsung dot com ---
Here's a simplified code to reproduce these results:

double sum(double *A, double *B, int n) 
{
  int i;
  double res = 0;

  for (i = 0; i  n; i++)
res += A [i] * B [i];

  return res;
}


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #7 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Vladimir Makarov from comment #6)
 
 Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
 days.  I'll start to work on this PR in September to try to make some
 progress for the next GCC release.
 
 May be a better remeaterialization in LRA I am working on now will help the
 PR too.

Vladimir,

I was thinking about using the hook function to avoid using FPR, at least when
-Os is specified, for the time being.  This way, registers would still be
allocated by the LRA, but this side-effect would be under control.  Or do y'all
think that it's better to wait a little while longer?


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

Evandro Menezes e.menezes at samsung dot com changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #16 from Evandro Menezes e.menezes at samsung dot com ---
If it's working, it's good for me.


[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #5 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33249
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249action=edit
Dhrystone, part 2 of 3

I firstly observed this issue when looking into Dhrystone built with fairly
standard options:

-O2 -fno-short-enums -fno-inline -fno-inline-functions
-fno-inline-small-functions -fno-inline-functions-called-once
-fomit-frame-pointer -funroll-all-loops

If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

--- Comment #9 from Evandro Menezes e.menezes at samsung dot com ---
It seems to me that it's the LRA which is forcing the use of FP registers, so,
even if the patterns are fixed, I believe that in the end the combiner would
just give up and ICE.  With this assumption, which is open to corrections, I
believe that the LRA needs to be properly managed according to the options
passed on to the target.


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

--- Comment #11 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to ktkachov from comment #10)
 What we really need here is a preprocessed testcase showing the problem.
 It should be fairly easy to lock down on the problem then

I'm on it.


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

--- Comment #13 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33253
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33253action=edit
Test-case

This test-case is a stripped-down version of Dhrystone, where the issue was
first observed.

Built with just -O2, it ends up with a handful of FMOVs, which are then avoided
if -mno-lra is also specified.


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

Evandro Menezes e.menezes at samsung dot com changed:

   What|Removed |Added

  Attachment #33246|0   |1
is obsolete||

--- Comment #14 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33254
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33254action=edit
Patch

For the sake of correctness, this patch uses the more generic flag to qualify
the spilling.


[Bug target/62014] New: [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-04 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

Bug ID: 62014
   Summary: [AArch64] Using -mgeneral-regs-only may lead to ICE
   Product: gcc
   Version: 4.10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: e.menezes at samsung dot com

Created attachment 33245
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33245action=edit
This patch should fix this issue, though it needs a test-case.

In some cases, when the LRA spills a register into an FP register, with the
option -mgeneral-regs-only specified, there is an ICE.

It seems to be caused by the LRA assuming that the FP registers are always
available and not being told when they aren't by the target.


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-04 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

--- Comment #2 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Andrew Pinski from comment #1)
 +  /* Do not spill into FP registers when -mgeneral-regs-only is
 specified. *
 
 You are missing a / in your comment.

Ermahgerd!


[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-04 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014

Evandro Menezes e.menezes at samsung dot com changed:

   What|Removed |Added

  Attachment #33245|0   |1
is obsolete||

--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33246
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33246action=edit
This patch should fix this issue, though it needs a test-case


[Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size

2014-07-25 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Bug ID: 61915
   Summary: [AArch64] Default use of the LRA results in extra code
size
   Product: gcc
   Version: 4.10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: e.menezes at samsung dot com

The issue that I observed in code size due to the default use of the LRA
results in the spilling of the FP register used to spill variables into, which
increases code-size.

For example, in Dhrystone, out of dhry_1.c I see sequences like this:

  ldrd9, [sp, 144]
  ...
  fmovx0, d9
  blprintf
  ...
  fmovx0, d9
  ...
  blprintf

By disabling the LRA, the code is a tad leaner (2%):

  ldrx0, [sp, 144]
  ...
  blprintf
  ...
  ldrx0, [sp, 144]
  ...
  blprintf

Moreover, is transferring registers between the GP and the FP register files
always cheap?  In some x86 processors this used to be accomplished internally
through the load-store unit anyway (e.g., Opteron).  How is this accomplished
internally in A53 and A57?

Is using the LRA by default clearly beneficial in other cases?

At the Cauldron I mentioned some variables that could be rematerialized when
needed instead of being spilled, but I could not reproduce that.  I'll try some
more to spot this behavior.


[Bug target/61915] [AArch64] Default use of the LRA results in extra code size

2014-07-25 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
In Opteron, there was a path from FP to the GP registers, but not the other way
around.  That path was eventually made symmetric in Barcelona only.