[Bug target/63304] Aarch64 pc-relative load offset out of range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #37 from Evandro --- Here's what I had in mind: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01787.html Feedback is welcome.
[Bug target/63304] Aarch64 pc-relative load offset out of range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #36 from Evandro --- (In reply to Ramana Radhakrishnan from comment #35) > (In reply to Evandro from comment #32) > > Because of side effects of the Haiffa scheduler, the loads now pile up, and > > the ADRPs may affect the load issue rate rather badly if not fused. At leas > > on our processor. > > In straight line code I can imagine this happening - In loopy code I would > have expected the constants to be hoisted - atleast that's what I remember > seeing in my analysis. You have seen -mcprelative-literal-loads haven't you > ? The cases that I have in mind involve SL code in functions which are called form a loop. Since they are external, only LTO would address such cases. And, since we do not control how they are built, we have to handle them as they come. As long as there's an opening to investigate the benefits and drawbacks of reverting to the legacy way considering the function size, I think that it's interesting to find out the results. Thank you.
[Bug target/63304] Aarch64 pc-relative load offset out of range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #30 from Evandro --- The performance impact of always referring to constants as if they were far away is significant on targets which do not fuse ADRP and LDR together. What's the status of the solution that evaluates the function size? Should this be optionally enabled only? Would it be the case to come up with a medium code model? :-P Could the assembler be left to address this issue by relaxing such loads? :-P Thank you.
[Bug target/63304] Aarch64 pc-relative load offset out of range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #32 from Evandro --- (In reply to Ramana Radhakrishnan from comment #31) > (In reply to Evandro from comment #30) > > The performance impact of always referring to constants as if they were far > > away is significant on targets which do not fuse ADRP and LDR together. > > What happens if you split them up and schedule them appropriately ? I didn't > see any significant impact in my benchmarking on implementations that did > not implement such fusion. Where people want performance in these cases they > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there > already. Because of side effects of the Haiffa scheduler, the loads now pile up, and the ADRPs may affect the load issue rate rather badly if not fused. At leas on our processor. Which brings another point, shouldn't there be just one ADRP per BB or, ideally, per function? Or am I missing something? > > What's the status of the solution that evaluates the function size? > > I am not working on that follow-up as I didn't see the real need for it in > the benchmarking results I was looking at. You are welcome to investigate. OK
[Bug target/63304] Aarch64 pc-relative load offset out of range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #34 from Evandro --- (In reply to Wilco from comment #33) > (In reply to Evandro from comment #32) > ADRP latency to load-address should be zero on any OoO core - ADRP is > basically a move-immediate, so can execute early and hide any latency. In an ideal world, yes. In the actual world, they compete for limited resources that could be used by other insns. > > Which brings another point, shouldn't there be just one ADRP per BB or, > > ideally, per function? Or am I missing something? > > That's not possible in this case as the section is mergeable. An alternative > implementation using anchors may be feasible, but GCC is extremely bad at > using anchors efficiently - functions using several global variables also > end up with a large number of ADRPs when you'd expect a single ADRP. I see. I'll investigate placing the constant after the function, as before, if the estimated function size allows for it. I think that eliminating the ADRPs could potentially be more beneficial to code size than merging constants in a common literal pool (v. http://bit.ly/1Ptc8nh). Thank you.
[Bug target/58623] lack of ldp/stp optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58623 --- Comment #6 from Evandro e.menezes at samsung dot com --- What's the PR of the fwprop issue? Thank you.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #20 from Evandro e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #19) To my mind it seems like 407 fmoves is just a bit too berserk and regardless of how efficient your core is, there is no point in having so many moves back and forth. It seems that the only LRA parameter exposed is lra-max-considered-reload-pseudos. It defaults to 500 and decreasing it, results in more FMOVs; increasing it, in less. It doesn't have any effect over 1000. At 1000, the number of FMOVs decreases by 5% in some cases.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #21 from Evandro e.menezes at samsung dot com --- (In reply to ramana.radhakrish...@arm.com from comment #20) What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking at the code produced here. Comparing the cycle counts on Juno when running the program from the matrix multiplication test above built with -Ofast and unrolling: -fno-unroll-loops: 592000 -funroll-loops --param max-unroll-times=2: 594000 -funroll-loops --param max-unroll-times=4: 592000 -funroll-loops: 59 (implies --param max-unroll-times=8) -funroll-loops --param max-unroll-times=16: 581000 It seems to me that without effective iv-opt in place, loops have to be unrolled too aggressively to make any difference in this case, greatly sacrificing code size.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #23 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #22) Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default. So add -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again these are all generic GCC issues. Adding -fvariable-expansion-in-unroller when using -funroll-loops results in practically the same code being emitted.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #11 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #9) The performance cost is a much bigger issue than codesize. The problem is that when register pressure is high, the register allocator decides to allocate integer liveranges to FP registers and insert int-fp moves for every use/define (ie. you end up with far more moves than you would if it were spilled, so it is a bad thing even if int-fp moves are cheap). I committed a workaround (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the int-fp move cost. Can you try this and check the issue has indeed gone? You need -mcpu=cortex-a57. I believe that it pretty much is, after a cursory examination. The code size after the patch is back down about 2% for the test case above. Of note, the prolog and epilog are much smaller, because the FP registers don't have to be saved and restored anymore, and the stack frame shrank correspondingly. Do you have an idea of the performance impact of this patch?
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #12 from Evandro e.menezes at samsung dot com --- (In reply to Evandro from comment #11) Do you have an idea of the performance impact of this patch? At least in Dhrystone, it improved by over 2% on A57.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #14 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #10) Note currently it is not possible to use FP registers for spilling using the hooks - basically you still end up with int-fp moves for every definition and use (even when multiple uses are right next to each other), and rematerialization does not happen at all. Vladimir, I had also noticed that the hooks that you pointed me to didn't seem to work as documented. Are we missing anything?
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #14 from Evandro Menezes e.menezes at samsung dot com --- Compiling the test-case above with just -O2, I can reproduce the code I mentioned initially and easily measure the cycle count to run it on target using perf. The binary created by GCC runs in about 447000 user cycles and the one created by LLVM, in about 499000 user cycles. IOW, fused multiply-add is a win on A57. Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with LLVM, both using -Ofast, GCC fails to vectorize the loop in gemm_block_kernel, while LLVM does. I should've done a more detailed analysis in this issue before submitting this bug, sorry.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #16 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM unrolls twice and uses multiple accumulators while GCC doesn't. You're right. LLVM produces: .LBB0_1:// %vector.body // =This Inner Loop Header: Depth=1 add x11, x9, x8 add x12, x10, x8 ldp q2, q3, [x11] ldp q4, q5, [x12] add x8, x8, #32 // =32 fmla v0.2d, v2.2d, v4.2d fmla v1.2d, v3.2d, v5.2d cmp x8, #128, lsl #12 // =524288 b.ne.LBB0_1 And GCC: .L3: ldr q2, [x2, x0] add w1, w1, 1 ldr q1, [x3, x0] cmp w1, w4 add x0, x0, 16 fmlav0.2d, v2.2d, v1.2d bcc .L3 I still don't see what this has to do with A57. You should open a generic bug about GCC not applying basic loop optimizations with -O3 (in fact limited unrolling is useful even for -O2). Indeed, but I think that there's still a code-generation opportunity for A57 here. Note above that the registers are loaded in pairs by LLVM, while GCC, when it unrolls the loop, more aggressively BTW, each vector is loaded individually: .L3: ldr q28, [x15, x16] add x17, x16, 16 ldr q29, [x14, x16] add x0, x16, 32 ldr q30, [x15, x17] add x18, x16, 48 ldr q31, [x14, x17] add x1, x16, 64 ... fmlav27.2d, v28.2d, v29.2d ... fmlav27.2d, v30.2d, v31.2d ... # Rest of 8x unroll bcc .L3 It also goes without saying that this code could also benefit from the post-increment addressing mode.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #17 from Evandro e.menezes at samsung dot com --- Created attachment 33785 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit Simple matrix multiplication
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Evandro e.menezes at samsung dot com changed: What|Removed |Added Attachment #33774|0 |1 is obsolete|| --- Comment #18 from Evandro e.menezes at samsung dot com --- Created attachment 33786 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33786action=edit Simple test-case
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #12 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33774 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit Simple test-case
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #8 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #7) As Evandro doesn't mention flags it's hard to say whether there really is a problem here or not. Both GCC and LLVM were given -O3 -ffast-math.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #9 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Wilco from comment #6) I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input. This is not a surprise given that the FMADD latency is lower than the FADD and FMUL latency. I ran the same Geekbench binaries on A53 and the result is about the same between the GCC and the LLVM code, if with a slight ( 1%) advantage for GCC.
[Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Bug ID: 63503 Summary: [AArch64] A57 executes fused multiply-add poorly in some situations Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com CC: spop at gcc dot gnu.org Target: aarch64-* Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I was baffled to find that the code emitted by GCC for the innermost loop in the algorithm core is actually very good: .L8: ldr d2, [x8, w5, uxtw 3] ldr d1, [x7, w5, uxtw 3] add w5, w5, 1 cmp w5, w6 fmadd d0, d2, d1, d0 bne .L8 LLVM's code is not so neat: .LBB0_10: ldr d1, [x27, x22, lsl #3] ldr d2, [x9, x22, lsl #3] fmuld1, d1, d2 faddd0, d0, d1 add w21, w21, #1 add x22, x22, #1 cmp w21, w24, uxtw b.ne .LBB0_10 However, it runs faster. Methinks that the A57 microarchitecture is performing tricks for discrete FP operations but not for fused multiply-add, since both code sequences are semantically the same. Whatever it is, it seems that fused multiply-add, and perhaps its cousins, is actually a performance hit only when one depends on the results of a previous one, as in this case on the results of the fused operation in the previous loop iteration. I'll try to create a simple test-case, but, in the meantime, please chime in about your thoughts.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Andrew Pinski from comment #1) The other question here are there denormals happening? That might cause some performance differences between using fmadd and fmul/fadd. Nope, no denormals.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #4 from Evandro Menezes e.menezes at samsung dot com --- Here's a simplified code to reproduce these results: double sum(double *A, double *B, int n) { int i; double res = 0; for (i = 0; i n; i++) res += A [i] * B [i]; return res; }
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #7 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Vladimir Makarov from comment #6) Evandro, thanks for reporting this. Sorry, I am busy with other thing these days. I'll start to work on this PR in September to try to make some progress for the next GCC release. May be a better remeaterialization in LRA I am working on now will help the PR too. Vladimir, I was thinking about using the hook function to avoid using FPR, at least when -Os is specified, for the time being. This way, registers would still be allocated by the LRA, but this side-effect would be under control. Or do y'all think that it's better to wait a little while longer?
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Evandro Menezes e.menezes at samsung dot com changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |WORKSFORME --- Comment #16 from Evandro Menezes e.menezes at samsung dot com --- If it's working, it's good for me.
[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #5 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33249 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249action=edit Dhrystone, part 2 of 3 I firstly observed this issue when looking into Dhrystone built with fairly standard options: -O2 -fno-short-enums -fno-inline -fno-inline-functions -fno-inline-small-functions -fno-inline-functions-called-once -fomit-frame-pointer -funroll-all-loops If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #9 from Evandro Menezes e.menezes at samsung dot com --- It seems to me that it's the LRA which is forcing the use of FP registers, so, even if the patterns are fixed, I believe that in the end the combiner would just give up and ICE. With this assumption, which is open to corrections, I believe that the LRA needs to be properly managed according to the options passed on to the target.
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #11 from Evandro Menezes e.menezes at samsung dot com --- (In reply to ktkachov from comment #10) What we really need here is a preprocessed testcase showing the problem. It should be fairly easy to lock down on the problem then I'm on it.
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #13 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33253 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33253action=edit Test-case This test-case is a stripped-down version of Dhrystone, where the issue was first observed. Built with just -O2, it ends up with a handful of FMOVs, which are then avoided if -mno-lra is also specified.
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Evandro Menezes e.menezes at samsung dot com changed: What|Removed |Added Attachment #33246|0 |1 is obsolete|| --- Comment #14 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33254 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33254action=edit Patch For the sake of correctness, this patch uses the more generic flag to qualify the spilling.
[Bug target/62014] New: [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Bug ID: 62014 Summary: [AArch64] Using -mgeneral-regs-only may lead to ICE Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com Created attachment 33245 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33245action=edit This patch should fix this issue, though it needs a test-case. In some cases, when the LRA spills a register into an FP register, with the option -mgeneral-regs-only specified, there is an ICE. It seems to be caused by the LRA assuming that the FP registers are always available and not being told when they aren't by the target.
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #2 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Andrew Pinski from comment #1) + /* Do not spill into FP registers when -mgeneral-regs-only is specified. * You are missing a / in your comment. Ermahgerd!
[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Evandro Menezes e.menezes at samsung dot com changed: What|Removed |Added Attachment #33245|0 |1 is obsolete|| --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33246 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33246action=edit This patch should fix this issue, though it needs a test-case
[Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 Bug ID: 61915 Summary: [AArch64] Default use of the LRA results in extra code size Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com The issue that I observed in code size due to the default use of the LRA results in the spilling of the FP register used to spill variables into, which increases code-size. For example, in Dhrystone, out of dhry_1.c I see sequences like this: ldrd9, [sp, 144] ... fmovx0, d9 blprintf ... fmovx0, d9 ... blprintf By disabling the LRA, the code is a tad leaner (2%): ldrx0, [sp, 144] ... blprintf ... ldrx0, [sp, 144] ... blprintf Moreover, is transferring registers between the GP and the FP register files always cheap? In some x86 processors this used to be accomplished internally through the load-store unit anyway (e.g., Opteron). How is this accomplished internally in A53 and A57? Is using the LRA by default clearly beneficial in other cases? At the Cauldron I mentioned some variables that could be rematerialized when needed instead of being spilled, but I could not reproduce that. I'll try some more to spot this behavior.
[Bug target/61915] [AArch64] Default use of the LRA results in extra code size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- In Opteron, there was a path from FP to the GP registers, but not the other way around. That path was eventually made symmetric in Barcelona only.