https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #20 from ramana.radhakrishnan at arm dot com <ramana.radhakrishnan at arm dot com> --- On 23/10/14 00:28, e.menezes at samsung dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 > > --- Comment #16 from Evandro <e.menezes at samsung dot com> --- > (In reply to Wilco from comment #15) >> Using -Ofast is not any different from -O3 -ffast-math when compiling >> non-Fortran code. As comment 10 shows, both loops are vectorized, however >> LLVM unrolls twice and uses multiple accumulators while GCC doesn't. > > You're right. LLVM produces: > > .LBB0_1: // %vector.body > // =>This Inner Loop Header: Depth=1 > add x11, x9, x8 > add x12, x10, x8 > ldp q2, q3, [x11] > ldp q4, q5, [x12] > add x8, x8, #32 // =32 > fmla v0.2d, v2.2d, v4.2d > fmla v1.2d, v3.2d, v5.2d > cmp x8, #128, lsl #12 // =524288 > b.ne .LBB0_1 > > And GCC: > > .L3: > ldr q2, [x2, x0] > add w1, w1, 1 > ldr q1, [x3, x0] > cmp w1, w4 > add x0, x0, 16 > fmla v0.2d, v2.2d, v1.2d > bcc .L3 > >> I still don't see what this has to do with A57. You should open a generic >> bug about GCC not applying basic loop optimizations with -O3 (in fact >> limited unrolling is useful even for -O2). > > Indeed, but I think that there's still a code-generation opportunity for A57 > here. What you mention is a general code generation improvement for AArch64. There's nothing Cortex-A57 specific about it. In the AArch64 backend, we think architecture and then micro-architecture. > > Note above that the registers are loaded in pairs by LLVM, while GCC, when it > unrolls the loop, more aggressively BTW, each vector is loaded individually: > > .L3: > ldr q28, [x15, x16] > add x17, x16, 16 > ldr q29, [x14, x16] > add x0, x16, 32 > ldr q30, [x15, x17] > add x18, x16, 48 > ldr q31, [x14, x17] > add x1, x16, 64 > ... > fmla v27.2d, v28.2d, v29.2d > ... > fmla v27.2d, v30.2d, v31.2d > ... # Rest of 8x unroll > bcc .L3 > > It also goes without saying that this code could also benefit from the > post-increment addressing mode. What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking at the code produced here. Ramana >