[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

ramana.radhakrishnan at arm dot com Thu, 23 Oct 2014 02:59:33 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503


--- Comment #20 from ramana.radhakrishnan at arm dot com <ramana.radhakrishnan 
at arm dot com> ---
On 23/10/14 00:28, e.menezes at samsung dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
>
> --- Comment #16 from Evandro <e.menezes at samsung dot com> ---
> (In reply to Wilco from comment #15)
>> Using -Ofast is not any different from -O3 -ffast-math when compiling
>> non-Fortran code. As comment 10 shows, both loops are vectorized, however
>> LLVM unrolls twice and uses multiple accumulators while GCC doesn't.
>
> You're right.  LLVM produces:
>
> .LBB0_1:                                // %vector.body
>                                          // =>This Inner Loop Header: Depth=1
>          add      x11, x9, x8
>          add      x12, x10, x8
>          ldp      q2, q3, [x11]
>          ldp      q4, q5, [x12]
>          add      x8, x8, #32             // =32
>          fmla     v0.2d, v2.2d, v4.2d
>          fmla     v1.2d, v3.2d, v5.2d
>          cmp      x8, #128, lsl #12      // =524288
>          b.ne    .LBB0_1
>
> And GCC:
>
> .L3:
>          ldr     q2, [x2, x0]
>          add     w1, w1, 1
>          ldr     q1, [x3, x0]
>          cmp     w1, w4
>          add     x0, x0, 16
>          fmla    v0.2d, v2.2d, v1.2d
>          bcc     .L3
>
>> I still don't see what this has to do with A57. You should open a generic
>> bug about GCC not applying basic loop optimizations with -O3 (in fact
>> limited unrolling is useful even for -O2).
>
> Indeed, but I think that there's still a code-generation opportunity for A57
> here.

What you mention is a general code generation improvement for AArch64.

There's nothing Cortex-A57 specific about it. In the AArch64 backend, we 
think architecture and then micro-architecture.

>
> Note above that the registers are loaded in pairs by LLVM, while GCC, when it
> unrolls the loop, more aggressively BTW, each vector is loaded individually:
>
> .L3:
>          ldr     q28, [x15, x16]
>          add     x17, x16, 16
>          ldr     q29, [x14, x16]
>          add     x0, x16, 32
>          ldr     q30, [x15, x17]
>          add     x18, x16, 48
>          ldr     q31, [x14, x17]
>          add     x1, x16, 64
>          ...
>          fmla    v27.2d, v28.2d, v29.2d
>          ...
>          fmla    v27.2d, v30.2d, v31.2d
>          ...     # Rest of 8x unroll
>          bcc     .L3
>
> It also goes without saying that this code could also benefit from the
> post-increment addressing mode.


What's the kind of performance delta you see if you managed to unroll 
the loop just a wee bit ? Probably not much looking at the code produced 
here.

Ramana

>

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

Reply via email to