When I compile the following code with 'gcc -O3 --save-temps -c':

double foo(double x, double y)
{
     return ((x + 0.1234 * y) * (x - 0.1234 * y));
}

gcc 3.x gives one load of the constant 0.1234, one multiplication
0.1234 * y, one addition, one subtraction, and the final
multiplication: total = one constant (load) and four fp operations.

gcc 4.0 (20050213 snapshot), on the other hand, compiles (x - 0.1234 *
y) as (x + (-0.1234) * y), and thus doesn't recognize that it is the
same constant as in the other expression.  Thus, it produces *two*
constants (2 loads), and *five* fp operations (3 multiplications):

foo:
        pushl   %ebp
        movl    %esp, %ebp
        fldl    16(%ebp)
        fld     %st(0)
        fldl    8(%ebp)
        fxch    %st(1)
        fmull   .LC0
        fxch    %st(2)
        popl    %ebp
        fmull   .LC1
        fxch    %st(2)
        fadd    %st(1), %st
        fxch    %st(1)
        faddp   %st, %st(2)
        fmulp   %st, %st(1)
        ret

As you can imagine, this leads to a major slowdown in code that has
lots of multiply-add and multiply-subtract combinations...in
particular any FFT (such as our FFTW, www.fftw.org) could
suffer a lot.

Thanks for your efforts,
Steven G Johnson

PS. When you fix this, please don't re-introduce another optimizer bug
that appears in gcc 3.x.  In particular, when compiling for a PowerPC
target, it *should* produce one constant load, one fused multiply-add,
one fused-multiply subtract, and one multiplication.  gcc 3.x, on the
other hand, pulls out the (0.1234 * y) in CSE, and thus does not
exploit the fma.  gcc 4.0 on PowerPC (MacOS 10.3) produces:

_foo:
        mflr r0
        bcl 20,31,"L00000000001$pb"
"L00000000001$pb":
        stw r31,-4(r1)
        fmr f13,f1
        mflr r31
        stw r0,8(r1)
        lwz r0,8(r1)
        addis r2,r31,ha16(LC0-"L00000000001$pb")
        lfd f1,lo16(LC0-"L00000000001$pb")(r2)
        addis r2,r31,ha16(LC1-"L00000000001$pb")
        lfd f0,lo16(LC1-"L00000000001$pb")(r2)
Cordially,        mtlr r0
        fmadd f1,f2,f1,f13
        lwz r31,-4(r1)
        fmadd f2,f2,f0,f13
        fmul f1,f1,f2
        blr

which utilizes the fma, but loads the constant twice (as 0.1234 and
-0.1234) instead of using fmadd and fmsub.

PPS. In general, turning negative constants into positive constants by
changing additions into subtractions can lead to substantial speedups
by reducing the number of fp constants in certain kinds of code.
e.g. "manually" doing this in FFTW gained us 10-15% in speed; YMMV.
Something to think about.

Environment:
System: Linux fftw.org 2.6.3-1-686-smp #2 SMP Tue Feb 24 20:29:08 EST 2004 i686 
GNU/Linux
Architecture: i686

        
host: i686-pc-linux-gnu
build: i686-pc-linux-gnu
target: i686-pc-linux-gnu
configured with: ../configure --prefix=/home/stevenj/gcc4

How-To-Repeat:
Compile above foo() subroutine with gcc -O3 -c --save-temps and look
at assembler output.

-- 
           Summary: [4.0 Regression] pessimizes fp multiply-add/subtract
                    combo
           Product: gcc
           Version: 4.0.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: stevenj at fftw dot org
                CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19988

Reply via email to