https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Ok. I found it even without vectorization:

GCC is worse than Clang:

https://godbolt.org/z/addr54Gc6

GCC (14 instructions inside the loop):

        fld     fa3,0(a0)
        fld     fa5,8(a0)
        fld     fa1,16(a0)
        fsub.d  fa4,ft2,fa3
        addi    a0,a0,160
        fadd.d  fa5,fa5,fa1
        addi    a1,a1,160
        addi    a5,a5,160
        fmadd.d fa4,fa4,fa2,fa3
        fnmsub.d        fa5,fa5,ft1,ft0
        fsd     fa4,-160(a1)
        fld     fa4,-152(a0)
        fadd.d  fa4,fa4,fa0
        fmadd.d fa5,fa5,fa2,fa4
        fsd     fa5,-160(a5)

Clang (12 instructions inside the loop):

        fld     fa1, -8(a0)
        fld     fa0, 0(a0)
        fld     ft0, 8(a0)
        fmadd.d fa1, fa1, fa4, fa5
        fsd     fa1, 0(a1)
        fld     fa1, 0(a0)
        fadd.d  fa0, ft0, fa0
        fmadd.d fa0, fa0, fa2, fa3
        fadd.d  fa1, fa0, fa1
        add     a4, a1, a3
        fsd     fa1, -376(a4)
        addi    a1, a1, 160
        addi    a0, a0, 160

The critical things is that:

GCC has 

        fsub.d  fa4,ft2,fa3
        fadd.d  fa5,fa5,fa1
        fmadd.d fa4,fa4,fa2,fa3
        fnmsub.d        fa5,fa5,ft1,ft0
        fadd.d  fa4,fa4,fa0
        fmadd.d fa5,fa5,fa2,fa4

6 floating-point operations.

Clang has:

        fmadd.d fa1, fa1, fa4, fa5
        fadd.d  fa0, ft0, fa0
        fmadd.d fa0, fa0, fa2, fa3
        fadd.d  fa1, fa0, fa1

Clang has 4.

2 more floating-point operations are very critical to the performance I think
since double floating-point operations are usually costly in real hardware.

Reply via email to