https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> --- Ok. I found it even without vectorization: GCC is worse than Clang: https://godbolt.org/z/addr54Gc6 GCC (14 instructions inside the loop): fld fa3,0(a0) fld fa5,8(a0) fld fa1,16(a0) fsub.d fa4,ft2,fa3 addi a0,a0,160 fadd.d fa5,fa5,fa1 addi a1,a1,160 addi a5,a5,160 fmadd.d fa4,fa4,fa2,fa3 fnmsub.d fa5,fa5,ft1,ft0 fsd fa4,-160(a1) fld fa4,-152(a0) fadd.d fa4,fa4,fa0 fmadd.d fa5,fa5,fa2,fa4 fsd fa5,-160(a5) Clang (12 instructions inside the loop): fld fa1, -8(a0) fld fa0, 0(a0) fld ft0, 8(a0) fmadd.d fa1, fa1, fa4, fa5 fsd fa1, 0(a1) fld fa1, 0(a0) fadd.d fa0, ft0, fa0 fmadd.d fa0, fa0, fa2, fa3 fadd.d fa1, fa0, fa1 add a4, a1, a3 fsd fa1, -376(a4) addi a1, a1, 160 addi a0, a0, 160 The critical things is that: GCC has fsub.d fa4,ft2,fa3 fadd.d fa5,fa5,fa1 fmadd.d fa4,fa4,fa2,fa3 fnmsub.d fa5,fa5,ft1,ft0 fadd.d fa4,fa4,fa0 fmadd.d fa5,fa5,fa2,fa4 6 floating-point operations. Clang has: fmadd.d fa1, fa1, fa4, fa5 fadd.d fa0, ft0, fa0 fmadd.d fa0, fa0, fa2, fa3 fadd.d fa1, fa0, fa1 Clang has 4. 2 more floating-point operations are very critical to the performance I think since double floating-point operations are usually costly in real hardware.