https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Alexander Monakov from comment #1) > Are you sure the testcase is correctly reduced, i.e. does it show the same > performance degradation? Latency-wise the scheduler is making the correct > decision here: we really want to schedule second-to-last FMA The reduction wasn't tested for performance, I'm not even claiming the final result is optimal because scheduling was completely disabled. > > y = v_fma_f32 (y, r2, x); > > earlier than its predecessor > > r = v_fma_f32 (y, r2, z); > > because we need to compute y*r2 before the last FMA. The relative order of the instructions didn't change as far as I can tell in the reduced example. I had expected the mul to be moved earlier. __v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t): fmul v3.4s, v3.4s, v3.4s mov v5.16b, v0.16b mov v4.16b, v0.16b fmla v5.4s, v2.4s, v3.4s fmla v4.4s, v5.4s, v3.4s fmla v0.4s, v4.4s, v3.4s fmul v6.4s, v3.4s, v0.4s mov v0.16b, v1.16b fmla v0.4s, v4.4s, v3.4s fmla v0.4s, v6.4s, v0.4s ret as the copy of v0 into v2 is still redundant. However looking at the RTL of the reduction, I don't really understand why the mov existed. The bad case is (insn 13 11 12 2 (set (reg:V4SF 94 [ _9 ]) (fma:V4SF (reg:V4SF 96 [ _11 ]) (reg/v:V4SF 93 [ r2 ]) (reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4} (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ]) (nil))) (insn 12 13 14 2 (set (reg:V4SF 95 [ _10 ]) (fma:V4SF (reg:V4SF 96 [ _11 ]) (reg/v:V4SF 93 [ r2 ]) (reg:V4SF 106))) "":14605:10 2206 {fmav4sf4} (expr_list:REG_DEAD (reg:V4SF 106) (expr_list:REG_DEAD (reg:V4SF 96 [ _11 ]) (nil)))) (insn 14 12 19 2 (set (reg:V4SF 103) (mult:V4SF (reg/v:V4SF 93 [ r2 ]) (reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3} (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ]) (expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ]) (nil)))) and the good case (insn 12 11 13 2 (set (reg:V4SF 95 [ _10 ]) (fma:V4SF (reg:V4SF 96 [ _11 ]) (reg/v:V4SF 93 [ r2 ]) (reg:V4SF 106))) "":14605:10 2206 {fmav4sf4} (expr_list:REG_DEAD (reg:V4SF 106) (nil))) (insn 13 12 14 2 (set (reg:V4SF 94 [ _9 ]) (fma:V4SF (reg:V4SF 96 [ _11 ]) (reg/v:V4SF 93 [ r2 ]) (reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4} (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ]) (expr_list:REG_DEAD (reg:V4SF 96 [ _11 ]) (nil)))) (insn 14 13 15 2 (set (reg:V4SF 103) (mult:V4SF (reg/v:V4SF 93 [ r2 ]) (reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3} (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ]) (expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ]) (nil)))) So I don't really see why the live range of _9 was extended... so maybe this is register allocation after all. It's weird that -fno-schedule-insns removes the redundant moves in all cases though.. But perhaps that's just coincidence...