https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #1)
> Are you sure the testcase is correctly reduced, i.e. does it show the same
> performance degradation? Latency-wise the scheduler is making the correct
> decision here: we really want to schedule second-to-last FMA


The reduction wasn't tested for performance, I'm not even claiming the final
result is optimal because scheduling was completely disabled.

> 
>   y = v_fma_f32 (y, r2, x);
> 
> earlier than its predecessor
> 
>   r = v_fma_f32 (y, r2, z);
> 
> because we need to compute y*r2 before the last FMA.

The relative order of the instructions didn't change as far as I can tell in
the reduced example.

I had expected the mul to be moved earlier.

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
        fmul    v3.4s, v3.4s, v3.4s
        mov     v5.16b, v0.16b
        mov     v4.16b, v0.16b
        fmla    v5.4s, v2.4s, v3.4s
        fmla    v4.4s, v5.4s, v3.4s
        fmla    v0.4s, v4.4s, v3.4s
        fmul    v6.4s, v3.4s, v0.4s
        mov     v0.16b, v1.16b
        fmla    v0.4s, v4.4s, v3.4s
        fmla    v0.4s, v6.4s, v0.4s
        ret

as the copy of v0 into v2 is still redundant. However looking at the RTL of the
reduction, I don't really understand why the mov existed.

The bad case is

(insn 13 11 12 2 (set (reg:V4SF 94 [ _9 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
        (nil)))
(insn 12 13 14 2 (set (reg:V4SF 95 [ _10 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg:V4SF 106)
        (expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
            (nil))))
(insn 14 12 19 2 (set (reg:V4SF 103)
        (mult:V4SF (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
     (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
        (expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
            (nil))))

and the good case

(insn 12 11 13 2 (set (reg:V4SF 95 [ _10 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg:V4SF 106)
        (nil)))
(insn 13 12 14 2 (set (reg:V4SF 94 [ _9 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
        (expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
            (nil))))
(insn 14 13 15 2 (set (reg:V4SF 103)
        (mult:V4SF (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
     (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
        (expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
            (nil))))

So I don't really see why the live range of _9 was extended... so maybe this is
register allocation after all.

It's weird that -fno-schedule-insns removes the redundant moves in all cases
though.. But perhaps that's just coincidence...

Reply via email to