Hi,
I was looking into why we don't produce fmls with a scalar register
as the last argument but I found a difference in how fnma<mode>4 is
described in RTL which I think is causing the missed optimization.
Look at the scalar version:
(define_insn "fnma<mode>4"
[(set (match_operand:GPF_F16 0 "register_operand" "=w")
(fma:GPF_F16
(neg:GPF_F16 (match_operand:GPF_F16 1 "register_operand" "w"))
(match_operand:GPF_F16 2 "register_operand" "w")
(match_operand:GPF_F16 3 "register_operand" "w")))]
"TARGET_FLOAT"
"fmsub\\t%<s>0, %<s>1, %<s>2, %<s>3"
[(set_attr "type" "fmac<stype>")]
)
vs the vector version:
(define_insn "fnma<mode>4"
[(set (match_operand:VHSDF 0 "register_operand" "=w")
(fma:VHSDF
(match_operand:VHSDF 1 "register_operand" "w")
(neg:VHSDF
(match_operand:VHSDF 2 "register_operand" "w"))
(match_operand:VHSDF 3 "register_operand" "0")))]
"TARGET_SIMD"
"fmls\\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
[(set_attr "type" "neon_fp_mla_<stype><q>")]
)
Notice how the neg is a different location for both of them. What is
the reason for that?
Thanks,
Andrew