Hi Juzhe, Demin, Jeff,

This email is intended to continue the discussion started in
https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f
and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.

There was a mention on that thread of the potential usefulness of the 
late-combine
pass (added last month in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7)
in making this transformation. However, when I tried it out with my testcase at
https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex
post-split1 patterns for broadcast and vfmacc:

(insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ])
        (if_then_else:RVVM4SF (unspec:RVVMF8BI [
                    (const_vector:RVVMF8BI [
                            (const_int 1 [0x1]) repeated x16
                        ])
                    (const_int 16 [0x10])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 
MEM[(float *)_145]+0 S4 A32]))
            (unspec:RVVM4SF [
                    (reg:SI 0 zero)
                ] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 
{*pred_broadcastrvvm4sf_zvfh}
     (nil))
[ ... ]
(insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ])
        (if_then_else:RVVM4SF (unspec:RVVMF8BI [
                    (const_vector:RVVMF8BI [
                            (const_int 1 [0x1]) repeated x16
                        ])
                    (const_int 16 [0x10])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (const_int 7 [0x7])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                    (reg:SI 69 frm)
                ] UNSPEC_VPREDICATE)
            (plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ])
                    (reg:RVVM4SF 168 [ _61 ]))
                (reg:RVVM4SF 139 [ D__lsm.10 ]))
            (unspec:RVVM4SF [
                    (reg:SI 0 zero)
                ] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 
{*pred_mul_addrvvm4sf_undef}
     (nil))

I'm no expert on this, but what's stopping us from adding some vector-scalar
split patterns alongside vector-vector ones in autovec.md to fix this? For
instance, the addition of fma<mode>4_scalar insn_and_split like this:

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793ac..bf54d71 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1229,2 +1229,22 @@

+(define_insn_and_split "fma<mode>4_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand")
+        (plus:V_VLSF
+         (mult:V_VLSF
+           (vec_duplicate:V_VLSF (match_operand:SF 1 
"direct_broadcast_operand"))
+           (match_operand:V_VLSF 2 "register_operand"))
+         (match_operand:V_VLSF 3 "register_operand")))]
+  "TARGET_VECTOR && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+    rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+                 operands[0]};
+    riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, <MODE>mode),
+                                  riscv_vector::TERNARY_OP_FRM_DYN, ops);
+    DONE;
+  }
+  [(set_attr "type" "vector")])
+
 ;; -------------------------------------------------------------------------

does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the
testcase linked above.

What do you think about this approach to implement this optimization? Am I
missing anything important? Maybe split1 is too early to determine the final
instruction format (.vf vs .vv) and we should strive to recombine during
late-combine2?

Also, is there anyone working on this optimization at the present moment?

Many thanks in advance,
Artemiy

Reply via email to