On 7/24/24 11:25 AM, Artemiy Volkov wrote:
Hi Juzhe, Demin, Jeff,

This email is intended to continue the discussion started in
https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f
and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.

There was a mention on that thread of the potential usefulness of the 
late-combine
pass (added last month in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7)
in making this transformation. However, when I tried it out with my testcase at
https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex
post-split1 patterns for broadcast and vfmacc:

(insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ])
         (if_then_else:RVVM4SF (unspec:RVVMF8BI [
                     (const_vector:RVVMF8BI [
                             (const_int 1 [0x1]) repeated x16
                         ])
                     (const_int 16 [0x10])
                     (const_int 2 [0x2]) repeated x2
                     (const_int 0 [0])
                     (reg:SI 66 vl)
                     (reg:SI 67 vtype)
                 ] UNSPEC_VPREDICATE)
             (vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 
MEM[(float *)_145]+0 S4 A32]))
             (unspec:RVVM4SF [
                     (reg:SI 0 zero)
                 ] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 
{*pred_broadcastrvvm4sf_zvfh}
      (nil))
[ ... ]
(insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ])
         (if_then_else:RVVM4SF (unspec:RVVMF8BI [
                     (const_vector:RVVMF8BI [
                             (const_int 1 [0x1]) repeated x16
                         ])
                     (const_int 16 [0x10])
                     (const_int 2 [0x2]) repeated x2
                     (const_int 0 [0])
                     (const_int 7 [0x7])
                     (reg:SI 66 vl)
                     (reg:SI 67 vtype)
                     (reg:SI 69 frm)
                 ] UNSPEC_VPREDICATE)
             (plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ])
                     (reg:RVVM4SF 168 [ _61 ]))
                 (reg:RVVM4SF 139 [ D__lsm.10 ]))
             (unspec:RVVM4SF [
                     (reg:SI 0 zero)
                 ] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 
{*pred_mul_addrvvm4sf_undef}
      (nil))

I'm no expert on this, but what's stopping us from adding some vector-scalar
split patterns alongside vector-vector ones in autovec.md to fix this? For
instance, the addition of fma<mode>4_scalar insn_and_split like this:

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793ac..bf54d71 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1229,2 +1229,22 @@

+(define_insn_and_split "fma<mode>4_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand")
+        (plus:V_VLSF
+         (mult:V_VLSF
+           (vec_duplicate:V_VLSF (match_operand:SF 1 
"direct_broadcast_operand"))
+           (match_operand:V_VLSF 2 "register_operand"))
+         (match_operand:V_VLSF 3 "register_operand")))]
+  "TARGET_VECTOR && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+    rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+                 operands[0]};
+    riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, <MODE>mode),
+                                  riscv_vector::TERNARY_OP_FRM_DYN, ops);
+    DONE;
+  }
+  [(set_attr "type" "vector")])
+
  ;; -------------------------------------------------------------------------

does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the
testcase linked above.

What do you think about this approach to implement this optimization? Am I
missing anything important? Maybe split1 is too early to determine the final
instruction format (.vf vs .vv) and we should strive to recombine during
late-combine2?

Also, is there anyone working on this optimization at the present moment?
Before jumping straight to a new combiner pattern (especially a define_insn_and_split), I would want to have a clearer understanding of the code before/after instruction combination as well as before/after register allocation and reloading.

When I was looking at the results of late-combine it did seem to fairly consistently work well for generating .vx and .vf forms rather than .vv, so odds are something simple is missing somewhere.



I'm not aware of anyone working on this, except perhaps Demin. So there's ample room to work without stepping on anyone's toes.

jeff


Reply via email to