On 6/28/23 16:00, 钟居哲 wrote:
You can see here:
https://godbolt.org/z/d78646hWb <https://godbolt.org/z/d78646hWb>
So just to be explicit, I see no difference with that test before/after
your proposed change. Nor would I expect one based on my understanding
of the patch.
The explicit conversions I see are because we need the output of the
conversion in multiple vfmul instructions. That won't be helped by the
patch you've proposed.
To be more concrete:
vsetvli t1,t5,e32,mf2,ta,ma # 99 [c=0 l=4] vsetvldi
vle32.v v2,0(a4) # 23 [c=4 l=4] pred_movvnx2sf/1
vle32.v v1,0(a5) # 25 [c=4 l=4] pred_movvnx2sf/1
vsetvli t0,zero,e32,mf2,ta,ma # 101 [c=0 l=4] vsetvldi
vfwcvt.f.f.v v3,v2 # 77 [c=4 l=4] pred_extendvnx2df/0
vfwcvt.f.f.v v2,v1 # 79 [c=4 l=4] pred_extendvnx2df/0
vsetvli zero,t1,e32,mf2,ta,ma # 102 [c=0 l=4]
vsetvl_discard_resultdi
vle32.v v5,0(a6) # 31 [c=4 l=4] pred_movvnx2sf/1
vle32.v v4,0(a7) # 39 [c=4 l=4] pred_movvnx2sf/1
vsetvli t0,zero,e32,mf2,ta,ma # 103 [c=0 l=4] vsetvldi
vfwcvt.f.f.v v1,v5 # 81 [c=4 l=4] pred_extendvnx2df/0
vsetvli zero,zero,e64,m1,ta,ma # 104 [c=16 l=4]
vsetvl_vtype_change_only
vfmul.vv v5,v2,v3 # 29 [c=4 l=4] pred_mulvnx2df/2
vfmul.vv v2,v1,v2 # 34 [c=4 l=4] pred_mulvnx2df/2
vsetvli zero,t1,e64,m1,ta,ma # 105 [c=0 l=4]
vsetvl_discard_resultdi
vse64.v v2,0(a1) # 35 [c=4 l=4] pred_storevnx2df
vse64.v v5,0(a0) # 30 [c=4 l=4] pred_storevnx2df
vsetvli t6,zero,e64,m1,ta,ma # 106 [c=0 l=4] vsetvldi
vfmul.vv v1,v1,v3 # 37 [c=4 l=4] pred_mulvnx2df/2
vsetvli zero,zero,e32,mf2,ta,ma # 107 [c=20 l=4]
vsetvl_vtype_change_only
vfwcvt.f.f.v v2,v4 # 83 [c=4 l=4] pred_extendvnx2df/0
vsetvli zero,t1,e64,m1,ta,ma # 108 [c=0 l=4]
vsetvl_discard_resultdi
vse64.v v1,0(a2) # 38 [c=4 l=4] pred_storevnx2df
vsetvli t6,zero,e64,m1,ta,ma # 109 [c=0 l=4] vsetvldi
slli t4,t1,2 # 22 [c=4 l=4] ashldi3
slli t3,t1,3 # 27 [c=4 l=4] ashldi3
vfmul.vv v1,v2,v3 # 42 [c=4 l=4] pred_mulvnx2df/2
Note how the output of the explicit conversion done in insn 77 is used
by the vfmul in insns 29, 37 and 42. Similarly for the other explcit
conversions.
Your pattern isn't going to help that problem.
You could model this as a dependency height reduction. I think that
will get you were you want to go.
You'll need a pattern that matches this:
(parallel [
(set (reg:VNx2DF 160 [ vect__11.15 ])
(if_then_else:VNx2DF (unspec:VNx2BI [
(const_vector:VNx2BI repeat [
(const_int 1 [0x1])
])
(reg:DI 169)
(const_int 2 [0x2]) repeated x2
(const_int 1 [0x1])
(const_int 7 [0x7])
(reg:SI 66 vl)
(reg:SI 67 vtype)
(reg:SI 69 frm)
] UNSPEC_VPREDICATE)
(mult:VNx2DF (float_extend:VNx2DF (reg:VNx2SF 144 [ vect__7.13
]))
(float_extend:VNx2DF (reg:VNx2SF 146 [ vect__4.9 ])))
(unspec:VNx2DF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF)))
(set (reg:VNx2DF 143 [ vect__8.14 ])
(float_extend:VNx2DF (reg:VNx2SF 144 [ vect__7.13 ])))
(set (reg:VNx2DF 145 [ vect__5.10 ])
(float_extend:VNx2DF (reg:VNx2SF 146 [ vect__4.9 ])))
])
It'll need to be a define_insn_and_split as its a 3->3 splitter. The
split will emit the two extensions and the widening multiply as 3
distinct insns.
This has two positive effects. First the widening multiply is no longer
data dependent on the float_extend and so it can issue when ever r144
and r146 are ready rather than when r143 and r145 are ready.
The second effect is I think this pattern will end up matching all the
multiplies in this sample code. As a result all the float_extend insns
you generated when splitting become dead and should be removed by DCE.
Jeff