On 6/28/23 16:00, 钟居哲 wrote:
You can see here:

https://godbolt.org/z/d78646hWb <https://godbolt.org/z/d78646hWb>
So just to be explicit, I see no difference with that test before/after your proposed change. Nor would I expect one based on my understanding of the patch.

The explicit conversions I see are because we need the output of the conversion in multiple vfmul instructions. That won't be helped by the patch you've proposed.

To be more concrete:

       vsetvli t1,t5,e32,mf2,ta,ma     # 99    [c=0 l=4]  vsetvldi
        vle32.v v2,0(a4)        # 23    [c=4 l=4]  pred_movvnx2sf/1
        vle32.v v1,0(a5)        # 25    [c=4 l=4]  pred_movvnx2sf/1
        vsetvli t0,zero,e32,mf2,ta,ma   # 101   [c=0 l=4]  vsetvldi
        vfwcvt.f.f.v    v3,v2   # 77    [c=4 l=4]  pred_extendvnx2df/0
        vfwcvt.f.f.v    v2,v1   # 79    [c=4 l=4]  pred_extendvnx2df/0
        vsetvli zero,t1,e32,mf2,ta,ma   # 102   [c=0 l=4]  
vsetvl_discard_resultdi
        vle32.v v5,0(a6)        # 31    [c=4 l=4]  pred_movvnx2sf/1
        vle32.v v4,0(a7)        # 39    [c=4 l=4]  pred_movvnx2sf/1
        vsetvli t0,zero,e32,mf2,ta,ma   # 103   [c=0 l=4]  vsetvldi
        vfwcvt.f.f.v    v1,v5   # 81    [c=4 l=4]  pred_extendvnx2df/0
        vsetvli zero,zero,e64,m1,ta,ma  # 104   [c=16 l=4]  
vsetvl_vtype_change_only
        vfmul.vv        v5,v2,v3        # 29    [c=4 l=4]  pred_mulvnx2df/2
        vfmul.vv        v2,v1,v2        # 34    [c=4 l=4]  pred_mulvnx2df/2
        vsetvli zero,t1,e64,m1,ta,ma    # 105   [c=0 l=4]  
vsetvl_discard_resultdi
        vse64.v v2,0(a1)        # 35    [c=4 l=4]  pred_storevnx2df
        vse64.v v5,0(a0)        # 30    [c=4 l=4]  pred_storevnx2df
        vsetvli t6,zero,e64,m1,ta,ma    # 106   [c=0 l=4]  vsetvldi
        vfmul.vv        v1,v1,v3        # 37    [c=4 l=4]  pred_mulvnx2df/2
        vsetvli zero,zero,e32,mf2,ta,ma # 107   [c=20 l=4]  
vsetvl_vtype_change_only
        vfwcvt.f.f.v    v2,v4   # 83    [c=4 l=4]  pred_extendvnx2df/0
        vsetvli zero,t1,e64,m1,ta,ma    # 108   [c=0 l=4]  
vsetvl_discard_resultdi
        vse64.v v1,0(a2)        # 38    [c=4 l=4]  pred_storevnx2df
        vsetvli t6,zero,e64,m1,ta,ma    # 109   [c=0 l=4]  vsetvldi
        slli    t4,t1,2 # 22    [c=4 l=4]  ashldi3
        slli    t3,t1,3 # 27    [c=4 l=4]  ashldi3
        vfmul.vv        v1,v2,v3        # 42    [c=4 l=4]  pred_mulvnx2df/2


Note how the output of the explicit conversion done in insn 77 is used by the vfmul in insns 29, 37 and 42. Similarly for the other explcit conversions.

Your pattern isn't going to help that problem.

You could model this as a dependency height reduction. I think that will get you were you want to go.

You'll need a pattern that matches this:

(parallel [ (set (reg:VNx2DF 160 [ vect__11.15 ])
            (if_then_else:VNx2DF (unspec:VNx2BI [
                        (const_vector:VNx2BI repeat [
                                (const_int 1 [0x1])
]) (reg:DI 169)
                        (const_int 2 [0x2]) repeated x2
(const_int 1 [0x1]) (const_int 7 [0x7])
                        (reg:SI 66 vl)
                        (reg:SI 67 vtype)
                        (reg:SI 69 frm)
                    ] UNSPEC_VPREDICATE)
                (mult:VNx2DF (float_extend:VNx2DF (reg:VNx2SF 144 [ vect__7.13 
]))
                    (float_extend:VNx2DF (reg:VNx2SF 146 [ vect__4.9 ])))
                (unspec:VNx2DF [
                        (reg:SI 0 zero)
                    ] UNSPEC_VUNDEF)))
        (set (reg:VNx2DF 143 [ vect__8.14 ])
            (float_extend:VNx2DF (reg:VNx2SF 144 [ vect__7.13 ])))
        (set (reg:VNx2DF 145 [ vect__5.10 ])
            (float_extend:VNx2DF (reg:VNx2SF 146 [ vect__4.9 ])))
    ])

It'll need to be a define_insn_and_split as its a 3->3 splitter. The split will emit the two extensions and the widening multiply as 3 distinct insns.

This has two positive effects. First the widening multiply is no longer data dependent on the float_extend and so it can issue when ever r144 and r146 are ready rather than when r143 and r145 are ready.

The second effect is I think this pattern will end up matching all the multiplies in this sample code. As a result all the float_extend insns you generated when splitting become dead and should be removed by DCE.


Jeff

Reply via email to