https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111317

--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I think the default cost model is not too bad for these simple cases.  Our
emitted instructions match gimple pretty well.

The thing we don't model is vsetvl.  We could ignore it under the assumption
that it is going to be rather cheap on most uarchs.

Something that needs to be fixed is the general costing used for
length-masking:

            /* Each may need two MINs and one MINUS to update lengths in body
               for next iteration.  */
            if (need_iterate_p)
              body_stmts += 3 * num_vectors;

We don't actually need min with vsetvl (they are our mins) so this would need
to be adjusted down, provided vsetvl is cheap.  

This is the scalar baseline:
.L3:
        lw      a5,0(a0)
        sd      a5,0(a1)
        addi    a0,a0,4
        addi    a1,a1,8
        bne     a4,a0,.L3


While this is what zvl128b would emit:
 .L3:
        vsetvli a5,a2,e8,mf8,ta,ma
        vle32.v v2,0(a0)
        vsetvli a4,zero,e64,m1,ta,ma
        vsext.vf2       v1,v2
        vsetvli zero,a2,e64,m1,ta,ma
        vse64.v v1,0(a1)
        slli    a4,a5,2
        add     a0,a0,a4
        slli    a4,a5,3
        add     a1,a1,a4
        sub     a2,a2,a5
        bne     a2,zero,.L3

With a vectorization factor of 2 (might effectively be higher of course but
possibly unknown at compile time) I'm not sure vectorization is always a win
and the costs actually reflect that.  If we disregard vsetvl for now we have 8
instructions in the vectorized loop and 2 * 4 instructions in the scalar loop
for the same amount of data.  Factoring in the vsetvls I'd say it's worse.
Once we statically know the VF is higher, we will vectorize.

Reply via email to