> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost:
> 
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but 
> failed on -march=rv64gcv_zvl4096b.
The scalar version also needs both instructions:

        li      a0,32768
        addiw   a0,a0,104

Therefore I don't think we should just add them to the
vectorization costs.  That would only be necessary if we needed
to synthesize a different constant (e.g. if a scalar constant
cannot be used directly in a vector setting).

Currently, scalar_outside_cost = 0 so we don't model it on the
scalar side either.

With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b:

a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body
a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue
(unsigned short) a.4_25 1 times vector_stmt costs 1 in body
MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 2 in prologue
MIN_EXPR <patt_28, 15> 1 times vector_stmt costs 1 in body
32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue
32872 >> patt_26 1 times vector_stmt costs 1 in body
<unknown> 1 times scalar_stmt costs 1 in prologue
<unknown> 1 times scalar_stmt costs 1 in body

  Vector inside of loop cost: 5
  Scalar iteration cost: 1 (shouldn't that be 2? but anyway)

So one vector iteration costs 5 right now regardless of
scalar_to_vec because there are 5 vector operations (phi,
promote, min, shift, vsetvl/len adjustment).
The scalar_to_vec costs are added to the prologue because it is
assumed that broadcasts are hoisted out of the loop.

Then vectorization is supposed to be profitable if
#iterations = 18 > (body_cost * min_iters)
   + vector_outside_cost - scalar_outside_cost + 1 = 15.

If we really don't want to vectorize, then we can either
further increase the prologue cost or the body itself.  The
body statements are all vector_stmts, though.  For the
prologue we need a good argument why to increase scalar_to_vec
to beyond, say 2.

> Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be
> more costly than normal vadd.vv since it is transferring data between
> different pipeline/register class.

We want it to be more expensive, yes.  In one of the last
messages I explained how I would model it using either
register_move_cost or using (tune-specific) costs directly.
I would start with scalar_to_vec = 1 and add 1 or 2 depending
on the uarch/tune-specific reg-move costs.

Regards
 Robin

Reply via email to