On Tue, May 11, 2021 at 12:50 PM Kewen.Lin <li...@linux.ibm.com> wrote:
>
> Hi Richi,
>
> >>>>> OTOH we already pass scalar_stmt to individual add_stmt_cost,
> >>>>> so not sure whether the context really matters.  That said,
> >>>>> the density test looks "interesting" ... the intent was that finish_cost
> >>>>> might look at gathered data from add_stmt, not that it looks at
> >>>>> the GIMPLE IL ... so why are you not counting vector_stmt vs.
> >>>>> scalar_stmt entries in vect_body and using that for this metric?
> >>>>>
> >>>>
> >>>> Good to know the intention behind finish_cost, thanks!
> >>>>
> >>>> I'm afraid that the check on vector_stmt and scalar_stmt entries
> >>>> from add_stmt_cost doesn't work for the density test here.  The
> >>>> density test focuses on the vector version itself, there are some
> >>>> stmts whose relevants are marked as vect_unused_in_scope, IIUC
> >>>> they won't be passed down when costing for both versions.  But the
> >>>> existing density check would like to know the cost for the
> >>>> non-vectorized part.  The current implementation does:
> >>>>
> >>>>  vec_cost = data->cost[vect_body]
> >>>>
> >>>>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
> >>>>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
> >>>>             not_vec_cost++
> >>>>
> >>>>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
> >>>>
> >>>> it takes those unrelevant stmts into account, and then has
> >>>> both costs from the non-vectorized part (not_vec_cost)
> >>>> and vectorized part (cost[vect_body]), it can calculate the
> >>>> vectorization code density ratio.
> >>>
> >>> Yes, but then what "relevant" stmts are actually needed and what
> >>> not is missed by your heuristics.  It's really some GIGO one
> >>> I fear - each vectorized data reference will add a pointer IV
> >>> (eventually commoned by IVOPTs later) and pointer value updates
> >>> that are not accounted for in costing (the IVs and updates in the
> >>> scalar code are marked as not relevant).  Are those the stmts
> >>> this heuristic wants to look at?
> >>
> >> Yes, the IVs and updates (even the comparison for exit) are what
> >> the heuristics tries to count.  In most cases, the non vectorized
> >> part in the loop are IV updates.  And it's so true that the
> >> collected not_vec_cost could be not accurate, but it seems hard
> >> to predict the cost exactly here?
> >>
> >> Assuming this not_vect_cost cost is over priced, it could result
> >> in a lower density ratio than what it should be.  Also assuming
> >> the density threshold is relatively conservative, in this case
> >> if the ratio still exceeds the density threshold, we can say the
> >> loop is really dense.  It could miss to catch some "dense" loops,
> >> but I hope it won't take "non-dense" loops as "dense" unexpectedly.
> >
> > So we could in principle include IVs and updates in the costing but
> > then the vectorizer isn't absolutely careful for doing scalar cleanups
> > and instead expects IVOPTs to create canonical IVs.  Note for
> > the scalar part those stmts are not costed either, we'd have to
> > change that as well.  What this would mean is that for a scalar
> > loop accessing a[i] and b[i] we'd have one original IV + update
> > and the vectorizer generates two pointer IVs + updates.
> >
>
>
> I broke down my understanding a bit below to ensure it's correct.
>
>   - We can pass down those "unrelevant" stmts into add_stmt_cost
>     for both scalar and vector versions, then targets can check
>     stmts accordingly instead of scanning IL by themselves.
>     For scalar version, these are mainly original IV + update
>     + some address ref calculation;  while for vector version,
>     these are mainly pointer IVs + updates.
>
>   - What's the cost assigned for these "unrelevant" stmts?
>     The comments seems to imply we want to cost them?   If so,
>     I am worried that it can break some current costing
>     heuristics which don't consider these costs.  Besides,
>     these "unrelavant" stmts can be optimized later, if we
>     consider them somwhere like calculating profitable min
>     iter, could result in worse code?
>     Can we pass them down but cost them freely?
>
> > But in the end the vector code shouldn't end up worse than the
> > scalar code with respect to IVs - the cases where it would should
> > be already costed.  So I wonder if you have specific examples
> > where things go worse enough for the heuristic to trigger?
> >
>
> One typical case that I worked on to reuse this density check is the
> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
> exceed the threshold unlikely).  The interesting loop is the innermost
> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
> We have verified that this loop isn't profitable to be vectorized at
> O2 (without loop-interchange).

Yeah, but that's because the loop only runs 5 iterations, not because
of some "density" (which suggests AGU overloading or some such)?
Because if you modify it so it iterates more then with keeping the
"density" measurement constant you suddenly become profitable?

The loop does have quite many memory streams so optimizing
the (few) arithmetic ops by vectorizign them might not be worth
the trouble, esp. since most of the loads are "strided" (composed
from scalars) when no interchange is performed.  So it's probably
more a "density" of # memory streams vs. # arithmetic ops, and
esp. with any non-consecutive vector loads this balance being
worse in the vector case?

The x86 add_stmt_cost has

  /* If we do elementwise loads into a vector then we are bound by
     latency and execution resources for the many scalar loads
     (AGU and load ports).  Try to account for this by scaling the
     construction cost by the number of elements involved.  */
  if ((kind == vec_construct || kind == vec_to_scalar)
      && stmt_info
      && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
          || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
      && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
    {
      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
    }

so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores.
The above materialized over PRs 84037, 85491 and 87561, so not specifically
for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
as well.

Oh, and yes - the info the vectorizer presents the target with for
vector loads/stores
leaves a lot to be desired ...

Richard.

> Another function shell which also comes from 503.bwaves_r src file
> shell_lam.fppized.f does hit this threshold, the loop is the one
> starting from line 228.
>
> BR,
> Kewen

Reply via email to