Hi Richi,

>>>>> OTOH we already pass scalar_stmt to individual add_stmt_cost,
>>>>> so not sure whether the context really matters.  That said,
>>>>> the density test looks "interesting" ... the intent was that finish_cost
>>>>> might look at gathered data from add_stmt, not that it looks at
>>>>> the GIMPLE IL ... so why are you not counting vector_stmt vs.
>>>>> scalar_stmt entries in vect_body and using that for this metric?
>>>>>
>>>>
>>>> Good to know the intention behind finish_cost, thanks!
>>>>
>>>> I'm afraid that the check on vector_stmt and scalar_stmt entries
>>>> from add_stmt_cost doesn't work for the density test here.  The
>>>> density test focuses on the vector version itself, there are some
>>>> stmts whose relevants are marked as vect_unused_in_scope, IIUC
>>>> they won't be passed down when costing for both versions.  But the
>>>> existing density check would like to know the cost for the
>>>> non-vectorized part.  The current implementation does:
>>>>
>>>>  vec_cost = data->cost[vect_body]
>>>>
>>>>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
>>>>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
>>>>             not_vec_cost++
>>>>
>>>>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>>>>
>>>> it takes those unrelevant stmts into account, and then has
>>>> both costs from the non-vectorized part (not_vec_cost)
>>>> and vectorized part (cost[vect_body]), it can calculate the
>>>> vectorization code density ratio.
>>>
>>> Yes, but then what "relevant" stmts are actually needed and what
>>> not is missed by your heuristics.  It's really some GIGO one
>>> I fear - each vectorized data reference will add a pointer IV
>>> (eventually commoned by IVOPTs later) and pointer value updates
>>> that are not accounted for in costing (the IVs and updates in the
>>> scalar code are marked as not relevant).  Are those the stmts
>>> this heuristic wants to look at?
>>
>> Yes, the IVs and updates (even the comparison for exit) are what
>> the heuristics tries to count.  In most cases, the non vectorized
>> part in the loop are IV updates.  And it's so true that the
>> collected not_vec_cost could be not accurate, but it seems hard
>> to predict the cost exactly here?
>>
>> Assuming this not_vect_cost cost is over priced, it could result
>> in a lower density ratio than what it should be.  Also assuming
>> the density threshold is relatively conservative, in this case
>> if the ratio still exceeds the density threshold, we can say the
>> loop is really dense.  It could miss to catch some "dense" loops,
>> but I hope it won't take "non-dense" loops as "dense" unexpectedly.
> 
> So we could in principle include IVs and updates in the costing but
> then the vectorizer isn't absolutely careful for doing scalar cleanups
> and instead expects IVOPTs to create canonical IVs.  Note for
> the scalar part those stmts are not costed either, we'd have to
> change that as well.  What this would mean is that for a scalar
> loop accessing a[i] and b[i] we'd have one original IV + update
> and the vectorizer generates two pointer IVs + updates.
> 


I broke down my understanding a bit below to ensure it's correct.

  - We can pass down those "unrelevant" stmts into add_stmt_cost
    for both scalar and vector versions, then targets can check
    stmts accordingly instead of scanning IL by themselves.
    For scalar version, these are mainly original IV + update
    + some address ref calculation;  while for vector version,
    these are mainly pointer IVs + updates.
  
  - What's the cost assigned for these "unrelevant" stmts?
    The comments seems to imply we want to cost them?   If so,
    I am worried that it can break some current costing
    heuristics which don't consider these costs.  Besides,
    these "unrelavant" stmts can be optimized later, if we
    consider them somwhere like calculating profitable min
    iter, could result in worse code?
    Can we pass them down but cost them freely?

> But in the end the vector code shouldn't end up worse than the
> scalar code with respect to IVs - the cases where it would should
> be already costed.  So I wonder if you have specific examples
> where things go worse enough for the heuristic to trigger?
> 

One typical case that I worked on to reuse this density check is the
function mat_times_vec of src file block_solver.fppized.f of SPEC2017
503.bwaves_r, the density with the existing heuristic is 83 (doesn't
exceed the threshold unlikely).  The interesting loop is the innermost
one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
We have verified that this loop isn't profitable to be vectorized at
O2 (without loop-interchange).

Another function shell which also comes from 503.bwaves_r src file
shell_lam.fppized.f does hit this threshold, the loop is the one
starting from line 228.

BR,
Kewen

Reply via email to