Hi!

On Tue, Sep 28, 2021 at 04:16:04PM +0800, Kewen.Lin wrote:
> This patch follows the discussions here[1][2], where Segher
> pointed out the existing way to guard the extra penalized
> cost for strided/elementwise loads with a magic bound does
> not scale.
> 
> The way with nunits * stmt_cost can get one much
> exaggerated penalized cost, such as: for V16QI on P8, it's
> 16 * 20 = 320, that's why we need one bound.  To make it
> better and more readable, the penalized cost is simplified
> as:
> 
>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>     unsigned extra_cost = nunits * adjusted_cost;

> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
> while for the other modes, it uses 1.

So for V2D[IF] we get 4, for V4S[IF] we get 4, for V8HI it's 8, and
for V16QI it is 16?  Pretty terrible as well, heh (I would expect all
vector ops to be similar cost).

> It's mainly concluded
> from the performance evaluations.  One thing might be
> related is that: More units vector gets constructed, more
> instructions are used.

Yes, but how often does that happen, compared to actual vector ops?

This also suggests we should cost vector construction separately, which
would pretty obviously be a good thing anyway (it happens often, it has
a quite different cost structure).

> It has more chances to schedule them
> better (even run in parallelly when enough available units
> at that time), so it seems reasonable not to penalize more
> for them.

Yes.

> +       /* Don't expect strided/elementwise loads for just 1 nunit.  */

"We don't expect" etc.

Okay for trunk.  Thanks!  This probably isn't the last word in this
story, but it is an improvement in any case :-)


Segher

Reply via email to