Hi! On Tue, Sep 28, 2021 at 04:16:04PM +0800, Kewen.Lin wrote: > This patch follows the discussions here[1][2], where Segher > pointed out the existing way to guard the extra penalized > cost for strided/elementwise loads with a magic bound does > not scale. > > The way with nunits * stmt_cost can get one much > exaggerated penalized cost, such as: for V16QI on P8, it's > 16 * 20 = 320, that's why we need one bound. To make it > better and more readable, the penalized cost is simplified > as: > > unsigned adjusted_cost = (nunits == 2) ? 2 : 1; > unsigned extra_cost = nunits * adjusted_cost;
> For V2DI/V2DF, it uses 2 penalized cost for each scalar load > while for the other modes, it uses 1. So for V2D[IF] we get 4, for V4S[IF] we get 4, for V8HI it's 8, and for V16QI it is 16? Pretty terrible as well, heh (I would expect all vector ops to be similar cost). > It's mainly concluded > from the performance evaluations. One thing might be > related is that: More units vector gets constructed, more > instructions are used. Yes, but how often does that happen, compared to actual vector ops? This also suggests we should cost vector construction separately, which would pretty obviously be a good thing anyway (it happens often, it has a quite different cost structure). > It has more chances to schedule them > better (even run in parallelly when enough available units > at that time), so it seems reasonable not to penalize more > for them. Yes. > + /* Don't expect strided/elementwise loads for just 1 nunit. */ "We don't expect" etc. Okay for trunk. Thanks! This probably isn't the last word in this story, but it is an improvement in any case :-) Segher