https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072

--- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
(In reply to Tamar Christina from comment #2)
> I thought the SLP algorithm was bottom up and stores were
> already sinks?
Yeah, they are.  But the point is that we're vectorising
the stores in isolation, with no knowledge of what happens
later.  The reason the code here is particularly bad is
that the array is later loaded into a vector.  But the
vectoriser doesn't know that.

> Ah, guess there are two problems.
> 
> 1. how did we end up with such poor scalar code, at least 5 instructions are
> unneeded (separate issue)
> 2. The costing of the above, I guess I'm still slightly confused how we got
> to that cost
The patch that introduce the regression uses an on-the-side costing
scheme for store sequences.  If it thinks that the scalar code is
better, it manipulates the vector body cost so that the body is twice
as expensive as the scalar body.  The prologue cost (1 for the
scalar_to_vec) is then added on top.

> If it's costing purely on latency than the two are equivalent no? if you
> take throughput into account the first would win, but the difference in
> costs is still a lot higher then I would have expected.
> 
> In this case:
> 
> node 0x4f45480 1 times scalar_to_vec costs 4 in prologue
> 
> seems quite high, but I guess it doesn't know that there's no regfile
> transfer?
Which -mcpu/-mtune are you using?  For generic it's 1 rather than 4
(so that the vector cost is 9 rather than 12, although still
higher than the scalar cost).

Reply via email to