https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to rsand...@gcc.gnu.org from comment #1) > (In reply to Tamar Christina from comment #0) > > The SLP costs went from: > > > > Vector cost: 2 > > Scalar cost: 4 > > > > to: > > > > Vector cost: 12 > > Scalar cost: 4 > > > > it looks like it's no longer costing it as a duplicate but instead 4 vec > > inserts. > We do cost it as a duplicate, but we only try to vectorize up to > the stores, rather than up to the load back. So we're costing > the difference between: > > fmov s1, s0 > stp s1, s1, [x0] > stp s1, s1, [x0, 8] > > (no idea why we have an fmov, pretend we don't) and: > > fmov s1, s0 > dup v1.4s, v1.s[0] > str q1, [x0] > > If we want the latter as a general principle, the PR is > easy to fix. But if we don't, we'd need to make the > vectoriser start at the load or (alternatively) fold > to a constructor independently of vectorisation. I thought the SLP algorithm was bottom up and stores were already sinks? So is this maybe a bug? Ah, guess there are two problems. 1. how did we end up with such poor scalar code, at least 5 instructions are unneeded (separate issue) 2. The costing of the above, I guess I'm still slightly confused how we got to that cost. If it's costing purely on latency than the two are equivalent no? if you take throughput into account the first would win, but the difference in costs is still a lot higher then I would have expected. In this case: node 0x4f45480 1 times scalar_to_vec costs 4 in prologue seems quite high, but I guess it doesn't know that there's no regfile transfer?