https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rsand...@gcc.gnu.org from comment #1)
> (In reply to Tamar Christina from comment #0)
> > The SLP costs went from:
> > 
> >   Vector cost: 2
> >   Scalar cost: 4
> > 
> > to:
> > 
> >   Vector cost: 12
> >   Scalar cost: 4
> > 
> > it looks like it's no longer costing it as a duplicate but instead 4 vec
> > inserts.
> We do cost it as a duplicate, but we only try to vectorize up to
> the stores, rather than up to the load back.  So we're costing
> the difference between:
> 
>         fmov    s1, s0
>         stp     s1, s1, [x0]
>         stp     s1, s1, [x0, 8]
> 
> (no idea why we have an fmov, pretend we don't) and:
> 
>         fmov    s1, s0
>         dup     v1.4s, v1.s[0]
>         str     q1, [x0]
> 
> If we want the latter as a general principle, the PR is
> easy to fix.  But if we don't, we'd need to make the
> vectoriser start at the load or (alternatively) fold
> to a constructor independently of vectorisation.

I thought the SLP algorithm was bottom up and stores were
already sinks?  So is this maybe a bug?

Ah, guess there are two problems.

1. how did we end up with such poor scalar code, at least 5 instructions are
unneeded (separate issue)
2. The costing of the above, I guess I'm still slightly confused how we got to
that cost.

If it's costing purely on latency than the two are equivalent no? if you take
throughput into account the first would win, but the difference in costs is
still a lot higher then I would have expected.

In this case:

node 0x4f45480 1 times scalar_to_vec costs 4 in prologue

seems quite high, but I guess it doesn't know that there's no regfile transfer?

Reply via email to