https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #5 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #3)
> We are vectorizing the store it dst[] now at -O2 since that appears
> profitable:
> 
> t.c:10:10: note: Cost model analysis:
> r0.0_12 1 times scalar_store costs 12 in body
> r1.1_13 1 times scalar_store costs 12 in body
> r2.2_14 1 times scalar_store costs 12 in body
> r3.3_15 1 times scalar_store costs 12 in body
> r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> t.c:10:10: note: Cost model analysis for part in loop 0:
>   Vector cost: 32
>   Scalar cost: 48
> t.c:10:10: note: Basic block will be vectorized using SLP

That makes no sense.
4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more
costly than 4 scalar stores.
Even more so considering that scalar store go to adjacent addresses so, on good
CPUs, they are likely combined
at the level of store queue, so a cache subsystem sees fewer operations.


Either your cost model is broken or there are bugs in summation.
I'd guess, somehow compiler thinks that moves have zero cost. But
scalar-to-vector moves are certainly not of zero cost. 
Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
does not have a zero cost, because quite often renamer itself constitutes the
narrowest performance bottleneck. But those moves... I don't think that they
are hoisted by renamer.
Also, it's likely that compiler thinks that scalar store costs the same as
vector store. That's also generally incorrect, esp. when you don't know your
target CPU and don't know whether stores are aligned or not, like in this case.

Reply via email to