https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #5 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Richard Biener from comment #3) > We are vectorizing the store it dst[] now at -O2 since that appears > profitable: > > t.c:10:10: note: Cost model analysis: > r0.0_12 1 times scalar_store costs 12 in body > r1.1_13 1 times scalar_store costs 12 in body > r2.2_14 1 times scalar_store costs 12 in body > r3.3_15 1 times scalar_store costs 12 in body > r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body > node 0x4b2b1e0 1 times vec_construct costs 4 in prologue > node 0x4b2b1e0 1 times vec_construct costs 4 in prologue > t.c:10:10: note: Cost model analysis for part in loop 0: > Vector cost: 32 > Scalar cost: 48 > t.c:10:10: note: Basic block will be vectorized using SLP That makes no sense. 4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more costly than 4 scalar stores. Even more so considering that scalar store go to adjacent addresses so, on good CPUs, they are likely combined at the level of store queue, so a cache subsystem sees fewer operations. Either your cost model is broken or there are bugs in summation. I'd guess, somehow compiler thinks that moves have zero cost. But scalar-to-vector moves are certainly not of zero cost. Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer does not have a zero cost, because quite often renamer itself constitutes the narrowest performance bottleneck. But those moves... I don't think that they are hoisted by renamer. Also, it's likely that compiler thinks that scalar store costs the same as vector store. That's also generally incorrect, esp. when you don't know your target CPU and don't know whether stores are aligned or not, like in this case.