https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Another thing is noticing the loop performs no vector loads/stores at all, all of them are strided. If we'd improve SLP analysis we could get equal (but VF==1) basic-block vectorization - but with the caveat of having to deal with the possible aliasing of XPQKL(MPQ,MKL) and XPQKL(MRS,MKL). Still in a case where there's no aliasing doing BB vectorization will eventually be a better solution. That said - a x86 backend specific thing could be to count the number of vector loads/stores as well as the number of strided loads/stores and apply the biasing based on that at finish_cost time, not on the individual case. We can also count the number of "other" stmts in the loop body so to weight the ratio between them. For gamess it's 10 vector stmts vs. 6 strided loads + 2 strided stores. We could simply sum vector stmts (including vector loads and stores), subtract the "emulated scalar" ones (maybe weight the variably strided cases with a factor of two) and require the outcome to be > 0 to be worthwhile to vectorize. Eventually the finish_cost hook should get a bool result to indicate that independent of the cost of the scalar loop we do not want this vectorization (that's nicer than returning an arbitrary high number for example).