https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> --- Hmm, so I tried reproducing this and without a vector cost model we indeed vectorize. My qemu dynamic instruction count results are not as abysmal as yours but still bad enough (20-30% increase in dynamic instructions). However, as soon as I use the vector cost model, enabled by -mtune=generic-ooo, the sha256 function is not vectorized anymore: bla.c:95:5: note: Cost model analysis for part in loop 0: Vector cost: 294 Scalar cost: 185 bla.c:95:5: missed: not vectorized: vectorization is not profitable. Without that we have: bla.c:95:5: note: Cost model analysis for part in loop 0: Vector cost: 173 Scalar cost: 185 bla.c:95:5: note: Basic block will be vectorized using SLP (Those costs are obtained via default_builtin_vectorization_cost). The main difference is vec_to_scalar cost being 1 by default and 2 in our cost model, as well as vec_perm = 2. Given our limited permute capabilities I think a cost of 2 makes sense. We can also argue in favor of vec_to_scalar = 2 because we need to slide down elements for extraction and cannot extract directly. Setting scalar_to_vec = 2 is debatable and I'd rather keep it at 1. For the future we need to make a decision whether to continue with generic-ooo as the default vector model or if we want to set latencies to a few uniform values in order for scheduling not to introduce spilling and waiting for dependencies. To help with that decision you could run some benchmarks with the generic-ooo tuning and see if things get better or worse?