https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> --- (In reply to Robin Dapp from comment #1) > Hmm, so I tried reproducing this and without a vector cost model we indeed > vectorize. My qemu dynamic instruction count results are not as abysmal as > yours but still bad enough (20-30% increase in dynamic instructions). > > However, as soon as I use the vector cost model, enabled by > -mtune=generic-ooo, the sha256 function is not vectorized anymore: > > bla.c:95:5: note: Cost model analysis for part in loop 0: > Vector cost: 294 > Scalar cost: 185 > bla.c:95:5: missed: not vectorized: vectorization is not profitable. > > Without that we have: > bla.c:95:5: note: Cost model analysis for part in loop 0: > Vector cost: 173 > Scalar cost: 185 > bla.c:95:5: note: Basic block will be vectorized using SLP > > (Those costs are obtained via default_builtin_vectorization_cost). > > The main difference is vec_to_scalar cost being 1 by default and 2 in our > cost model, as well as vec_perm = 2. Given our limited permute capabilities > I think a cost of 2 makes sense. We can also argue in favor of > vec_to_scalar = 2 because we need to slide down elements for extraction and > cannot extract directly. Setting scalar_to_vec = 2 is debatable and I'd > rather keep it at 1. > > For the future we need to make a decision whether to continue with > generic-ooo as the default vector model or if we want to set latencies to a > few uniform values in order for scheduling not to introduce spilling and > waiting for dependencies. > > To help with that decision you could run some benchmarks with the > generic-ooo tuning and see if things get better or worse? generic-ooo is using the generic cost model I added. Since the default tuning info is rocket which doesn't have cost model, then we use default builtin cost model. I think it's more reasonable adjust code in builtin_vectorize_cost: if (!cost) cost = &generic_vector_cost;