https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Hmm, so I tried reproducing this and without a vector cost model we indeed
vectorize.  My qemu dynamic instruction count results are not as abysmal as
yours but still bad enough (20-30% increase in dynamic instructions).

However, as soon as I use the vector cost model, enabled by -mtune=generic-ooo,
the sha256 function is not vectorized anymore:

bla.c:95:5: note: Cost model analysis for part in loop 0:
  Vector cost: 294
  Scalar cost: 185
bla.c:95:5: missed: not vectorized: vectorization is not profitable.

Without that we have:
bla.c:95:5: note: Cost model analysis for part in loop 0:
  Vector cost: 173
  Scalar cost: 185
bla.c:95:5: note: Basic block will be vectorized using SLP

(Those costs are obtained via default_builtin_vectorization_cost).

The main difference is vec_to_scalar cost being 1 by default and 2 in our cost
model, as well as vec_perm = 2.  Given our limited permute capabilities I think
a cost of 2 makes sense.  We can also argue in favor of vec_to_scalar = 2
because we need to slide down elements for extraction and cannot extract
directly.  Setting scalar_to_vec = 2 is debatable and I'd rather keep it at 1.

For the future we need to make a decision whether to continue with generic-ooo
as the default vector model or if we want to set latencies to a few uniform
values in order for scheduling not to introduce spilling and waiting for
dependencies.

To help with that decision you could run some benchmarks with the generic-ooo
tuning and see if things get better or worse?

Reply via email to