https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'll note that esp. two-lane reductions (or in general two-lane BB
vectorization) is hardly profitable on modern x86 uarchs unless the vectorized
code is interleaved with other non-vectorized code that can execute at the same
time.  vectorizing two lanes will only make them dependent on each other while
when not vectorized modern uarchs have no difficulty in executing them in
parallel (but without the tied dependences).  It's only when there's sufficient
benefit, aka more lanes, approaching the issue width or the number of available
ports for the ops, or the whole SLP mostly consisting of loads/stores, that BB
vectorization is going to be profitable.  Note the cost model only ever looks
at the stmts participating in the vectorization, not the "surrounding" code,
and it would be difficult to include that since the schedule on GIMPLE isn't
even close to what we get later.  The reduction op is also a serialization
point on the scalar side of course, whether that means that BB reductions
with two lanes are possibly better candidates than grouped BB stores with
two lanes is another question.

The BB reduction op itself is costed properly.

So the 525.x264_r case might be loop vectorization, OTOH the epilogue
cost is hardly ever a knob that decides whether a vectorization is profitable.

I think we need to figure out what exactly gets slower (and hope it's not
scattered all over the place)

Reply via email to