https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- I'll note that esp. two-lane reductions (or in general two-lane BB vectorization) is hardly profitable on modern x86 uarchs unless the vectorized code is interleaved with other non-vectorized code that can execute at the same time. vectorizing two lanes will only make them dependent on each other while when not vectorized modern uarchs have no difficulty in executing them in parallel (but without the tied dependences). It's only when there's sufficient benefit, aka more lanes, approaching the issue width or the number of available ports for the ops, or the whole SLP mostly consisting of loads/stores, that BB vectorization is going to be profitable. Note the cost model only ever looks at the stmts participating in the vectorization, not the "surrounding" code, and it would be difficult to include that since the schedule on GIMPLE isn't even close to what we get later. The reduction op is also a serialization point on the scalar side of course, whether that means that BB reductions with two lanes are possibly better candidates than grouped BB stores with two lanes is another question. The BB reduction op itself is costed properly. So the 525.x264_r case might be loop vectorization, OTOH the epilogue cost is hardly ever a knob that decides whether a vectorization is profitable. I think we need to figure out what exactly gets slower (and hope it's not scattered all over the place)