https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
--- Comment #1 from Peter Cordes <peter at cordes dot ca> --- Whoops, the compiler-explorer link had aligned=1. This one produces the asm I showed in the original report: https://godbolt.org/g/WsZ5S9 See bug 82137 for a much more efficient vectorization strategy gcc should use instead, with just in-lane shuffle + blend and some duplicated work.