https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- (In reply to Richard Biener from comment #2) > And it gets worse because of the splitting > which isn't exposed to the vectorizer. Split loads/stores can be a useful shuffling strategy even on Haswell/Skylake (which don't normally benefit from split loads/stores for contiguous but unaligned). vinsertf128 from memory only requires a load uop + blend uop which can run on any of the three vector ALU ports. (The register-source version is a shuffle, though.) vextractf128 is a pure store with no ALU, but the store-address and store-data uops can't micro-fuse. If store-port throughput isn't a problem but shuffle throughput is, it could be a win. Unaligned overlapping loads are another interesting way to replace shuffles in some cases. > not sure what ends up messing things up here (I guess AVX256 doesn't have > full width extract even/odd and interleave high/low ...). Yep, replied to that in detail at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137#c2 > Looks like with -mprefer-avx128 we never try the larger vector size (Oops?). > At least we figure vectorization isn't profitable. AVX actually makes the SSE2 vectorization strategy a lot better (saves a lot of movdqa reg,reg copies), and it implies that the CPU has efficient unaligned vector load/store. So probably the best choice for -mavx -mprefer-avx128 is to vectorize like SSE2, but with unaligned loads/stores. 256b lane-crossing shuffles are extra expensive with -mtune=znver1 and -mtune=bdver* (which enable -mprefer-avx128), so it's a good thing that -mprefer-avx128 doesn't enable the current AVX1/2 256b vectorization strategy. (Which is horrible in this case anyway: bug 82137). With a good 256b vectorization strategy, it might be a win for Ryzen (which, unlike Bulldozer-family, has better total uop throughput when running 2-uop instructions like AVX 256b according to Agner Fog, so -mprefer-avx128 isn't always appropriate for Ryzen.) > So all this probably boils down to costs of permutes not being modeled. That would certainly explain gcc's behaviour in a lot of cases. It often seems pretty shuffle-happy, even though shuffles are a very limited execution resource. (Especially with unrolling so front-end bottlenecks don't hide the shuffle bottlenecks.)