https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136

--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Richard Biener from comment #2)
> And it gets worse because of the splitting
> which isn't exposed to the vectorizer.

Split loads/stores can be a useful shuffling strategy even on Haswell/Skylake
(which don't normally benefit from split loads/stores for contiguous but
unaligned).   vinsertf128 from memory only requires a load uop + blend uop
which can run on any of the three vector ALU ports.  (The register-source
version is a shuffle, though.)  vextractf128 is a pure store with no ALU, but
the store-address and store-data uops can't micro-fuse.  If store-port
throughput isn't a problem but shuffle throughput is, it could be a win.

Unaligned overlapping loads are another interesting way to replace shuffles in
some cases.

> not sure what ends up messing things up here (I guess AVX256 doesn't have
> full width extract even/odd and interleave high/low ...).

 Yep, replied to that in detail at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137#c2

> Looks like with -mprefer-avx128 we never try the larger vector size (Oops?).
> At least we figure vectorization isn't profitable.

 AVX actually makes the SSE2 vectorization strategy a lot better (saves a lot
of movdqa reg,reg copies), and it implies that the CPU has efficient unaligned
vector load/store.  So probably the best choice for -mavx -mprefer-avx128 is to
vectorize like SSE2, but with unaligned loads/stores.

256b lane-crossing shuffles are extra expensive with -mtune=znver1 and
-mtune=bdver* (which enable -mprefer-avx128), so it's a good thing that
-mprefer-avx128 doesn't enable the current AVX1/2 256b vectorization strategy. 
(Which is horrible in this case anyway: bug 82137).

With a good 256b vectorization strategy, it might be a win for Ryzen (which,
unlike Bulldozer-family, has better total uop throughput when running 2-uop
instructions like AVX 256b according to Agner Fog, so -mprefer-avx128 isn't
always appropriate for Ryzen.)

> So all this probably boils down to costs of permutes not being modeled.

That would certainly explain gcc's behaviour in a lot of cases.  It often seems
pretty shuffle-happy, even though shuffles are a very limited execution
resource.  (Especially with unrolling so front-end bottlenecks don't hide the
shuffle bottlenecks.)

Reply via email to