https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #2 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 44418 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44418&action=edit Code to reproduce slow vectorization pattern and unnecessary loads & stores (Sorry if this goes to the bottom instead of top, trying to attach a file in place of a link, but I can't edit the old comment.) Attached is sample code to reproduce the problem in gcc 8.1.1 As observed by amonakov, compiling with -O3/-Ofast reproduces the full problem, eg: gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S kernels.f90 -o kernels.s Compiling with -O3 -fdisable-tree-cunrolli or -O2 -ftree-vectorize fixes the incorrect vectorization pattern, but leave a lot of unnecessary broadcast loads and stores.