https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740
--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- Created attachment 44121 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121&action=edit Test case for vectorizing, inc. assembly This test case, written by Nicolas König, shows a proof of concept for vectorizing a maxloc loop using AVX2. It lacks loop peeling, so the overhead for real-life cases will be a little higher. Compiled with $ gcc -Ofast -funroll-loops -Wall main-gp.c maxloc_unroll.s maxloc_nounroll.s with a Ryzen 1700, the results are # Ints per cycle # n normal expect AVX2 AVX2_unroll 128 0.179272 0.107563 0.537815 0.537815 256 0.396285 0.442907 1.075630 1.254902 512 0.456328 0.654731 1.368984 1.254902 1024 0.501961 0.912656 1.771626 1.434174 2048 0.552617 1.368984 1.434174 1.771626 4096 0.568257 1.338562 1.673203 2.041874 8192 0.529541 1.392724 1.661663 1.958871 16384 0.533056 1.342291 1.204706 1.617055 32768 0.535128 1.478167 1.451453 1.832252 65536 0.435206 1.479301 1.612995 2.077079 131072 0.586499 1.366558 1.603602 2.081565 262144 0.533831 1.366315 1.708424 2.071499 524288 0.582885 1.458868 1.601936 2.053294 1048576 0.580800 1.367224 1.563047 2.060289 2097152 0.567541 1.120453 1.552151 1.420402 4194304 0.555527 0.934199 1.001712 1.210831 8388608 0.553388 0.968962 1.211283 1.010533 16777216 0.529886 0.972008 1.143200 1.291568 33554432 0.538509 0.971257 1.303704 1.245031 67108864 0.567889 1.016425 1.320057 1.413633 134217728 0.572604 1.043561 1.338406 1.437528 268435456 0.578321 1.046748 1.420888 1.450485 536870912 0.572145 1.042577 1.445350 1.424409 536870912 0.458679 1.072158 1.451617 1.442926 268435456 0.460019 1.011306 1.380031 1.404121 134217728 0.460833 1.007872 1.318771 1.409553 67108864 0.457754 1.008473 1.281166 1.387048 33554432 0.425367 0.933973 1.378792 1.078957 16777216 0.449178 0.903576 1.371796 1.416321 8388608 0.436478 1.043172 1.298123 1.344617 4194304 0.421925 1.023954 1.214096 1.288334 2097152 0.458274 1.068309 1.060794 1.147894 1048576 0.470394 1.488727 1.247491 1.777959 524288 0.473536 1.493919 2.073448 2.096280 262144 0.476521 1.504707 2.261032 2.067056 131072 0.473536 1.494209 2.189131 2.060427 65536 0.432861 1.477034 2.238710 2.018355 32768 0.468757 1.482715 2.072612 2.024716 16384 0.470129 1.455838 2.068165 1.974928 8192 0.469671 1.417301 2.190374 2.041874 4096 0.474294 1.368984 2.007843 1.974928 2048 0.459811 1.075630 2.007843 2.077079 1024 0.406995 1.254902 1.882353 1.505882 512 0.406995 0.792570 1.673203 0.941176 256 0.327366 0.627451 1.254902 0.752941 128 0.342246 0.941176 0.941176 0.537815 so a) __builtin_expect is a big win b) AVX2 is an even bigger win c) Unrolling the AVX2 loop is a win for intermediate sizes, at large sizes (outside of any cache) Dunno how realistic it is to emit this kind of code for the general case, or what we could do in the Fortran front end to encourage vectorization like that. Use vector data types and do the loop peeling by hand, I presume.