https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
Freddie Witherden <freddie at witherden dot org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |freddie at witherden dot org --- Comment #11 from Freddie Witherden <freddie at witherden dot org> --- I've been looking into this and the big difference appears to be that when Clang unrolls the loop it does so using multiple accumulators (and indeed does this without need to be told to unroll. Given: double acc(double *x, int n) { double a = 0; #pragma omp simd for (int i = 0; i < n; i++) a += x[i]; return a; } and compiling with clang -march=native -Ofast -fopenmp -S the core loop reads as: vaddpd (%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 32(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 64(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 96(%rdi,%rsi,8), %ymm3, %ymm3 vaddpd 128(%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 160(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 192(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 224(%rdi,%rsi,8), %ymm3, %ymm3 vaddpd 256(%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 288(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 320(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 352(%rdi,%rsi,8), %ymm3, %ymm3 vaddpd 384(%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 416(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 448(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 480(%rdi,%rsi,8), %ymm3, %ymm3 which is heavily unrolled and uses four separate accumulators to hide the latency of the vector adds. Interestingly, one could argue that Clang is not using enough registers given that Skylake can dual-issue adds and they have a latency of 4 cycles (implying you want 8 separate accumulators). GCC 10 with gcc -march=skylake -Ofast -fopenmp -S test.c -funroll-loops vaddpd -224(%r8), %ymm1, %ymm2 vaddpd -192(%r8), %ymm2, %ymm3 vaddpd -160(%r8), %ymm3, %ymm4 vaddpd -128(%r8), %ymm4, %ymm5 vaddpd -96(%r8), %ymm5, %ymm6 vaddpd -64(%r8), %ymm6, %ymm7 vaddpd -32(%r8), %ymm7, %ymm0 which although it is unrolled, is not a useful unrolling due to the dependency chain. Indeed, I would not be surprised if the performance is similar to the unrolled code as the loop related cruft can be hidden.