https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93588
Bug ID: 93588 Summary: Vectorized load followed by FMA pessimizes on Haswell from version 8.1 Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alex.reinking at gmail dot com Target Milestone: --- Using vector intrinsics (via immintrin.h) on GCC 7.3 with -O3 -march=haswell performs the following compilation: --- for (int k = 0; k < n; ++k) { ymm12 = _mm256_broadcast_sd(&b[k]); ymm13 = _mm256_broadcast_sd(&b[k + ldb]); ymm14 = _mm256_broadcast_sd(&b[k + 2 * ldb]); ymm15 = _mm256_loadu_pd(&a[k * lda]); ymm0 = _mm256_fmadd_pd(ymm15, ymm12, ymm0); ymm4 = _mm256_fmadd_pd(ymm15, ymm13, ymm4); ymm8 = _mm256_fmadd_pd(ymm15, ymm14, ymm8); ymm15 = _mm256_loadu_pd(&a[4 + k * lda]); ymm1 = _mm256_fmadd_pd(ymm15, ymm12, ymm1); ymm5 = _mm256_fmadd_pd(ymm15, ymm13, ymm5); ymm9 = _mm256_fmadd_pd(ymm15, ymm14, ymm9); ymm15 = _mm256_loadu_pd(&a[8 + k * lda]); ymm2 = _mm256_fmadd_pd(ymm15, ymm12, ymm2); ymm6 = _mm256_fmadd_pd(ymm15, ymm13, ymm6); ymm10 = _mm256_fmadd_pd(ymm15, ymm14, ymm10); ymm15 = _mm256_loadu_pd(&a[12 + k * lda]); ymm3 = _mm256_fmadd_pd(ymm15, ymm12, ymm3); ymm7 = _mm256_fmadd_pd(ymm15, ymm13, ymm7); ymm11 = _mm256_fmadd_pd(ymm15, ymm14, ymm11); } --- .L3: lea rax, [r8+rcx] vbroadcastsd ymm2, QWORD PTR [rcx] vmovupd ymm3, YMMWORD PTR [rsi] add rcx, 8 vbroadcastsd ymm1, QWORD PTR [rax] vbroadcastsd ymm0, QWORD PTR [rax+r8] vfmadd231pd ymm15, ymm3, ymm2 vfmadd231pd ymm11, ymm3, ymm1 vfmadd231pd ymm7, ymm3, ymm0 vmovupd ymm3, YMMWORD PTR [rsi+32] vfmadd231pd ymm14, ymm3, ymm2 vfmadd231pd ymm10, ymm3, ymm1 vfmadd231pd ymm6, ymm3, ymm0 vmovupd ymm3, YMMWORD PTR [rsi+64] vfmadd231pd ymm13, ymm3, ymm2 vfmadd231pd ymm9, ymm3, ymm1 vfmadd231pd ymm5, ymm3, ymm0 vmovupd ymm3, YMMWORD PTR [rsi+96] add rsi, rdx vfmadd231pd ymm12, ymm3, ymm2 vfmadd231pd ymm8, ymm3, ymm1 vfmadd231pd ymm4, ymm3, ymm0 cmp rdi, rcx jne .L3 --- This reuses the registers that are loaded from memory (and in fact uses all 16 ymm registers). However, when compiling with GCC 8.1 or newer, we get: --- .L3: vbroadcastsd ymm2, QWORD PTR [rcx] lea rax, [r8+rcx] add rcx, 8 vbroadcastsd ymm1, QWORD PTR [rax] vbroadcastsd ymm0, QWORD PTR [rax+r8] vfmadd231pd ymm14, ymm2, YMMWORD PTR [rsi] vfmadd231pd ymm10, ymm1, YMMWORD PTR [rsi] vfmadd231pd ymm6, ymm0, YMMWORD PTR [rsi] vfmadd231pd ymm13, ymm2, YMMWORD PTR [rsi+32] vfmadd231pd ymm9, ymm1, YMMWORD PTR [rsi+32] vfmadd231pd ymm5, ymm0, YMMWORD PTR [rsi+32] vfmadd231pd ymm12, ymm2, YMMWORD PTR [rsi+64] vfmadd231pd ymm8, ymm1, YMMWORD PTR [rsi+64] vfmadd231pd ymm4, ymm0, YMMWORD PTR [rsi+64] vfmadd231pd ymm11, ymm2, YMMWORD PTR [rsi+96] vfmadd231pd ymm7, ymm1, YMMWORD PTR [rsi+96] vfmadd231pd ymm3, ymm0, YMMWORD PTR [rsi+96] add rsi, rdx cmp rdi, rcx jne .L3 --- This code has half the throughput on both my i9-7900X and NERSC's Xeon E5-2698 v3. Enabling -mtune=skylake "fixes" the problem, but it isn't clear why it does or how this code could be written to be more robust to compiler changes. The intrinsics are supposed to map to the corresponding assembly instructions, no? Here are some Compiler Explorer links to show the behavior: [GCC 7.3] https://gcc.godbolt.org/z/nLHD47 [GCC 8.1] https://gcc.godbolt.org/z/6EEt2N [GCC 8.1 -mtune=skylake] https://gcc.godbolt.org/z/XGZKtX