https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
Bug ID: 82136 Summary: x86: -mavx256-split-unaligned-load should try to fold other shuffles into the load/vinsertf128 Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* static const int aligned = 0; void pairs_double(double blocks[]) { if(aligned) blocks = __builtin_assume_aligned(blocks, 64); for (int i = 0 ; i<10240 ; i+=2) { double x = blocks[i]; double y = blocks[i+1]; blocks[i] = x*y; blocks[i+1] = x+y; } } i.e. load a pair of 64-bit elements and replace them with 2 different combinations of the pair. https://godbolt.org/g/Y9eJF3 (also includes a uint64_t version that shuffles similarly with AVX2). See https://stackoverflow.com/questions/46038401/unpack-m128i-m256i-to-m64-mmx-sse2-avx2 for the original integer question. GCC autovectorizes this poorly with AVX, and *very* poorly in the unaligned case with split loads/stores. It's doing two movupd/vinsertf128 pairs to emulate a ymm load, and then shuffling with vinsertf128 / vperm2f128. Using split loads from non-contiguous addresses would definitely allow dropping at least one shuffle. with gcc 8.0.0 20170907 -xc -std=gnu11 -O3 -Wall -march=sandybridge pairs_double: leaq 81920(%rdi), %rax .L2: vmovupd (%rdi), %xmm1 vinsertf128 $0x1, 16(%rdi), %ymm1, %ymm1 addq $64, %rdi vmovupd -32(%rdi), %xmm2 vinsertf128 $0x1, -16(%rdi), %ymm2, %ymm2 vinsertf128 $1, %xmm2, %ymm1, %ymm0 vperm2f128 $49, %ymm2, %ymm1, %ymm1 vunpcklpd %ymm1, %ymm0, %ymm2 vunpckhpd %ymm1, %ymm0, %ymm0 vmulpd %ymm2, %ymm0, %ymm1 vaddpd %ymm2, %ymm0, %ymm0 vinsertf128 $1, %xmm1, %ymm1, %ymm2 vperm2f128 $49, %ymm1, %ymm1, %ymm1 vinsertf128 $1, %xmm0, %ymm0, %ymm3 vperm2f128 $49, %ymm0, %ymm0, %ymm0 vshufpd $12, %ymm3, %ymm2, %ymm2 vshufpd $12, %ymm0, %ymm1, %ymm0 vmovups %xmm2, -64(%rdi) vextractf128 $0x1, %ymm2, -48(%rdi) vextractf128 $0x1, %ymm0, -16(%rdi) vmovups %xmm0, -32(%rdi) cmpq %rdi, %rax jne .L2 vzeroupper ret This is obviously pretty horrible, with far too much shuffling just to set up for a mul and add. Things are not bad with -mprefer-avx128 and aligned pointers (4x vunpckl/h per 2 vectors), which should be good on Ryzen or SnB-family, but maybe not faster than scalar on Haswell with limited shuffle throughput. (Same for the uint64_t version of the same thing.)