https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136

            Bug ID: 82136
           Summary: x86: -mavx256-split-unaligned-load should try to fold
                    other shuffles into the load/vinsertf128
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

static const int aligned = 0;

void pairs_double(double blocks[]) {
    if(aligned) blocks = __builtin_assume_aligned(blocks, 64);
    for (int i = 0 ; i<10240 ; i+=2) {
        double x = blocks[i];
        double y = blocks[i+1];
        blocks[i] = x*y;
        blocks[i+1] = x+y;
    }
}

i.e. load a pair of 64-bit elements and replace them with 2 different
combinations of the pair.

https://godbolt.org/g/Y9eJF3  (also includes a uint64_t version that shuffles
similarly with AVX2).  See
https://stackoverflow.com/questions/46038401/unpack-m128i-m256i-to-m64-mmx-sse2-avx2
for the original integer question.


GCC autovectorizes this poorly with AVX, and *very* poorly in the unaligned
case with split loads/stores.  It's doing two movupd/vinsertf128 pairs to
emulate a ymm load, and then shuffling with vinsertf128 / vperm2f128.  Using
split loads from non-contiguous addresses would definitely allow dropping at
least one shuffle.

with gcc 8.0.0 20170907  -xc -std=gnu11 -O3 -Wall -march=sandybridge

pairs_double:
        leaq    81920(%rdi), %rax
.L2:
        vmovupd (%rdi), %xmm1
        vinsertf128     $0x1, 16(%rdi), %ymm1, %ymm1
        addq    $64, %rdi
        vmovupd -32(%rdi), %xmm2
        vinsertf128     $0x1, -16(%rdi), %ymm2, %ymm2
        vinsertf128     $1, %xmm2, %ymm1, %ymm0
        vperm2f128      $49, %ymm2, %ymm1, %ymm1
        vunpcklpd       %ymm1, %ymm0, %ymm2
        vunpckhpd       %ymm1, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vinsertf128     $1, %xmm1, %ymm1, %ymm2
        vperm2f128      $49, %ymm1, %ymm1, %ymm1
        vinsertf128     $1, %xmm0, %ymm0, %ymm3
        vperm2f128      $49, %ymm0, %ymm0, %ymm0
        vshufpd $12, %ymm3, %ymm2, %ymm2
        vshufpd $12, %ymm0, %ymm1, %ymm0
        vmovups %xmm2, -64(%rdi)
        vextractf128    $0x1, %ymm2, -48(%rdi)
        vextractf128    $0x1, %ymm0, -16(%rdi)
        vmovups %xmm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L2
        vzeroupper
        ret


This is obviously pretty horrible, with far too much shuffling just to set up
for a mul and add.  Things are not bad with -mprefer-avx128 and aligned
pointers (4x vunpckl/h per 2 vectors), which should be good on Ryzen or
SnB-family, but maybe not faster than scalar on Haswell with limited shuffle
throughput.  (Same for the uint64_t version of the same thing.)

Reply via email to