https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428

--- Comment #6 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #4)
> 
> while the lack of cross-lane shuffles in AVX2 requires a
> 
> .L3:
>         vmovupd (%rsi,%rax), %xmm5
>         vmovupd 32(%rsi,%rax), %xmm6
>         vinsertf128     $0x1, 16(%rsi,%rax), %ymm5, %ymm1
>         vinsertf128     $0x1, 48(%rsi,%rax), %ymm6, %ymm3
>         vmovupd (%rcx,%rax), %xmm7
>         vmovupd 32(%rcx,%rax), %xmm5
>         vinsertf128     $0x1, 16(%rcx,%rax), %ymm7, %ymm0
>         vinsertf128     $0x1, 48(%rcx,%rax), %ymm5, %ymm2
>         vunpcklpd       %ymm3, %ymm1, %ymm4
>         vunpckhpd       %ymm3, %ymm1, %ymm1
>         vpermpd $216, %ymm4, %ymm4
>         vpermpd $216, %ymm1, %ymm1
>         vmovupd %xmm4, (%rdi,%rax,2)
>         vextractf128    $0x1, %ymm4, 16(%rdi,%rax,2)
>         vmovupd %xmm1, 32(%rdi,%rax,2)
>         vextractf128    $0x1, %ymm1, 48(%rdi,%rax,2)
>         vunpcklpd       %ymm2, %ymm0, %ymm1
>         vunpckhpd       %ymm2, %ymm0, %ymm0
>         vpermpd $216, %ymm1, %ymm1
>         vpermpd $216, %ymm0, %ymm0
>         vmovupd %xmm1, 64(%rdi,%rax,2)
>         vextractf128    $0x1, %ymm1, 80(%rdi,%rax,2)
>         vextractf128    $0x1, %ymm0, 112(%rdi,%rax,2)
>         vmovupd %xmm0, 96(%rdi,%rax,2)
>         addq    $64, %rax
>         cmpq    %rax, %rdx
>         jne     .L3
> 

I don't follow. AVX2 indeed can't transpose with 4 shuffles per loop iteration,
but it is fully capable to do it with 8 shuffles per iteration, as you
demonstrated few posts above:
.L8:
        vmovupd (%rdi), %ymm1
        vmovupd 32(%rdi), %ymm4
        vmovupd (%rax), %ymm0
        vmovupd 32(%rax), %ymm3
        vunpcklpd       %ymm4, %ymm1, %ymm2
        vunpckhpd       %ymm4, %ymm1, %ymm1
        vpermpd $216, %ymm1, %ymm1
        vmovupd %ymm1, 32(%r8)
        vunpcklpd       %ymm3, %ymm0, %ymm1
        vunpckhpd       %ymm3, %ymm0, %ymm0
        vpermpd $216, %ymm2, %ymm2
        vpermpd $216, %ymm1, %ymm1
        vpermpd $216, %ymm0, %ymm0
        addq    $64, %rdi
        vmovupd %ymm2, (%r8)
        vmovupd %ymm1, 64(%r8)
        vmovupd %ymm0, 96(%r8)
        addq    $64, %rax
        subq    $-128, %r8
        cmpq    %rdi, %rdx
        jne     .L8

Reply via email to