https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428
--- Comment #6 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Richard Biener from comment #4) > > while the lack of cross-lane shuffles in AVX2 requires a > > .L3: > vmovupd (%rsi,%rax), %xmm5 > vmovupd 32(%rsi,%rax), %xmm6 > vinsertf128 $0x1, 16(%rsi,%rax), %ymm5, %ymm1 > vinsertf128 $0x1, 48(%rsi,%rax), %ymm6, %ymm3 > vmovupd (%rcx,%rax), %xmm7 > vmovupd 32(%rcx,%rax), %xmm5 > vinsertf128 $0x1, 16(%rcx,%rax), %ymm7, %ymm0 > vinsertf128 $0x1, 48(%rcx,%rax), %ymm5, %ymm2 > vunpcklpd %ymm3, %ymm1, %ymm4 > vunpckhpd %ymm3, %ymm1, %ymm1 > vpermpd $216, %ymm4, %ymm4 > vpermpd $216, %ymm1, %ymm1 > vmovupd %xmm4, (%rdi,%rax,2) > vextractf128 $0x1, %ymm4, 16(%rdi,%rax,2) > vmovupd %xmm1, 32(%rdi,%rax,2) > vextractf128 $0x1, %ymm1, 48(%rdi,%rax,2) > vunpcklpd %ymm2, %ymm0, %ymm1 > vunpckhpd %ymm2, %ymm0, %ymm0 > vpermpd $216, %ymm1, %ymm1 > vpermpd $216, %ymm0, %ymm0 > vmovupd %xmm1, 64(%rdi,%rax,2) > vextractf128 $0x1, %ymm1, 80(%rdi,%rax,2) > vextractf128 $0x1, %ymm0, 112(%rdi,%rax,2) > vmovupd %xmm0, 96(%rdi,%rax,2) > addq $64, %rax > cmpq %rax, %rdx > jne .L3 > I don't follow. AVX2 indeed can't transpose with 4 shuffles per loop iteration, but it is fully capable to do it with 8 shuffles per iteration, as you demonstrated few posts above: .L8: vmovupd (%rdi), %ymm1 vmovupd 32(%rdi), %ymm4 vmovupd (%rax), %ymm0 vmovupd 32(%rax), %ymm3 vunpcklpd %ymm4, %ymm1, %ymm2 vunpckhpd %ymm4, %ymm1, %ymm1 vpermpd $216, %ymm1, %ymm1 vmovupd %ymm1, 32(%r8) vunpcklpd %ymm3, %ymm0, %ymm1 vunpckhpd %ymm3, %ymm0, %ymm0 vpermpd $216, %ymm2, %ymm2 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm0, %ymm0 addq $64, %rdi vmovupd %ymm2, (%r8) vmovupd %ymm1, 64(%r8) vmovupd %ymm0, 96(%r8) addq $64, %rax subq $-128, %r8 cmpq %rdi, %rdx jne .L8