https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 3 Dec 2015, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655 > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > I guess it needs analysis. > Some examples of changes: > vshuf-v16qi.c -msse2 test_2, scalar code vs. punpcklqdq, clear win > vshuf-v16qi.c -msse4 test_2, pshufb -> punpcklqdq (is this a win or not?) > (similarly for -mavx, -mavx2, -mavx512f, -mavx512bw) > vshuf-v16si.c -mavx512{f,bw} test_2: > - vpermi2d %zmm1, %zmm1, %zmm0 > + vmovdqa64 .LC2(%rip), %zmm0 > + vpermi2q %zmm1, %zmm1, %zmm0 > looks like pessimization. > vshuf-v32hi.c -mavx512bw test_2, similar pessimization. > vshuf-v32hi.c -mavx512bw test_2, similarly: > - vpermi2w %zmm1, %zmm1, %zmm0 > + vmovdqa64 .LC2(%rip), %zmm0 > + vpermi2q %zmm1, %zmm1, %zmm0 > vshuf-v4si.c -msse2 test_183, another pessimization: > - pshufd $78, %xmm0, %xmm1 > + movdqa %xmm0, %xmm1 > movd b(%rip), %xmm4 > pshufd $255, %xmm0, %xmm2 > + shufpd $1, %xmm0, %xmm1 > vshuf-v4si.c -msse4 test_183, another pessimization: > - pshufd $78, %xmm1, %xmm0 > + movdqa %xmm1, %xmm0 > + palignr $8, %xmm0, %xmm0 > vshuf-v4si.c -mavx test_183: > - vpshufd $78, %xmm1, %xmm0 > + vpalignr $8, %xmm1, %xmm1, %xmm0 > vshuf-v64qi.c -mavx512bw, desirable change: > - vpermi2w %zmm1, %zmm1, %zmm0 > - vpshufb .LC3(%rip), %zmm0, %zmm1 > - vpshufb .LC4(%rip), %zmm0, %zmm0 > - vporq %zmm0, %zmm1, %zmm0 > + vpermi2q %zmm1, %zmm1, %zmm0 > vshuf-v8hi.c -msse2 test_1 another scalar to punpcklqdq, win > vshuf-v8hi.c -msse4 test_2 (supposedly a win): > - pshufb .LC3(%rip), %xmm0 > + punpcklqdq %xmm0, %xmm0 > vshuf-v8hi.c -mavx test_2, similarly: > - vpshufb .LC3(%rip), %xmm0, %xmm0 > + vpunpcklqdq %xmm0, %xmm0, %xmm0 > vshuf-v8si.c -mavx2 test_2, another win: > - vmovdqa a(%rip), %ymm0 > - vperm2i128 $0, %ymm0, %ymm0, %ymm0 > + vpermq $68, a(%rip), %ymm0 > vshuf-v8si.c -mavx2 test_5, another win: > - vmovdqa .LC6(%rip), %ymm0 > - vmovdqa .LC7(%rip), %ymm1 > - vmovdqa %ymm0, -48(%rbp) > vmovdqa a(%rip), %ymm0 > - vpermd %ymm0, %ymm1, %ymm1 > - vpshufb .LC8(%rip), %ymm0, %ymm3 > - vpshufb .LC10(%rip), %ymm0, %ymm0 > - vmovdqa %ymm1, c(%rip) > - vmovdqa b(%rip), %ymm1 > - vpermq $78, %ymm3, %ymm3 > - vpshufb .LC9(%rip), %ymm1, %ymm2 > - vpshufb .LC11(%rip), %ymm1, %ymm1 > - vpor %ymm3, %ymm0, %ymm0 > - vpermq $78, %ymm2, %ymm2 > - vpor %ymm2, %ymm1, %ymm1 > - vpor %ymm1, %ymm0, %ymm0 > + vmovdqa .LC7(%rip), %ymm2 > + vmovdqa .LC6(%rip), %ymm1 > + vpermd %ymm0, %ymm2, %ymm2 > + vpermd b(%rip), %ymm1, %ymm3 > + vmovdqa %ymm1, -48(%rbp) > + vmovdqa %ymm2, c(%rip) > + vpermd %ymm0, %ymm1, %ymm0 > + vmovdqa .LC8(%rip), %ymm2 > + vpand %ymm2, %ymm1, %ymm1 > + vpcmpeqd %ymm2, %ymm1, %ymm1 > + vpblendvb %ymm1, %ymm3, %ymm0, %ymm0 > vshuf-v8si.c -mavx512f test_2, another win? > - vmovdqa a(%rip), %ymm0 > - vperm2i128 $0, %ymm0, %ymm0, %ymm0 > + vpermq $68, a(%rip), %ymm0 > > The above does not list all changes, I've been often ignoring further changes > in the file if say one change adds or removes a .LC*, then everything else is > renumbered (and doesn't sometimes list cases where the same or similar change > appears with multiple ISAs). So the results are clearly mixed. > > Perhaps I should just try doing this at the end of expand_vec_perm_1 (i.e. if > we (most likely) couldn't get a single insn normally, see if we would get it > otherwise), and at the end of ix86_expand_vec_perm_const_1 (as the fallback > after all sequences). Yeah, I would have done it only if we fail to permute, not generally. I think you need to stop at 16 byte boundaries (TImode) only for AVX256 and 32byte (OImode) for AVX512. Not sure if there are cases where a "effective" DImode permute works with SImode but not DImode, say { 4, 5, 6, 7, 0, 1, 2, 3 } HImode can be done with both an SImode { 2, 3, 0, 1 } or a DImode { 1, 0 } permute. > It won't catch some beneficial one insn to one insn > changes (e.g. where in the original case the insn needs a constant operand in > memory) though. True. I fear that at some point we want a generator covering all possible permutes using permute patterns (input would be the .md file and a list of insns to consider - or maybe even autodetect those). The code handling permutation is already quite unwieldly (and it tries generating RTL ...) :/