PING^7: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Thu, Feb 6, 2020 at 8:17 PM H.J. Lu wrote: > > On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu wrote: > > > > On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu wrote: > > > > > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu wrote: > > > > > > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu wrote: > > > > > > > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu > > > > > > wrote: > > > > > > > > > > > > > > Hi Jan, Uros, > > > > > > > > > > > > > > This patch fixes the wrong code bug: > > > > > > > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > > > > > OK for trunk? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > H.J. > > > > > > > -- > > > > > > > i386 backend has > > > > > > > > > > > > > > INT_MODE (OI, 32); > > > > > > > INT_MODE (XI, 64); > > > > > > > > > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit > > > > > > > operation, > > > > > > > in case of const_1, all 512 bits set. > > > > > > > > > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by > > > > > > > inherent > > > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this > > > > > > > case. > > > > > > > > > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for > > > > > > > zeroing. > > > > > > > > > > > > > > sse.md has > > > > > > > > > > > > > > (define_insn "mov_internal" > > > > > > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > > > > > > "=v,v ,v ,m") > > > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > > > > > > " C,BC,vm,v"))] > > > > > > > > > > > > > > /* There is no evex-encoded vmov* for sizes smaller than > > > > > > > 64-bytes > > > > > > > in avx512f, so we need to use workarounds, to access sse > > > > > > > registers > > > > > > > 16-31, which are evex-only. In avx512vl we don't need > > > > > > > workarounds. */ > > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > > { > > > > > > > if (memory_operand (operands[0], mode)) > > > > > > > { > > > > > > > if ( == 32) > > > > > > > return "vextract64x4\t{$0x0, %g1, > > > > > > > %0|%0, %g1, 0x0}"; > > > > > > > else if ( == 16) > > > > > > > return "vextract32x4\t{$0x0, %g1, > > > > > > > %0|%0, %g1, 0x0}"; > > > > > > > else > > > > > > > gcc_unreachable (); > > > > > > > } > > > > > > > ... > > > > > > > > > > > > > > However, since ix86_hard_regno_mode_ok has > > > > > > > > > > > > > > /* TODO check for QI/HI scalars. */ > > > > > > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > > > > > > if (TARGET_AVX512VL > > > > > > > && (mode == OImode > > > > > > > || mode == TImode > > > > > > > || VALID_AVX256_REG_MODE (mode) > > > > > > > || VALID_AVX512VL_128_REG_MODE (mode))) > > > > > > > return true; > > > > > > > > > > > > > > /* xmm16-xmm31 are only available for AVX-512. */ > > > > > > > if (EXT_REX_SSE_REGNO_P (regno)) > > > > > > > return false; > > > > > > > > > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > > > > > > > > > is a dead code. > > > > > > > > > > > > > > Also for > > > > > > > > > > > > > > long long *p; > > > > > > > volatile __m256i yy; > > > > > > > > > > > > > > void > > > > > > > foo (void) > > > > > > > { > > > > > > >_mm256_store_epi64 (p, yy); > > > > > > > } > > > > > > > > > > > > > > with AVX512VL, we should generate > > > > > > > > > > > > > > vmovdqa %ymm0, (%rax) > > > > > > > > > > > > > > not > > > > > > > > > > > > > > vmovdqa64 %ymm0, (%rax) > > > > > > > > > > > > > > All TYPE_SSEMOV vector moves are consolidated to > > > > > > > ix86_output_ssemov: > > > > > > > > > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX > > > > > > > vector > > > > > > > moves will be generated. > > > > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: > > > > > > >a. With AVX512VL, AVX512VL vector moves will be generated. > > > > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to > > > > > > > register > > > > > > > move will be done with zmm register move. > > > > > > > > > > > > > > ext_sse_reg_operand is removed since it is no longer needed. > > > > > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > > > > >
PING^6: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
&& (EXT_REX_SSE_REG_P (operands[0]) > > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > > > > > > > is a dead code. > > > > > > > > > > > > Also for > > > > > > > > > > > > long long *p; > > > > > > volatile __m256i yy; > > > > > > > > > > > > void > > > > > > foo (void) > > > > > > { > > > > > >_mm256_store_epi64 (p, yy); > > > > > > } > > > > > > > > > > > > with AVX512VL, we should generate > > > > > > > > > > > > vmovdqa %ymm0, (%rax) > > > > > > > > > > > > not > > > > > > > > > > > > vmovdqa64 %ymm0, (%rax) > > > > > > > > > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > > > > > > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > > > > > > moves will be generated. > > > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: > > > > > >a. With AVX512VL, AVX512VL vector moves will be generated. > > > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > > > > > > move will be done with zmm register move. > > > > > > > > > > > > ext_sse_reg_operand is removed since it is no longer needed. > > > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > > > gcc/ > > > > > > > > > > > > PR target/89229 > > > > > > PR target/89346 > > > > > > * config/i386/i386-protos.h (ix86_output_ssemov): New > > > > > > prototype. > > > > > > * config/i386/i386.c (ix86_get_ssemov): New function. > > > > > > (ix86_output_ssemov): Likewise. > > > > > > * config/i386/i386.md (*movxi_internal_avx512f): Call > > > > > > ix86_output_ssemov for TYPE_SSEMOV. > > > > > > (*movoi_internal_avx): Call ix86_output_ssemov for > > > > > > TYPE_SSEMOV. > > > > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > > > > > > (*movti_internal): Likewise. > > > > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > > > Remove ext_sse_reg_operand check. > > > > > > (*movsi_internal): Likewise. > > > > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > > > > > > and ext_sse_reg_operand check. > > > > > > (*movsf_internal_avx): Call ix86_output_ssemov for > > > > > > TYPE_SSEMOV. > > > > > > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and > > > > > > ext_sse_reg_operand check. > > > > > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call > > > > > > ix86_output_ssemov for TYPE_SSEMOV. Remove > > > > > > ext_sse_reg_operand > > > > > > check. > > > > > > * config/i386/sse.md (VMOVE:mov_internal): Call > > > > > > ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL > > > > > > check. > > > > > > * config/i386/predicates.md (ext_sse_reg_operand): Removed. > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > PR target/89229 > > > > > > PR target/89346 > > > > > > * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. > > > > > > * gcc.target/i386/pr89229-2a.c: New test. > > > > > > * gcc.target/i386/pr89229-2b.c: Likewise. > > > > > > * gcc.target/i386/pr89229-2c.c: Likewise. > > > > > > * gcc.target/i386/pr89229-3a.c: Likewise. > > > > > > * gcc.target/i386/pr89229-3b.c
PING^5: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
> > > > > > > > > > with AVX512VL, we should generate > > > > > > > > > > vmovdqa %ymm0, (%rax) > > > > > > > > > > not > > > > > > > > > > vmovdqa64 %ymm0, (%rax) > > > > > > > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > > > > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > > > > > moves will be generated. > > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: > > > > >a. With AVX512VL, AVX512VL vector moves will be generated. > > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > > > > > move will be done with zmm register move. > > > > > > > > > > ext_sse_reg_operand is removed since it is no longer needed. > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > gcc/ > > > > > > > > > > PR target/89229 > > > > > PR target/89346 > > > > > * config/i386/i386-protos.h (ix86_output_ssemov): New > > > > > prototype. > > > > > * config/i386/i386.c (ix86_get_ssemov): New function. > > > > > (ix86_output_ssemov): Likewise. > > > > > * config/i386/i386.md (*movxi_internal_avx512f): Call > > > > > ix86_output_ssemov for TYPE_SSEMOV. > > > > > (*movoi_internal_avx): Call ix86_output_ssemov for > > > > > TYPE_SSEMOV. > > > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > > > > > (*movti_internal): Likewise. > > > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > > Remove ext_sse_reg_operand check. > > > > > (*movsi_internal): Likewise. > > > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > > > > > and ext_sse_reg_operand check. > > > > > (*movsf_internal_avx): Call ix86_output_ssemov for > > > > > TYPE_SSEMOV. > > > > > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and > > > > > ext_sse_reg_operand check. > > > > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call > > > > > ix86_output_ssemov for TYPE_SSEMOV. Remove > > > > > ext_sse_reg_operand > > > > > check. > > > > > * config/i386/sse.md (VMOVE:mov_internal): Call > > > > > ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL > > > > > check. > > > > > * config/i386/predicates.md (ext_sse_reg_operand): Removed. > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > PR target/89229 > > > > > PR target/89346 > > > > > * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. > > > > > * gcc.target/i386/pr89229-2a.c: New test. > > > > > * gcc.target/i386/pr89229-2b.c: Likewise. > > > > > * gcc.target/i386/pr89229-2c.c: Likewise. > > > > > * gcc.target/i386/pr89229-3a.c: Likewise. > > > > > * gcc.target/i386/pr89229-3b.c: Likewise. > > > > > * gcc.target/i386/pr89229-3c.c: Likewise. > > > > > * gcc.target/i386/pr89229-4a.c: Likewise. > > > > > * gcc.target/i386/pr89229-4b.c: Likewise. > > > > > * gcc.target/i386/pr89229-4c.c: Likewise. > > > > > * gcc.target/i386/pr89229-5a.c: Likewise. > > > > > * gcc.target/i386/pr89229-5b.c: Likewise. > > > > > * gcc.target/i386/pr89229-5c.c: Likewise. > > > > > * gcc.target/i386/pr89229-6a.c: Likewise. > > > > > * gcc.target/i386/pr89229-6b.c: Likewise. > > > > > * gcc.target/i386/pr89229-6c.c: Likewise. > > > > > * gcc.target/i386/pr89229-7a.c: Likewise. > > > > > * gcc.target/i386/pr89229-7b.c: Likewise. > > > > >
Re: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On 2/22/19 9:24 AM, H.J. Lu wrote: > Hi Jan, Uros, > > This patch fixes the wrong code bug: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > OK for trunk? > > Thanks. > > H.J. > -- > i386 backend has > > INT_MODE (OI, 32); > INT_MODE (XI, 64); > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > in case of const_1, all 512 bits set. > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > Some targets prefer V4SF mode, so they will emit float xorps for zeroing. > > sse.md has > > (define_insn "mov_internal" > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > "=v,v ,v ,m") > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > " C,BC,vm,v"))] > > /* There is no evex-encoded vmov* for sizes smaller than 64-bytes > in avx512f, so we need to use workarounds, to access sse registers > 16-31, which are evex-only. In avx512vl we don't need workarounds. > */ > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > && (EXT_REX_SSE_REG_P (operands[0]) > || EXT_REX_SSE_REG_P (operands[1]))) > { > if (memory_operand (operands[0], mode)) > { > if ( == 32) > return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, > 0x0}"; > else if ( == 16) > return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, > 0x0}"; > else > gcc_unreachable (); > } > ... > > However, since ix86_hard_regno_mode_ok has > > /* TODO check for QI/HI scalars. */ > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > if (TARGET_AVX512VL > && (mode == OImode > || mode == TImode > || VALID_AVX256_REG_MODE (mode) > || VALID_AVX512VL_128_REG_MODE (mode))) > return true; > > /* xmm16-xmm31 are only available for AVX-512. */ > if (EXT_REX_SSE_REGNO_P (regno)) > return false; > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > && (EXT_REX_SSE_REG_P (operands[0]) > || EXT_REX_SSE_REG_P (operands[1]))) > > is a dead code. > > Also for > > long long *p; > volatile __m256i yy; > > void > foo (void) > { >_mm256_store_epi64 (p, yy); > } > > with AVX512VL, we should generate > > vmovdqa %ymm0, (%rax) > > not > > vmovdqa64 %ymm0, (%rax) > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > moves will be generated. > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: >a. With AVX512VL, AVX512VL vector moves will be generated. >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > move will be done with zmm register move. > > ext_sse_reg_operand is removed since it is no longer needed. > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > gcc/ > > PR target/89229 > PR target/89346 > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. > * config/i386/i386.c (ix86_get_ssemov): New function. > (ix86_output_ssemov): Likewise. > * config/i386/i386.md (*movxi_internal_avx512f): Call > ix86_output_ssemov for TYPE_SSEMOV. > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > (*movti_internal): Likewise. > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove ext_sse_reg_operand check. > (*movsi_internal): Likewise. > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > and ext_sse_reg_operand check. > (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and > ext_sse_reg_operand check. > * config/i386/mmx.md (MMXMODE:*mov_internal): Call > ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand > check. > * config/i386/sse.md (VMOVE:mov_internal): Call > ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL > check. > * config/i386/predicates.md (ext_sse_reg_operand): Removed. > > gcc/testsuite/ > > PR target/89229 > PR target/89346 > * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. > * gcc.target/i386/pr89229-2a.c: New test. > * gcc.target/i386/pr89229-2b.c: Likewise. > * gcc.target/i386/pr89229-2c.c: Likewise. > * gcc.target/i386/pr89229-3a.c: Likewise. > * gcc.target/i386/pr89229-3b.c: Likewise. > * gcc.target/i386/pr89229-3c.c:
PING^4: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu wrote: > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu wrote: > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu wrote: > > > > > > > > Hi Jan, Uros, > > > > > > > > This patch fixes the wrong code bug: > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > OK for trunk? > > > > > > > > Thanks. > > > > > > > > H.J. > > > > -- > > > > i386 backend has > > > > > > > > INT_MODE (OI, 32); > > > > INT_MODE (XI, 64); > > > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > > > > in case of const_1, all 512 bits set. > > > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > > > > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for > > > > zeroing. > > > > > > > > sse.md has > > > > > > > > (define_insn "mov_internal" > > > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > > > "=v,v ,v ,m") > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > > > " C,BC,vm,v"))] > > > > > > > > /* There is no evex-encoded vmov* for sizes smaller than 64-bytes > > > > in avx512f, so we need to use workarounds, to access sse > > > > registers > > > > 16-31, which are evex-only. In avx512vl we don't need > > > > workarounds. */ > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > { > > > > if (memory_operand (operands[0], mode)) > > > > { > > > > if ( == 32) > > > > return "vextract64x4\t{$0x0, %g1, %0|%0, > > > > %g1, 0x0}"; > > > > else if ( == 16) > > > > return "vextract32x4\t{$0x0, %g1, %0|%0, > > > > %g1, 0x0}"; > > > > else > > > > gcc_unreachable (); > > > > } > > > > ... > > > > > > > > However, since ix86_hard_regno_mode_ok has > > > > > > > > /* TODO check for QI/HI scalars. */ > > > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > > > if (TARGET_AVX512VL > > > > && (mode == OImode > > > > || mode == TImode > > > > || VALID_AVX256_REG_MODE (mode) > > > > || VALID_AVX512VL_128_REG_MODE (mode))) > > > > return true; > > > > > > > > /* xmm16-xmm31 are only available for AVX-512. */ > > > > if (EXT_REX_SSE_REGNO_P (regno)) > > > > return false; > > > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > > > is a dead code. > > > > > > > > Also for > > > > > > > > long long *p; > > > > volatile __m256i yy; > > > > > > > > void > > > > foo (void) > > > > { > > > >_mm256_store_epi64 (p, yy); > > > > } > > > > > > > > with AVX512VL, we should generate > > > > > > > > vmovdqa %ymm0, (%rax) > > > > > > > > not > > > > > > > > vmovdqa64 %ymm0, (%rax) > > > > > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > > > > moves will be generated. > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: > > > >a. With AVX512VL, AVX512VL vector moves will be generated. > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > > > > move will be done with zmm register move. > > > > > > > > ext_sse_reg_operand is removed since it is no longer needed. > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > gcc/ > > > > > > > > PR target/89229 > > > > PR target/89346 > > > > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. > > > > * config/i386/i386.c (ix86_get_ssemov): New function. > > > > (ix86_output_ssemov): Likewise. > > > > * config/i386/i386.md (*movxi_internal_avx512f): Call > > > > ix86_output_ssemov for TYPE_SSEMOV. > > > > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > > > > (*movti_internal): Likewise. > > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > Remove ext_sse_reg_operand check. > > > > (*movsi_internal): Likewise. > > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > > > >
PING^3: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Fri, May 31, 2019 at 10:38 AM H.J. Lu wrote: > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu wrote: > > > > > > Hi Jan, Uros, > > > > > > This patch fixes the wrong code bug: > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > OK for trunk? > > > > > > Thanks. > > > > > > H.J. > > > -- > > > i386 backend has > > > > > > INT_MODE (OI, 32); > > > INT_MODE (XI, 64); > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > > > in case of const_1, all 512 bits set. > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > > > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for zeroing. > > > > > > sse.md has > > > > > > (define_insn "mov_internal" > > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > > "=v,v ,v ,m") > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > > " C,BC,vm,v"))] > > > > > > /* There is no evex-encoded vmov* for sizes smaller than 64-bytes > > > in avx512f, so we need to use workarounds, to access sse > > > registers > > > 16-31, which are evex-only. In avx512vl we don't need > > > workarounds. */ > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > { > > > if (memory_operand (operands[0], mode)) > > > { > > > if ( == 32) > > > return "vextract64x4\t{$0x0, %g1, %0|%0, > > > %g1, 0x0}"; > > > else if ( == 16) > > > return "vextract32x4\t{$0x0, %g1, %0|%0, > > > %g1, 0x0}"; > > > else > > > gcc_unreachable (); > > > } > > > ... > > > > > > However, since ix86_hard_regno_mode_ok has > > > > > > /* TODO check for QI/HI scalars. */ > > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > > if (TARGET_AVX512VL > > > && (mode == OImode > > > || mode == TImode > > > || VALID_AVX256_REG_MODE (mode) > > > || VALID_AVX512VL_128_REG_MODE (mode))) > > > return true; > > > > > > /* xmm16-xmm31 are only available for AVX-512. */ > > > if (EXT_REX_SSE_REGNO_P (regno)) > > > return false; > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > is a dead code. > > > > > > Also for > > > > > > long long *p; > > > volatile __m256i yy; > > > > > > void > > > foo (void) > > > { > > >_mm256_store_epi64 (p, yy); > > > } > > > > > > with AVX512VL, we should generate > > > > > > vmovdqa %ymm0, (%rax) > > > > > > not > > > > > > vmovdqa64 %ymm0, (%rax) > > > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > > > moves will be generated. > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: > > >a. With AVX512VL, AVX512VL vector moves will be generated. > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > > > move will be done with zmm register move. > > > > > > ext_sse_reg_operand is removed since it is no longer needed. > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > gcc/ > > > > > > PR target/89229 > > > PR target/89346 > > > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. > > > * config/i386/i386.c (ix86_get_ssemov): New function. > > > (ix86_output_ssemov): Likewise. > > > * config/i386/i386.md (*movxi_internal_avx512f): Call > > > ix86_output_ssemov for TYPE_SSEMOV. > > > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > > > (*movti_internal): Likewise. > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > Remove ext_sse_reg_operand check. > > > (*movsi_internal): Likewise. > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > > > and ext_sse_reg_operand check. > > > (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > > > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and > > > ext_sse_reg_operand check. > > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call > > > ix86_output_ssemov for
PING^2: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu wrote: > > > > Hi Jan, Uros, > > > > This patch fixes the wrong code bug: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > OK for trunk? > > > > Thanks. > > > > H.J. > > -- > > i386 backend has > > > > INT_MODE (OI, 32); > > INT_MODE (XI, 64); > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > > in case of const_1, all 512 bits set. > > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > > > Some targets prefer V4SF mode, so they will emit float xorps for zeroing. > > > > sse.md has > > > > (define_insn "mov_internal" > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > "=v,v ,v ,m") > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > " C,BC,vm,v"))] > > > > /* There is no evex-encoded vmov* for sizes smaller than 64-bytes > > in avx512f, so we need to use workarounds, to access sse registers > > 16-31, which are evex-only. In avx512vl we don't need workarounds. > > */ > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > && (EXT_REX_SSE_REG_P (operands[0]) > > || EXT_REX_SSE_REG_P (operands[1]))) > > { > > if (memory_operand (operands[0], mode)) > > { > > if ( == 32) > > return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, > > 0x0}"; > > else if ( == 16) > > return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, > > 0x0}"; > > else > > gcc_unreachable (); > > } > > ... > > > > However, since ix86_hard_regno_mode_ok has > > > > /* TODO check for QI/HI scalars. */ > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > if (TARGET_AVX512VL > > && (mode == OImode > > || mode == TImode > > || VALID_AVX256_REG_MODE (mode) > > || VALID_AVX512VL_128_REG_MODE (mode))) > > return true; > > > > /* xmm16-xmm31 are only available for AVX-512. */ > > if (EXT_REX_SSE_REGNO_P (regno)) > > return false; > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > && (EXT_REX_SSE_REG_P (operands[0]) > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > is a dead code. > > > > Also for > > > > long long *p; > > volatile __m256i yy; > > > > void > > foo (void) > > { > >_mm256_store_epi64 (p, yy); > > } > > > > with AVX512VL, we should generate > > > > vmovdqa %ymm0, (%rax) > > > > not > > > > vmovdqa64 %ymm0, (%rax) > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > > moves will be generated. > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: > >a. With AVX512VL, AVX512VL vector moves will be generated. > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > > move will be done with zmm register move. > > > > ext_sse_reg_operand is removed since it is no longer needed. > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > gcc/ > > > > PR target/89229 > > PR target/89346 > > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. > > * config/i386/i386.c (ix86_get_ssemov): New function. > > (ix86_output_ssemov): Likewise. > > * config/i386/i386.md (*movxi_internal_avx512f): Call > > ix86_output_ssemov for TYPE_SSEMOV. > > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > > (*movti_internal): Likewise. > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > Remove ext_sse_reg_operand check. > > (*movsi_internal): Likewise. > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > > and ext_sse_reg_operand check. > > (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and > > ext_sse_reg_operand check. > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call > > ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand > > check. > > * config/i386/sse.md (VMOVE:mov_internal): Call > > ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL > > check. > > * config/i386/predicates.md (ext_sse_reg_operand): Removed. > > > > gcc/testsuite/ > > > > PR target/89229
PING^1: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu wrote: > > Hi Jan, Uros, > > This patch fixes the wrong code bug: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > OK for trunk? > > Thanks. > > H.J. > -- > i386 backend has > > INT_MODE (OI, 32); > INT_MODE (XI, 64); > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > in case of const_1, all 512 bits set. > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > Some targets prefer V4SF mode, so they will emit float xorps for zeroing. > > sse.md has > > (define_insn "mov_internal" > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > "=v,v ,v ,m") > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > " C,BC,vm,v"))] > > /* There is no evex-encoded vmov* for sizes smaller than 64-bytes > in avx512f, so we need to use workarounds, to access sse registers > 16-31, which are evex-only. In avx512vl we don't need workarounds. > */ > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > && (EXT_REX_SSE_REG_P (operands[0]) > || EXT_REX_SSE_REG_P (operands[1]))) > { > if (memory_operand (operands[0], mode)) > { > if ( == 32) > return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, > 0x0}"; > else if ( == 16) > return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, > 0x0}"; > else > gcc_unreachable (); > } > ... > > However, since ix86_hard_regno_mode_ok has > > /* TODO check for QI/HI scalars. */ > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > if (TARGET_AVX512VL > && (mode == OImode > || mode == TImode > || VALID_AVX256_REG_MODE (mode) > || VALID_AVX512VL_128_REG_MODE (mode))) > return true; > > /* xmm16-xmm31 are only available for AVX-512. */ > if (EXT_REX_SSE_REGNO_P (regno)) > return false; > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > && (EXT_REX_SSE_REG_P (operands[0]) > || EXT_REX_SSE_REG_P (operands[1]))) > > is a dead code. > > Also for > > long long *p; > volatile __m256i yy; > > void > foo (void) > { >_mm256_store_epi64 (p, yy); > } > > with AVX512VL, we should generate > > vmovdqa %ymm0, (%rax) > > not > > vmovdqa64 %ymm0, (%rax) > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector > moves will be generated. > 2. If xmm16-xmm31/ymm16-ymm31 registers are used: >a. With AVX512VL, AVX512VL vector moves will be generated. >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register > move will be done with zmm register move. > > ext_sse_reg_operand is removed since it is no longer needed. > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > gcc/ > > PR target/89229 > PR target/89346 > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. > * config/i386/i386.c (ix86_get_ssemov): New function. > (ix86_output_ssemov): Likewise. > * config/i386/i386.md (*movxi_internal_avx512f): Call > ix86_output_ssemov for TYPE_SSEMOV. > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove ext_sse_reg_operand and TARGET_AVX512VL check. > (*movti_internal): Likewise. > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove ext_sse_reg_operand check. > (*movsi_internal): Likewise. > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL > and ext_sse_reg_operand check. > (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and > ext_sse_reg_operand check. > * config/i386/mmx.md (MMXMODE:*mov_internal): Call > ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand > check. > * config/i386/sse.md (VMOVE:mov_internal): Call > ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL > check. > * config/i386/predicates.md (ext_sse_reg_operand): Removed. > > gcc/testsuite/ > > PR target/89229 > PR target/89346 > * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. > * gcc.target/i386/pr89229-2a.c: New test. > * gcc.target/i386/pr89229-2b.c: Likewise. > * gcc.target/i386/pr89229-2c.c: Likewise. > * gcc.target/i386/pr89229-3a.c: Likewise. > * gcc.target/i386/pr89229-3b.c:
[PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
Hi Jan, Uros, This patch fixes the wrong code bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 Tested on AVX2 and AVX512 with and without --with-arch=native. OK for trunk? Thanks. H.J. -- i386 backend has INT_MODE (OI, 32); INT_MODE (XI, 64); So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, in case of const_1, all 512 bits set. We can load zeros with narrower instruction, (e.g. 256 bit by inherent zeroing of highpart in case of 128 bit xor), so TImode in this case. Some targets prefer V4SF mode, so they will emit float xorps for zeroing. sse.md has (define_insn "mov_internal" [(set (match_operand:VMOVE 0 "nonimmediate_operand" "=v,v ,v ,m") (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" " C,BC,vm,v"))] /* There is no evex-encoded vmov* for sizes smaller than 64-bytes in avx512f, so we need to use workarounds, to access sse registers 16-31, which are evex-only. In avx512vl we don't need workarounds. */ if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL && (EXT_REX_SSE_REG_P (operands[0]) || EXT_REX_SSE_REG_P (operands[1]))) { if (memory_operand (operands[0], mode)) { if ( == 32) return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 0x0}"; else if ( == 16) return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 0x0}"; else gcc_unreachable (); } ... However, since ix86_hard_regno_mode_ok has /* TODO check for QI/HI scalars. */ /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ if (TARGET_AVX512VL && (mode == OImode || mode == TImode || VALID_AVX256_REG_MODE (mode) || VALID_AVX512VL_128_REG_MODE (mode))) return true; /* xmm16-xmm31 are only available for AVX-512. */ if (EXT_REX_SSE_REGNO_P (regno)) return false; if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL && (EXT_REX_SSE_REG_P (operands[0]) || EXT_REX_SSE_REG_P (operands[1]))) is a dead code. Also for long long *p; volatile __m256i yy; void foo (void) { _mm256_store_epi64 (p, yy); } with AVX512VL, we should generate vmovdqa %ymm0, (%rax) not vmovdqa64 %ymm0, (%rax) All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector moves will be generated. 2. If xmm16-xmm31/ymm16-ymm31 registers are used: a. With AVX512VL, AVX512VL vector moves will be generated. b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register move will be done with zmm register move. ext_sse_reg_operand is removed since it is no longer needed. Tested on AVX2 and AVX512 with and without --with-arch=native. gcc/ PR target/89229 PR target/89346 * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. * config/i386/i386.c (ix86_get_ssemov): New function. (ix86_output_ssemov): Likewise. * config/i386/i386.md (*movxi_internal_avx512f): Call ix86_output_ssemov for TYPE_SSEMOV. (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand and TARGET_AVX512VL check. (*movti_internal): Likewise. (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand check. (*movsi_internal): Likewise. (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL and ext_sse_reg_operand check. (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and ext_sse_reg_operand check. * config/i386/mmx.md (MMXMODE:*mov_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand check. * config/i386/sse.md (VMOVE:mov_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL check. * config/i386/predicates.md (ext_sse_reg_operand): Removed. gcc/testsuite/ PR target/89229 PR target/89346 * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. * gcc.target/i386/pr89229-2a.c: New test. * gcc.target/i386/pr89229-2b.c: Likewise. * gcc.target/i386/pr89229-2c.c: Likewise. * gcc.target/i386/pr89229-3a.c: Likewise. * gcc.target/i386/pr89229-3b.c: Likewise. * gcc.target/i386/pr89229-3c.c: Likewise. * gcc.target/i386/pr89229-4a.c: Likewise. * gcc.target/i386/pr89229-4b.c: Likewise. * gcc.target/i386/pr89229-4c.c: Likewise. * gcc.target/i386/pr89229-5a.c: Likewise. * gcc.target/i386/pr89229-5b.c: Likewise.
[PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Mon, Feb 11, 2019 at 05:24:24PM +0100, Jakub Jelinek wrote: > On Mon, Feb 11, 2019 at 04:56:45PM +0100, Uros Bizjak wrote: > > > Let's first define what MODE_XI means in standard_sse_constant_opcode > > > as well as in all these mov patterns for with and without AVX512VL. > > > Without > > > a clear definition, we can't get out of this mess. > > > > INT_MODE (OI, 32); > > INT_MODE (XI, 64); > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > > in case of const_1, all 512 bits set. > > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > > > Some targets prefer V4SF mode, so they will emit float xorps for zeroing > > > > Then the introduction of AVX512F fubared everything by overloading the > > meaning of insn mode. > > I don't see much changes in AVX512F here, most of the behavior has been > there already in AVX. > Most of the SSE/AVX/AVX512 instructions affect the whole register, > usually there is DEST[MAX_VL-1:VL] <- 0 at the end of each instruction. > But, using the MAX_VL to determine get_attr_mode doesn't seem really useful, > because that changes dynamically at runtime based on the actual hw, not on > what we've been compiled for. > So, I believe we want to use that VL value to determine the bitsize of the > mode corresponding to get_attr_mode. And in that case, for > *movoi_internal_avx and *movti_internal, I believe the right mode is MODE_OI > resp. MODE_TI for AVX512VL, because e.g. > vmovdqa32 %ymm12, %ymm23 > is a VL = 256 instruction, not VL = 512. Similarly, if we want to set > %ymm25 to all ones, i.e. movoi_internal_avx, we use > vpternlogd$0xFF, %ymm25, %ymm25, %ymm25 > which is again VL = 256 instruction, so should use MODE_OI. > We'd need to use > vmovdqa32 %zmm12, %zmm23 > or > vpternlogd$0xFF, %zmm25, %zmm25, %zmm25 > instructions for AVX512F without AVX512VL, but as has been discussed, this > won't really happen, because hard_regno_mode_ok refuses to allocate 256-bit > or 128-bit modes in ext sse registers. > Here is the patch. Tested on AVX2/x86-64 and AVX512/x96-64 with and without --with-arch=native. H.J. --- i386 backend has INT_MODE (OI, 32); INT_MODE (XI, 64); So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, in case of const_1, all 512 bits set. We can load zeros with narrower instruction, (e.g. 256 bit by inherent zeroing of highpart in case of 128 bit xor), so TImode in this case. Some targets prefer V4SF mode, so they will emit float xorps for zeroing. sse.md has (define_insn "mov_internal" [(set (match_operand:VMOVE 0 "nonimmediate_operand" "=v,v ,v ,m") (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" " C,BC,vm,v"))] /* There is no evex-encoded vmov* for sizes smaller than 64-bytes in avx512f, so we need to use workarounds, to access sse registers 16-31, which are evex-only. In avx512vl we don't need workarounds. */ if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL && (EXT_REX_SSE_REG_P (operands[0]) || EXT_REX_SSE_REG_P (operands[1]))) { if (memory_operand (operands[0], mode)) { if ( == 32) return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 0x0}"; else if ( == 16) return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 0x0}"; else gcc_unreachable (); } ... However, since ix86_hard_regno_mode_ok has /* TODO check for QI/HI scalars. */ /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ if (TARGET_AVX512VL && (mode == OImode || mode == TImode || VALID_AVX256_REG_MODE (mode) || VALID_AVX512VL_128_REG_MODE (mode))) return true; /* xmm16-xmm31 are only available for AVX-512. */ if (EXT_REX_SSE_REGNO_P (regno)) return false; if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL && (EXT_REX_SSE_REG_P (operands[0]) || EXT_REX_SSE_REG_P (operands[1]))) is a dead code. Also for long long *p; volatile __m256i yy; void foo (void) { _mm256_store_epi64 (p, yy); } with AVX512VL, we should generate vmovdqa %ymm0, (%rax) not vmovdqa64 %ymm0, (%rax) All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov: 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector moves will be generated. 2. If xmm16-xmm31/ymm16-ymm31 registers are used: a. With AVX512VL, AVX512VL vector moves will be generated. b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register move will be done with zmm register move. ext_sse_reg_operand is removed since it is no longer needed. gcc/ PR target/89229 PR target/89346 * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. *