PING^7: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-02-13 Thread H.J. Lu
On Thu, Feb 6, 2020 at 8:17 PM H.J. Lu  wrote:
>
> On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu  wrote:
> >
> > On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu  wrote:
> > >
> > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu  wrote:
> > > >
> > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu  wrote:
> > > > >
> > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Jan, Uros,
> > > > > > >
> > > > > > > This patch fixes the wrong code bug:
> > > > > > >
> > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > > > > > >
> > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > > >
> > > > > > > OK for trunk?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > H.J.
> > > > > > > --
> > > > > > > i386 backend has
> > > > > > >
> > > > > > > INT_MODE (OI, 32);
> > > > > > > INT_MODE (XI, 64);
> > > > > > >
> > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit 
> > > > > > > operation,
> > > > > > > in case of const_1, all 512 bits set.
> > > > > > >
> > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by 
> > > > > > > inherent
> > > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this 
> > > > > > > case.
> > > > > > >
> > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for 
> > > > > > > zeroing.
> > > > > > >
> > > > > > > sse.md has
> > > > > > >
> > > > > > > (define_insn "mov_internal"
> > > > > > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > > > > > >  "=v,v ,v ,m")
> > > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > > > > > >  " C,BC,vm,v"))]
> > > > > > > 
> > > > > > >   /* There is no evex-encoded vmov* for sizes smaller than 
> > > > > > > 64-bytes
> > > > > > >  in avx512f, so we need to use workarounds, to access sse 
> > > > > > > registers
> > > > > > >  16-31, which are evex-only. In avx512vl we don't need 
> > > > > > > workarounds.  */
> > > > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > > > > {
> > > > > > >   if (memory_operand (operands[0], mode))
> > > > > > > {
> > > > > > >   if ( == 32)
> > > > > > > return "vextract64x4\t{$0x0, %g1, 
> > > > > > > %0|%0, %g1, 0x0}";
> > > > > > >   else if ( == 16)
> > > > > > > return "vextract32x4\t{$0x0, %g1, 
> > > > > > > %0|%0, %g1, 0x0}";
> > > > > > >   else
> > > > > > > gcc_unreachable ();
> > > > > > > }
> > > > > > > ...
> > > > > > >
> > > > > > > However, since ix86_hard_regno_mode_ok has
> > > > > > >
> > > > > > >  /* TODO check for QI/HI scalars.  */
> > > > > > >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > > > > > >   if (TARGET_AVX512VL
> > > > > > >   && (mode == OImode
> > > > > > >   || mode == TImode
> > > > > > >   || VALID_AVX256_REG_MODE (mode)
> > > > > > >   || VALID_AVX512VL_128_REG_MODE (mode)))
> > > > > > > return true;
> > > > > > >
> > > > > > >   /* xmm16-xmm31 are only available for AVX-512.  */
> > > > > > >   if (EXT_REX_SSE_REGNO_P (regno))
> > > > > > > return false;
> > > > > > >
> > > > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > > > >
> > > > > > > is a dead code.
> > > > > > >
> > > > > > > Also for
> > > > > > >
> > > > > > > long long *p;
> > > > > > > volatile __m256i yy;
> > > > > > >
> > > > > > > void
> > > > > > > foo (void)
> > > > > > > {
> > > > > > >_mm256_store_epi64 (p, yy);
> > > > > > > }
> > > > > > >
> > > > > > > with AVX512VL, we should generate
> > > > > > >
> > > > > > > vmovdqa %ymm0, (%rax)
> > > > > > >
> > > > > > > not
> > > > > > >
> > > > > > > vmovdqa64   %ymm0, (%rax)
> > > > > > >
> > > > > > > All TYPE_SSEMOV vector moves are consolidated to 
> > > > > > > ix86_output_ssemov:
> > > > > > >
> > > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX 
> > > > > > > vector
> > > > > > > moves will be generated.
> > > > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> > > > > > >a. With AVX512VL, AVX512VL vector moves will be generated.
> > > > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to 
> > > > > > > register
> > > > > > >   move will be done with zmm register move.
> > > > > > >
> > > > > > > ext_sse_reg_operand is removed since it is no longer needed.
> > > > > > >
> > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > > >
> > > > > > > 

PING^6: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-02-06 Thread H.J. Lu
   && (EXT_REX_SSE_REG_P (operands[0])
> > > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > > >
> > > > > > is a dead code.
> > > > > >
> > > > > > Also for
> > > > > >
> > > > > > long long *p;
> > > > > > volatile __m256i yy;
> > > > > >
> > > > > > void
> > > > > > foo (void)
> > > > > > {
> > > > > >_mm256_store_epi64 (p, yy);
> > > > > > }
> > > > > >
> > > > > > with AVX512VL, we should generate
> > > > > >
> > > > > > vmovdqa %ymm0, (%rax)
> > > > > >
> > > > > > not
> > > > > >
> > > > > > vmovdqa64   %ymm0, (%rax)
> > > > > >
> > > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> > > > > >
> > > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> > > > > > moves will be generated.
> > > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> > > > > >a. With AVX512VL, AVX512VL vector moves will be generated.
> > > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
> > > > > >   move will be done with zmm register move.
> > > > > >
> > > > > > ext_sse_reg_operand is removed since it is no longer needed.
> > > > > >
> > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > >
> > > > > > gcc/
> > > > > >
> > > > > > PR target/89229
> > > > > > PR target/89346
> > > > > > * config/i386/i386-protos.h (ix86_output_ssemov): New 
> > > > > > prototype.
> > > > > > * config/i386/i386.c (ix86_get_ssemov): New function.
> > > > > > (ix86_output_ssemov): Likewise.
> > > > > > * config/i386/i386.md (*movxi_internal_avx512f): Call
> > > > > > ix86_output_ssemov for TYPE_SSEMOV.
> > > > > > (*movoi_internal_avx): Call ix86_output_ssemov for 
> > > > > > TYPE_SSEMOV.
> > > > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> > > > > > (*movti_internal): Likewise.
> > > > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > > > Remove ext_sse_reg_operand check.
> > > > > > (*movsi_internal): Likewise.
> > > > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
> > > > > > and ext_sse_reg_operand check.
> > > > > > (*movsf_internal_avx): Call ix86_output_ssemov for 
> > > > > > TYPE_SSEMOV.
> > > > > > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
> > > > > > ext_sse_reg_operand check.
> > > > > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call
> > > > > > ix86_output_ssemov for TYPE_SSEMOV.  Remove 
> > > > > > ext_sse_reg_operand
> > > > > > check.
> > > > > > * config/i386/sse.md (VMOVE:mov_internal): Call
> > > > > > ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
> > > > > > check.
> > > > > > * config/i386/predicates.md (ext_sse_reg_operand): Removed.
> > > > > >
> > > > > > gcc/testsuite/
> > > > > >
> > > > > > PR target/89229
> > > > > > PR target/89346
> > > > > > * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
> > > > > > * gcc.target/i386/pr89229-2a.c: New test.
> > > > > > * gcc.target/i386/pr89229-2b.c: Likewise.
> > > > > > * gcc.target/i386/pr89229-2c.c: Likewise.
> > > > > > * gcc.target/i386/pr89229-3a.c: Likewise.
> > > > > > * gcc.target/i386/pr89229-3b.c

PING^5: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-01-27 Thread H.J. Lu
> > > > >
> > > > > with AVX512VL, we should generate
> > > > >
> > > > > vmovdqa %ymm0, (%rax)
> > > > >
> > > > > not
> > > > >
> > > > > vmovdqa64   %ymm0, (%rax)
> > > > >
> > > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> > > > >
> > > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> > > > > moves will be generated.
> > > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> > > > >a. With AVX512VL, AVX512VL vector moves will be generated.
> > > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
> > > > >   move will be done with zmm register move.
> > > > >
> > > > > ext_sse_reg_operand is removed since it is no longer needed.
> > > > >
> > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > >
> > > > > gcc/
> > > > >
> > > > > PR target/89229
> > > > > PR target/89346
> > > > > * config/i386/i386-protos.h (ix86_output_ssemov): New 
> > > > > prototype.
> > > > > * config/i386/i386.c (ix86_get_ssemov): New function.
> > > > > (ix86_output_ssemov): Likewise.
> > > > > * config/i386/i386.md (*movxi_internal_avx512f): Call
> > > > > ix86_output_ssemov for TYPE_SSEMOV.
> > > > > (*movoi_internal_avx): Call ix86_output_ssemov for 
> > > > > TYPE_SSEMOV.
> > > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> > > > > (*movti_internal): Likewise.
> > > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > > Remove ext_sse_reg_operand check.
> > > > > (*movsi_internal): Likewise.
> > > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
> > > > > and ext_sse_reg_operand check.
> > > > > (*movsf_internal_avx): Call ix86_output_ssemov for 
> > > > > TYPE_SSEMOV.
> > > > >     Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
> > > > > ext_sse_reg_operand check.
> > > > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call
> > > > > ix86_output_ssemov for TYPE_SSEMOV.  Remove 
> > > > > ext_sse_reg_operand
> > > > > check.
> > > > > * config/i386/sse.md (VMOVE:mov_internal): Call
> > > > > ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
> > > > > check.
> > > > > * config/i386/predicates.md (ext_sse_reg_operand): Removed.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > > PR target/89229
> > > > > PR target/89346
> > > > > * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
> > > > > * gcc.target/i386/pr89229-2a.c: New test.
> > > > > * gcc.target/i386/pr89229-2b.c: Likewise.
> > > > > * gcc.target/i386/pr89229-2c.c: Likewise.
> > > > > * gcc.target/i386/pr89229-3a.c: Likewise.
> > > > > * gcc.target/i386/pr89229-3b.c: Likewise.
> > > > > * gcc.target/i386/pr89229-3c.c: Likewise.
> > > > > * gcc.target/i386/pr89229-4a.c: Likewise.
> > > > > * gcc.target/i386/pr89229-4b.c: Likewise.
> > > > > * gcc.target/i386/pr89229-4c.c: Likewise.
> > > > > * gcc.target/i386/pr89229-5a.c: Likewise.
> > > > > * gcc.target/i386/pr89229-5b.c: Likewise.
> > > > > * gcc.target/i386/pr89229-5c.c: Likewise.
> > > > > * gcc.target/i386/pr89229-6a.c: Likewise.
> > > > > * gcc.target/i386/pr89229-6b.c: Likewise.
> > > > > * gcc.target/i386/pr89229-6c.c: Likewise.
> > > > > * gcc.target/i386/pr89229-7a.c: Likewise.
> > > > > * gcc.target/i386/pr89229-7b.c: Likewise.
> > > > > 

Re: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-07-22 Thread Jeff Law
On 2/22/19 9:24 AM, H.J. Lu wrote:
> Hi Jan, Uros,
> 
> This patch fixes the wrong code bug:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> 
> Tested on AVX2 and AVX512 with and without --with-arch=native.
> 
> OK for trunk?
> 
> Thanks.
> 
> H.J.
> --
> i386 backend has
> 
> INT_MODE (OI, 32);
> INT_MODE (XI, 64);
> 
> So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> in case of const_1, all 512 bits set.
> 
> We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> zeroing of highpart in case of 128 bit xor), so TImode in this case.
> 
> Some targets prefer V4SF mode, so they will emit float xorps for zeroing.
> 
> sse.md has
> 
> (define_insn "mov_internal"
>   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
>  "=v,v ,v ,m")
> (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
>  " C,BC,vm,v"))]
> 
>   /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
>  in avx512f, so we need to use workarounds, to access sse registers
>  16-31, which are evex-only. In avx512vl we don't need workarounds.  
> */
>   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
>   && (EXT_REX_SSE_REG_P (operands[0])
>   || EXT_REX_SSE_REG_P (operands[1])))
> {
>   if (memory_operand (operands[0], mode))
> {
>   if ( == 32)
> return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 
> 0x0}";
>   else if ( == 16)
> return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 
> 0x0}";
>   else
> gcc_unreachable ();
> }
> ...
> 
> However, since ix86_hard_regno_mode_ok has
> 
>  /* TODO check for QI/HI scalars.  */
>   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
>   if (TARGET_AVX512VL
>   && (mode == OImode
>   || mode == TImode
>   || VALID_AVX256_REG_MODE (mode)
>   || VALID_AVX512VL_128_REG_MODE (mode)))
> return true;
> 
>   /* xmm16-xmm31 are only available for AVX-512.  */
>   if (EXT_REX_SSE_REGNO_P (regno))
> return false;
> 
>   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
>   && (EXT_REX_SSE_REG_P (operands[0])
>   || EXT_REX_SSE_REG_P (operands[1])))
> 
> is a dead code.
> 
> Also for
> 
> long long *p;
> volatile __m256i yy;
> 
> void
> foo (void)
> {
>_mm256_store_epi64 (p, yy);
> }
> 
> with AVX512VL, we should generate
> 
>   vmovdqa %ymm0, (%rax)
> 
> not
> 
>   vmovdqa64   %ymm0, (%rax)
> 
> All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> 
> 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> moves will be generated.
> 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
>a. With AVX512VL, AVX512VL vector moves will be generated.
>b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
>   move will be done with zmm register move.
> 
> ext_sse_reg_operand is removed since it is no longer needed.
> 
> Tested on AVX2 and AVX512 with and without --with-arch=native.
> 
> gcc/
> 
>   PR target/89229
>   PR target/89346
>   * config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
>   * config/i386/i386.c (ix86_get_ssemov): New function.
>   (ix86_output_ssemov): Likewise.
>   * config/i386/i386.md (*movxi_internal_avx512f): Call
>   ix86_output_ssemov for TYPE_SSEMOV.
>   (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
>   Remove ext_sse_reg_operand and TARGET_AVX512VL check.
>   (*movti_internal): Likewise.
>   (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
>   Remove ext_sse_reg_operand check.
>   (*movsi_internal): Likewise.
>   (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
>   (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
>   Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
>   and ext_sse_reg_operand check.
>   (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
>   Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
>   ext_sse_reg_operand check.
>   * config/i386/mmx.md (MMXMODE:*mov_internal): Call
>   ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
>   check.
>   * config/i386/sse.md (VMOVE:mov_internal): Call
>   ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
>   check.
>   * config/i386/predicates.md (ext_sse_reg_operand): Removed.
> 
> gcc/testsuite/
> 
>   PR target/89229
>   PR target/89346
>   * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
>   * gcc.target/i386/pr89229-2a.c: New test.
>   * gcc.target/i386/pr89229-2b.c: Likewise.
>   * gcc.target/i386/pr89229-2c.c: Likewise.
>   * gcc.target/i386/pr89229-3a.c: Likewise.
>   * gcc.target/i386/pr89229-3b.c: Likewise.
>   * gcc.target/i386/pr89229-3c.c: 

PING^4: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-07-08 Thread H.J. Lu
On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu  wrote:
>
> On Fri, May 31, 2019 at 10:38 AM H.J. Lu  wrote:
> >
> > On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
> > >
> > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  wrote:
> > > >
> > > > Hi Jan, Uros,
> > > >
> > > > This patch fixes the wrong code bug:
> > > >
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > > >
> > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > >
> > > > OK for trunk?
> > > >
> > > > Thanks.
> > > >
> > > > H.J.
> > > > --
> > > > i386 backend has
> > > >
> > > > INT_MODE (OI, 32);
> > > > INT_MODE (XI, 64);
> > > >
> > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> > > > in case of const_1, all 512 bits set.
> > > >
> > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> > > > zeroing of highpart in case of 128 bit xor), so TImode in this case.
> > > >
> > > > Some targets prefer V4SF mode, so they will emit float xorps for 
> > > > zeroing.
> > > >
> > > > sse.md has
> > > >
> > > > (define_insn "mov_internal"
> > > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > > >  "=v,v ,v ,m")
> > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > > >  " C,BC,vm,v"))]
> > > > 
> > > >   /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
> > > >  in avx512f, so we need to use workarounds, to access sse 
> > > > registers
> > > >  16-31, which are evex-only. In avx512vl we don't need 
> > > > workarounds.  */
> > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > {
> > > >   if (memory_operand (operands[0], mode))
> > > > {
> > > >   if ( == 32)
> > > > return "vextract64x4\t{$0x0, %g1, %0|%0, 
> > > > %g1, 0x0}";
> > > >   else if ( == 16)
> > > > return "vextract32x4\t{$0x0, %g1, %0|%0, 
> > > > %g1, 0x0}";
> > > >   else
> > > > gcc_unreachable ();
> > > > }
> > > > ...
> > > >
> > > > However, since ix86_hard_regno_mode_ok has
> > > >
> > > >  /* TODO check for QI/HI scalars.  */
> > > >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > > >   if (TARGET_AVX512VL
> > > >   && (mode == OImode
> > > >   || mode == TImode
> > > >   || VALID_AVX256_REG_MODE (mode)
> > > >   || VALID_AVX512VL_128_REG_MODE (mode)))
> > > > return true;
> > > >
> > > >   /* xmm16-xmm31 are only available for AVX-512.  */
> > > >   if (EXT_REX_SSE_REGNO_P (regno))
> > > > return false;
> > > >
> > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > >
> > > > is a dead code.
> > > >
> > > > Also for
> > > >
> > > > long long *p;
> > > > volatile __m256i yy;
> > > >
> > > > void
> > > > foo (void)
> > > > {
> > > >_mm256_store_epi64 (p, yy);
> > > > }
> > > >
> > > > with AVX512VL, we should generate
> > > >
> > > > vmovdqa %ymm0, (%rax)
> > > >
> > > > not
> > > >
> > > > vmovdqa64   %ymm0, (%rax)
> > > >
> > > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> > > >
> > > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> > > > moves will be generated.
> > > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> > > >a. With AVX512VL, AVX512VL vector moves will be generated.
> > > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
> > > >   move will be done with zmm register move.
> > > >
> > > > ext_sse_reg_operand is removed since it is no longer needed.
> > > >
> > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > >
> > > > gcc/
> > > >
> > > > PR target/89229
> > > > PR target/89346
> > > > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
> > > > * config/i386/i386.c (ix86_get_ssemov): New function.
> > > > (ix86_output_ssemov): Likewise.
> > > > * config/i386/i386.md (*movxi_internal_avx512f): Call
> > > > ix86_output_ssemov for TYPE_SSEMOV.
> > > > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> > > > (*movti_internal): Likewise.
> > > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > Remove ext_sse_reg_operand check.
> > > > (*movsi_internal): Likewise.
> > > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
> > > > 

PING^3: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-06-18 Thread H.J. Lu
On Fri, May 31, 2019 at 10:38 AM H.J. Lu  wrote:
>
> On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
> >
> > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  wrote:
> > >
> > > Hi Jan, Uros,
> > >
> > > This patch fixes the wrong code bug:
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > >
> > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > >
> > > OK for trunk?
> > >
> > > Thanks.
> > >
> > > H.J.
> > > --
> > > i386 backend has
> > >
> > > INT_MODE (OI, 32);
> > > INT_MODE (XI, 64);
> > >
> > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> > > in case of const_1, all 512 bits set.
> > >
> > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> > > zeroing of highpart in case of 128 bit xor), so TImode in this case.
> > >
> > > Some targets prefer V4SF mode, so they will emit float xorps for zeroing.
> > >
> > > sse.md has
> > >
> > > (define_insn "mov_internal"
> > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > >  "=v,v ,v ,m")
> > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > >  " C,BC,vm,v"))]
> > > 
> > >   /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
> > >  in avx512f, so we need to use workarounds, to access sse 
> > > registers
> > >  16-31, which are evex-only. In avx512vl we don't need 
> > > workarounds.  */
> > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > >   && (EXT_REX_SSE_REG_P (operands[0])
> > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > {
> > >   if (memory_operand (operands[0], mode))
> > > {
> > >   if ( == 32)
> > > return "vextract64x4\t{$0x0, %g1, %0|%0, 
> > > %g1, 0x0}";
> > >   else if ( == 16)
> > > return "vextract32x4\t{$0x0, %g1, %0|%0, 
> > > %g1, 0x0}";
> > >   else
> > > gcc_unreachable ();
> > > }
> > > ...
> > >
> > > However, since ix86_hard_regno_mode_ok has
> > >
> > >  /* TODO check for QI/HI scalars.  */
> > >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > >   if (TARGET_AVX512VL
> > >   && (mode == OImode
> > >   || mode == TImode
> > >   || VALID_AVX256_REG_MODE (mode)
> > >   || VALID_AVX512VL_128_REG_MODE (mode)))
> > > return true;
> > >
> > >   /* xmm16-xmm31 are only available for AVX-512.  */
> > >   if (EXT_REX_SSE_REGNO_P (regno))
> > > return false;
> > >
> > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > >   && (EXT_REX_SSE_REG_P (operands[0])
> > >   || EXT_REX_SSE_REG_P (operands[1])))
> > >
> > > is a dead code.
> > >
> > > Also for
> > >
> > > long long *p;
> > > volatile __m256i yy;
> > >
> > > void
> > > foo (void)
> > > {
> > >_mm256_store_epi64 (p, yy);
> > > }
> > >
> > > with AVX512VL, we should generate
> > >
> > > vmovdqa %ymm0, (%rax)
> > >
> > > not
> > >
> > > vmovdqa64   %ymm0, (%rax)
> > >
> > > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> > >
> > > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> > > moves will be generated.
> > > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> > >a. With AVX512VL, AVX512VL vector moves will be generated.
> > >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
> > >   move will be done with zmm register move.
> > >
> > > ext_sse_reg_operand is removed since it is no longer needed.
> > >
> > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > >
> > > gcc/
> > >
> > > PR target/89229
> > > PR target/89346
> > > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
> > > * config/i386/i386.c (ix86_get_ssemov): New function.
> > > (ix86_output_ssemov): Likewise.
> > > * config/i386/i386.md (*movxi_internal_avx512f): Call
> > > ix86_output_ssemov for TYPE_SSEMOV.
> > > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> > > (*movti_internal): Likewise.
> > > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > Remove ext_sse_reg_operand check.
> > > (*movsi_internal): Likewise.
> > > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
> > > and ext_sse_reg_operand check.
> > > (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> > > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
> > > ext_sse_reg_operand check.
> > > * config/i386/mmx.md (MMXMODE:*mov_internal): Call
> > > ix86_output_ssemov for 

PING^2: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-05-31 Thread H.J. Lu
On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
>
> On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  wrote:
> >
> > Hi Jan, Uros,
> >
> > This patch fixes the wrong code bug:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> >
> > Tested on AVX2 and AVX512 with and without --with-arch=native.
> >
> > OK for trunk?
> >
> > Thanks.
> >
> > H.J.
> > --
> > i386 backend has
> >
> > INT_MODE (OI, 32);
> > INT_MODE (XI, 64);
> >
> > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> > in case of const_1, all 512 bits set.
> >
> > We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> > zeroing of highpart in case of 128 bit xor), so TImode in this case.
> >
> > Some targets prefer V4SF mode, so they will emit float xorps for zeroing.
> >
> > sse.md has
> >
> > (define_insn "mov_internal"
> >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> >  "=v,v ,v ,m")
> > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> >  " C,BC,vm,v"))]
> > 
> >   /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
> >  in avx512f, so we need to use workarounds, to access sse registers
> >  16-31, which are evex-only. In avx512vl we don't need workarounds. 
> >  */
> >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> >   && (EXT_REX_SSE_REG_P (operands[0])
> >   || EXT_REX_SSE_REG_P (operands[1])))
> > {
> >   if (memory_operand (operands[0], mode))
> > {
> >   if ( == 32)
> > return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 
> > 0x0}";
> >   else if ( == 16)
> > return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 
> > 0x0}";
> >   else
> > gcc_unreachable ();
> > }
> > ...
> >
> > However, since ix86_hard_regno_mode_ok has
> >
> >  /* TODO check for QI/HI scalars.  */
> >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> >   if (TARGET_AVX512VL
> >   && (mode == OImode
> >   || mode == TImode
> >   || VALID_AVX256_REG_MODE (mode)
> >   || VALID_AVX512VL_128_REG_MODE (mode)))
> > return true;
> >
> >   /* xmm16-xmm31 are only available for AVX-512.  */
> >   if (EXT_REX_SSE_REGNO_P (regno))
> > return false;
> >
> >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> >   && (EXT_REX_SSE_REG_P (operands[0])
> >   || EXT_REX_SSE_REG_P (operands[1])))
> >
> > is a dead code.
> >
> > Also for
> >
> > long long *p;
> > volatile __m256i yy;
> >
> > void
> > foo (void)
> > {
> >_mm256_store_epi64 (p, yy);
> > }
> >
> > with AVX512VL, we should generate
> >
> > vmovdqa %ymm0, (%rax)
> >
> > not
> >
> > vmovdqa64   %ymm0, (%rax)
> >
> > All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> >
> > 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> > moves will be generated.
> > 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
> >a. With AVX512VL, AVX512VL vector moves will be generated.
> >b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
> >   move will be done with zmm register move.
> >
> > ext_sse_reg_operand is removed since it is no longer needed.
> >
> > Tested on AVX2 and AVX512 with and without --with-arch=native.
> >
> > gcc/
> >
> > PR target/89229
> > PR target/89346
> > * config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
> > * config/i386/i386.c (ix86_get_ssemov): New function.
> > (ix86_output_ssemov): Likewise.
> > * config/i386/i386.md (*movxi_internal_avx512f): Call
> > ix86_output_ssemov for TYPE_SSEMOV.
> > (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> > Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> > (*movti_internal): Likewise.
> > (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > Remove ext_sse_reg_operand check.
> > (*movsi_internal): Likewise.
> > (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> > Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
> > and ext_sse_reg_operand check.
> > (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> > Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
> > ext_sse_reg_operand check.
> > * config/i386/mmx.md (MMXMODE:*mov_internal): Call
> > ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
> > check.
> > * config/i386/sse.md (VMOVE:mov_internal): Call
> > ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
> > check.
> > * config/i386/predicates.md (ext_sse_reg_operand): Removed.
> >
> > gcc/testsuite/
> >
> > PR target/89229

PING^1: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-05-21 Thread H.J. Lu
On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  wrote:
>
> Hi Jan, Uros,
>
> This patch fixes the wrong code bug:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
>
> Tested on AVX2 and AVX512 with and without --with-arch=native.
>
> OK for trunk?
>
> Thanks.
>
> H.J.
> --
> i386 backend has
>
> INT_MODE (OI, 32);
> INT_MODE (XI, 64);
>
> So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> in case of const_1, all 512 bits set.
>
> We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> zeroing of highpart in case of 128 bit xor), so TImode in this case.
>
> Some targets prefer V4SF mode, so they will emit float xorps for zeroing.
>
> sse.md has
>
> (define_insn "mov_internal"
>   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
>  "=v,v ,v ,m")
> (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
>  " C,BC,vm,v"))]
> 
>   /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
>  in avx512f, so we need to use workarounds, to access sse registers
>  16-31, which are evex-only. In avx512vl we don't need workarounds.  
> */
>   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
>   && (EXT_REX_SSE_REG_P (operands[0])
>   || EXT_REX_SSE_REG_P (operands[1])))
> {
>   if (memory_operand (operands[0], mode))
> {
>   if ( == 32)
> return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 
> 0x0}";
>   else if ( == 16)
> return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 
> 0x0}";
>   else
> gcc_unreachable ();
> }
> ...
>
> However, since ix86_hard_regno_mode_ok has
>
>  /* TODO check for QI/HI scalars.  */
>   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
>   if (TARGET_AVX512VL
>   && (mode == OImode
>   || mode == TImode
>   || VALID_AVX256_REG_MODE (mode)
>   || VALID_AVX512VL_128_REG_MODE (mode)))
> return true;
>
>   /* xmm16-xmm31 are only available for AVX-512.  */
>   if (EXT_REX_SSE_REGNO_P (regno))
> return false;
>
>   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
>   && (EXT_REX_SSE_REG_P (operands[0])
>   || EXT_REX_SSE_REG_P (operands[1])))
>
> is a dead code.
>
> Also for
>
> long long *p;
> volatile __m256i yy;
>
> void
> foo (void)
> {
>_mm256_store_epi64 (p, yy);
> }
>
> with AVX512VL, we should generate
>
> vmovdqa %ymm0, (%rax)
>
> not
>
> vmovdqa64   %ymm0, (%rax)
>
> All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
>
> 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> moves will be generated.
> 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
>a. With AVX512VL, AVX512VL vector moves will be generated.
>b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
>   move will be done with zmm register move.
>
> ext_sse_reg_operand is removed since it is no longer needed.
>
> Tested on AVX2 and AVX512 with and without --with-arch=native.
>
> gcc/
>
> PR target/89229
> PR target/89346
> * config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
> * config/i386/i386.c (ix86_get_ssemov): New function.
> (ix86_output_ssemov): Likewise.
> * config/i386/i386.md (*movxi_internal_avx512f): Call
> ix86_output_ssemov for TYPE_SSEMOV.
> (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> Remove ext_sse_reg_operand and TARGET_AVX512VL check.
> (*movti_internal): Likewise.
> (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> Remove ext_sse_reg_operand check.
> (*movsi_internal): Likewise.
> (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
> Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
> and ext_sse_reg_operand check.
> (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
> Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
> ext_sse_reg_operand check.
> * config/i386/mmx.md (MMXMODE:*mov_internal): Call
> ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
> check.
> * config/i386/sse.md (VMOVE:mov_internal): Call
> ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
> check.
> * config/i386/predicates.md (ext_sse_reg_operand): Removed.
>
> gcc/testsuite/
>
> PR target/89229
> PR target/89346
> * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
> * gcc.target/i386/pr89229-2a.c: New test.
> * gcc.target/i386/pr89229-2b.c: Likewise.
> * gcc.target/i386/pr89229-2c.c: Likewise.
> * gcc.target/i386/pr89229-3a.c: Likewise.
> * gcc.target/i386/pr89229-3b.c: 

[PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-02-22 Thread H.J. Lu
Hi Jan, Uros,

This patch fixes the wrong code bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229

Tested on AVX2 and AVX512 with and without --with-arch=native.

OK for trunk?

Thanks.

H.J.
--
i386 backend has

INT_MODE (OI, 32);
INT_MODE (XI, 64);

So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
in case of const_1, all 512 bits set.

We can load zeros with narrower instruction, (e.g. 256 bit by inherent
zeroing of highpart in case of 128 bit xor), so TImode in this case.

Some targets prefer V4SF mode, so they will emit float xorps for zeroing.

sse.md has

(define_insn "mov_internal"
  [(set (match_operand:VMOVE 0 "nonimmediate_operand"
 "=v,v ,v ,m")
(match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
 " C,BC,vm,v"))]

  /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
 in avx512f, so we need to use workarounds, to access sse registers
 16-31, which are evex-only. In avx512vl we don't need workarounds.  */
  if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
  && (EXT_REX_SSE_REG_P (operands[0])
  || EXT_REX_SSE_REG_P (operands[1])))
{
  if (memory_operand (operands[0], mode))
{
  if ( == 32)
return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 
0x0}";
  else if ( == 16)
return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 
0x0}";
  else
gcc_unreachable ();
}
...

However, since ix86_hard_regno_mode_ok has

 /* TODO check for QI/HI scalars.  */
  /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
  if (TARGET_AVX512VL
  && (mode == OImode
  || mode == TImode
  || VALID_AVX256_REG_MODE (mode)
  || VALID_AVX512VL_128_REG_MODE (mode)))
return true;

  /* xmm16-xmm31 are only available for AVX-512.  */
  if (EXT_REX_SSE_REGNO_P (regno))
return false;

  if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
  && (EXT_REX_SSE_REG_P (operands[0])
  || EXT_REX_SSE_REG_P (operands[1])))

is a dead code.

Also for

long long *p;
volatile __m256i yy;

void
foo (void)
{
   _mm256_store_epi64 (p, yy);
}

with AVX512VL, we should generate

vmovdqa %ymm0, (%rax)

not

vmovdqa64   %ymm0, (%rax)

All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:

1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
moves will be generated.
2. If xmm16-xmm31/ymm16-ymm31 registers are used:
   a. With AVX512VL, AVX512VL vector moves will be generated.
   b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
  move will be done with zmm register move.

ext_sse_reg_operand is removed since it is no longer needed.

Tested on AVX2 and AVX512 with and without --with-arch=native.

gcc/

PR target/89229
PR target/89346
* config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
* config/i386/i386.c (ix86_get_ssemov): New function.
(ix86_output_ssemov): Likewise.
* config/i386/i386.md (*movxi_internal_avx512f): Call
ix86_output_ssemov for TYPE_SSEMOV.
(*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
Remove ext_sse_reg_operand and TARGET_AVX512VL check.
(*movti_internal): Likewise.
(*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
Remove ext_sse_reg_operand check.
(*movsi_internal): Likewise.
(*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
(*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
and ext_sse_reg_operand check.
(*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
ext_sse_reg_operand check.
* config/i386/mmx.md (MMXMODE:*mov_internal): Call
ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
check.
* config/i386/sse.md (VMOVE:mov_internal): Call
ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
check.
* config/i386/predicates.md (ext_sse_reg_operand): Removed.

gcc/testsuite/

PR target/89229
PR target/89346
* gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
* gcc.target/i386/pr89229-2a.c: New test.
* gcc.target/i386/pr89229-2b.c: Likewise.
* gcc.target/i386/pr89229-2c.c: Likewise.
* gcc.target/i386/pr89229-3a.c: Likewise.
* gcc.target/i386/pr89229-3b.c: Likewise.
* gcc.target/i386/pr89229-3c.c: Likewise.
* gcc.target/i386/pr89229-4a.c: Likewise.
* gcc.target/i386/pr89229-4b.c: Likewise.
* gcc.target/i386/pr89229-4c.c: Likewise.
* gcc.target/i386/pr89229-5a.c: Likewise.
* gcc.target/i386/pr89229-5b.c: Likewise.

[PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2019-02-13 Thread H.J. Lu
On Mon, Feb 11, 2019 at 05:24:24PM +0100, Jakub Jelinek wrote:
> On Mon, Feb 11, 2019 at 04:56:45PM +0100, Uros Bizjak wrote:
> > > Let's first define what MODE_XI means in standard_sse_constant_opcode
> > > as well as in all these mov patterns for with and without AVX512VL.   
> > > Without
> > > a clear definition, we can't get out of this mess.
> > 
> > INT_MODE (OI, 32);
> > INT_MODE (XI, 64);
> > 
> > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> > in case of const_1, all 512 bits set.
> > 
> > We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> > zeroing of highpart in case of 128 bit xor), so TImode in this case.
> > 
> > Some targets prefer V4SF mode, so they will emit float xorps for zeroing
> > 
> > Then the introduction of AVX512F fubared everything by overloading the
> > meaning of insn mode.
> 
> I don't see much changes in AVX512F here, most of the behavior has been
> there already in AVX.
> Most of the SSE/AVX/AVX512 instructions affect the whole register,
> usually there is DEST[MAX_VL-1:VL] <- 0 at the end of each instruction.
> But, using the MAX_VL to determine get_attr_mode doesn't seem really useful,
> because that changes dynamically at runtime based on the actual hw, not on
> what we've been compiled for.
> So, I believe we want to use that VL value to determine the bitsize of the
> mode corresponding to get_attr_mode.  And in that case, for
> *movoi_internal_avx and *movti_internal, I believe the right mode is MODE_OI
> resp. MODE_TI for AVX512VL, because e.g.
> vmovdqa32 %ymm12, %ymm23
> is a VL = 256 instruction, not VL = 512.  Similarly, if we want to set
> %ymm25 to all ones, i.e. movoi_internal_avx, we use
> vpternlogd$0xFF, %ymm25, %ymm25, %ymm25
> which is again VL = 256 instruction, so should use MODE_OI.
> We'd need to use
> vmovdqa32 %zmm12, %zmm23
> or
> vpternlogd$0xFF, %zmm25, %zmm25, %zmm25
> instructions for AVX512F without AVX512VL, but as has been discussed, this
> won't really happen, because hard_regno_mode_ok refuses to allocate 256-bit
> or 128-bit modes in ext sse registers.
> 

Here is the patch.  Tested on AVX2/x86-64 and AVX512/x96-64 with
and without --with-arch=native.


H.J.
---
i386 backend has

INT_MODE (OI, 32);
INT_MODE (XI, 64);

So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
in case of const_1, all 512 bits set.

We can load zeros with narrower instruction, (e.g. 256 bit by inherent
zeroing of highpart in case of 128 bit xor), so TImode in this case.

Some targets prefer V4SF mode, so they will emit float xorps for zeroing.

sse.md has

(define_insn "mov_internal"
  [(set (match_operand:VMOVE 0 "nonimmediate_operand"
 "=v,v ,v ,m")
(match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
 " C,BC,vm,v"))]

  /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
 in avx512f, so we need to use workarounds, to access sse registers
 16-31, which are evex-only. In avx512vl we don't need workarounds.  */
  if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
  && (EXT_REX_SSE_REG_P (operands[0])
  || EXT_REX_SSE_REG_P (operands[1])))
{
  if (memory_operand (operands[0], mode))
{
  if ( == 32)
return "vextract64x4\t{$0x0, %g1, %0|%0, %g1, 
0x0}";
  else if ( == 16)
return "vextract32x4\t{$0x0, %g1, %0|%0, %g1, 
0x0}";
  else
gcc_unreachable ();
}
...

However, since ix86_hard_regno_mode_ok has

 /* TODO check for QI/HI scalars.  */
  /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
  if (TARGET_AVX512VL
  && (mode == OImode
  || mode == TImode
  || VALID_AVX256_REG_MODE (mode)
  || VALID_AVX512VL_128_REG_MODE (mode)))
return true;

  /* xmm16-xmm31 are only available for AVX-512.  */
  if (EXT_REX_SSE_REGNO_P (regno))
return false;

  if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
  && (EXT_REX_SSE_REG_P (operands[0])
  || EXT_REX_SSE_REG_P (operands[1])))

is a dead code.

Also for

long long *p;
volatile __m256i yy;

void
foo (void)
{
   _mm256_store_epi64 (p, yy);
}

with AVX512VL, we should generate

vmovdqa %ymm0, (%rax)

not

vmovdqa64   %ymm0, (%rax)

All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:

1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
moves will be generated.
2. If xmm16-xmm31/ymm16-ymm31 registers are used:
   a. With AVX512VL, AVX512VL vector moves will be generated.
   b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
  move will be done with zmm register move.

ext_sse_reg_operand is removed since it is no longer needed.

gcc/

PR target/89229
PR target/89346
* config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
*