Re: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

Jeff Law Mon, 22 Jul 2019 16:11:39 -0700

On 2/22/19 9:24 AM, H.J. Lu wrote:
> Hi Jan, Uros,
> 
> This patch fixes the wrong code bug:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> 
> Tested on AVX2 and AVX512 with and without --with-arch=native.
> 
> OK for trunk?
> 
> Thanks.
> 
> H.J.
> --
> i386 backend has
> 
> INT_MODE (OI, 32);
> INT_MODE (XI, 64);
> 
> So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> in case of const_1, all 512 bits set.
> 
> We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> zeroing of highpart in case of 128 bit xor), so TImode in this case.
> 
> Some targets prefer V4SF mode, so they will emit float xorps for zeroing.
> 
> sse.md has
> 
> (define_insn "mov<mode>_internal"
>   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
>          "=v,v ,v ,m")
>         (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
>          " C,BC,vm,v"))]
> ....
>       /* There is no evex-encoded vmov* for sizes smaller than 64-bytes
>          in avx512f, so we need to use workarounds, to access sse registers
>          16-31, which are evex-only. In avx512vl we don't need workarounds.  
> */
>       if (TARGET_AVX512F && <MODE_SIZE> < 64 && !TARGET_AVX512VL
>           && (EXT_REX_SSE_REG_P (operands[0])
>               || EXT_REX_SSE_REG_P (operands[1])))
>         {
>           if (memory_operand (operands[0], <MODE>mode))
>             {
>               if (<MODE_SIZE> == 32)
>                 return "vextract<shuffletype>64x4\t{$0x0, %g1, %0|%0, %g1, 
> 0x0}";
>               else if (<MODE_SIZE> == 16)
>                 return "vextract<shuffletype>32x4\t{$0x0, %g1, %0|%0, %g1, 
> 0x0}";
>               else
>                 gcc_unreachable ();
>             }
> ...
> 
> However, since ix86_hard_regno_mode_ok has
> 
>      /* TODO check for QI/HI scalars.  */
>       /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
>       if (TARGET_AVX512VL
>           && (mode == OImode
>               || mode == TImode
>               || VALID_AVX256_REG_MODE (mode)
>               || VALID_AVX512VL_128_REG_MODE (mode)))
>         return true;
> 
>       /* xmm16-xmm31 are only available for AVX-512.  */
>       if (EXT_REX_SSE_REGNO_P (regno))
>         return false;
> 
>       if (TARGET_AVX512F && <MODE_SIZE> < 64 && !TARGET_AVX512VL
>           && (EXT_REX_SSE_REG_P (operands[0])
>               || EXT_REX_SSE_REG_P (operands[1])))
> 
> is a dead code.
> 
> Also for
> 
> long long *p;
> volatile __m256i yy;
> 
> void
> foo (void)
> {
>    _mm256_store_epi64 (p, yy);
> }
> 
> with AVX512VL, we should generate
> 
>       vmovdqa         %ymm0, (%rax)
> 
> not
> 
>       vmovdqa64       %ymm0, (%rax)
> 
> All TYPE_SSEMOV vector moves are consolidated to ix86_output_ssemov:
> 
> 1. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE/AVX vector
> moves will be generated.
> 2. If xmm16-xmm31/ymm16-ymm31 registers are used:
>    a. With AVX512VL, AVX512VL vector moves will be generated.
>    b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
>       move will be done with zmm register move.
> 
> ext_sse_reg_operand is removed since it is no longer needed.
> 
> Tested on AVX2 and AVX512 with and without --with-arch=native.
> 
> gcc/
> 
>       PR target/89229
>       PR target/89346
>       * config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
>       * config/i386/i386.c (ix86_get_ssemov): New function.
>       (ix86_output_ssemov): Likewise.
>       * config/i386/i386.md (*movxi_internal_avx512f): Call
>       ix86_output_ssemov for TYPE_SSEMOV.
>       (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
>       Remove ext_sse_reg_operand and TARGET_AVX512VL check.
>       (*movti_internal): Likewise.
>       (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
>       Remove ext_sse_reg_operand check.
>       (*movsi_internal): Likewise.
>       (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
>       (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
>       Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL
>       and ext_sse_reg_operand check.
>       (*movsf_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
>       Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and
>       ext_sse_reg_operand check.
>       * config/i386/mmx.md (MMXMODE:*mov<mode>_internal): Call
>       ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
>       check.
>       * config/i386/sse.md (VMOVE:mov<mode>_internal): Call
>       ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
>       check.
>       * config/i386/predicates.md (ext_sse_reg_operand): Removed.
> 
> gcc/testsuite/
> 
>       PR target/89229
>       PR target/89346
>       * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
>       * gcc.target/i386/pr89229-2a.c: New test.
>       * gcc.target/i386/pr89229-2b.c: Likewise.
>       * gcc.target/i386/pr89229-2c.c: Likewise.
>       * gcc.target/i386/pr89229-3a.c: Likewise.
>       * gcc.target/i386/pr89229-3b.c: Likewise.
>       * gcc.target/i386/pr89229-3c.c: Likewise.
>       * gcc.target/i386/pr89229-4a.c: Likewise.
>       * gcc.target/i386/pr89229-4b.c: Likewise.
>       * gcc.target/i386/pr89229-4c.c: Likewise.
>       * gcc.target/i386/pr89229-5a.c: Likewise.
>       * gcc.target/i386/pr89229-5b.c: Likewise.
>       * gcc.target/i386/pr89229-5c.c: Likewise.
>       * gcc.target/i386/pr89229-6a.c: Likewise.
>       * gcc.target/i386/pr89229-6b.c: Likewise.
>       * gcc.target/i386/pr89229-6c.c: Likewise.
>       * gcc.target/i386/pr89229-7a.c: Likewise.
>       * gcc.target/i386/pr89229-7b.c: Likewise.
>       * gcc.target/i386/pr89229-7c.c: Likewise.
I've tried to follow what you're doing here, but frankly all this code
is an absolute mess.  Some comments about the difference cases would
likely help me and anyone else that needed to look at this in the future.


I like that we're consolidating things, but it's just damn hard to map
from what we do now to what you're doing in this patch and verify that
you're just changing the cases that you really want to be changing.

Is there any way to break this down into more manageable hunks?  Perhaps
changing one pattern from the md file at a time and walking through any
changes in code generation for the change (as part of the patch
discusion, not necessarily as comments in the patch?)

Again, what I'm trying to do is cut this down into something that is
understandable to someone that isn't intimately familiar with the code
and what you're trying to change.

Just an example, I'm having trouble just following how this affects the
one pattern in sse.md you're changing.  I can't see that the cases that
should stay the same are staying the same nor is it easy to tease out
what cases you want to change for that pattern.




>  
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 81dfed12837..80ebc187041 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -10286,6 +10286,280 @@ ix86_standard_x87sse_constant_load_p (const 
> rtx_insn *insn, rtx dst)
>    return true;
>  }
>  
> +/* Return the opcode of the TYPE_SSEMOV instruction.  To move from
> +   or to xmm16-xmm31/ymm16-ymm31 registers, we either require
> +   TARGET_AVX512VL or it is a register to register move which can
> +   be done with zmm register move. */
> +
> +static const char *
> +ix86_get_ssemov (rtx *operands, unsigned size,
> +              enum attr_mode insn_mode, machine_mode mode)
> +{
> +  char buf[128];
> +  bool misaligned_p = (misaligned_operand (operands[0], mode)
> +                    || misaligned_operand (operands[1], mode));
> +  bool evex_reg_p = (EXT_REX_SSE_REG_P (operands[0])
> +                  || EXT_REX_SSE_REG_P (operands[1]));
> +  machine_mode scalar_mode;
> +
> +  const char *opcode = NULL;
> +  enum
> +    {
> +      opcode_int,
> +      opcode_float,
> +      opcode_double
> +    } type = opcode_int;
> +
> +  switch (insn_mode)
> +    {
> +    case MODE_V16SF:
> +    case MODE_V8SF:
> +    case MODE_V4SF:
> +      scalar_mode = E_SFmode;
> +      break;
> +    case MODE_V8DF:
> +    case MODE_V4DF:
> +    case MODE_V2DF:
> +      scalar_mode = E_DFmode;
> +      break;
> +    case MODE_XI:
> +    case MODE_OI:
> +    case MODE_TI:
> +      scalar_mode = GET_MODE_INNER (mode);
> +      break;
> +    default:
> +      gcc_unreachable ();
> +    }
So why are the switches split across functions?  Is there some reason
why you don't have output_ssemov first compute the size with its
existing switch, then a switch like the one above to compute the scalar
mode to pass down to get_ssemov?  Or put the two switches in get_ssemov?




Jeff

Re: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

Reply via email to