Re: [PATCH, x86] Use vector moves in memmove expanding

Michael Zolotukhin Thu, 18 Apr 2013 04:55:58 -0700

Forgot to add Uros - adding now.

On 18 April 2013 15:53, Michael Zolotukhin
<michael.v.zolotuk...@gmail.com> wrote:
> Hi,
> Jan, thanks for the review, I hope to prepare an updated version of the patch
> shortly.  Please see my answers to your comments below.
>
> Uros, there is a question of a better approach for generation of wide moves.
> Could you please comment it (see details in bullets 3 and 5)?
>
> 1.
>> +static int smallest_pow2_greater_than (int);
>>
>> Perhaps it is easier to use existing 1<<ceil_log2?
> Well, yep.  Actually, this routine has already been used there, so I continued
> using it.  I guess we could change its implementation to call
> ceil_log2/floor_log2 or remove it entirely.
>
> 2.
>> -      y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
>> -      srcmem = change_address (srcmem, mode, y_addr);
>> +      srcmem = offset_address (srcmem, copy_rtx (tmp), piece_size_n);
>> +      srcmem = adjust_address (srcmem, mode, 0);
>> ...
>> This change looks OK and can go into manline independnetly. Just please 
>> ensure that changing
>> the way address is computed is not making us to preserve alias set. Memmove 
>> can not rely on the alias
>> set of the src/destination objects.
> Could you explain it in more details?  Do you mean that at the beginning DST
> and SRC could point to one memory location and have corresponding alias sets,
> and I just change addresses they point to without invalidating alias sets?
> I haven't thought about this, and that seems like a possible bug, but I guess
> it could be simply fixed by calling change_address at the end.
>
> 3.
>> +  /* Find the widest mode in which we could perform moves.
>> +     Start with the biggest power of 2 less than SIZE_TO_MOVE and half
>> +     it until move of such size is supported.  */
>> +  piece_size = smallest_pow2_greater_than (size_to_move) >> 1;
>> +  move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0);
>>
>> I suppose this is a problem with SSE moves ending up in integer register, 
>> since
>> you get TImode rather than vectorized mode, like V8QImode in here.  Why not 
>> stick
>> with the original mode parmaeter?
> Yes, here we choose TImode instead of a vector mode, but that actually was 
> done
> intentionally.  I tried several approaches here and decided that using the
> widest integer mode is the best one for now.  We could try to find out a
> particular (vector)mode in which we want to perform copying, but isn't it 
> better
> to rely on a machine-description here?  My idea here was to just request a 
> copy
> of, for instance, 128-bit piece (i.e. one TI-move) and leave it to the 
> compiler
> to choose the most optimal way of performing it.  Currently, the compiler 
> thinks
> that move of 128bits should be splitted into two moves of 64-bits (this
> transformation is done in split2 pass) - if it's actually not so optimal, we
> should fix it there, IMHO.
>
> I think Uros could give me an advice on whether it's a reasonable approach or 
> it
> should be changed.
>
> Also, I tried to avoid such fixes in this patch - that doesn't mean I'm not
> going to work on the fixes, quite the contrary.  But it'd be easier to work on
> them if we have a code in the trunk that could reveal the problem.
>
> 4.
>> Doesn't this effectively kill support for TARGET_SINGLE_STRINGOP? It is 
>> useful as
>> size optimization.
> Do you mean removing emit_strmov?  I don't think it'll kill anything, as new
> emit_memmov is capable of doing what emit_strmov did and is just an extended
> version of it.  BTW, under TARGET_SINGLE_STRINGOP switch gen_strmov is used, 
> not
> emit_strmov - behaviour there hasn't been changed by this patch.
>
> 5.
>> For SSE codegen, won't we need to track down in destination was aligned to 
>> generate aligned/unaligned moves?
> We try to achieve a required alignment by prologue, so in the main loop
> destination is aligned properly.  Source, meanwhile, could be misaligned, so 
> for
> it unaligned moves could be generated.  Here I actually also rely on the fact
> that we have an optimal description of aligned/unaligned moves in MD-file, 
> i.e.
> if it's better to emit two DI-moves instead of one unaligned TI-mode, then
> splits/expands will manage to do that.
>
> 6.
>> +  else if (TREE_CODE (expr) == MEM_REF)
>> +    {
>> +      tree base = TREE_OPERAND (expr, 0);
>> +      tree byte_offset = TREE_OPERAND (expr, 1);
>> +      if (TREE_CODE (base) != ADDR_EXPR
>> +         || TREE_CODE (byte_offset) != INTEGER_CST)
>> +       return -1;
>> +      if (!DECL_P (TREE_OPERAND (base, 0))
>> +         || DECL_ALIGN (TREE_OPERAND (base, 0)) < align)
>>
>> You can use TYPE_ALIGN here? In general can't we replace all the GIMPLE
>> handling by get_object_alignment?
>>
>> +       return -1;
>> +      offset += tree_low_cst (byte_offset, 1);
>> +    }
>>    else
>>      return -1;
>>
>> This change out to go independently. I can not review it.
>> I will make first look over the patch shortly, but please send updated patch 
>> fixing
>> the problem with integer regs.
> Actually, I don't know what is a right way to find out alignment, but the
> existing one didn't work.  Routine get_mem_align_offset didn't handle MEM_REFs
> at all, so I added some handling there - I'm not sure it's complete and
> absoulutely correct, but that currently works for me.  I'd be glad to hear any
> suggestions of how that should be done - whom should I ask about it?
>
> ---
> Thanks, Michael
>
>
> On 17 April 2013 19:12, Jan Hubicka <hubi...@ucw.cz> wrote:
>> @@ -2392,6 +2392,7 @@ static void ix86_set_current_function (tree);
>>  static unsigned int ix86_minimum_incoming_stack_boundary (bool);
>>
>>  static enum calling_abi ix86_function_abi (const_tree);
>> +static int smallest_pow2_greater_than (int);
>>
>> Perhaps it is easier to use existing 1<<ceil_log2?
>>
>>
>>  #ifndef SUBTARGET32_DEFAULT_CPU
>> @@ -21829,11 +21830,10 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx 
>> srcmem,
>>  {
>>    rtx out_label, top_label, iter, tmp;
>>    enum machine_mode iter_mode = counter_mode (count);
>> -  rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
>> +  int piece_size_n = GET_MODE_SIZE (mode) * unroll;
>> +  rtx piece_size = GEN_INT (piece_size_n);
>>    rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
>>    rtx size;
>> -  rtx x_addr;
>> -  rtx y_addr;
>>    int i;
>>
>>    top_label = gen_label_rtx ();
>> @@ -21854,13 +21854,18 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx 
>> srcmem,
>>    emit_label (top_label);
>>
>>    tmp = convert_modes (Pmode, iter_mode, iter, true);
>> -  x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
>> -  destmem = change_address (destmem, mode, x_addr);
>> +
>> +  /* This assert could be relaxed - in this case we'll need to compute
>> +     smallest power of two, containing in PIECE_SIZE_N and pass it to
>> +     offset_address.  */
>> +  gcc_assert ((piece_size_n & (piece_size_n - 1)) == 0);
>> +  destmem = offset_address (destmem, tmp, piece_size_n);
>> +  destmem = adjust_address (destmem, mode, 0);
>>
>>    if (srcmem)
>>      {
>> -      y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
>> -      srcmem = change_address (srcmem, mode, y_addr);
>> +      srcmem = offset_address (srcmem, copy_rtx (tmp), piece_size_n);
>> +      srcmem = adjust_address (srcmem, mode, 0);
>>
>>        /* When unrolling for chips that reorder memory reads and writes,
>>          we can save registers by using single temporary.
>> @@ -22039,13 +22044,61 @@ expand_setmem_via_rep_stos (rtx destmem, rtx 
>> destptr, rtx value,
>>    emit_insn (gen_rep_stos (destptr, countreg, destmem, value, destexp));
>>  }
>>
>> This change looks OK and can go into manline independnetly. Just please 
>> ensure that changing
>> the way address is computed is not making us to preserve alias set. Memmove 
>> can not rely on the alias
>> set of the src/destination objects.
>>
>> -static void
>> -emit_strmov (rtx destmem, rtx srcmem,
>> -            rtx destptr, rtx srcptr, enum machine_mode mode, int offset)
>> -{
>> -  rtx src = adjust_automodify_address_nv (srcmem, mode, srcptr, offset);
>> -  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
>> -  emit_insn (gen_strmov (destptr, dest, srcptr, src));
>> +/* This function emits moves to copy SIZE_TO_MOVE bytes from SRCMEM to
>> +   DESTMEM.
>> +   SRC is passed by pointer to be updated on return.
>> +   Return value is updated DST.  */
>> +static rtx
>> +emit_memmov (rtx destmem, rtx *srcmem, rtx destptr, rtx srcptr,
>> +            HOST_WIDE_INT size_to_move)
>> +{
>> +  rtx dst = destmem, src = *srcmem, adjust, tempreg;
>> +  enum insn_code code;
>> +  enum machine_mode move_mode;
>> +  int piece_size, i;
>> +
>> +  /* Find the widest mode in which we could perform moves.
>> +     Start with the biggest power of 2 less than SIZE_TO_MOVE and half
>> +     it until move of such size is supported.  */
>> +  piece_size = smallest_pow2_greater_than (size_to_move) >> 1;
>> +  move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0);
>>
>> I suppose this is a problem with SSE moves ending up in integer register, 
>> since
>> you get TImode rather than vectorized mode, like V8QImode in here.  Why not 
>> stick
>> with the original mode parmaeter?
>> +  code = optab_handler (mov_optab, move_mode);
>> +  while (code == CODE_FOR_nothing && piece_size > 1)
>> +    {
>> +      piece_size >>= 1;
>> +      move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0);
>> +      code = optab_handler (mov_optab, move_mode);
>> +    }
>> +  gcc_assert (code != CODE_FOR_nothing);
>> +
>> +  dst = adjust_automodify_address_nv (dst, move_mode, destptr, 0);
>> +  src = adjust_automodify_address_nv (src, move_mode, srcptr, 0);
>> +
>> +  /* Emit moves.  We'll need SIZE_TO_MOVE/PIECE_SIZES moves.  */
>> +  gcc_assert (size_to_move % piece_size == 0);
>> +  adjust = GEN_INT (piece_size);
>> +  for (i = 0; i < size_to_move; i += piece_size)
>> +    {
>> +      /* We move from memory to memory, so we'll need to do it via
>> +        a temporary register.  */
>> +      tempreg = gen_reg_rtx (move_mode);
>> +      emit_insn (GEN_FCN (code) (tempreg, src));
>> +      emit_insn (GEN_FCN (code) (dst, tempreg));
>> +
>> +      emit_move_insn (destptr,
>> +                     gen_rtx_PLUS (Pmode, copy_rtx (destptr), adjust));
>> +      emit_move_insn (srcptr,
>> +                     gen_rtx_PLUS (Pmode, copy_rtx (srcptr), adjust));
>> +
>> +      dst = adjust_automodify_address_nv (dst, move_mode, destptr,
>> +                                         piece_size);
>> +      src = adjust_automodify_address_nv (src, move_mode, srcptr,
>> +                                         piece_size);
>> +    }
>> +
>> +  /* Update DST and SRC rtx.  */
>> +  *srcmem = src;
>> +  return dst;
>>
>> Doesn't this effectively kill support for TARGET_SINGLE_STRINGOP? It is 
>> useful as
>> size optimization.
>>  }
>>
>>  /* Output code to copy at most count & (max_size - 1) bytes from SRC to 
>> DEST.  */
>> @@ -22057,44 +22110,17 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
>>    if (CONST_INT_P (count))
>>      {
>>        HOST_WIDE_INT countval = INTVAL (count);
>> -      int offset = 0;
>> +      HOST_WIDE_INT epilogue_size = countval % max_size;
>> +      int i;
>>
>> -      if ((countval & 0x10) && max_size > 16)
>> -       {
>> -         if (TARGET_64BIT)
>> -           {
>> -             emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
>> -             emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset 
>> + 8);
>> -           }
>> -         else
>> -           gcc_unreachable ();
>> -         offset += 16;
>> -       }
>> -      if ((countval & 0x08) && max_size > 8)
>> -       {
>> -         if (TARGET_64BIT)
>> -           emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
>> -         else
>> -           {
>> -             emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
>> -             emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset 
>> + 4);
>> -           }
>> -         offset += 8;
>> -       }
>> -      if ((countval & 0x04) && max_size > 4)
>> +      /* For now MAX_SIZE should be a power of 2.  This assert could be
>> +        relaxed, but it'll require a bit more complicated epilogue
>> +        expanding.  */
>> +      gcc_assert ((max_size & (max_size - 1)) == 0);
>> +      for (i = max_size; i >= 1; i >>= 1)
>>         {
>> -          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
>> -         offset += 4;
>> -       }
>> -      if ((countval & 0x02) && max_size > 2)
>> -       {
>> -          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
>> -         offset += 2;
>> -       }
>> -      if ((countval & 0x01) && max_size > 1)
>> -       {
>> -          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
>> -         offset += 1;
>> +         if (epilogue_size & i)
>> +           destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
>>         }
>>        return;
>>      }
>> @@ -22330,47 +22356,33 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, 
>> rtx value, rtx count, int max_
>>  }
>>
>>  /* Copy enough from DEST to SRC to align DEST known to by aligned by ALIGN 
>> to
>> -   DESIRED_ALIGNMENT.  */
>> -static void
>> +   DESIRED_ALIGNMENT.
>> +   Return value is updated DESTMEM.  */
>> +static rtx
>>  expand_movmem_prologue (rtx destmem, rtx srcmem,
>>                         rtx destptr, rtx srcptr, rtx count,
>>                         int align, int desired_alignment)
>>  {
>> -  if (align <= 1 && desired_alignment > 1)
>> -    {
>> -      rtx label = ix86_expand_aligntest (destptr, 1, false);
>> -      srcmem = change_address (srcmem, QImode, srcptr);
>> -      destmem = change_address (destmem, QImode, destptr);
>> -      emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
>> -      ix86_adjust_counter (count, 1);
>> -      emit_label (label);
>> -      LABEL_NUSES (label) = 1;
>> -    }
>> -  if (align <= 2 && desired_alignment > 2)
>> -    {
>> -      rtx label = ix86_expand_aligntest (destptr, 2, false);
>> -      srcmem = change_address (srcmem, HImode, srcptr);
>> -      destmem = change_address (destmem, HImode, destptr);
>> -      emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
>> -      ix86_adjust_counter (count, 2);
>> -      emit_label (label);
>> -      LABEL_NUSES (label) = 1;
>> -    }
>> -  if (align <= 4 && desired_alignment > 4)
>> +  int i;
>> +  for (i = 1; i < desired_alignment; i <<= 1)
>>      {
>> -      rtx label = ix86_expand_aligntest (destptr, 4, false);
>> -      srcmem = change_address (srcmem, SImode, srcptr);
>> -      destmem = change_address (destmem, SImode, destptr);
>> -      emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
>> -      ix86_adjust_counter (count, 4);
>> -      emit_label (label);
>> -      LABEL_NUSES (label) = 1;
>> +      if (align <= i)
>> +       {
>> +         rtx label = ix86_expand_aligntest (destptr, i, false);
>> +         destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
>> +         ix86_adjust_counter (count, i);
>> +         emit_label (label);
>> +         LABEL_NUSES (label) = 1;
>> +         set_mem_align (destmem, i * 2 * BITS_PER_UNIT);
>> +       }
>>      }
>> -  gcc_assert (desired_alignment <= 8);
>> +  return destmem;
>>  }
>>
>>  /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
>> -   ALIGN_BYTES is how many bytes need to be copied.  */
>> +   ALIGN_BYTES is how many bytes need to be copied.
>> +   The function updates DST and SRC, namely, it sets proper alignment.
>> +   DST is returned via return value, SRC is updated via pointer SRCP.  */
>>  static rtx
>>  expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx 
>> srcreg,
>>                                  int desired_align, int align_bytes)
>> @@ -22378,62 +22390,34 @@ expand_constant_movmem_prologue (rtx dst, rtx 
>> *srcp, rtx destreg, rtx srcreg,
>>    rtx src = *srcp;
>>    rtx orig_dst = dst;
>>    rtx orig_src = src;
>> -  int off = 0;
>> +  int piece_size = 1;
>> +  int copied_bytes = 0;
>>    int src_align_bytes = get_mem_align_offset (src, desired_align * 
>> BITS_PER_UNIT);
>>    if (src_align_bytes >= 0)
>>      src_align_bytes = desired_align - src_align_bytes;
>> -  if (align_bytes & 1)
>> -    {
>> -      dst = adjust_automodify_address_nv (dst, QImode, destreg, 0);
>> -      src = adjust_automodify_address_nv (src, QImode, srcreg, 0);
>> -      off = 1;
>> -      emit_insn (gen_strmov (destreg, dst, srcreg, src));
>> -    }
>> -  if (align_bytes & 2)
>> -    {
>> -      dst = adjust_automodify_address_nv (dst, HImode, destreg, off);
>> -      src = adjust_automodify_address_nv (src, HImode, srcreg, off);
>> -      if (MEM_ALIGN (dst) < 2 * BITS_PER_UNIT)
>> -       set_mem_align (dst, 2 * BITS_PER_UNIT);
>> -      if (src_align_bytes >= 0
>> -         && (src_align_bytes & 1) == (align_bytes & 1)
>> -         && MEM_ALIGN (src) < 2 * BITS_PER_UNIT)
>> -       set_mem_align (src, 2 * BITS_PER_UNIT);
>> -      off = 2;
>> -      emit_insn (gen_strmov (destreg, dst, srcreg, src));
>> -    }
>> -  if (align_bytes & 4)
>> +
>> +  for (piece_size = 1;
>> +       piece_size <= desired_align && copied_bytes < align_bytes;
>> +       piece_size <<= 1)
>>      {
>> -      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
>> -      src = adjust_automodify_address_nv (src, SImode, srcreg, off);
>> -      if (MEM_ALIGN (dst) < 4 * BITS_PER_UNIT)
>> -       set_mem_align (dst, 4 * BITS_PER_UNIT);
>> -      if (src_align_bytes >= 0)
>> +      if (align_bytes & piece_size)
>>         {
>> -         unsigned int src_align = 0;
>> -         if ((src_align_bytes & 3) == (align_bytes & 3))
>> -           src_align = 4;
>> -         else if ((src_align_bytes & 1) == (align_bytes & 1))
>> -           src_align = 2;
>> -         if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
>> -           set_mem_align (src, src_align * BITS_PER_UNIT);
>> +         dst = emit_memmov (dst, &src, destreg, srcreg, piece_size);
>> +         copied_bytes += piece_size;
>>         }
>> -      off = 4;
>> -      emit_insn (gen_strmov (destreg, dst, srcreg, src));
>>      }
>> -  dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
>> -  src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
>> +
>>    if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
>>      set_mem_align (dst, desired_align * BITS_PER_UNIT);
>>    if (src_align_bytes >= 0)
>>      {
>> -      unsigned int src_align = 0;
>> -      if ((src_align_bytes & 7) == (align_bytes & 7))
>> -       src_align = 8;
>> -      else if ((src_align_bytes & 3) == (align_bytes & 3))
>> -       src_align = 4;
>> -      else if ((src_align_bytes & 1) == (align_bytes & 1))
>> -       src_align = 2;
>> +      unsigned int src_align;
>> +      for (src_align = desired_align; src_align >= 2; src_align >>= 1)
>> +       {
>> +         if ((src_align_bytes & (src_align - 1))
>> +              == (align_bytes & (src_align - 1)))
>> +           break;
>> +       }
>>        if (src_align > (unsigned int) desired_align)
>>         src_align = desired_align;
>>        if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
>> @@ -22666,42 +22650,24 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT 
>> expected_size, bool memset,
>>  static int
>>  decide_alignment (int align,
>>                   enum stringop_alg alg,
>> -                 int expected_size)
>> +                 int expected_size,
>> +                 enum machine_mode move_mode)
>>  {
>>    int desired_align = 0;
>> -  switch (alg)
>> -    {
>> -      case no_stringop:
>> -       gcc_unreachable ();
>> -      case loop:
>> -      case unrolled_loop:
>> -       desired_align = GET_MODE_SIZE (Pmode);
>> -       break;
>> -      case rep_prefix_8_byte:
>> -       desired_align = 8;
>> -       break;
>> -      case rep_prefix_4_byte:
>> -       /* PentiumPro has special logic triggering for 8 byte aligned blocks.
>> -          copying whole cacheline at once.  */
>> -       if (TARGET_PENTIUMPRO)
>> -         desired_align = 8;
>> -       else
>> -         desired_align = 4;
>> -       break;
>> -      case rep_prefix_1_byte:
>> -       /* PentiumPro has special logic triggering for 8 byte aligned blocks.
>> -          copying whole cacheline at once.  */
>> -       if (TARGET_PENTIUMPRO)
>> -         desired_align = 8;
>> -       else
>> -         desired_align = 1;
>> -       break;
>> -      case loop_1_byte:
>> -       desired_align = 1;
>> -       break;
>> -      case libcall:
>> -       return 0;
>> -    }
>> +
>> +  gcc_assert (alg != no_stringop);
>> +
>> +  if (alg == libcall)
>> +    return 0;
>> +  if (move_mode == VOIDmode)
>> +    return 0;
>> +
>> +  desired_align = GET_MODE_SIZE (move_mode);
>> +  /* PentiumPro has special logic triggering for 8 byte aligned blocks.
>> +     copying whole cacheline at once.  */
>> +  if (TARGET_PENTIUMPRO
>> +      && (alg == rep_prefix_4_byte || alg == rep_prefix_1_byte))
>> +    desired_align = 8;
>>
>>    if (optimize_size)
>>      desired_align = 1;
>> @@ -22709,6 +22675,7 @@ decide_alignment (int align,
>>      desired_align = align;
>>    if (expected_size != -1 && expected_size < 4)
>>      desired_align = align;
>> +
>>    return desired_align;
>>  }
>>
>> @@ -22765,6 +22732,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, 
>> rtx align_exp,
>>    int dynamic_check;
>>    bool need_zero_guard = false;
>>    bool noalign;
>> +  enum machine_mode move_mode = VOIDmode;
>> +  int unroll_factor = 1;
>>
>>    if (CONST_INT_P (align_exp))
>>      align = INTVAL (align_exp);
>> @@ -22788,50 +22757,60 @@ ix86_expand_movmem (rtx dst, rtx src, rtx 
>> count_exp, rtx align_exp,
>>
>>    /* Step 0: Decide on preferred algorithm, desired alignment and
>>       size of chunks to be copied by main loop.  */
>> -
>>    alg = decide_alg (count, expected_size, false, &dynamic_check, &noalign);
>> -  desired_align = decide_alignment (align, alg, expected_size);
>> -
>> -  if (!TARGET_ALIGN_STRINGOPS || noalign)
>> -    align = desired_align;
>> -
>>    if (alg == libcall)
>>      return false;
>>    gcc_assert (alg != no_stringop);
>> +
>>    if (!count)
>>      count_exp = copy_to_mode_reg (GET_MODE (count_exp), count_exp);
>>    destreg = copy_addr_to_reg (XEXP (dst, 0));
>>    srcreg = copy_addr_to_reg (XEXP (src, 0));
>> +
>> +  unroll_factor = 1;
>> +  move_mode = word_mode;
>>    switch (alg)
>>      {
>>      case libcall:
>>      case no_stringop:
>>        gcc_unreachable ();
>> +    case loop_1_byte:
>> +      need_zero_guard = true;
>> +      move_mode = QImode;
>> +      break;
>>      case loop:
>>        need_zero_guard = true;
>> -      size_needed = GET_MODE_SIZE (word_mode);
>>        break;
>>      case unrolled_loop:
>>        need_zero_guard = true;
>> -      size_needed = GET_MODE_SIZE (word_mode) * (TARGET_64BIT ? 4 : 2);
>> +      unroll_factor = (TARGET_64BIT ? 4 : 2);
>> +      break;
>> +    case vector_loop:
>> +      need_zero_guard = true;
>> +      unroll_factor = 4;
>> +      /* Find the widest supported mode.  */
>> +      move_mode = Pmode;
>> +      while (optab_handler (mov_optab, GET_MODE_WIDER_MODE (move_mode))
>> +            != CODE_FOR_nothing)
>> +         move_mode = GET_MODE_WIDER_MODE (move_mode);
>>        break;
>>      case rep_prefix_8_byte:
>> -      size_needed = 8;
>> +      move_mode = DImode;
>>        break;
>>      case rep_prefix_4_byte:
>> -      size_needed = 4;
>> +      move_mode = SImode;
>>        break;
>>      case rep_prefix_1_byte:
>> -      size_needed = 1;
>> -      break;
>> -    case loop_1_byte:
>> -      need_zero_guard = true;
>> -      size_needed = 1;
>> +      move_mode = QImode;
>>        break;
>>      }
>> -
>> +  size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>>    epilogue_size_needed = size_needed;
>>
>> +  desired_align = decide_alignment (align, alg, expected_size, move_mode);
>>
>> +  desired_align = decide_alignment (align, alg, expected_size, move_mode);
>> +  if (!TARGET_ALIGN_STRINGOPS || noalign)
>> +    align = desired_align;
>> +
>>
>> For SSE codegen, won't we need to track down in destination was aligned to 
>> generate aligned/unaligned moves?
>>
>> Otherwise the patch seems resonable.  Thanks for submitting it by pieces so 
>> the review is easier.
>>
>> Honza
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.




--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

Re: [PATCH, x86] Use vector moves in memmove expanding

Reply via email to