Re: Use of vector instructions in memmov/memset expanding

Jan Hubicka Thu, 27 Oct 2011 08:09:33 -0700

Hi,
sorry for delay with the review. This is my first pass through the backend 
part, hopefully
someone else will do the middle end bits.


diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..d7c4330 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for 
tuning for size */
   COSTS_N_BYTES (2),                   /* cost of FABS instruction.  */
   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},

I am bit concerned about explossion of variants, but adding aligned variants 
probably makes
sense.  I guess we should specify what alignment needs to be known. I.e. is 
alignment of 2 enough
or shall the alignment be matching the size of loads/stores produced?

@@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx 
operands[])
     }
   else
     {
+      if (mode == DImode
+         && !TARGET_64BIT
+         && TARGET_SSE2
+         && MEM_P (op0)
+         && MEM_P (op1)
+         && !push_operand (op0, mode)
+         && can_create_pseudo_p ())
+       {
+         rtx temp = gen_reg_rtx (V2DImode);
+         emit_insn (gen_sse2_loadq (temp, op1));
+         emit_insn (gen_sse_storeq (op0, temp));
+         return;
+       }
+      if (mode == DImode
+         && !TARGET_64BIT
+         && TARGET_SSE
+         && !MEM_P (op1)
+         && GET_MODE (op1) == V2DImode)
+       {
+         emit_insn (gen_sse_storeq (op0, op1));
+         return;
+       }
+      if (mode == TImode
+         && TARGET_AVX2
+         && MEM_P (op0)
+         && !MEM_P (op1)
+         && GET_MODE (op1) == V4DImode)
+       {
+         op0 = convert_to_mode (V2DImode, op0, 1);
+         emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+         return;
+       }

This hunk seems dangerous in a way that by emitting the explicit loadq/storeq 
pairs (and similar) will
prevent use of integer registers for 64bit/128bit arithmetic.

I guess we could play such tricks for memory-memory moves & constant stores. 
With gimple optimizations
we already know pretty well that the moves will stay as they are.  That might 
be enough for you?

If you go this way, please make separate patch so it can be benchmarked. While 
the moves are faster
there are always problem with size mismatches in load/store buffers.

I think for string operations we should output the SSE/AVX instruction variants
by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
feasible?  This has been traditionally problem with old RA because it did not
see that because pseudo is eventually used for arithmetics, it can not go into
SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
variants for 64bit/128bit manipulations w/o hurting performance of code that
does arithmetic on long long and __int128.  Perhaps IRA can solve this now.
Vladimir, do you have any ida?

+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+                                        rtx destptr, rtx srcptr, rtx value,
+                                        rtx count, rtx iter,
+                                        enum machine_mode mode, int unroll,
+                                        int expected_size, bool change_ptrs)

I wrote the original function, but it is not really clear for me what the 
function
does now. I.e. what is code for updating addresses and what means reusing iter.
I guess reusing iter means that we won't start the loop from 0.  Could you
expand comments a bit more?

I know I did not documented them originally, but all the parameters ought to be
explicitely documented in a function comment.
@@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  
*/
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+            rtx destptr, enum machine_mode mode, int offset)

This seems to more naturally belong into gen_strset expander?
        {
-         if (TARGET_64BIT)
-           {
-             dest = adjust_automodify_address_nv (destmem, DImode, destptr, 
offset);
-             emit_insn (gen_strset (destptr, dest, value));
-           }
-         else
-           {
-             dest = adjust_automodify_address_nv (destmem, SImode, destptr, 
offset);
-             emit_insn (gen_strset (destptr, dest, value));
-             dest = adjust_automodify_address_nv (destmem, SImode, destptr, 
offset + 4);
-             emit_insn (gen_strset (destptr, dest, value));
-           }
-         offset += 8;
+         if (GET_MODE (destmem) != move_mode)
+           destmem = change_address (destmem, move_mode, destptr);
AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a 
way
it copies memory aliasing attributes and it is needed to zap them here since 
stringops
behaves funily WRT aliaseing.
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
        {
-         dest = change_address (destmem, DImode, destptr);
-         emit_insn (gen_strset (destptr, dest, value));
-         emit_insn (gen_strset (destptr, dest, value));
+         destmem = change_address (destmem, DImode, destptr);
+         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+                                                               value)));
+         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+                                                               value)));

No use for 128bit moves here?
        }
       else
        {
-         dest = change_address (destmem, SImode, destptr);
-         emit_insn (gen_strset (destptr, dest, value));
-         emit_insn (gen_strset (destptr, dest, value));
-         emit_insn (gen_strset (destptr, dest, value));
-         emit_insn (gen_strset (destptr, dest, value));
+         destmem = change_address (destmem, SImode, destptr);
+         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+                                                               value)));
+         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+                                                               value)));
+         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+                                                               value)));
+         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+                                                               value)));

And here?
@@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);

You want to always use adjust_automodify_address or 
adjust_automodify_address_nv,
adjust_automodify_address_1 is not intended for general use.
@@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, 
rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+       {
+         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+         emit_insn (gen_strmov (destreg, dst, srcreg, src));
+       }
+      else
+       {
+         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+         emit_insn (gen_strmov (destreg, dst, srcreg, src));
+         emit_insn (gen_strmov (destreg, dst, srcreg, src));

again, no use for vector moves?
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).

This can be handled in machine independent way, right?

+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */

This would probably go better as named expansion pattern, like we do for other
machine description interfaces.
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..4b7d67b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by 
@code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode 
@var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.

New hooks should go to the machine indpendent part of the patch.

Honza

Re: Use of vector instructions in memmov/memset expanding

Reply via email to