[patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful

Kyrill Tkachov Tue, 15 May 2018 01:21:07 -0700

Hi all,

This is a respin of James's patch from: 
https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
The original patch was approved and committed but was later reverted because of 
failures on big-endian.
This tweaked version fixes the big-endian failures in 
aarch64_expand_vector_init by picking the right
element of VALS to move into the low part of the vector register depending on 
endianness. The rest of the patch
stays the same. I'm looking for approval on the aarch64 parts, as they are the 
ones that have changed
since the last approved version of the patch.


-----------------------------------------------------------------------

In the testcase in this patch we create an SLP vector with only two
elements. Our current vector initialisation code will first duplicate
the first element to both lanes, then overwrite the top lane with a new
value.

This duplication can be clunky and wasteful.

Better would be to simply use the fact that we will always be
overwriting
the remaining bits, and simply move the first element to the corrcet
place
(implicitly zeroing all other bits).

This reduces the code generation for this case, and can allow more
efficient addressing modes, and other second order benefits for AArch64
code which has been vectorized to V2DI mode.

Note that the change is generic enough to catch the case for any vector
mode, but is expected to be most useful for 2x64-bit vectorization.

Unfortunately, on its own, this would cause failures in
gcc.target/aarch64/load_v2vec_lanes_1.c and
gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
vec_merge and vec_duplicate for their simplifications to apply. To fix
this,
add a special case to the AArch64 code if we are loading from two memory
addresses, and use the load_pair_lanes patterns directly.

We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
, to
catch:

  (vec_merge:OUTER
     (vec_duplicate:OUTER x:INNER)
     (subreg:OUTER y:INNER 0)
     (const_int N))

And simplify it to:

  (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)

This is similar to the existing patterns which are tested in this
function,
without requiring the second operand to also be a vec_duplicate.

Bootstrapped and tested on aarch64-none-linux-gnu and tested on
aarch64-none-elf.

Note that this requires
https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
if we don't want to ICE creating broken vector zero extends.

Are the non-AArch64 parts OK?

Thanks,
James

---
2018-05-15  James Greenhalgh  <james.greenha...@arm.com>
            Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

        * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify
        code generation for cases where splatting a value is not useful.
        * simplify-rtx.c (simplify_ternary_operation): Simplify
        vec_merge across a vec_duplicate and a paradoxical subreg forming a 
vector
        mode to a vec_concat.

2018-05-15  James Greenhalgh  <james.greenha...@arm.com>

        * gcc.target/aarch64/vect-slp-dup.c: New.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4b5183b602b8786307deb8e3d8056323028b50a2..562eb315f881a1e0d8aa3ba946c99b4c6f25949b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13901,9 +13901,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	    maxv = matches[i][1];
 	  }
 
-      /* Create a duplicate of the most common element.  */
-      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+      /* Create a duplicate of the most common element, unless all elements
+	 are equally useless to us, in which case just immediately set the
+	 vector register using the first element.  */
+
+      if (maxv == 1)
+	{
+	  /* For vectors of two 64-bit elements, we can do even better.  */
+	  if (n_elts == 2
+	      && (inner_mode == E_DImode
+		  || inner_mode == E_DFmode))
+
+	    {
+	      rtx x0 = XVECEXP (vals, 0, 0);
+	      rtx x1 = XVECEXP (vals, 0, 1);
+	      /* Combine can pick up this case, but handling it directly
+		 here leaves clearer RTL.
+
+		 This is load_pair_lanes<mode>, and also gives us a clean-up
+		 for store_pair_lanes<mode>.  */
+	      if (memory_operand (x0, inner_mode)
+		  && memory_operand (x1, inner_mode)
+		  && !STRICT_ALIGNMENT
+		  && rtx_equal_p (XEXP (x1, 0),
+				  plus_constant (Pmode,
+						 XEXP (x0, 0),
+						 GET_MODE_SIZE (inner_mode))))
+		{
+		  rtx t;
+		  if (inner_mode == DFmode)
+		    t = gen_load_pair_lanesdf (target, x0, x1);
+		  else
+		    t = gen_load_pair_lanesdi (target, x0, x1);
+		  emit_insn (t);
+		  return;
+		}
+	    }
+	  /* The subreg-move sequence below will move into lane zero of the
+	     vector register.  For big-endian we want that position to hold
+	     the last element of VALS.  */
+	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	}
+      else
+	{
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+	}
 
       /* Insert the rest.  */
       for (int i = 0; i < n_elts; i++)
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 23244a12545ba2f9db21f66a63a6d36ff8fd29fc..d32cdd19ecd9b8ca6a2957d115bec9c6613d3836 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5891,6 +5891,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
 	    }
 
+	  /* Replace:
+
+	      (vec_merge:outer (vec_duplicate:outer x:inner)
+			       (subreg:outer y:inner 0)
+			       (const_int N))
+
+	     with (vec_concat:outer x:inner y:inner) if N == 1,
+	     or (vec_concat:outer y:inner x:inner) if N == 2.
+
+	     Implicitly, this means we have a paradoxical subreg, but such
+	     a check is cheap, so make it anyway.
+
+	     Only applies for vectors of two elements.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE
+	      && GET_CODE (op1) == SUBREG
+	      && GET_MODE (op1) == GET_MODE (op0)
+	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
+	      && paradoxical_subreg_p (op1)
+	      && subreg_lowpart_p (op1)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = XEXP (op0, 0);
+	      rtx newop1 = SUBREG_REG (op1);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
+
 	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
 				 (const_int n))
 	     with (vec_concat x y) or (vec_concat y x) depending on value
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
new file mode 100644
index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+
+/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
+
+void bar (double);
+
+void
+foo (double *restrict in, double *restrict in2,
+     double *restrict out1, double *restrict out2)
+{
+  for (int i = 0; i < 1024; i++)
+    {
+      out1[i] = in[i] + 2.0 * in[i+128];
+      out1[i+1] = in[i+1] + 2.0 * in2[i];
+      bar (in[i]);
+    }
+}
+
+/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
+

[patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful

Reply via email to