[PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)

Julian Brown Fri, 12 Nov 2010 07:49:29 -0800

Hi,

Here's a work-in-progress patch which fixes many execution failures
seen in big-endian mode when -mvectorize-with-neon-quad is in effect
(which is soon to be the default, if we stick to the current plan).
But, it's pretty hairy, and I'm not at all convinced it's not working
"for the wrong reason" in a couple of places.


I'm mainly posting to gauge opinions on what we should do in big-endian
mode. This patch works with the assumption that quad-word vectors in
big-endian mode are in "vldm" order (i.e. with constituent double-words
in little-endian order: see previous discussions). But, that's pretty
confusing, leads to less than optimal code, and is bound to cause more
problems in the future. So I'm not sure how much effort to expend on
making it work right, given that we might be throwing that vector
ordering away in the future (at least in some cases: see below).

The "problem" patterns are as follows.

 * Full-vector shifts: these don't work with big-endian vldm-order quad
   vectors. For now, I've disabled them, although they could
   potentially be implemented using vtbl (at some cost).

 * Widening moves (unpacks) & widening multiplies: when widening from
   D-reg to Q-reg size, we must swap double-words in the result (I've
   done this with vext). This seems to work fine, but what "hi" and "lo"
   refer to is rather muddled (in my head!). Also they should be
   expanders instead of emitting multiple assembler insns.

 * Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x
   D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as
   suggested by Paul). These seem to work fine.

 * Reduction operations: when reducing Q-reg values, GCC currently
   tries to extract the result from the "wrong half" of the reduced
   vector. The fix in the attached patch is rather dubious, but seems
   to work (I'd like to understand why better).

We can sort those bits out, but the question is, do we want to go that
route? Vectors are used in three quite distinct ways by GCC:

 1. By the vectorizer.

 2. By the NEON intrinsics.

 3. By the "generic vector" support.

For the first of these, I think we can get away with changing the
vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so
that vector registers will hold elements in memory order -- so, all the
contortions in the attached patch will be unnecessary. ABI issues are
irrelevant, since vectors are "invisible" at the source code layer
generally, including at ABI boundaries.

For the second, intrinsics, we should do exactly what the user
requests: so, vectors are essentially treated as opaque objects. This
isn't a problem as such, but might mean that instruction patterns
written using "canonical" RTL for the vectorizer can't be shared with
intrinsics when the order of elements matters. (I'm not sure how many
patterns this would refer to at present; possibly none.)

The third case would continue to use "vldm" ordering, so if users
inadvertantly write code such as:

  res = vaddq_u32 (*foo, bar);

instead of writing an explicit vld* intrinsic (for the load of *foo),
the result might be different from what they expect. It'd be nice to
diagnose such code as erroneous, but that's another issue.

The important observation is that vectors from case 1 and from cases 2/3
never interact: it's quite safe for them to use different element
orderings, without extensive changes to GCC infrastructure (i.e.,
multiple internal representations). I don't think I quite realised this
previously.

So, anyway, back to the patch in question. The choices are, I think:

 1. Apply as-is (after I've ironed out the wrinkles), and then remove
    the "ugly" bits at a later point when vectorizer "array load/store"
    support is implemented.

 2. Apply a version which simply disables all the troublesome
    patterns until the same support appears.

Apologies if I'm retreading old ground ;-).

(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good
code for the quad-word vec_pack_trunc_<mode> pattern. It would
eventually be applied as a separate patch.)

Thoughts?

Julian

ChangeLog

    gcc/
    * config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode
    of vector registers.
    * config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in
    big-endian mode.
    (reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>)
    (reduc_umin_<mode>, reduc_umax_<mode>)
    (neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>)
    (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>)
    (vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle
    big-endian mode for quad-word vectors.

Index: gcc/config/arm/arm.h
===================================================================
--- gcc/config/arm/arm.h	(revision 306416)
+++ gcc/config/arm/arm.h	(working copy)
@@ -1302,11 +1302,13 @@ enum reg_class
 
 /* FPA registers can't do subreg as all values are reformatted to internal
    precision.  VFP registers may only be accessed in the mode they
-   were set.  */
-#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)	\
-  (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)		\
-   ? reg_classes_intersect_p (FPA_REGS, (CLASS))	\
-     || reg_classes_intersect_p (VFP_REGS, (CLASS))	\
+   were set unless they contain NEON vector values.  */
+#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)			 \
+  (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)				 \
+   ? reg_classes_intersect_p (FPA_REGS, (CLASS))			 \
+     || ((!VALID_NEON_DREG_MODE (TO) || !VALID_NEON_QREG_MODE (FROM))	 \
+         && (!VALID_NEON_QREG_MODE (TO) || !VALID_NEON_DREG_MODE (FROM)) \
+	 && reg_classes_intersect_p (VFP_REGS, (CLASS)))		 \
    : 0)
 
 /* We need to define this for LO_REGS on Thumb-1.  Otherwise we can end up
Index: gcc/config/arm/neon.md
===================================================================
--- gcc/config/arm/neon.md	(revision 306416)
+++ gcc/config/arm/neon.md	(working copy)
@@ -1350,11 +1350,14 @@
 ;; shift-count granularity. That's good enough for the middle-end's current
 ;; needs.
 
+;; Note that it's not safe to perform such an operation in big-endian mode,
+;; due to element-ordering issues.
+
 (define_expand "vec_shr_<mode>"
   [(match_operand:VDQ 0 "s_register_operand" "")
    (match_operand:VDQ 1 "s_register_operand" "")
    (match_operand:SI 2 "const_multiple_of_8_operand" "")]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   rtx zero_reg;
   HOST_WIDE_INT num_bits = INTVAL (operands[2]);
@@ -1382,7 +1385,7 @@
   [(match_operand:VDQ 0 "s_register_operand" "")
    (match_operand:VDQ 1 "s_register_operand" "")
    (match_operand:SI 2 "const_multiple_of_8_operand" "")]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   rtx zero_reg;
   HOST_WIDE_INT num_bits = INTVAL (operands[2]);
@@ -1587,7 +1590,11 @@
 
   emit_insn (gen_quad_halves_plus<mode> (step1, operands[1]));
   emit_insn (gen_reduc_splus_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+
+  if (BYTES_BIG_ENDIAN)
+    emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d));
+  else
+    emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
 
   DONE;
 })
@@ -1632,7 +1639,11 @@
 
   emit_insn (gen_quad_halves_smin<mode> (step1, operands[1]));
   emit_insn (gen_reduc_smin_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+
+  if (BYTES_BIG_ENDIAN)
+    emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d));
+  else
+    emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
 
   DONE;
 })
@@ -1657,7 +1668,11 @@
 
   emit_insn (gen_quad_halves_smax<mode> (step1, operands[1]));
   emit_insn (gen_reduc_smax_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+
+  if (BYTES_BIG_ENDIAN)
+    emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d));
+  else
+    emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
 
   DONE;
 })
@@ -1682,7 +1697,11 @@
 
   emit_insn (gen_quad_halves_umin<mode> (step1, operands[1]));
   emit_insn (gen_reduc_umin_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+
+  if (BYTES_BIG_ENDIAN)
+    emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d));
+  else
+    emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
 
   DONE;
 })
@@ -1707,7 +1726,11 @@
 
   emit_insn (gen_quad_halves_umax<mode> (step1, operands[1]));
   emit_insn (gen_reduc_umax_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+
+  if (BYTES_BIG_ENDIAN)
+    emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d));
+  else
+    emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
 
   DONE;
 })
@@ -5563,8 +5586,14 @@
 			  (match_operand:VU 1 "register_operand" "w")
 			  (match_operand:VU 2 "vect_par_constant_low" ""))))]
   "TARGET_NEON"
-  "vmovl.<US><V_sz_elem> %q0, %e1"
-  [(set_attr "neon_type" "neon_shift_1")]
+{
+  if (BYTES_BIG_ENDIAN)
+    return "vmovl.<US><V_sz_elem>\t%q0, %f1\;vext.64\t%q0, %q0, %q0, #1";
+  else
+    return "vmovl.<US><V_sz_elem>\t%q0, %e1";
+}
+  [(set_attr "neon_type" "neon_shift_1")
+   (set_attr "length" "8")]
 )
 
 (define_insn "neon_vec_unpack<US>_hi_<mode>"
@@ -5573,8 +5602,14 @@
 			  (match_operand:VU 1 "register_operand" "w")
 			  (match_operand:VU 2 "vect_par_constant_high" ""))))]
   "TARGET_NEON"
-  "vmovl.<US><V_sz_elem> %q0, %f1"
-  [(set_attr "neon_type" "neon_shift_1")]
+{
+  if (BYTES_BIG_ENDIAN)
+    return "vmovl.<US><V_sz_elem>\t%q0, %e1\;vext.64\t%q0, %q0, %q0, #1";
+  else
+    return "vmovl.<US><V_sz_elem>\t%q0, %f1";
+}
+  [(set_attr "neon_type" "neon_shift_1")
+   (set_attr "length" "8")]
 )
 
 (define_expand "vec_unpack<US>_hi_<mode>"
@@ -5623,8 +5658,14 @@
                            (match_operand:VU 3 "register_operand" "w") 
                            (match_dup 2)))))]
   "TARGET_NEON"
-  "vmull.<US><V_sz_elem> %q0, %e1, %e3"
-  [(set_attr "neon_type" "neon_shift_1")]
+{
+  if (BYTES_BIG_ENDIAN)
+    return "vmull.<US><V_sz_elem>\t%q0, %f1, %f3\;vext.64\t%q0, %q0, %q0, #1";
+  else
+    return "vmull.<US><V_sz_elem>\t%q0, %e1, %e3";
+}
+  [(set_attr "neon_type" "neon_shift_1")
+   (set_attr "length" "8")]
 )
 
 (define_expand "vec_widen_<US>mult_lo_<mode>"
@@ -5657,8 +5698,14 @@
 			    (match_operand:VU 3 "register_operand" "w") 
 			    (match_dup 2)))))]
   "TARGET_NEON"
-  "vmull.<US><V_sz_elem> %q0, %f1, %f3"
-  [(set_attr "neon_type" "neon_shift_1")]
+{
+  if (BYTES_BIG_ENDIAN)
+    return "vmull.<US><V_sz_elem>\t%q0, %e1, %e3\;vext.64\t%q0, %q0, %q0, #1";
+  else
+    return "vmull.<US><V_sz_elem>\t%q0, %f1, %f3";
+}
+  [(set_attr "neon_type" "neon_shift_1")
+   (set_attr "length" "8")]
 )
 
 (define_expand "vec_widen_<US>mult_hi_<mode>"
@@ -5759,24 +5806,55 @@
  }
 )
 
-; FIXME: These instruction patterns can't be used safely in big-endian mode
-; because the ordering of vector elements in Q registers is different from what
-; the semantics of the instructions require.
-
-(define_insn "vec_pack_trunc_<mode>"
-  [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
+(define_expand "vec_pack_trunc_<mode>"
+  [(set (match_operand:<V_narrow_pack> 0 "register_operand" "")
 	(vec_concat:<V_narrow_pack> 
 		(truncate:<V_narrow> 
-			(match_operand:VN 1 "register_operand" "w"))
+			(match_operand:VN 1 "register_operand" ""))
 		(truncate:<V_narrow>
-			(match_operand:VN 2 "register_operand" "w"))))]
- "TARGET_NEON && !BYTES_BIG_ENDIAN"
- "vmovn.i<V_sz_elem>\t%e0, %q1\;vmovn.i<V_sz_elem>\t%f0, %q2"
- [(set_attr "neon_type" "neon_shift_1")
-  (set_attr "length" "8")]
-)
+			(match_operand:VN 2 "register_operand" ""))))]
+ "TARGET_NEON"
+{
+  rtx tmp;
+  rtx destlo = simplify_gen_subreg (<V_narrow>mode, operands[0],
+				    <V_narrow_pack>mode, 0);
+  rtx desthi = simplify_gen_subreg (<V_narrow>mode, operands[0],
+				    <V_narrow_pack>mode, 8);
+
+  /* Operand 0 may overlap with operand 2, and the former is clobbered early.
+     Make a copy of the latter if necessary.  */
+
+  if (reg_overlap_mentioned_p (operands[0], operands[2]))
+    tmp = copy_to_reg (operands[2]);
+  else
+    tmp = operands[2];
+
+  if (BYTES_BIG_ENDIAN)
+    {
+      rtx rev = gen_reg_rtx (V4SImode);
+      
+      emit_insn (gen_neon_vmovn<mode> (destlo, operands[1], const0_rtx));
+      emit_insn (gen_neon_vmovn<mode> (desthi, tmp, const0_rtx));
+      neon_reinterpret (rev, operands[0]);
+      emit_insn (gen_neon_vrev64v4si (rev, rev, const0_rtx));
+      neon_reinterpret (operands[0], rev);
+    }
+  else
+    {
+      emit_insn (gen_neon_vec_pack_trunc_<mode> (destlo, operands[1]));
+      emit_insn (gen_neon_vec_pack_trunc_<mode> (desthi, tmp));
+    }
+
+  DONE;
+})
 
 ; For the non-quad case.
+
+; Note that the "truncate" opcode shouldn't be used in big-endian mode, since
+; the relative orderings of quad-word and double-word vectors differ: use of
+; truncate would not imply the necessary permutation. Use the "opaque"
+; neon_vmovn<mode> pattern in big-endian mode.
+
 (define_insn "neon_vec_pack_trunc_<mode>"
  [(set (match_operand:<V_narrow> 0 "register_operand" "=w")
        (truncate:<V_narrow> (match_operand:VN 1 "register_operand" "w")))]
@@ -5789,12 +5867,23 @@
  [(match_operand:<V_narrow_pack> 0 "register_operand" "")
   (match_operand:VSHFT 1 "register_operand" "")
   (match_operand:VSHFT 2 "register_operand" "")]
- "TARGET_NEON && !BYTES_BIG_ENDIAN"
+ "TARGET_NEON"
 {
   rtx tempreg = gen_reg_rtx (<V_DOUBLE>mode);
   
-  emit_insn (gen_move_lo_quad_<V_double> (tempreg, operands[1]));
-  emit_insn (gen_move_hi_quad_<V_double> (tempreg, operands[2]));
-  emit_insn (gen_neon_vec_pack_trunc_<V_double> (operands[0], tempreg));
+  if (BYTES_BIG_ENDIAN)
+    {
+      /* An open-coded "permute", to switch the order of the hi/lo double-words
+         constituting the input operands of the instruction.  */
+      emit_insn (gen_move_lo_quad_<V_double> (tempreg, operands[2]));
+      emit_insn (gen_move_hi_quad_<V_double> (tempreg, operands[1]));
+      emit_insn (gen_neon_vmovn<V_double> (operands[0], tempreg, const0_rtx));
+    }
+  else
+    {
+      emit_insn (gen_move_lo_quad_<V_double> (tempreg, operands[1]));
+      emit_insn (gen_move_hi_quad_<V_double> (tempreg, operands[2]));
+      emit_insn (gen_neon_vec_pack_trunc_<V_double> (operands[0], tempreg));
+    }
   DONE;
 })

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

[PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)

Reply via email to