Hi, Here's a work-in-progress patch which fixes many execution failures seen in big-endian mode when -mvectorize-with-neon-quad is in effect (which is soon to be the default, if we stick to the current plan). But, it's pretty hairy, and I'm not at all convinced it's not working "for the wrong reason" in a couple of places.
I'm mainly posting to gauge opinions on what we should do in big-endian mode. This patch works with the assumption that quad-word vectors in big-endian mode are in "vldm" order (i.e. with constituent double-words in little-endian order: see previous discussions). But, that's pretty confusing, leads to less than optimal code, and is bound to cause more problems in the future. So I'm not sure how much effort to expend on making it work right, given that we might be throwing that vector ordering away in the future (at least in some cases: see below). The "problem" patterns are as follows. * Full-vector shifts: these don't work with big-endian vldm-order quad vectors. For now, I've disabled them, although they could potentially be implemented using vtbl (at some cost). * Widening moves (unpacks) & widening multiplies: when widening from D-reg to Q-reg size, we must swap double-words in the result (I've done this with vext). This seems to work fine, but what "hi" and "lo" refer to is rather muddled (in my head!). Also they should be expanders instead of emitting multiple assembler insns. * Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as suggested by Paul). These seem to work fine. * Reduction operations: when reducing Q-reg values, GCC currently tries to extract the result from the "wrong half" of the reduced vector. The fix in the attached patch is rather dubious, but seems to work (I'd like to understand why better). We can sort those bits out, but the question is, do we want to go that route? Vectors are used in three quite distinct ways by GCC: 1. By the vectorizer. 2. By the NEON intrinsics. 3. By the "generic vector" support. For the first of these, I think we can get away with changing the vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so that vector registers will hold elements in memory order -- so, all the contortions in the attached patch will be unnecessary. ABI issues are irrelevant, since vectors are "invisible" at the source code layer generally, including at ABI boundaries. For the second, intrinsics, we should do exactly what the user requests: so, vectors are essentially treated as opaque objects. This isn't a problem as such, but might mean that instruction patterns written using "canonical" RTL for the vectorizer can't be shared with intrinsics when the order of elements matters. (I'm not sure how many patterns this would refer to at present; possibly none.) The third case would continue to use "vldm" ordering, so if users inadvertantly write code such as: res = vaddq_u32 (*foo, bar); instead of writing an explicit vld* intrinsic (for the load of *foo), the result might be different from what they expect. It'd be nice to diagnose such code as erroneous, but that's another issue. The important observation is that vectors from case 1 and from cases 2/3 never interact: it's quite safe for them to use different element orderings, without extensive changes to GCC infrastructure (i.e., multiple internal representations). I don't think I quite realised this previously. So, anyway, back to the patch in question. The choices are, I think: 1. Apply as-is (after I've ironed out the wrinkles), and then remove the "ugly" bits at a later point when vectorizer "array load/store" support is implemented. 2. Apply a version which simply disables all the troublesome patterns until the same support appears. Apologies if I'm retreading old ground ;-). (The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good code for the quad-word vec_pack_trunc_<mode> pattern. It would eventually be applied as a separate patch.) Thoughts? Julian ChangeLog gcc/ * config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode of vector registers. * config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in big-endian mode. (reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>) (reduc_umin_<mode>, reduc_umax_<mode>) (neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>) (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>) (vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle big-endian mode for quad-word vectors.
Index: gcc/config/arm/arm.h =================================================================== --- gcc/config/arm/arm.h (revision 306416) +++ gcc/config/arm/arm.h (working copy) @@ -1302,11 +1302,13 @@ enum reg_class /* FPA registers can't do subreg as all values are reformatted to internal precision. VFP registers may only be accessed in the mode they - were set. */ -#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ - (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ - ? reg_classes_intersect_p (FPA_REGS, (CLASS)) \ - || reg_classes_intersect_p (VFP_REGS, (CLASS)) \ + were set unless they contain NEON vector values. */ +#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ + (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ + ? reg_classes_intersect_p (FPA_REGS, (CLASS)) \ + || ((!VALID_NEON_DREG_MODE (TO) || !VALID_NEON_QREG_MODE (FROM)) \ + && (!VALID_NEON_QREG_MODE (TO) || !VALID_NEON_DREG_MODE (FROM)) \ + && reg_classes_intersect_p (VFP_REGS, (CLASS))) \ : 0) /* We need to define this for LO_REGS on Thumb-1. Otherwise we can end up Index: gcc/config/arm/neon.md =================================================================== --- gcc/config/arm/neon.md (revision 306416) +++ gcc/config/arm/neon.md (working copy) @@ -1350,11 +1350,14 @@ ;; shift-count granularity. That's good enough for the middle-end's current ;; needs. +;; Note that it's not safe to perform such an operation in big-endian mode, +;; due to element-ordering issues. + (define_expand "vec_shr_<mode>" [(match_operand:VDQ 0 "s_register_operand" "") (match_operand:VDQ 1 "s_register_operand" "") (match_operand:SI 2 "const_multiple_of_8_operand" "")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx zero_reg; HOST_WIDE_INT num_bits = INTVAL (operands[2]); @@ -1382,7 +1385,7 @@ [(match_operand:VDQ 0 "s_register_operand" "") (match_operand:VDQ 1 "s_register_operand" "") (match_operand:SI 2 "const_multiple_of_8_operand" "")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx zero_reg; HOST_WIDE_INT num_bits = INTVAL (operands[2]); @@ -1587,7 +1590,11 @@ emit_insn (gen_quad_halves_plus<mode> (step1, operands[1])); emit_insn (gen_reduc_splus_<V_half> (res_d, step1)); - emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); + + if (BYTES_BIG_ENDIAN) + emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d)); + else + emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); DONE; }) @@ -1632,7 +1639,11 @@ emit_insn (gen_quad_halves_smin<mode> (step1, operands[1])); emit_insn (gen_reduc_smin_<V_half> (res_d, step1)); - emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); + + if (BYTES_BIG_ENDIAN) + emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d)); + else + emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); DONE; }) @@ -1657,7 +1668,11 @@ emit_insn (gen_quad_halves_smax<mode> (step1, operands[1])); emit_insn (gen_reduc_smax_<V_half> (res_d, step1)); - emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); + + if (BYTES_BIG_ENDIAN) + emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d)); + else + emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); DONE; }) @@ -1682,7 +1697,11 @@ emit_insn (gen_quad_halves_umin<mode> (step1, operands[1])); emit_insn (gen_reduc_umin_<V_half> (res_d, step1)); - emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); + + if (BYTES_BIG_ENDIAN) + emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d)); + else + emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); DONE; }) @@ -1707,7 +1726,11 @@ emit_insn (gen_quad_halves_umax<mode> (step1, operands[1])); emit_insn (gen_reduc_umax_<V_half> (res_d, step1)); - emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); + + if (BYTES_BIG_ENDIAN) + emit_insn (gen_move_hi_quad_<mode> (operands[0], res_d)); + else + emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d)); DONE; }) @@ -5563,8 +5586,14 @@ (match_operand:VU 1 "register_operand" "w") (match_operand:VU 2 "vect_par_constant_low" ""))))] "TARGET_NEON" - "vmovl.<US><V_sz_elem> %q0, %e1" - [(set_attr "neon_type" "neon_shift_1")] +{ + if (BYTES_BIG_ENDIAN) + return "vmovl.<US><V_sz_elem>\t%q0, %f1\;vext.64\t%q0, %q0, %q0, #1"; + else + return "vmovl.<US><V_sz_elem>\t%q0, %e1"; +} + [(set_attr "neon_type" "neon_shift_1") + (set_attr "length" "8")] ) (define_insn "neon_vec_unpack<US>_hi_<mode>" @@ -5573,8 +5602,14 @@ (match_operand:VU 1 "register_operand" "w") (match_operand:VU 2 "vect_par_constant_high" ""))))] "TARGET_NEON" - "vmovl.<US><V_sz_elem> %q0, %f1" - [(set_attr "neon_type" "neon_shift_1")] +{ + if (BYTES_BIG_ENDIAN) + return "vmovl.<US><V_sz_elem>\t%q0, %e1\;vext.64\t%q0, %q0, %q0, #1"; + else + return "vmovl.<US><V_sz_elem>\t%q0, %f1"; +} + [(set_attr "neon_type" "neon_shift_1") + (set_attr "length" "8")] ) (define_expand "vec_unpack<US>_hi_<mode>" @@ -5623,8 +5658,14 @@ (match_operand:VU 3 "register_operand" "w") (match_dup 2)))))] "TARGET_NEON" - "vmull.<US><V_sz_elem> %q0, %e1, %e3" - [(set_attr "neon_type" "neon_shift_1")] +{ + if (BYTES_BIG_ENDIAN) + return "vmull.<US><V_sz_elem>\t%q0, %f1, %f3\;vext.64\t%q0, %q0, %q0, #1"; + else + return "vmull.<US><V_sz_elem>\t%q0, %e1, %e3"; +} + [(set_attr "neon_type" "neon_shift_1") + (set_attr "length" "8")] ) (define_expand "vec_widen_<US>mult_lo_<mode>" @@ -5657,8 +5698,14 @@ (match_operand:VU 3 "register_operand" "w") (match_dup 2)))))] "TARGET_NEON" - "vmull.<US><V_sz_elem> %q0, %f1, %f3" - [(set_attr "neon_type" "neon_shift_1")] +{ + if (BYTES_BIG_ENDIAN) + return "vmull.<US><V_sz_elem>\t%q0, %e1, %e3\;vext.64\t%q0, %q0, %q0, #1"; + else + return "vmull.<US><V_sz_elem>\t%q0, %f1, %f3"; +} + [(set_attr "neon_type" "neon_shift_1") + (set_attr "length" "8")] ) (define_expand "vec_widen_<US>mult_hi_<mode>" @@ -5759,24 +5806,55 @@ } ) -; FIXME: These instruction patterns can't be used safely in big-endian mode -; because the ordering of vector elements in Q registers is different from what -; the semantics of the instructions require. - -(define_insn "vec_pack_trunc_<mode>" - [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") +(define_expand "vec_pack_trunc_<mode>" + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "") (vec_concat:<V_narrow_pack> (truncate:<V_narrow> - (match_operand:VN 1 "register_operand" "w")) + (match_operand:VN 1 "register_operand" "")) (truncate:<V_narrow> - (match_operand:VN 2 "register_operand" "w"))))] - "TARGET_NEON && !BYTES_BIG_ENDIAN" - "vmovn.i<V_sz_elem>\t%e0, %q1\;vmovn.i<V_sz_elem>\t%f0, %q2" - [(set_attr "neon_type" "neon_shift_1") - (set_attr "length" "8")] -) + (match_operand:VN 2 "register_operand" ""))))] + "TARGET_NEON" +{ + rtx tmp; + rtx destlo = simplify_gen_subreg (<V_narrow>mode, operands[0], + <V_narrow_pack>mode, 0); + rtx desthi = simplify_gen_subreg (<V_narrow>mode, operands[0], + <V_narrow_pack>mode, 8); + + /* Operand 0 may overlap with operand 2, and the former is clobbered early. + Make a copy of the latter if necessary. */ + + if (reg_overlap_mentioned_p (operands[0], operands[2])) + tmp = copy_to_reg (operands[2]); + else + tmp = operands[2]; + + if (BYTES_BIG_ENDIAN) + { + rtx rev = gen_reg_rtx (V4SImode); + + emit_insn (gen_neon_vmovn<mode> (destlo, operands[1], const0_rtx)); + emit_insn (gen_neon_vmovn<mode> (desthi, tmp, const0_rtx)); + neon_reinterpret (rev, operands[0]); + emit_insn (gen_neon_vrev64v4si (rev, rev, const0_rtx)); + neon_reinterpret (operands[0], rev); + } + else + { + emit_insn (gen_neon_vec_pack_trunc_<mode> (destlo, operands[1])); + emit_insn (gen_neon_vec_pack_trunc_<mode> (desthi, tmp)); + } + + DONE; +}) ; For the non-quad case. + +; Note that the "truncate" opcode shouldn't be used in big-endian mode, since +; the relative orderings of quad-word and double-word vectors differ: use of +; truncate would not imply the necessary permutation. Use the "opaque" +; neon_vmovn<mode> pattern in big-endian mode. + (define_insn "neon_vec_pack_trunc_<mode>" [(set (match_operand:<V_narrow> 0 "register_operand" "=w") (truncate:<V_narrow> (match_operand:VN 1 "register_operand" "w")))] @@ -5789,12 +5867,23 @@ [(match_operand:<V_narrow_pack> 0 "register_operand" "") (match_operand:VSHFT 1 "register_operand" "") (match_operand:VSHFT 2 "register_operand" "")] - "TARGET_NEON && !BYTES_BIG_ENDIAN" + "TARGET_NEON" { rtx tempreg = gen_reg_rtx (<V_DOUBLE>mode); - emit_insn (gen_move_lo_quad_<V_double> (tempreg, operands[1])); - emit_insn (gen_move_hi_quad_<V_double> (tempreg, operands[2])); - emit_insn (gen_neon_vec_pack_trunc_<V_double> (operands[0], tempreg)); + if (BYTES_BIG_ENDIAN) + { + /* An open-coded "permute", to switch the order of the hi/lo double-words + constituting the input operands of the instruction. */ + emit_insn (gen_move_lo_quad_<V_double> (tempreg, operands[2])); + emit_insn (gen_move_hi_quad_<V_double> (tempreg, operands[1])); + emit_insn (gen_neon_vmovn<V_double> (operands[0], tempreg, const0_rtx)); + } + else + { + emit_insn (gen_move_lo_quad_<V_double> (tempreg, operands[1])); + emit_insn (gen_move_hi_quad_<V_double> (tempreg, operands[2])); + emit_insn (gen_neon_vec_pack_trunc_<V_double> (operands[0], tempreg)); + } DONE; })
_______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain