Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

2021-10-20 Thread Andre Vieira (lists) via Gcc-patches

On 15/10/2021 09:48, Richard Biener wrote:

On Tue, 12 Oct 2021, Andre Vieira (lists) wrote:


Hi Richi,

I think this is what you meant, I now hide all the unrolling cost calculations
in the existing target hooks for costs. I did need to adjust 'finish_cost' to
take the loop_vinfo so the target's implementations are able to set the newly
renamed 'suggested_unroll_factor'.

Also added the checks for the epilogue's VF.

Is this more like what you had in mind?

Not exactly (sorry..).  For the target hook I think we don't want to
pass vec_info but instead another output parameter like the existing
ones.

vect_estimate_min_profitable_iters should then via
vect_analyze_loop_costing and vect_analyze_loop_2 report the unroll
suggestion to vect_analyze_loop which should then, if the suggestion
was > 1, instead of iterating to the next vector mode run again
with a fixed VF (old VF times suggested unroll factor - there's
min_vf in vect_analyze_loop_2 which we should adjust to
the old VF times two for example and maybe store the suggested
factor as hint) - if it succeeds the result will end up in the
list of considered modes (where we now may have more than one
entry for the same mode but a different VF), we probably want to
only consider more unrolling once.

For simplicity I'd probably set min_vf = max_vf = old VF * suggested
factor, thus take the targets request literally.

Richard.


Hi,

I now pass an output parameter to finish_costs and route it through the 
various calls up to vect_analyze_loop.  I tried to rework 
vect_determine_vectorization_factor and noticed that merely setting 
min_vf and max_vf is not enough, we only use these to check whether the 
vectorization factor is within range, well actually we only use max_vf 
at that stage. We only seem to use 'min_vf' to make sure the 
data_references are valid.  I am not sure my changes are the most 
appropriate here, for instance I am pretty sure the checks for max and 
min vf I added in vect_determine_vectorization_factor are currently 
superfluous as they will pass by design, but thought they might be good 
future proofing?


Also I changed how we compare against max_vf, rather than relying on the 
'MAX_VECTORIZATION' I decided to use the estimated_poly_value with 
POLY_VALUE_MAX, to be able to bound it further in case we have knowledge 
of the VL. I am not entirely about the validity of this change, maybe we 
are better off keeping the MAX_VECTORIZATION in place and not making any 
changes to max_vf for unrolling.


What do you think?
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
36519ccc5a58abab483c38d0a6c5f039592bfc7f..9b1e01e9b62050d7e34bc55454771e40bdbdb4cb
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15972,8 +15972,8 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, 
unsigned int body_cost)
 
 /* Implement TARGET_VECTORIZE_FINISH_COST.  */
 static void
-aarch64_finish_cost (void *data, unsigned *prologue_cost,
-unsigned *body_cost, unsigned *epilogue_cost)
+aarch64_finish_cost (void *data, unsigned *prologue_cost, unsigned *body_cost,
+unsigned *epilogue_cost, unsigned *suggested_unroll_factor)
 {
   auto *costs = static_cast (data);
   *prologue_cost = costs->region[vect_prologue];
@@ -15984,6 +15984,9 @@ aarch64_finish_cost (void *data, unsigned 
*prologue_cost,
   && costs->vec_flags
   && aarch64_use_new_vector_costs_p ())
 *body_cost = aarch64_adjust_body_cost (costs, *body_cost);
+
+  if(suggested_unroll_factor)
+*suggested_unroll_factor = 1;
 }
 
 /* Implement TARGET_VECTORIZE_DESTROY_COST_DATA.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 
afc2674d49da370ae0f5ef277df7e9954f303b8e..a48e43879512793907fef946c1575c3ed7f68092
 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23048,13 +23048,15 @@ ix86_add_stmt_cost (class vec_info *vinfo, void 
*data, int count,
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-ix86_finish_cost (void *data, unsigned *prologue_cost,
- unsigned *body_cost, unsigned *epilogue_cost)
+ix86_finish_cost (void *data, unsigned *prologue_cost, unsigned *body_cost,
+ unsigned *epilogue_cost, unsigned *suggested_unroll_factor)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
   *body_cost = cost[vect_body];
   *epilogue_cost = cost[vect_epilogue];
+  if (suggested_unroll_factor)
+*suggested_unroll_factor = 1;
 }
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 
ad81dfb316dff00cde810d6b1edd31fa49d5c1e8..59d30ad6fcd1758383c52e34a0f90a126c501ec3
 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5551,8 +5551,8 @@ rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
 /* Implement targetm.vectorize.finish_cost.  */
 
 static

Re: FW: [PING] Re: [Patch][GCC][middle-end] - Generate FRINTZ for (double)(int) under -ffast-math on aarch64

2021-10-20 Thread Andre Vieira (lists) via Gcc-patches


On 19/10/2021 00:22, Joseph Myers wrote:

On Fri, 15 Oct 2021, Richard Biener via Gcc-patches wrote:


On Fri, Sep 24, 2021 at 2:59 PM Jirui Wu via Gcc-patches
 wrote:

Hi,

Ping: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577846.html

The patch is attached as text for ease of use. Is there anything that needs to 
change?

Ok for master? If OK, can it be committed for me, I have no commit rights.

I'm still not sure about the correctness.  I suppose the
flag_fp_int_builtin_inexact && !flag_trapping_math is supposed to guard
against spurious inexact exceptions, shouldn't that be
!flag_fp_int_builtin_inexact || !flag_trapping_math instead?

The following remarks may be relevant here, but are not intended as an
assertion of what is correct in this case.

1. flag_fp_int_builtin_inexact is the more permissive case ("inexact" may
or may not be raised).  All existing uses in back ends are
"flag_fp_int_builtin_inexact || !flag_trapping_math" or equivalent.

2. flag_fp_int_builtin_inexact only applies to certain built-in functions
(as listed in invoke.texi).  It's always unspecified, even in C2X, whether
casts of non-integer values from floating-point to integer types raise
"inexact".  So flag_fp_int_builtin_inexact should not be checked in insn
patterns corresponding to simple casts from floating-point to integer,
only in insn patterns corresponding to the built-in functions listed for
-fno-fp-int-builtin-inexact in invoke.texi (or for operations that combine
such a built-in function with a cast of the *result* to integer type).

Hi,

I agree with Joseph, I don't think we should be checking 
flag_fp_int_builtin_inexact here because we aren't transforming the math 
function 'trunc', but rather a piece of C-code that has trunc-like 
semantics.


As for flag_trapping_math, it's definition says 'Assume floating point 
operations can trap'. I assume IFN_TRUNC would not trap, since I don't 
think IFN_TRUNC will preserve the overflow behaviour, in the cases where 
the FP value is bigger than the intermediate integer type range. So I 
think we should prevent the transformation if we are assuming the FP 
instructions can trap.


If we don't assume the FP instructions can trap, then I think it's fine 
to ignore the overflow as this behavior is undefined in C.


Also changed the comment. Slightly different to your suggestion Richard, 
in an attempt to be more generic. Do you still have concerns regarding 
the checks?


Kind regards,
Andrediff --git a/gcc/match.pd b/gcc/match.pd
index 
3ff15bc0de5aba45ade94ca6e47e01fad9a2a314..5bed2e12715ea213813ef8b84fd420475b04d201
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3606,6 +3606,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 >= inside_prec - !inside_unsignedp)
  (convert @0)))
 
+/* (float_type)(integer_type) x -> trunc (x) if the type of x matches
+   float_type.  Only do the transformation if we do not need to preserve
+   trapping behaviour, so require !flag_trapping_math. */
+#if GIMPLE
+(simplify
+   (float (fix_trunc @0))
+   (if (!flag_trapping_math
+   && types_match (type, TREE_TYPE (@0))
+   && direct_internal_fn_supported_p (IFN_TRUNC, type,
+ OPTIMIZE_FOR_BOTH))
+  (IFN_TRUNC @0)))
+#endif
+
 /* If we have a narrowing conversion to an integral type that is fed by a
BIT_AND_EXPR, we might be able to remove the BIT_AND_EXPR if it merely
masks off bits outside the final type (and nothing else).  */
diff --git a/gcc/testsuite/gcc.target/aarch64/merge_trunc1.c 
b/gcc/testsuite/gcc.target/aarch64/merge_trunc1.c
new file mode 100644
index 
..07217064e2ba54fcf4f5edc440e6ec19ddae66e1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/merge_trunc1.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math" } */
+
+float
+f1 (float x)
+{
+  int y = x;
+
+  return (float) y;
+}
+
+double
+f2 (double x)
+{
+  long y = x;
+
+  return (double) y;
+}
+
+float
+f3 (double x)
+{
+  int y = x;
+
+  return (float) y;
+}
+
+double
+f4 (float x)
+{
+  int y = x;
+
+  return (double) y;
+}
+
+/* { dg-final { scan-assembler "frintz\\ts\[0-9\]+, s\[0-9\]+" } } */
+/* { dg-final { scan-assembler "frintz\\td\[0-9\]+, d\[0-9\]+" } } */
+/* { dg-final { scan-assembler "fcvtzs\\tw\[0-9\]+, d\[0-9\]+" } } */
+/* { dg-final { scan-assembler "scvtf\\ts\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { scan-assembler "fcvtzs\\tw\[0-9\]+, s\[0-9\]+" } } */
+/* { dg-final { scan-assembler "scvtf\\td\[0-9\]+, w\[0-9\]+" } } */


Re: [Patch][GCC][middle-end] - Lower store and load neon builtins to gimple

2021-10-20 Thread Andre Vieira (lists) via Gcc-patches

On 27/09/2021 12:54, Richard Biener via Gcc-patches wrote:

On Mon, 27 Sep 2021, Jirui Wu wrote:


Hi all,

I now use the type based on the specification of the intrinsic
instead of type based on formal argument.

I use signed Int vector types because the outputs of the neon builtins
that I am lowering is always signed. In addition, fcode and stmt
does not have information on whether the result is signed.

Because I am replacing the stmt with new_stmt,
a VIEW_CONVERT_EXPR cast is already in the code if needed.
As a result, the result assembly code is correct.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master? If OK can it be committed for me, I have no commit rights.

+   tree temp_lhs = gimple_call_lhs (stmt);
+   aarch64_simd_type_info simd_type
+ = aarch64_simd_types[mem_type];
+   tree elt_ptr_type = build_pointer_type (simd_type.eltype);
+   tree zero = build_zero_cst (elt_ptr_type);
+   gimple_seq stmts = NULL;
+   tree base = gimple_convert (, elt_ptr_type,
+   args[0]);
+   new_stmt = gimple_build_assign (temp_lhs,
+fold_build2 (MEM_REF,
+TREE_TYPE (temp_lhs),
+base,
+zero));

this now uses the alignment info as on the LHS of the call by using
TREE_TYPE (temp_lhs) as type of the MEM_REF.  So for example

  typedef int foo __attribute__((vector_size(N),aligned(256)));

  foo tem = ld1 (ptr);

will now access *ptr as if it were aligned to 256 bytes.  But I'm sure
the ld1 intrinsic documents the required alignment (either it's the
natural alignment of the vector type loaded or element alignment?).

For element alignment you'd do sth like

   tree access_type = build_aligned_type (vector_type, TYPE_ALIGN
(TREE_TYPE (vector_type)));

for example.

Richard.

Hi,

I'm taking over this patch from Jirui.

I've decided to use the vector type stored in aarch64_simd_type_info, 
since that should always have the correct alignment.


To be fair though, I do wonder whether this is actually needed as is 
right now, since the way we cast the inputs and outputs of these 
__builtins in arm_neon.h prevents these issues I think, but it is more 
future proof. Also you could argue people could use the __builtins 
directly, though I'd think that would be at their own risk.


Is this OK?

Kind regards,
Andrediff --git a/gcc/config/aarch64/aarch64-builtins.c 
b/gcc/config/aarch64/aarch64-builtins.c
index 
1a507ea59142d0b5977b0167abfe9a58a567adf7..a815e4cfbccab692ca688ba87c71b06c304abbfb
 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -46,6 +46,7 @@
 #include "emit-rtl.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "gimple-fold.h"
 
 #define v8qi_UP  E_V8QImode
 #define v4hi_UP  E_V4HImode
@@ -2399,11 +2400,65 @@ aarch64_general_fold_builtin (unsigned int fcode, tree 
type,
   return NULL_TREE;
 }
 
+enum aarch64_simd_type
+get_mem_type_for_load_store (unsigned int fcode)
+{
+  switch (fcode)
+  {
+VAR1 (LOAD1, ld1 , 0, LOAD, v8qi)
+VAR1 (STORE1, st1 , 0, STORE, v8qi)
+  return Int8x8_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v16qi)
+VAR1 (STORE1, st1 , 0, STORE, v16qi)
+  return Int8x16_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v4hi)
+VAR1 (STORE1, st1 , 0, STORE, v4hi)
+  return Int16x4_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v8hi)
+VAR1 (STORE1, st1 , 0, STORE, v8hi)
+  return Int16x8_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v2si)
+VAR1 (STORE1, st1 , 0, STORE, v2si)
+  return Int32x2_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v4si)
+VAR1 (STORE1, st1 , 0, STORE, v4si)
+  return Int32x4_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v2di)
+VAR1 (STORE1, st1 , 0, STORE, v2di)
+  return Int64x2_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v4hf)
+VAR1 (STORE1, st1 , 0, STORE, v4hf)
+  return Float16x4_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v8hf)
+VAR1 (STORE1, st1 , 0, STORE, v8hf)
+  return Float16x8_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v4bf)
+VAR1 (STORE1, st1 , 0, STORE, v4bf)
+  return Bfloat16x4_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v8bf)
+VAR1 (STORE1, st1 , 0, STORE, v8bf)
+  return Bfloat16x8_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v2sf)
+VAR1 (STORE1, st1 , 0, STORE, v2sf)
+  return Float32x2_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v4sf)
+VAR1 (STORE1, st1 , 0, STORE, v4sf)
+  return Float32x4_t;
+VAR1 (LOAD1, ld1 , 0, LOAD, v2df)
+VAR1 (STORE1, st1 , 0, STORE, v2df)
+  return Float64x2_t;
+default:
+  gcc_unreachable ();
+  break;
+  }
+}
+
 /* Try to fold STMT, given that it's a call to the built-in function with
subcode FCODE.  Return the new statement on success and null on
failure.  */
 gimple *
-aarch64_general_gimple_fold_builtin (unsigned int fcode, gcall *stmt)

Re: [PATCH 2/3][vect] Consider outside costs earlier for epilogue loops

2021-10-14 Thread Andre Vieira (lists) via Gcc-patches

Hi,

I completely forgot I still had this patch out as well, I grouped it 
together with the unrolling because it was what motivated the change, 
but it is actually wider applicable and can be reviewed separately.


On 17/09/2021 16:32, Andre Vieira (lists) via Gcc-patches wrote:

Hi,

This patch changes the order in which we check outside and inside 
costs for epilogue loops, this is to ensure that a predicated epilogue 
is more likely to be picked over an unpredicated one, since it saves 
having to enter a scalar epilogue loop.


gcc/ChangeLog:

    * tree-vect-loop.c (vect_better_loop_vinfo_p): Change how 
epilogue loop costs are compared.


Re: [arm] Fix MVE addressing modes for VLDR[BHW] and VSTR[BHW]

2021-10-13 Thread Andre Vieira (lists) via Gcc-patches



On 13/10/2021 13:37, Kyrylo Tkachov wrote:

Hi Andre,


@@ -24276,7 +24271,7 @@ arm_print_operand (FILE *stream, rtx x, int code)
else if (code == POST_MODIFY || code == PRE_MODIFY)
  {
asm_fprintf (stream, "[%r", REGNO (XEXP (addr, 0)));
-   postinc_reg = XEXP ( XEXP (x, 1), 1);
+   postinc_reg = XEXP (XEXP (addr, 1), 1);
if (postinc_reg && CONST_INT_P (postinc_reg))
  {
if (code == POST_MODIFY)

this looks like a bug fix that should be separately backported to the branches?
Otherwise, the patch looks ok for trunk to me.
Thanks,
Kyrill

Normally I'd agree with you, but this is specific for the 'E' handling, 
which is MVE only and I am pretty sure the existing code would never 
accept POST/PRE Modify codes so this issue will never trigger before my 
patch.So I'm not sure it's useful to backport a bugfix for a bug that 
won't trigger, unless we also backport the entire patch, but I suspect 
we don't want to do that?




[arm] Fix MVE addressing modes for VLDR[BHW] and VSTR[BHW]

2021-10-12 Thread Andre Vieira (lists) via Gcc-patches

Hi,

The way we were previously dealing with addressing modes for MVE was 
preventing

the use of pre, post and offset addressing modes for the normal loads and
stores, including widening and narrowing.  This patch fixes that and
adds tests to ensure we are capable of using all the available addressing
modes.

gcc/ChangeLog:
2021-10-12  Andre Vieira  

    * config/arm/arm.c (thumb2_legitimate_address_p): Use 
VALID_MVE_MODE

    when checking mve addressing modes.
    (mve_vector_mem_operand): Fix the way we handle pre, post and 
offset

    addressing modes.
    (arm_print_operand): Fix printing of POST_ and PRE_MODIFY.
    * config/arm/mve.md: Use mve_memory_operand predicate 
everywhere where

    there is a single Ux constraint.

gcc/testsuite/ChangeLog:
2021-10-12  Andre Vieira  

    * gcc.target/arm/mve/mve.exp: Make it test main directory.
    * gcc.target/arm/mve/mve_load_memory_modes.c: New test.
    * gcc.target/arm/mve/mve_store_memory_modes.c: New test.
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
6c6e77fab666f4aeff023b1f949e3ca0a3545658..d921261633aeff4f92a2e1a6057b00b685dea892
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -8530,8 +8530,7 @@ thumb2_legitimate_address_p (machine_mode mode, rtx x, 
int strict_p)
   bool use_ldrd;
   enum rtx_code code = GET_CODE (x);
 
-  if (TARGET_HAVE_MVE
-  && (mode == V8QImode || mode == E_V4QImode || mode == V4HImode))
+  if (TARGET_HAVE_MVE && VALID_MVE_MODE (mode))
 return mve_vector_mem_operand (mode, x, strict_p);
 
   if (arm_address_register_rtx_p (x, strict_p))
@@ -13433,53 +13432,49 @@ mve_vector_mem_operand (machine_mode mode, rtx op, 
bool strict)
   || code == PRE_INC || code == POST_DEC)
 {
   reg_no = REGNO (XEXP (op, 0));
-  return (((mode == E_V8QImode || mode == E_V4QImode || mode == E_V4HImode)
-  ? reg_no <= LAST_LO_REGNUM
-  :(reg_no < LAST_ARM_REGNUM && reg_no != SP_REGNUM))
- || (!strict && reg_no >= FIRST_PSEUDO_REGISTER));
-}
-  else if ((code == POST_MODIFY || code == PRE_MODIFY)
-  && GET_CODE (XEXP (op, 1)) == PLUS && REG_P (XEXP (XEXP (op, 1), 1)))
+  return ((mode == E_V8QImode || mode == E_V4QImode || mode == E_V4HImode)
+ ? reg_no <= LAST_LO_REGNUM
+ :(reg_no < LAST_ARM_REGNUM && reg_no != SP_REGNUM))
+   || reg_no >= FIRST_PSEUDO_REGISTER;
+}
+  else if (((code == POST_MODIFY || code == PRE_MODIFY)
+   && GET_CODE (XEXP (op, 1)) == PLUS
+   && XEXP (op, 0) == XEXP (XEXP (op, 1), 0)
+   && REG_P (XEXP (op, 0))
+   && GET_CODE (XEXP (XEXP (op, 1), 1)) == CONST_INT)
+  /* Make sure to only accept PLUS after reload_completed, otherwise
+ this will interfere with auto_inc's pattern detection.  */
+  || (reload_completed && code == PLUS && REG_P (XEXP (op, 0))
+  && GET_CODE (XEXP (op, 1)) == CONST_INT))
 {
   reg_no = REGNO (XEXP (op, 0));
-  val = INTVAL (XEXP ( XEXP (op, 1), 1));
+  if (code == PLUS)
+   val = INTVAL (XEXP (op, 1));
+  else
+   val = INTVAL (XEXP(XEXP (op, 1), 1));
+
   switch (mode)
{
  case E_V16QImode:
-   if (abs (val) <= 127)
- return ((reg_no < LAST_ARM_REGNUM && reg_no != SP_REGNUM)
- || (!strict && reg_no >= FIRST_PSEUDO_REGISTER));
-   return FALSE;
- case E_V8HImode:
- case E_V8HFmode:
-   if (abs (val) <= 255)
- return ((reg_no < LAST_ARM_REGNUM && reg_no != SP_REGNUM)
- || (!strict && reg_no >= FIRST_PSEUDO_REGISTER));
-   return FALSE;
  case E_V8QImode:
  case E_V4QImode:
if (abs (val) <= 127)
- return (reg_no <= LAST_LO_REGNUM
- || (!strict && reg_no >= FIRST_PSEUDO_REGISTER));
+ return (reg_no < LAST_ARM_REGNUM && reg_no != SP_REGNUM)
+   || reg_no >= FIRST_PSEUDO_REGISTER;
return FALSE;
+ case E_V8HImode:
+ case E_V8HFmode:
  case E_V4HImode:
  case E_V4HFmode:
if (val % 2 == 0 && abs (val) <= 254)
- return (reg_no <= LAST_LO_REGNUM
- || (!strict && reg_no >= FIRST_PSEUDO_REGISTER));
+ return reg_no <= LAST_LO_REGNUM
+   || reg_no >= FIRST_PSEUDO_REGISTER;
return FALSE;
  case E_V4SImode:
  case E_V4SFmode:
if (val % 4 == 0 && abs (val) <= 508)
- return ((reg_no < LAST_ARM_REGNUM && reg_no != SP_REGNUM)
- || (!strict && reg_no >= FIRST_PSEUDO_REGISTER));
- 

Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

2021-10-12 Thread Andre Vieira (lists) via Gcc-patches

Hi Richi,

I think this is what you meant, I now hide all the unrolling cost 
calculations in the existing target hooks for costs. I did need to 
adjust 'finish_cost' to take the loop_vinfo so the target's 
implementations are able to set the newly renamed 'suggested_unroll_factor'.


Also added the checks for the epilogue's VF.

Is this more like what you had in mind?


gcc/ChangeLog:

    * config/aarch64/aarch64.c (aarch64_finish_cost): Add class 
vec_info parameter.

    * config/i386/i386.c (ix86_finish_cost): Likewise.
    * config/rs6000/rs6000.c (rs6000_finish_cost): Likewise.
    * doc/tm.texi: Document changes to TARGET_VECTORIZE_FINISH_COST.
    * target.def: Add class vec_info parameter to finish_cost.
    * targhooks.c (default_finish_cost): Likewise.
    * targhooks.h (default_finish_cost): Likewise.
    * tree-vect-loop.c (vect_determine_vectorization_factor): Use 
suggested_unroll_factor

    to increase vectorization_factor if possible.
    (_loop_vec_info::_loop_vec_info): Add suggested_unroll_factor 
member.
    (vect_compute_single_scalar_iteration_cost): Adjust call to 
finish_cost.
    (vect_determine_partial_vectors_and_peeling): Ensure unrolled 
loop is not predicated.

    (vect_determine_unroll_factor): New.
    (vect_try_unrolling): New.
    (vect_reanalyze_as_main_loop): Also try to unroll when 
reanalyzing as main loop.
    (vect_analyze_loop): Add call to vect_try_unrolling and check 
to ensure epilogue
    is either a smaller VF than main loop or uses partial vectors 
and might be of equal

    VF.
    (vect_estimate_min_profitable_iters): Adjust call to finish_cost.
    (vectorizable_reduction): Make sure to not use 
single_defuse_cyle when unrolling.
    * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Adjust 
call to finish_cost.
    * tree-vectorizer.h (finish_cost): Change to pass new class 
vec_info parameter.


On 01/10/2021 09:19, Richard Biener wrote:

On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:


Hi,



That just forces trying the vector modes we've tried before. Though I might
need to revisit this now I think about it. I'm afraid it might be possible
for
this to generate an epilogue with a vf that is not lower than that of the
main
loop, but I'd need to think about this again.

Either way I don't think this changes the vector modes used for the
epilogue.
But maybe I'm just missing your point here.

Yes, I was refering to the above which suggests that when we vectorize
the main loop with V4SF but unroll then we try vectorizing the
epilogue with V4SF as well (but not unrolled).  I think that's
premature (not sure if you try V8SF if the main loop was V4SF but
unrolled 4 times).

My main motivation for this was because I had a SVE loop that vectorized with
both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
V8HI by two and skipped using VNx8HI as a predicated epilogue which would've
been the best choice.

I see, yes - for fully predicated epilogues it makes sense to consider
the same vector mode as for the main loop anyways (independent on
whether we're unrolling or not).  One could argue that with an
unrolled V4SImode main loop a predicated V8SImode epilogue would also
be a good match (but then somehow costing favored the unrolled V4SI
over the V8SI for the main loop...).


So that is why I decided to just 'reset' the vector_mode selection. In a
scenario where you only have the traditional vector modes it might make less
sense.

Just realized I still didn't add any check to make sure the epilogue has a
lower VF than the previous loop, though I'm still not sure that could happen.
I'll go look at where to add that if you agree with this.

As said above, it only needs a lower VF in case the epilogue is not
fully masked - otherwise the same VF would be OK.


I can move it there, it would indeed remove the need for the change to
vect_update_vf_for_slp, the change to
vect_determine_partial_vectors_and_peeling would still be required I think.
It
is meant to disable using partial vectors in an unrolled loop.

Why would we disable the use of partial vectors in an unrolled loop?

The motivation behind that is that the overhead caused by generating
predicates for each iteration will likely be too much for it to be profitable
to unroll. On top of that, when dealing with low iteration count loops, if
executing one predicated iteration would be enough we now still need to
execute all other unrolled predicated iterations, whereas if we keep them
unrolled we skip the unrolled loops.

OK, I guess we're not factoring in costs when deciding on predication
but go for it if it's gernally enabled and possible.

With the proposed scheme we'd then cost the predicated not unrolled
loop against a not predicated unrolled loop which might be a bit
apples vs. oranges also because the target made the unroll decision
based on the data it collected for the predicated loop

Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

2021-09-30 Thread Andre Vieira (lists) via Gcc-patches

Hi,



That just forces trying the vector modes we've tried before. Though I might
need to revisit this now I think about it. I'm afraid it might be possible for
this to generate an epilogue with a vf that is not lower than that of the main
loop, but I'd need to think about this again.

Either way I don't think this changes the vector modes used for the epilogue.
But maybe I'm just missing your point here.

Yes, I was refering to the above which suggests that when we vectorize
the main loop with V4SF but unroll then we try vectorizing the
epilogue with V4SF as well (but not unrolled).  I think that's
premature (not sure if you try V8SF if the main loop was V4SF but
unrolled 4 times).


My main motivation for this was because I had a SVE loop that vectorized 
with both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided 
to unroll V8HI by two and skipped using VNx8HI as a predicated epilogue 
which would've been the best choice.


So that is why I decided to just 'reset' the vector_mode selection. In a 
scenario where you only have the traditional vector modes it might make 
less sense.


Just realized I still didn't add any check to make sure the epilogue has 
a lower VF than the previous loop, though I'm still not sure that could 
happen. I'll go look at where to add that if you agree with this.



I can move it there, it would indeed remove the need for the change to
vect_update_vf_for_slp, the change to
vect_determine_partial_vectors_and_peeling would still be required I think. It
is meant to disable using partial vectors in an unrolled loop.

Why would we disable the use of partial vectors in an unrolled loop?
The motivation behind that is that the overhead caused by generating 
predicates for each iteration will likely be too much for it to be 
profitable to unroll. On top of that, when dealing with low iteration 
count loops, if executing one predicated iteration would be enough we 
now still need to execute all other unrolled predicated iterations, 
whereas if we keep them unrolled we skip the unrolled loops.

Sure but I'm suggesting you keep the not unrolled body as one way of
costed vectorization but then if the target says "try unrolling"
re-do the analysis with the same mode but a larger VF.  Just like
we iterate over vector modes you'll now iterate over pairs of
vector mode + VF (unroll factor).  It's not about re-using the costing
it's about using costing that is actually relevant and also to avoid
targets inventing two distinct separate costings - a target (powerpc)
might already compute load/store density and other stuff for the main
costing so it should have an idea whether doubling or triplicating is OK.

Richard.
Sounds good! I changed the patch to determine the unrolling factor 
later, after all analysis has been done and retry analysis if an 
unrolling factor larger than 1 has been chosen for this loop and 
vector_mode.


gcc/ChangeLog:

    * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
    * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
    * params.opt: Add vect-unroll and vect-unroll-reductions 
parameters.

    * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.
    * targhooks.c (default_unroll_factor): New.
    * targhooks.h (default_unroll_factor): Likewise.
    * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
    par_unrolling_factor.
    (vect_determine_partial_vectors_and_peeling): Account for 
unrolling.

    (vect_determine_unroll_factor): New.
    (vect_try_unrolling): New.
    (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
    retrying a loop_vinfo as a main loop.
    (vect_analyze_loop): Call vect_try_unrolling when vectorizing 
main loops.
    (vect_analyze_loop): Allow for epilogue vectorization when 
unrolling

    and rewalk vector_mode warray for the epilogues.
    (vectorizable_reduction): Disable single_defuse_cycle when 
unrolling.
    * tree-vectorizer.h (vect_unroll_value): Declare 
par_unrolling_factor

    as a member of loop_vec_info.
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 
be8148583d8571b0d035b1938db9d056bfd213a8..71ee33a200fcbd37ccd5380321df507ae1e8961f
 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6289,6 +6289,12 @@ allocated by TARGET_VECTORIZE_INIT_COST.  The default 
releases the
 accumulator.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_VECTORIZE_UNROLL_FACTOR (class 
vec_info *@var{vinfo})
+This hook should return the desired vector unrolling factor for a loop with
+@var{vinfo}. The default returns one, which means no unrolling will be
+performed.
+@end deftypefn
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree 
@var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git 

Re: [PATCH 1/3][vect] Add main vectorized loop unrolling

2021-09-21 Thread Andre Vieira (lists) via Gcc-patches

Hi Richi,

Thanks for the review, see below some questions.

On 21/09/2021 13:30, Richard Biener wrote:

On Fri, 17 Sep 2021, Andre Vieira (lists) wrote:


Hi all,

This patch adds the ability to define a target hook to unroll the main
vectorized loop. It also introduces --param's vect-unroll and
vect-unroll-reductions to control this through a command-line. I found this
useful to experiment and believe can help when tuning, so I decided to leave
it in.
We only unroll the main loop and have disabled unrolling epilogues for now. We
also do not support unrolling of any loop that has a negative step and we do
not support unrolling a loop with any reduction other than a
TREE_CODE_REDUCTION.

Bootstrapped and regression tested on aarch64-linux-gnu as part of the series.

I wonder why we want to change the vector modes used for the epilogue,
we're either making it more likely to need to fall through to the
scalar epilogue or require another vectorized epilogue.
I don't quite understand what you mean by change the vector modes for 
the epilogue. I don't think we do.

If you are referring to:
      /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
      if (loop_vinfo->par_unrolling_factor > 1)
        {
      next_vector_mode = vector_modes[0];
      mode_i = 1;

      if (dump_enabled_p ())
        dump_printf_loc (MSG_NOTE, vect_location,
                 "* Re-trying analysis with vector mode"
                 " %s for epilogue with partial vectors.\n",
                 GET_MODE_NAME (next_vector_mode));
      continue;
        }

That just forces trying the vector modes we've tried before. Though I 
might need to revisit this now I think about it. I'm afraid it might be 
possible for this to generate an epilogue with a vf that is not lower 
than that of the main loop, but I'd need to think about this again.


Either way I don't think this changes the vector modes used for the 
epilogue. But maybe I'm just missing your point here.

That said, for simplicity I'd only change the VF of the main loop.

There I wonder why you need to change vect_update_vf_for_slp or
vect_determine_partial_vectors_and_peeling and why it's not enough
to adjust the VF in a single place, I'd do that here:

   /* This is the point where we can re-start analysis with SLP forced off.
*/
start_over:

   /* Now the vectorization factor is final.  */
   poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   gcc_assert (known_ne (vectorization_factor, 0U));

>  call vect_update_vf_for_unroll ()
I can move it there, it would indeed remove the need for the change to 
vect_update_vf_for_slp, the change to 
vect_determine_partial_vectors_and_peeling would still be required I 
think. It is meant to disable using partial vectors in an unrolled loop.

note there's also loop->unroll (from #pragma GCC unroll) which we
could include in what you look at in vect_unroll_value.

I don't like add_stmt_cost_for_unroll - how should a target go
and decide based on what it is fed?  You could as well feed it
the scalar body or the vinfo so it can get a shot at all
the vectorizers meta data - but feeding it individual stmt_infos
does not add any meaningful abstraction and thus what's the
point?
I am still working on tuning our backend hook, but the way it works is 
it estimates how many load, store and general ops are required for the 
vectorized loop based on these.

I _think_ what would make some sense is when we actually cost
the vector body (with the not unrolled VF) ask the target
"well, how about unrolling this?" because there it has the
chance to look at the actual vector stmts produced (in "cost form").
And if the target answers "yeah - go ahead and try x4" we signal
that to the iteration and have "mode N with x4 unroll" validated and
costed.

So instead of a new target hook amend the finish_cost hook to
produce a suggested unroll value and cost both the unrolled and
not unrolled body.

Sorry for steering in a different direction ;)
The reason we decided to do this early and not after cost is because 
'vect_prune_runtime_alias_test_list' and 
'vect_enhance_data_refs_alignment' require the VF and if you suddenly 
raise that the alias analysis could become invalid.


An initial implementation did do it later for that very reason that we 
could reuse the cost calculations and AArch64 already computed these 
'ops' after Richard Sandiford's patches.

But yeah ... the above kinda led me to rewrite it this way.



Thanks,
Richard.




gcc/ChangeLog:

     * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR
     and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
     * doc/tm.texi.in: Add entries for target hooks above.
     * params.opt: Add vect-unroll and vect-unroll-reductions
parameters.
     * target.def: Define hooks TARGET_VECTORIZE_UNROLL_FACTOR
     and TARGET_VECTORIZE_ADD_STMT_COST_FO

[PATCH 2/3][vect] Consider outside costs earlier for epilogue loops

2021-09-17 Thread Andre Vieira (lists) via Gcc-patches

Hi,

This patch changes the order in which we check outside and inside costs 
for epilogue loops, this is to ensure that a predicated epilogue is more 
likely to be picked over an unpredicated one, since it saves having to 
enter a scalar epilogue loop.


gcc/ChangeLog:

    * tree-vect-loop.c (vect_better_loop_vinfo_p): Change how 
epilogue loop costs are compared.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
14f8150d7c262b9422784e0e997ca4387664a20a..038af13a91d43c9f09186d042cf415020ea73a38
 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2881,17 +2881,75 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
return new_simdlen_p;
 }
 
+  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
+  if (main_loop)
+{
+  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
+  unsigned HOST_WIDE_INT main_vf;
+  unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
+  /* If we can determine how many iterations are left for the epilogue
+loop, that is if both the main loop's vectorization factor and number
+of iterations are constant, then we use them to calculate the cost of
+the epilogue loop together with a 'likely value' for the epilogues
+vectorization factor.  Otherwise we use the main loop's vectorization
+factor and the maximum poly value for the epilogue's.  If the target
+has not provided with a sensible upper bound poly vectorization
+factors are likely to be favored over constant ones.  */
+  if (main_poly_vf.is_constant (_vf)
+ && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
+   {
+ unsigned HOST_WIDE_INT niters
+   = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
+ HOST_WIDE_INT old_likely_vf
+   = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
+ HOST_WIDE_INT new_likely_vf
+   = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
+
+ /* If the epilogue is using partial vectors we account for the
+partial iteration here too.  */
+ old_factor = niters / old_likely_vf;
+ if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
+ && niters % old_likely_vf != 0)
+   old_factor++;
+
+ new_factor = niters / new_likely_vf;
+ if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
+ && niters % new_likely_vf != 0)
+   new_factor++;
+   }
+  else
+   {
+ unsigned HOST_WIDE_INT main_vf_max
+   = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
+
+ old_factor = main_vf_max / estimated_poly_value (old_vf,
+  POLY_VALUE_MAX);
+ new_factor = main_vf_max / estimated_poly_value (new_vf,
+  POLY_VALUE_MAX);
+
+ /* If the loop is not using partial vectors then it will iterate one
+time less than one that does.  It is safe to subtract one here,
+because the main loop's vf is always at least 2x bigger than that
+of an epilogue.  */
+ if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
+   old_factor -= 1;
+ if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
+   new_factor -= 1;
+   }
+
+  /* Compute the costs by multiplying the inside costs with the factor and
+add the outside costs for a more complete picture.  The factor is the
+amount of times we are expecting to iterate this epilogue.  */
+  old_cost = old_loop_vinfo->vec_inside_cost * old_factor;
+  new_cost = new_loop_vinfo->vec_inside_cost * new_factor;
+  old_cost += old_loop_vinfo->vec_outside_cost;
+  new_cost += new_loop_vinfo->vec_outside_cost;
+  return new_cost < old_cost;
+}
+
   /* Limit the VFs to what is likely to be the maximum number of iterations,
  to handle cases in which at least one loop_vinfo is fully-masked.  */
-  HOST_WIDE_INT estimated_max_niter;
-  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
-  unsigned HOST_WIDE_INT main_vf;
-  if (main_loop
-  && LOOP_VINFO_NITERS_KNOWN_P (main_loop)
-  && LOOP_VINFO_VECT_FACTOR (main_loop).is_constant (_vf))
-estimated_max_niter = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
-  else
-estimated_max_niter = likely_max_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
   if (estimated_max_niter != -1)
 {
   if (known_le (estimated_max_niter, new_vf))


[PATCH 1/3][vect] Add main vectorized loop unrolling

2021-09-17 Thread Andre Vieira (lists) via Gcc-patches

Hi all,

This patch adds the ability to define a target hook to unroll the main 
vectorized loop. It also introduces --param's vect-unroll and 
vect-unroll-reductions to control this through a command-line. I found 
this useful to experiment and believe can help when tuning, so I decided 
to leave it in.
We only unroll the main loop and have disabled unrolling epilogues for 
now. We also do not support unrolling of any loop that has a negative 
step and we do not support unrolling a loop with any reduction other 
than a TREE_CODE_REDUCTION.


Bootstrapped and regression tested on aarch64-linux-gnu as part of the 
series.


gcc/ChangeLog:

    * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR
    and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
    * doc/tm.texi.in: Add entries for target hooks above.
    * params.opt: Add vect-unroll and vect-unroll-reductions 
parameters.

    * target.def: Define hooks TARGET_VECTORIZE_UNROLL_FACTOR
    and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
    * targhooks.c (default_add_stmt_cost_for_unroll): New.
    (default_unroll_factor): Likewise.
    * targhooks.h (default_add_stmt_cost_for_unroll): Likewise.
    (default_unroll_factor): Likewise.
    * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
    par_unrolling_factor.
    (vect_update_vf_for_slp): Use unrolling factor to update 
vectorization

    factor.
    (vect_determine_partial_vectors_and_peeling): Account for 
unrolling.
    (vect_determine_unroll_factor): Determine how much to unroll 
vectorized

    main loop.
    (vect_analyze_loop_2): Call vect_determine_unroll_factor.
    (vect_analyze_loop): Allow for epilogue vectorization when 
unrolling

    and rewalk vector_mode array for the epilogues.
    (vectorizable_reduction): Disable single_defuse_cycle when 
unrolling.
    * tree-vectorizer.h (vect_unroll_value): Declare 
par_unrolling_factor

    as a member of loop_vec_info.
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 
f68f42638a112bed8396fd634bd3fd3c44ce848a..3bc9694d2162055d3db165ef888f35deb676548b
 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6283,6 +6283,19 @@ allocated by TARGET_VECTORIZE_INIT_COST.  The default 
releases the
 accumulator.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL (class 
vec_info *@var{vinfo}, class _stmt_vec_info *@var{stmt_info}, void *@var{data})
+This hook should update the target-specific @var{data} relative
+relative to the statement represented by @var{stmt_vinfo} to be used
+later to determine the unrolling factor for this loop using the current
+vectorization factor.
+@end deftypefn
+
+@deftypefn {Target Hook} unsigned TARGET_VECTORIZE_UNROLL_FACTOR (class 
vec_info *@var{vinfo}, void *@var{data})
+This hook should return the desired vector unrolling factor for a loop with
+@var{vinfo} based on the target-specific @var{data}. The default returns one,
+which means no unrolling will be performed.
+@end deftypefn
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree 
@var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 
fdf16b901c537e6a02f630a80a2213d2dcb6d5d6..40f4cb02c34f575439f35070301855ddaf82a21a
 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4195,6 +4195,10 @@ address;  but often a machine-dependent strategy can 
generate better code.
 
 @hook TARGET_VECTORIZE_DESTROY_COST_DATA
 
+@hook TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL
+
+@hook TARGET_VECTORIZE_UNROLL_FACTOR
+
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
 @hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/params.opt b/gcc/params.opt
index 
f414dc1a61cfa9d5b9ded75e96560fc1f73041a5..00f92d4484797df0dbbad052f45205469cbb2c49
 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1117,4 +1117,12 @@ Controls how loop vectorizer uses partial vectors.  0 
means never, 1 means only
 Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) 
IntegerRange(1, 1) Param Optimization
 The maximum factor which the loop vectorizer applies to the cost of statements 
in an inner loop relative to the loop being vectorized.
 
+-param=vect-unroll=
+Common Joined UInteger Var(param_vect_unroll) Init(0) IntegerRange(0, 32) 
Param Optimization
+Controls how many times the vectorizer tries to unroll loops.  Also see 
vect-unroll-reductions.
+
+-param=vect-unroll-reductions=
+Common Joined UInteger Var(param_vect_unroll_reductions) Init(0) 
IntegerRange(0, 32) Param Optimization
+Controls how many times the vectorizer tries to unroll loops that contain 
associative reductions.  0 means that such loops should be unrolled vect-unroll 
times.
+
 ; This comment is to ensure we retain the blank 

[PATCH 0/3][vect] Enable vector unrolling of main loop

2021-09-17 Thread Andre Vieira (lists) via Gcc-patches

Hi all,

This patch series enables unrolling of an unpredicated main vectorized 
loop based on a target hook. The epilogue loop will have (at least) half 
the VF of the main loop and can be predicated.


Andre Vieira (3):
[vect] Add main vectorized loop unrolling
[vect] Consider outside costs earlier for epilogue loops
[AArch64] Implement vect_unroll backend hook



Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-16 Thread Andre Vieira (lists) via Gcc-patches



On 14/06/2021 11:57, Richard Biener wrote:

On Mon, 14 Jun 2021, Richard Biener wrote:


Indeed. For example a simple
int a[1024], b[1024], c[1024];

void foo(int n)
{
   for (int i = 0; i < n; ++i)
 a[i+1] += c[i+i] ? b[i+1] : 0;
}

should usually see peeling for alignment (though on x86 you need
exotic -march= since cost models generally have equal aligned and
unaligned access costs).  For example with -mavx2 -mtune=atom
we'll see an alignment peeling prologue, a AVX2 vector loop,
a SSE2 vectorized epilogue and a scalar epilogue.  It also
shows the original scalar loop being used in the scalar prologue
and epilogue.

We're not even trying to make the counting IV easily used
across loops (we're not counting scalar iterations in the
vector loops).

Specifically we see

 [local count: 94607391]:
niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
_67 = niters_vector_mult_vf.10_62 + 7;
_64 = (int) niters_vector_mult_vf.10_62;
tmp.11_63 = i_43 + _64;
if (niters.8_45 == niters_vector_mult_vf.10_62)
   goto ; [12.50%]
else
   goto ; [87.50%]

after the maini vect loop, recomputing the original IV (i) rather
than using the inserted canonical IV.  And then the vectorized
epilogue header check doing

 [local count: 93293400]:
# i_59 = PHI 
# _66 = PHI <_67(33), 0(18)>
_96 = (unsigned int) n_10(D);
niters.26_95 = _96 - _66;
_108 = (unsigned int) n_10(D);
_109 = _108 - _66;
_110 = _109 + 4294967295;
if (_110 <= 3)
   goto ; [10.00%]
else
   goto ; [90.00%]

re-computing everything from scratch again (also notice how
the main vect loop guard jumps around the alignment prologue
as well and lands here - and the vectorized epilogue using
unaligned accesses - good!).

That is, I'd expect _much_ easier jobs if we'd manage to
track the number of performed scalar iterations (or the
number of scalar iterations remaining) using the canonical
IV we add to all loops across all of the involved loops.

Richard.



So I am now looking at using an IV that counts scalar iterations rather 
than vector iterations and reusing that through all loops, (prologue, 
main loop, vect_epilogue and scalar epilogue). The first is easy, since 
that's what we already do for partial vectors or non-constant VFs. The 
latter requires some plumbing and removing a lot of the code in there 
that creates new IV's going from [0, niters - previous iterations]. I 
don't yet have a clear cut view of how to do this, I first thought of 
keeping track of the 'control' IV in the loop_vinfo, but the prologue 
and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, 
but that is used for overflow detection and only keeps track of what 
looks like a constant 'base' and 'step'. Not quite sure how all that 
works, but intuitively doesn't seem like the right thing to reuse.


I'll go hack around and keep you posted on progress.

Regards,
Andre



Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-14 Thread Andre Vieira (lists) via Gcc-patches

Hi,


On 20/05/2021 11:22, Richard Biener wrote:

On Mon, 17 May 2021, Andre Vieira (lists) wrote:


Hi,

So this is my second attempt at finding a way to improve how we generate the
vector IV's and teach the vectorizer to share them between main loop and
epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
was a terrible idea and I quickly threw it in the bin. The main problem, that
for some reason I failed to see, was that the control_iv increases by 's' and
the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
amount of elements we handle per iteration. That means the epilogue loops
would have to start from the last loop's IV * the last loop's NELEMENT's and
that would just cause a mess.

Instead I started to think about creating IV's for the datarefs and what I
thought worked best was to create these in scalar before peeling. That way the
peeling mechanisms takes care of the duplication of these for the vector and
scalar epilogues and it also takes care of adding phi-nodes for the
skip_vector paths.

How does this work for if-converted loops where we use the
non-if-converted scalar loop for (scalar) peeling but the
if-converted scalar loop for vectorized epilogues?  I suppose
you're only adjusting the if-converted copy.

True hadn't thought about this :(



These new IV's have two functions:
1) 'vect_create_data_ref_ptr' can use them to:
  a) if it's the main loop: replace the values of the 'initial' value of the
main loop's IV and the initial values in the skip_vector phi-nodes
  b) Update the the skip_vector phi-nodes argument for the non-skip path with
the updated vector ptr.

b) means the prologue IV will not be dead there so we actually need
to compute it?  I suppose IVOPTs could be teached to replace an
IV with its final value (based on some other IV) when it's unused?
Or does it already magically do good?

It does not and ...



2) They are used for the scalar epilogue ensuring they share the same
datareference ptr.

There are still a variety of 'hacky' elements here and a lot of testing to be
done, but I hope to be able to clean them away. One of the main issues I had
was that I had to skip a couple of checks and things for the added phi-nodes
and update statements as these do not have stmt_vec_info representation.
Though I'm not sure adding this representation at their creation was much
cleaner... It is something I could play around with but I thought this was a
good moment to ask you for input. For instance, maybe we could do this
transformation before analysis?

Also be aware that because I create a IV for each dataref this leads to
regressions with SVE codegen for instance. NEON is able to use the post-index
addressing mode to increase each dr IV at access time, but SVE can't do this.
For this I don't know if maybe we could try to be smart and create shared
IV's. So rather than make them based on the actual vector ptr, use a shared
sizetype IV that can be shared among dr IV's with the same step. Or maybe this
is something for IVOPTs?

Certainly IVOPTs could decide to use the newly created IVs in the
scalar loops for the DRs therein as well.  But since IVOPTs only
considers a single loop at a time it will probably not pay too
much attention and is only influenced by the out-of-loop uses of
the final values of the IVs.

My gut feeling tells me that whatever we do we'll have to look
into improving IVOPTs to consider multiple loops.


So I redid the IV-sharing and it's looking a lot simpler and neater, 
however it only shares IVs between vectorized loops and not scalar pro- 
or epilogues. I am not certain IVOPTs will be able to deal with these, 
as it has no knowledge of the number of iterations of each different 
loop. So take for instance a prologue peeling for alignment loop and a 
first main vectorization loop. To be able to reuse the IV's from the 
prologue in the main vectorization loop it would need to know that the 
initial start adress + PEELING_NITERS == base address for main 
vectorization loop.


I'll starting testing this approach for correctness if there are no 
major concerns. Though I suspect we will only want to turn this into a 
patch once we have the IVOPTs work done to a point where it at least 
doesn't regress codegen because of shared IVs and eventually we can look 
at how to solve the sharing between vectorized and scalar loops.


A small nitpick on my own RFC. I will probably move the 'skip_e' to 
outside of the map, as we only need one per loop_vinfo and not one per 
DR. Initially I didnt even have this skip_e in, but was using the 
creation of a dummy PHI node and then replacing it with the real thing 
later. Though this made the code simpler, especially when inserting the 
'init's stmt_list.


Kind regards,
Andre
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 
b317df532a9a92a619de9572378437d09c632ab0..e7d0f1e657b1a0c9bec75799817242e0bc1d8282
 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree

[RFC][ivopts] Generate better code for IVs with uses outside the loop

2021-06-10 Thread Andre Vieira (lists) via Gcc-patches



On 08/06/2021 16:00, Andre Simoes Dias Vieira via Gcc-patches wrote:

Hi Bin,

Thank you for the reply, I have some questions, see below.

On 07/06/2021 12:28, Bin.Cheng wrote:

On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
 wrote:

Hi Andre,
I didn't look into the details of the IV sharing RFC.  It seems to me
costing outside uses is trying to generate better code for later code
(epilogue loop here).  The only problem is IVOPTs doesn't know that
the outside use is not in the final form - which will be transformed
by IVOPTs again.

I think this example is not good at describing your problem because it
shows exactly that considering outside use results in better code,
compared to the other two approaches.
I don't quite understand what you are saying here :( What do you mean 
by final form? It seems to me that costing uses inside and outside 
loop the same way is wrong because calculating the IV inside the loop 
has to be done every iteration, whereas if you can resolve it to a 
single update (without an IV) then you can sink it outside the loop. 
This is why I think this example shows why we need to cost these uses 
differently.

2) Is there a cleaner way to generate the optimal 'post-increment' use
for the outside-use variable? I first thought the position in the
candidate might be something I could use or even the var_at_stmt
functionality, but the outside IV has the actual increment of the
variable as it's use, rather than the outside uses. This is this RFC's
main weakness I find.

To answer why IVOPTs behaves like this w/o your two patches. The main
problem is the point IVOPTs rewrites outside use IV - I don't remember
the exact point - but looks like at the end of loop while before
incrementing instruction of main IV.  It's a known issue that outside
use should be costed/re-written on the exit edge along which its value
flows out of loop.  I had a patch a long time ago but discarded it,
because it didn't bring obvious improvement and is complicated in case
of multi-exit edges.
Yeah I haven't looked at multi-exit edges and I understand that 
complicates things. But for now we could disable the special casing of 
outside uses when dealing with multi-exit loops and keep the current 
behavior.


But in general, I am less convinced that any of the two patches is the
right direction solving IV sharing issue between vectorized loop and
epilogue loop.  I would need to read the previous RFC before giving
further comments though.


The previous RFC still has a lot of unanswered questions too, but 
regardless of that, take the following (non-vectorizer) example:


#include 
#include 

void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned long long n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned long long i = 0;
    for (; i < (n & ~(svcntb() - 1)); i += svcntb()) {
  svuint8_t va = svld1 (all_true, (uint8_t*)a);
  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
  a += svcntb();
  b += svcntb();
  c += svcntb();
  }
  svbool_t pred;
  for (; i < (n); i += svcntb()) {
  pred = svwhilelt_b8 (i, n);
  svuint8_t va = svld1 (pred, (uint8_t*)a);
  svuint8_t vb = svld1 (pred, (uint8_t*)b);
  svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb));
  a += svcntb();
  b += svcntb();
  c += svcntb();
  }


Current IVOPTs will use 4 iterators for the first loop, when it could 
do with just 1. In fact, if you use my patches it will create just a 
single IV and sink the uses and it is then able to merge them with 
loads & stores of the next loop.
I mixed things up here, I think an earlier version of my patch (with 
even more hacks) managed to rewrite these properly, but it looks like 
the current ones are messing things up.
I'll continue to try to understand how this works as I do still think 
IVOPTs should be able to do better.


You mentioned you had a patch you thought might help earlier, but you 
dropped it. Do you still have it lying around anywhere?


I am not saying setting outside costs to 0 is the right thing to do by 
the way. It is absolutely not! It will break cost considerations for 
other cases. Like I said above I've been playing around with using 
'!use->outside' as a multiplier for the cost. Unfortunately it won't 
help with the case above, because this seems to choose 'infinite_cost' 
because the candidate IV has a lower precision than the use IV. I 
don't quite understand yet how candidates are created, but something 
I'm going to try to look at. Just wanted to show this as an example of 
how IVOPTs would not improve code with multiple loops that don't 
involve the vectorizer.


BR,
Andre




Thanks,
bin


[RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic)

2021-06-03 Thread Andre Vieira (lists) via Gcc-patches

Streams got crossed there and used the wrong subject ...

On 03/06/2021 17:34, Andre Vieira (lists) via Gcc-patches wrote:

Hi,

This RFC is motivated by the IV sharing RFC in 
https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the 
need to have the IVOPTS pass be able to clean up IV's shared between 
multiple loops. When creating a similar problem with C code I noticed 
IVOPTs treated IV's with uses outside the loop differently, this 
didn't even required multiple loops, take for instance the following 
example using SVE intrinsics:


#include 
#include 
extern void use (char *);
void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned i = 0;
  if (n < (UINT_MAX - svcntb() - 1))
    {
    for (; i < n; i += svcntb())
    {
    svuint8_t va = svld1 (all_true, (uint8_t*)a);
    svuint8_t vb = svld1 (all_true, (uint8_t*)b);
    svst1 (all_true, (uint8_t *)c, svadd_z (all_true, 
va,vb));

    a += svcntb();
    b += svcntb();
    c += svcntb();
    }
    }
  use (a);
}

IVOPTs tends to generate a shared IV for SVE memory accesses, as we 
don't have a post-increment for SVE load/stores. If we had not 
included 'use (a);' in this example, IVOPTs would have replaced the 
IV's for a, b and c with a single one, (also used for the 
loop-control). See:


   [local count: 955630225]:
  # ivtmp.7_8 = PHI 
  va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
  vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
  MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
  ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
  i_23 = (unsigned int) ivtmp.7_25;
  if (n_9(D) > i_23)
    goto ; [89.00%]
  else
    goto ; [11.00%]

 However, due to the 'use (a);' it will create two IVs one for 
loop-control, b and c and one for a. See:


  [local count: 955630225]:
  # a_28 = PHI 
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_28];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  a_18 = a_28 + POLY_INT_CST [16, 16];
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_8 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_8)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the first patch attached in this RFC 'no_cost.patch', I tell 
IVOPTs to not cost uses outside of the loop. This makes IVOPTs 
generate a single IV, but unfortunately it decides to create the 
variable for the use inside the loop and it also seems to use the 
pre-increment value of the shared-IV and add the [16,16] to it. See:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_25;
  _6 = _7 + POLY_INT_CST [16, 16];
  a_18 = (char * restrict) _6;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_5 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_5)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the patch 'var_after.patch' I make get_computation_aff_1 use 
'cand->var_after' for outside uses thus using the post-increment var 
of the candidate IV. This means I have to insert it in a different 
place and make sure to delete the old use->stmt. I'm sure there is a 
better way to do this using IVOPTs current framework, but I didn't 
find one yet. See the result:


  [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_24;
  a_18 = (char * restrict) _7;
  i_6 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_6)
    goto ; [89.00%]
  else
    goto ; [11.00%]


This is still not optimal as we are still doing the update inside the 
loop and there is absolutely no need for that. I found that running 
sink would solve it and it seems someone has added a second sink pass, 
so that saves me a third patch :) see after sink2:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_6 = (unsigned int) ivtmp.7_24;
  if (i_6 &

[RFC] Implementing detection of saturation and rounding arithmetic

2021-06-03 Thread Andre Vieira (lists) via Gcc-patches

Hi,

This RFC is motivated by the IV sharing RFC in 
https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the 
need to have the IVOPTS pass be able to clean up IV's shared between 
multiple loops. When creating a similar problem with C code I noticed 
IVOPTs treated IV's with uses outside the loop differently, this didn't 
even required multiple loops, take for instance the following example 
using SVE intrinsics:


#include 
#include 
extern void use (char *);
void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned i = 0;
  if (n < (UINT_MAX - svcntb() - 1))
    {
    for (; i < n; i += svcntb())
    {
    svuint8_t va = svld1 (all_true, (uint8_t*)a);
    svuint8_t vb = svld1 (all_true, (uint8_t*)b);
    svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
    a += svcntb();
    b += svcntb();
    c += svcntb();
    }
    }
  use (a);
}

IVOPTs tends to generate a shared IV for SVE memory accesses, as we 
don't have a post-increment for SVE load/stores. If we had not included 
'use (a);' in this example, IVOPTs would have replaced the IV's for a, b 
and c with a single one, (also used for the loop-control). See:


   [local count: 955630225]:
  # ivtmp.7_8 = PHI 
  va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
  vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
  MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
  ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
  i_23 = (unsigned int) ivtmp.7_25;
  if (n_9(D) > i_23)
    goto ; [89.00%]
  else
    goto ; [11.00%]

 However, due to the 'use (a);' it will create two IVs one for 
loop-control, b and c and one for a. See:


  [local count: 955630225]:
  # a_28 = PHI 
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_28];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  a_18 = a_28 + POLY_INT_CST [16, 16];
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_8 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_8)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs 
to not cost uses outside of the loop. This makes IVOPTs generate a 
single IV, but unfortunately it decides to create the variable for the 
use inside the loop and it also seems to use the pre-increment value of 
the shared-IV and add the [16,16] to it. See:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_25;
  _6 = _7 + POLY_INT_CST [16, 16];
  a_18 = (char * restrict) _6;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_5 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_5)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the patch 'var_after.patch' I make get_computation_aff_1 use 
'cand->var_after' for outside uses thus using the post-increment var of 
the candidate IV. This means I have to insert it in a different place 
and make sure to delete the old use->stmt. I'm sure there is a better 
way to do this using IVOPTs current framework, but I didn't find one 
yet. See the result:


  [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_24;
  a_18 = (char * restrict) _7;
  i_6 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_6)
    goto ; [89.00%]
  else
    goto ; [11.00%]


This is still not optimal as we are still doing the update inside the 
loop and there is absolutely no need for that. I found that running sink 
would solve it and it seems someone has added a second sink pass, so 
that saves me a third patch :) see after sink2:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_6 = (unsigned int) ivtmp.7_24;
  if (i_6 < n_10(D))
    goto ; [89.00%]
  else
    goto ; [11.00%]

   [local count: 105119324]:
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_24;
  a_18 = (char * restrict) _7;
  goto ; 

Re: [PATCH][vect] Use main loop's thresholds and vectorization factor to narrow upper_bound of epilogue

2021-06-03 Thread Andre Vieira (lists) via Gcc-patches

Thank you Kewen!!

I will apply this now.

BR,
Andre

On 25/05/2021 09:42, Kewen.Lin wrote:

on 2021/5/24 下午3:21, Kewen.Lin via Gcc-patches wrote:

Hi Andre,

on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote:

Hi,

When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses 
an unpredicated (all-true predicate for SVE) main loop and a predicated tail 
loop. The way this was implemented seems to mean it re-uses the same 
vector-mode for both loops, which means the tail loop isn't an actual loop but 
only executes one iteration.

This patch uses the knowledge of the conditions to enter an epilogue loop to 
help come up with a potentially more restricive upper bound.

Regression tested on aarch64-linux-gnu and also ran the testsuite using 
'--param vect-partial-vector-usage=1' detecting no ICEs and no execution 
failures.

Would be good to have this tested for PPC too as I believe they are the main 
users of the --param vect-partial-vector-usage=1 option. Can someone help me 
test (and maybe even benchmark?) this on a PPC target?



Thanks for doing this!  I can test it on Power10 which enables this parameter
by default, also evaluate its impact on SPEC2017 Ofast/unroll.


Bootstrapped/regtested on powerpc64le-linux-gnu Power10.
SPEC2017 run didn't show any remarkable improvement/degradation.

BR,
Kewen


[PATCH][vect] Use main loop's thresholds and vectorization factor to narrow upper_bound of epilogue

2021-05-24 Thread Andre Vieira (lists) via Gcc-patches

Hi,

When vectorizing with --param vect-partial-vector-usage=1 the vectorizer 
uses an unpredicated (all-true predicate for SVE) main loop and a 
predicated tail loop. The way this was implemented seems to mean it 
re-uses the same vector-mode for both loops, which means the tail loop 
isn't an actual loop but only executes one iteration.


This patch uses the knowledge of the conditions to enter an epilogue 
loop to help come up with a potentially more restricive upper bound.


Regression tested on aarch64-linux-gnu and also ran the testsuite using 
'--param vect-partial-vector-usage=1' detecting no ICEs and no execution 
failures.


Would be good to have this tested for PPC too as I believe they are the 
main users of the --param vect-partial-vector-usage=1 option. Can 
someone help me test (and maybe even benchmark?) this on a PPC target?


Kind regards,
Andre

gcc/ChangeLog:

    * tree-vect-loop.c (vect_transform_loop): Use main loop's 
various' thresholds

    to narrow the upper bound on epilogue iterations.

gcc/testsuite/ChangeLog:

    * gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test.

diff --git 
a/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c 
b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c
new file mode 100644
index 
..a03229eb55585f637ebd5288fb4c00f8f921d44c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param vect-partial-vector-usage=1" } */
+
+void
+foo (short * __restrict__ a, short * __restrict__ b, short * __restrict__ c, 
int n)
+{
+  for (int i = 0; i < n; ++i)
+c[i] = a[i] + b[i];
+}
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.h, wzr, [xw][0-9]+} 1 
} } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
3e973e774af8f9205be893e01ad9263281116885..81e9c5cc42415a0a92b765bc46640105670c4e6b
 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -9723,12 +9723,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
   /* In these calculations the "- 1" converts loop iteration counts
  back to latch counts.  */
   if (loop->any_upper_bound)
-loop->nb_iterations_upper_bound
-  = (final_iter_may_be_partial
-? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
- lowest_vf) - 1
-: wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
-  lowest_vf) - 1);
+{
+  loop_vec_info main_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  loop->nb_iterations_upper_bound
+   = (final_iter_may_be_partial
+  ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
+   lowest_vf) - 1
+  : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
+lowest_vf) - 1);
+  if (main_vinfo)
+   {
+ unsigned int bound;
+ poly_uint64 main_iters
+   = upper_bound (LOOP_VINFO_VECT_FACTOR (main_vinfo),
+  LOOP_VINFO_COST_MODEL_THRESHOLD (main_vinfo));
+ main_iters
+   = upper_bound (main_iters,
+  LOOP_VINFO_VERSIONING_THRESHOLD (main_vinfo));
+ if (can_div_away_from_zero_p (main_iters,
+   LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+   ))
+   loop->nb_iterations_upper_bound
+ = wi::umin ((widest_int) (bound - 1),
+ loop->nb_iterations_upper_bound);
+  }
+  }
   if (loop->any_likely_upper_bound)
 loop->nb_iterations_likely_upper_bound
   = (final_iter_may_be_partial


Re: [PATCH][AArch64]: Use UNSPEC_LD1_SVE for all LD1 loads

2021-05-18 Thread Andre Vieira (lists) via Gcc-patches

Hi,

Using aarch64_pred_mov for these was tricky as it did both store and 
load. Furthermore there was some concern it might allow for a predicated 
mov to end up as a mem -> mem and a predicated load being wrongfully 
reloaded to a full-load to register. So instead we decided to let the 
extending aarch64_load_* patterns accept both UNSPEC_LD1_SVE and 
UNSPEC_PRED_X.


Is this OK for trunk?

Kind regards,
Andre Vieira


gcc/ChangeLog:
2021-05-18  Andre Vieira  

    * config/aarch64/iterators.md (SVE_PRED_LOAD): New iterator.
    (pred_load): New int attribute.
    * config/aarch64/aarch64-sve.md 
(aarch64_load_): 
Use SVE_PRED_LOAD

    enum iterator and corresponding pred_load attribute.
    * config/aarch64/aarch64-sve-builtins-base.cc (expand): Update 
call to code_for_aarch64_load.


gcc/testsuite/ChangeLog:
2021-05-18  Andre Vieira  

    * gcc.target/aarch64/sve/logical_unpacked_and_2.c: Change 
scan-assembly-times to scan-assembly not for superfluous uxtb.

    * gcc.target/aarch64/sve/logical_unpacked_and_3.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_and_4.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_and_6.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_and_7.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_2.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_3.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_4.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_6.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_7.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_2.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_3.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_4.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_6.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_7.c: Likewise.
    * gcc.target/aarch64/sve/ld1_extend.c: New test.
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 
dfdf0e2fd186389cbddcff51ef52f8778d7fdb24..8fd6d3fb3171f56b4ceacaf7ea812bc696117210
 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1123,7 +1123,7 @@ public:
   rtx
   expand (function_expander ) const OVERRIDE
   {
-insn_code icode = code_for_aarch64_load (extend_rtx_code (),
+insn_code icode = code_for_aarch64_load (UNSPEC_LD1_SVE, extend_rtx_code 
(),
 e.vector_mode (0),
 e.memory_vector_mode ());
 return e.use_contiguous_load_insn (icode);
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
7db2938bb84e04d066a7b07574e5cf344a3a8fb6..a5663200d51b95684b4dc0caefd527a525aebd52
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1287,7 +1287,7 @@ (define_insn "vec_mask_load_lanes"
 ;; -
 
 ;; Predicated load and extend, with 8 elements per 128-bit block.
-(define_insn_and_rewrite 
"@aarch64_load_"
+(define_insn_and_rewrite 
"@aarch64_load_"
   [(set (match_operand:SVE_HSDI 0 "register_operand" "=w")
(unspec:SVE_HSDI
  [(match_operand: 3 "general_operand" "UplDnm")
@@ -1295,7 +1295,7 @@ (define_insn_and_rewrite 
"@aarch64_load_ 2 "register_operand" "Upl")
(match_operand:SVE_PARTIAL_I 1 "memory_operand" "m")]
-  UNSPEC_LD1_SVE))]
+  SVE_PRED_LOAD))]
  UNSPEC_PRED_X))]
   "TARGET_SVE && (~ & ) == 0"
   "ld1\t%0., %2/z, %1"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 
fb6e228651eae6a2db8c1ac755885ae7ad9225d6..8c17929cea4c83cc9f80b4cde950407ba4eb0416
 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2509,6 +2509,10 @@ (define_int_iterator SVE_SHIFT_WIDE [UNSPEC_ASHIFT_WIDE
 
 (define_int_iterator SVE_LDFF1_LDNF1 [UNSPEC_LDFF1 UNSPEC_LDNF1])
 
+(define_int_iterator SVE_PRED_LOAD [UNSPEC_PRED_X UNSPEC_LD1_SVE])
+
+(define_int_attr pred_load [(UNSPEC_PRED_X "_x") (UNSPEC_LD1_SVE "")])
+
 (define_int_iterator SVE2_U32_UNARY [UNSPEC_URECPE UNSPEC_RSQRTE])
 
 (define_int_iterator SVE2_INT_UNARY_NARROWB [UNSPEC_SQXTNB
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ld1_extend.c 
b/gcc/testsuite/gcc.target/aarch64/sve/ld1_extend.c
new file mode 100644
index 
..7f78cb4b3e4445c4da93b00ae78d6ef6fec1b2de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/ld1_extend.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --para

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-17 Thread Andre Vieira (lists) via Gcc-patches

Hi,

So this is my second attempt at finding a way to improve how we generate 
the vector IV's and teach the vectorizer to share them between main loop 
and epilogues. On IRC we discussed my idea to use the loop's control_iv, 
but that was a terrible idea and I quickly threw it in the bin. The main 
problem, that for some reason I failed to see, was that the control_iv 
increases by 's' and the datarefs by 's' * NELEMENTS where 's' is 
usually 1 and NELEMENTs the amount of elements we handle per iteration. 
That means the epilogue loops would have to start from the last loop's 
IV * the last loop's NELEMENT's and that would just cause a mess.


Instead I started to think about creating IV's for the datarefs and what 
I thought worked best was to create these in scalar before peeling. That 
way the peeling mechanisms takes care of the duplication of these for 
the vector and scalar epilogues and it also takes care of adding 
phi-nodes for the skip_vector paths.

These new IV's have two functions:
1) 'vect_create_data_ref_ptr' can use them to:
 a) if it's the main loop: replace the values of the 'initial' value of 
the main loop's IV and the initial values in the skip_vector phi-nodes
 b) Update the the skip_vector phi-nodes argument for the non-skip path 
with the updated vector ptr.


2) They are used for the scalar epilogue ensuring they share the same 
datareference ptr.


There are still a variety of 'hacky' elements here and a lot of testing 
to be done, but I hope to be able to clean them away. One of the main 
issues I had was that I had to skip a couple of checks and things for 
the added phi-nodes and update statements as these do not have 
stmt_vec_info representation.  Though I'm not sure adding this 
representation at their creation was much cleaner... It is something I 
could play around with but I thought this was a good moment to ask you 
for input. For instance, maybe we could do this transformation before 
analysis?


Also be aware that because I create a IV for each dataref this leads to 
regressions with SVE codegen for instance. NEON is able to use the 
post-index addressing mode to increase each dr IV at access time, but 
SVE can't do this.  For this I don't know if maybe we could try to be 
smart and create shared IV's. So rather than make them based on the 
actual vector ptr, use a shared sizetype IV that can be shared among dr 
IV's with the same step. Or maybe this is something for IVOPTs?


Let me know what ya think!

Kind regards,
Andre
diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
index 
8001cc54f518d9d9d1a0fcfe5790d22dae109fb2..939c0a7fefd4355dd75d7646ac2ae63ce23a0e14
 100644
--- a/gcc/tree-data-ref.h
+++ b/gcc/tree-data-ref.h
@@ -174,6 +174,8 @@ struct data_reference
 
   /* Alias information for the data reference.  */
   struct dr_alias alias;
+
+  hash_map *iv_bases;
 };
 
 #define DR_STMT(DR)(DR)->stmt
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 
124a7bea6a94161556a6622fa7b113b3cef98bcf..f638bb3e0aa007e0bf7ad8f75fb767d3484b02ce
 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1475,6 +1475,7 @@ void
 free_data_ref (data_reference_p dr)
 {
   DR_ACCESS_FNS (dr).release ();
+  delete dr->iv_bases;
   free (dr);
 }
 
@@ -1506,6 +1507,7 @@ create_data_ref (edge nest, loop_p loop, tree memref, 
gimple *stmt,
   DR_REF (dr) = memref;
   DR_IS_READ (dr) = is_read;
   DR_IS_CONDITIONAL_IN_STMT (dr) = is_conditional_in_stmt;
+  dr->iv_bases = new hash_map ();
 
   dr_analyze_innermost (_INNERMOST (dr), memref,
nest != NULL ? loop : NULL, stmt);
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 
86fc118b6befb06233e5e86a01454fd7075075e1..93e14d09763da5034ba97d09b07c94c20fe25a28
 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -24,6 +24,8 @@ typedef void (*transform_callback)(class loop *, void *);
 
 extern void create_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *,
   bool, tree *, tree *);
+extern void create_or_update_iv (tree, tree, tree, class loop *, 
gimple_stmt_iterator *,
+ bool, tree *, tree *, gphi *, bool);
 extern void rewrite_into_loop_closed_ssa_1 (bitmap, unsigned, int,
class loop *);
 extern void rewrite_into_loop_closed_ssa (bitmap, unsigned);
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 
28ae1316fa0eb6939a45d15e893b7386622ba60c..1709e175c382ef5d74c2f628a61c9fffe26f726d
 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -57,9 +57,10 @@ static bitmap_obstack loop_renamer_obstack;
VAR_AFTER (unless they are NULL).  */
 
 void
-create_iv (tree base, tree step, tree var, class loop *loop,
-  gimple_stmt_iterator *incr_pos, bool after,
-  tree *var_before, tree *var_after)
+create_or_update_iv (tree base, tree step, tree var, class loop *loop,
+

[PATCH][AArch64]: Use UNSPEC_LD1_SVE for all LD1 loads

2021-05-14 Thread Andre Vieira (lists) via Gcc-patches

Hi,

I noticed we were missing out on LD1 + UXT combinations in some cases 
and found it was because of inconsistent use of the unspec enum 
UNSPEC_LD1_SVE. The combine pattern for LD1[S][BHWD] uses UNSPEC_LD1_SVE 
whereas one of the LD1 expanders was using UNSPEC_PRED_X. I wasn't sure 
whether to change the UNSPEC_LD1_SVE into UNSPEC_PRED_X as the enum 
doesn't seem to be used for anything in particular, though I decided 
against it for now as it is easier to rename UNSPEC_LD1_SVE to 
UNSPEC_PRED_X if there is no use for it than it is to rename only 
specific instances of UNSPEC_PRED_X.


If there is a firm belief the UNSPEC_LD1_SVE will not be used for 
anything I am also happy to refactor it out.


Bootstrapped and regression tested aarch64-none-linux-gnu.

Is this OK for trunk?

Kind regards,
Andre Vieira

gcc/ChangeLog:
2021-05-14  Andre Vieira  

    * config/aarch64/aarch64-sve.md: Use UNSPEC_LD1_SVE instead of 
UNSPEC_PRED_X.


gcc/testsuite/ChangeLog:
2021-05-14  Andre Vieira  

    * gcc.target/aarch64/sve/logical_unpacked_and_2.c: Remove 
superfluous uxtb.

    * gcc.target/aarch64/sve/logical_unpacked_and_3.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_and_4.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_and_6.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_and_7.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_2.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_3.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_4.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_6.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_eor_7.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_2.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_4.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_6.c: Likewise.
    * gcc.target/aarch64/sve/logical_unpacked_orr_7.c: Likewise.
    * gcc.target/aarch64/sve/ld1_extend.c: New test.

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
7db2938bb84e04d066a7b07574e5cf344a3a8fb6..5fd74fcf3e0a984b5b40b8128ad9354fb899ce5f
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -747,7 +747,7 @@ (define_insn_and_split "@aarch64_pred_mov"
(unspec:SVE_ALL
  [(match_operand: 1 "register_operand" "Upl, Upl, Upl")
   (match_operand:SVE_ALL 2 "nonimmediate_operand" "w, m, w")]
- UNSPEC_PRED_X))]
+ UNSPEC_LD1_SVE))]
   "TARGET_SVE
&& (register_operand (operands[0], mode)
|| register_operand (operands[2], mode))"
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ld1_extend.c 
b/gcc/testsuite/gcc.target/aarch64/sve/ld1_extend.c
new file mode 100644
index 
..7f78cb4b3e4445c4da93b00ae78d6ef6fec1b2de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/ld1_extend.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param vect-partial-vector-usage=1" } */
+
+void foo (signed char * __restrict__ a, signed char * __restrict__ b, short * 
__restrict__ c, int n)
+{
+for (int i = 0; i < n; ++i)
+  c[i] = a[i] + b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tld1sb\t} 4 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_2.c
index 
08b274512e1c6ce8f5845084a664b2fa0456dafe..cb6029e90ffc815e75092624f611c4631cbd9fd6
 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_2.c
@@ -11,7 +11,6 @@ f (uint64_t *restrict dst, uint16_t *restrict src1, uint8_t 
*restrict src2)
 
 /* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.d,} 2 } } */
 /* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.d,} 2 } } */
-/* { dg-final { scan-assembler-times {\tuxtb\tz[0-9]+\.h,} 1 } } */
 /* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d,} 2 } } */
 /* { dg-final { scan-assembler-times {\tuxth\tz[0-9]+\.d,} 2 } } */
 /* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d,} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_3.c 
b/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_3.c
index 
c823470ca925ee66929475f74fa8d94bc4735594..02fc5460e5ce89c8a3fef611aac561145ddd0f39
 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/logical_unpacked_and_3.c
@@ -11,7 +11,6 @@ f (uint64_t *restrict dst, uint32_t *restrict src1, uint8_t 
*restrict src2)
 
 /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.d,} 2 } } */
 /* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.d,} 2 } } */
-/* { dg-final { scan-assembler-times {\tuxtb\tz[0-9]+\.s,} 1 } } */
 /* { dg-final { s

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-05 Thread Andre Vieira (lists) via Gcc-patches



On 05/05/2021 13:34, Richard Biener wrote:

On Wed, 5 May 2021, Andre Vieira (lists) wrote:


I tried to see what IVOPTs would make of this and it is able to analyze the
IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
could be used as the base for the other (loop 2). I don't know if this is
where you'd want such optimizations to be made, on one side I think it would
be great as it would also help with non-vectorized loops as you allured to.

Hmm, OK.  So there's the first loop that has a looparound jump and thus
we do not always enter the 2nd loop with the first loop final value of the
IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
And for a followup transform to catch this it would need to compute
the final value of the IV and then match this up with the initial
value computation.  I suppose FRE could be teached to do this, at
least for very simple cases.
I will admit I am not at all familiar with how FRE works, I know it 
exists as the occlusion of running it often breaks my vector patches :P 
But that's about all I know.
I will have a look and see if it makes sense from my perspective to 
address it there, because ...



Anyway I diverge. Back to the main question of this patch. How do you suggest
I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
if one ends where the other begins?

I don't think we will make that work easily.  So indeed attacking this
in the vectorizer sounds most promising.


The problem with this that I found with my approach is that it only 
tackles the vectorized epilogues and that leads to regressions, I don't 
have the example at hand, but what I saw was happening was that 
increased register pressure lead to a spill in the hot path. I believe 
this was caused by the epilogue loop using the update pointers as the 
base for their DR's, in this case there were three DR's (2 loads one 
store), but the scalar epilogue still using the original base + niters, 
since this data_reference approach only changes the vectorized epilogues.




  I'll note there's also
the issue of epilogue vectorization and reductions where we seem
to not re-use partially reduced reduction vectors but instead
reduce to a scalar in each step.  That's a related issue - we're
not able to carry forward a (reduction) IV we generated for the
main vector loop to the epilogue loops.  Like for

double foo (double *a, int n)
{
   double sum = 0.;
   for (int i = 0; i < n; ++i)
 sum += a[i];
   return sum;
}

with AVX512 we get three reductions to scalars instead of
a partial reduction from zmm to ymm before the first vectorized
epilogue followed by a reduction from ymm to xmm before the second
(the jump around for the epilogues need to jump to the further
reduction piece obviously).

So I think we want to record IVs we generate (the reduction IVs
are already nicely associated with the stmt-infos), one might
consider to refer to them from the dr_vec_info for example.

It's just going to be "interesting" to wire everything up
correctly with all the jump-arounds we have ...
I have a downstream hack for the reductions, but it only worked for 
partial-vector-usage as there you have the guarantee it's the same 
vector-mode, so you don't need to pfaff around with half and full 
vectors. Obviously what you are suggesting has much wider applications 
and not surprisingly I think Richard Sandiford also pointed out to me 
that these are somewhat related and we might be able to reuse the 
IV-creation to manage it all. But I feel like I am currently light years 
away from that.


I had started to look at removing the data_reference updating we have 
now and dealing with this in the 'create_iv' calls from 
'vect_create_data_ref_ptr' inside 'vectorizable_{load,store}' but then I 
thought it would be good to discuss it with you first. This will require 
keeping track of the 'end-value' of the IV, which for loops where we can 
skip the previous loop means we will need to construct a phi-node 
containing the updated pointer and the initial base. But I'm not 
entirely sure where to keep track of all this. Also I don't know if I 
can replace the base address of the data_reference right there at the 
'create_iv' call, can a data_reference be used multiple times in the 
same loop?


I'll go do a bit more nosing around this idea and the ivmap you 
mentioned before. Let me know if you have any ideas on how this all 
should look like, even if its a 'in an ideal world'.


Andre



On 04/05/2021 10:56, Richard Biener wrote:

On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:


Hi,

The aim of this RFC is to explore a way of cleaning up the codegen around
data_references.  To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters.  We have found this le

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-05 Thread Andre Vieira (lists) via Gcc-patches

Hi Richi,

So I'm trying to look at what IVOPTs does right now and how it might be 
able to help us. Looking at these two code examples:

#include 
#if 0
int foo(short * a, short * b, unsigned int n)
{
    int sum = 0;
    for (unsigned int i = 0; i < n; ++i)
    sum += a[i] + b[i];

    return sum;
}


#else

int bar (short * a, short *b, unsigned int n)
{
    int sum = 0;
    unsigned int i = 0;
    for (; i < (n / 16); i += 1)
    {
    // Iterates [0, 16, .., (n/16 * 16) * 16]
    // Example n = 127,
    // iterates [0, 16, 32, 48, 64, 80, 96, 112]
    sum += a[i*16] + b[i*16];
    }
    for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
    {
    // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
    // Example n = 127,
    // j starts at (127/16) * 16 = 7 * 16 = 112,
    // So iterates over [112, 113, 114, 115, ..., 127]
    sum += a[j] + b[j];
    }
    return sum;
}
#endif

Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the 
following options '-O3 -march=armv8-a -fno-tree-vectorize 
-fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: 
https://godbolt.org/z/MEf6j6ebM


I tried to see what IVOPTs would make of this and it is able to analyze 
the IVs but it doesn't realize (not even sure it tries) that one IV's 
end (loop 1) could be used as the base for the other (loop 2). I don't 
know if this is where you'd want such optimizations to be made, on one 
side I think it would be great as it would also help with non-vectorized 
loops as you allured to.


However, if you compile the top test case (#if 1) and let the 
tree-vectorizer have a go you will see different behaviours for 
different vectorization approaches, so for:
'-O3 -march=armv8-a', using NEON and epilogue vectorization it seems 
IVOPTs only picks up on one loop.
If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' 
it will detect two loops. This may well be because in fact epilogue 
vectorization 'un-loops' it because it knows it will only have to do one 
iteration of the vectorized epilogue. vect-partial-vector-usage=1 could 
have done the same, but because we are dealing with polymorphic vector 
modes it fails to, I have a hack that works for 
vect-partial-vector-usage to avoid it, but I think we can probably do 
better and try to reason about boundaries in poly_int's rather than 
integers (TBC).


Anyway I diverge. Back to the main question of this patch. How do you 
suggest I go about this? Is there a way to make IVOPTS aware of the 
'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then 
teach it to merge IV's if one ends where the other begins?


On 04/05/2021 10:56, Richard Biener wrote:

On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:


Hi,

The aim of this RFC is to explore a way of cleaning up the codegen around
data_references.  To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters.  We have found this leads to
better codegen in the vectorized epilogue loops.

The approach in this RFC creates a map if iv_updates which always contain an
updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
also contain a skip_edge in case we decide the vectorization can be skipped in
'vect_do_peeling'. During the epilogue update this map of iv_updates is then
checked to see if it contains an entry for a data_reference and it is used
accordingly and if not it reverts back to the old behavior of using the niters
to advance the data_reference.

The motivation for this work is to improve codegen for the option `--param
vect-partial-vector-usage=1` for SVE. We found that one of the main problems
for the codegen here was coming from unnecessary conversions caused by the way
we update the data_references in the epilogue.

This patch passes regression tests in aarch64-linux-gnu, but the codegen is
still not optimal in some cases. Specifically those where we have a scalar
epilogue, as this does not use the data_reference's and will rely on the
gimple scalar code, thus constructing again a memory access using the niters.
This is a limitation for which I haven't quite worked out a solution yet and
does cause some minor regressions due to unfortunate spills.

Let me know what you think and if you have ideas of how we can better achieve
this.

Hmm, so the patch adds a kludge to improve the kludge we have in place ;)

I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer.  That way we can
see how the various components involved behave (FRE + ivopts most
specifically).

That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue a

Re: [PATCH 9/9] arm: Auto-vectorization for MVE: vld4/vst4

2021-05-04 Thread Andre Vieira (lists) via Gcc-patches

Hi Christophe,

The series LGTM but you'll need the approval of an arm port maintainer 
before committing. I only did code-review, did not try to build/run tests.


Kind regards,
Andre

On 30/04/2021 15:09, Christophe Lyon via Gcc-patches wrote:

This patch enables MVE vld4/vst4 instructions for auto-vectorization.
We move the existing expanders from neon.md and enable them for MVE,
calling the respective emitter.

2021-03-12  Christophe Lyon  

gcc/
* config/arm/neon.md (vec_load_lanesxi)
(vec_store_lanexoi): Move ...
* config/arm/vec-common.md: here.

gcc/testsuite/
* gcc.target/arm/simd/mve-vld4.c: New test, derived from
slp-perm-3.c
---
  gcc/config/arm/neon.md   |  20 
  gcc/config/arm/vec-common.md |  26 +
  gcc/testsuite/gcc.target/arm/simd/mve-vld4.c | 140 +++
  3 files changed, 166 insertions(+), 20 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vld4.c

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index bc8775c..fb58baf 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5617,16 +5617,6 @@ (define_insn "neon_vld4"
  (const_string "neon_load4_4reg")))]
  )
  
-(define_expand "vec_load_lanesxi"

-  [(match_operand:XI 0 "s_register_operand")
-   (match_operand:XI 1 "neon_struct_operand")
-   (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
-  "TARGET_NEON"
-{
-  emit_insn (gen_neon_vld4 (operands[0], operands[1]));
-  DONE;
-})
-
  (define_expand "neon_vld4"
[(match_operand:XI 0 "s_register_operand")
 (match_operand:XI 1 "neon_struct_operand")
@@ -5818,16 +5808,6 @@ (define_insn "neon_vst4"
  (const_string "neon_store4_4reg")))]
  )
  
-(define_expand "vec_store_lanesxi"

-  [(match_operand:XI 0 "neon_struct_operand")
-   (match_operand:XI 1 "s_register_operand")
-   (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
-  "TARGET_NEON"
-{
-  emit_insn (gen_neon_vst4 (operands[0], operands[1]));
-  DONE;
-})
-
  (define_expand "neon_vst4"
[(match_operand:XI 0 "neon_struct_operand")
 (match_operand:XI 1 "s_register_operand")
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index 7abefea..d46b78d 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -512,3 +512,29 @@ (define_expand "vec_store_lanesoi"
  emit_insn (gen_mve_vst2q (operands[0], operands[1]));
DONE;
  })
+
+(define_expand "vec_load_lanesxi"
+  [(match_operand:XI 0 "s_register_operand")
+   (match_operand:XI 1 "neon_struct_operand")
+   (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+  "TARGET_NEON || TARGET_HAVE_MVE"
+{
+  if (TARGET_NEON)
+emit_insn (gen_neon_vld4 (operands[0], operands[1]));
+  else
+emit_insn (gen_mve_vld4q (operands[0], operands[1]));
+  DONE;
+})
+
+(define_expand "vec_store_lanesxi"
+  [(match_operand:XI 0 "neon_struct_operand")
+   (match_operand:XI 1 "s_register_operand")
+   (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+  "TARGET_NEON || TARGET_HAVE_MVE"
+{
+  if (TARGET_NEON)
+emit_insn (gen_neon_vst4 (operands[0], operands[1]));
+  else
+emit_insn (gen_mve_vst4q (operands[0], operands[1]));
+  DONE;
+})
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vld4.c 
b/gcc/testsuite/gcc.target/arm/simd/mve-vld4.c
new file mode 100644
index 000..ce3e755
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vld4.c
@@ -0,0 +1,140 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
+/* { dg-additional-options "-O3" } */
+
+#include 
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M30 237
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M31 2280
+#define M02 74
+#define M12 191
+#define M22 500
+#define M32 111
+#define M03 134
+#define M13 117
+#define M23 11
+#define M33 771
+
+#define N 128
+
+/* Integer tests.  */
+#define FUNC(SIGN, TYPE, BITS) \
+  void foo_##SIGN##BITS##x (TYPE##BITS##_t *__restrict__ pInput,   \
+   TYPE##BITS##_t *__restrict__ pOutput)   \
+  {\
+unsigned int i;\
+TYPE##BITS##_t  a, b, c, d;
\
+   \
+for (i = 0; i < N / BITS; i++)  \
+  {
\
+   a = *pInput++;  \
+   b = *pInput++;  \
+   c = *pInput++;  \
+   d = *pInput++;  \
+   

Re: [PATCH 7/9] arm: Auto-vectorization for MVE: add __fp16 support to VCMP

2021-05-04 Thread Andre Vieira (lists) via Gcc-patches
It would be good to also add tests for NEON as you also enable auto-vec 
for it. I checked and I do think the necessary 'neon_vc' patterns exist 
for 'VH', so we should be OK there.


On 30/04/2021 15:09, Christophe Lyon via Gcc-patches wrote:

This patch adds __fp16 support to the previous patch that added vcmp
support with MVE. For this we update existing expanders to use VDQWH
iterator, and add a new expander vcond.  In the
process we need to create suitable iterators, and update v_cmp_result
as needed.

2021-04-26  Christophe Lyon  

gcc/
* config/arm/iterators.md (V16): New iterator.
(VH_cvtto): New iterator.
(v_cmp_result): Added V4HF and V8HF support.
* config/arm/vec-common.md (vec_cmp): Use VDQWH.
(vcond): Likewise.
(vcond_mask_): Likewise.
(vcond): New expander.

gcc/testsuite/
* gcc.target/arm/simd/mve-compare-3.c: New test with GCC vectors.
* gcc.target/arm/simd/mve-vcmp-f16.c: New test for
auto-vectorization.
---
  gcc/config/arm/iterators.md   |  6 
  gcc/config/arm/vec-common.md  | 40 ---
  gcc/testsuite/gcc.target/arm/simd/mve-compare-3.c | 38 +
  gcc/testsuite/gcc.target/arm/simd/mve-vcmp-f16.c  | 30 +
  4 files changed, 102 insertions(+), 12 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-compare-3.c
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vcmp-f16.c

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a128465..3042baf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -231,6 +231,9 @@ (define_mode_iterator VU [V16QI V8HI V4SI])
  ;; Vector modes for 16-bit floating-point support.
  (define_mode_iterator VH [V8HF V4HF])
  
+;; Modes with 16-bit elements only.

+(define_mode_iterator V16 [V4HI V4HF V8HI V8HF])
+
  ;; 16-bit floating-point vector modes suitable for moving (includes BFmode).
  (define_mode_iterator VHFBF [V8HF V4HF V4BF V8BF])
  
@@ -571,6 +574,8 @@ (define_mode_attr V_cvtto [(V2SI "v2sf") (V2SF "v2si")

  ;; (Opposite) mode to convert to/from for vector-half mode conversions.
  (define_mode_attr VH_CVTTO [(V4HI "V4HF") (V4HF "V4HI")
(V8HI "V8HF") (V8HF "V8HI")])
+(define_mode_attr VH_cvtto [(V4HI "v4hf") (V4HF "v4hi")
+   (V8HI "v8hf") (V8HF "v8hi")])
  
  ;; Define element mode for each vector mode.

  (define_mode_attr V_elem [(V8QI "QI") (V16QI "QI")
@@ -720,6 +725,7 @@ (define_mode_attr V_cmp_result [(V8QI "V8QI") (V16QI 
"V16QI")
  (define_mode_attr v_cmp_result [(V8QI "v8qi") (V16QI "v16qi")
(V4HI "v4hi") (V8HI  "v8hi")
(V2SI "v2si") (V4SI  "v4si")
+   (V4HF "v4hi") (V8HF  "v8hi")
(DI   "di")   (V2DI  "v2di")
(V2SF "v2si") (V4SF  "v4si")])
  
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md

index 034b48b..3fd341c 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -366,8 +366,8 @@ (define_expand "vlshr3"
  (define_expand "vec_cmp"
[(set (match_operand: 0 "s_register_operand")
(match_operator: 1 "comparison_operator"
- [(match_operand:VDQW 2 "s_register_operand")
-  (match_operand:VDQW 3 "reg_or_zero_operand")]))]
+ [(match_operand:VDQWH 2 "s_register_operand")
+  (match_operand:VDQWH 3 "reg_or_zero_operand")]))]
"ARM_HAVE__ARITH
 && !TARGET_REALLY_IWMMXT
 && (! || flag_unsafe_math_optimizations)"
@@ -399,13 +399,13 @@ (define_expand "vec_cmpu"
  ;; element-wise.
  
  (define_expand "vcond"

-  [(set (match_operand:VDQW 0 "s_register_operand")
-   (if_then_else:VDQW
+  [(set (match_operand:VDQWH 0 "s_register_operand")
+   (if_then_else:VDQWH
  (match_operator 3 "comparison_operator"
-   [(match_operand:VDQW 4 "s_register_operand")
-(match_operand:VDQW 5 "reg_or_zero_operand")])
- (match_operand:VDQW 1 "s_register_operand")
- (match_operand:VDQW 2 "s_register_operand")))]
+   [(match_operand:VDQWH 4 "s_register_operand")
+(match_operand:VDQWH 5 "reg_or_zero_operand")])
+ (match_operand:VDQWH 1 "s_register_operand")
+ (match_operand:VDQWH 2 "s_register_operand")))]
"ARM_HAVE__ARITH
 && !TARGET_REALLY_IWMMXT
 && (! || flag_unsafe_math_optimizations)"
@@ -430,6 +430,22 @@ (define_expand "vcond"
DONE;
  })
  
+(define_expand "vcond"

+  [(set (match_operand: 0 "s_register_operand")
+   (if_then_else:
+ (match_operator 3 "comparison_operator"
+   [(match_operand:V16 4 "s_register_operand")
+(match_operand:V16 5 "reg_or_zero_operand")])
+ (match_operand: 1 "s_register_operand")
+ (match_operand: 2 "s_register_operand")))]
+  

Re: [PATCH 6/9] arm: Auto-vectorization for MVE: vcmp

2021-05-04 Thread Andre Vieira (lists) via Gcc-patches

Hi Christophe,

On 30/04/2021 15:09, Christophe Lyon via Gcc-patches wrote:

Since MVE has a different set of vector comparison operators from
Neon, we have to update the expansion to take into account the new
ones, for instance 'NE' for which MVE does not require to use 'EQ'
with the inverted condition.

Conversely, Neon supports comparisons with #0, MVE does not.

For:
typedef long int vs32 __attribute__((vector_size(16)));
vs32 cmp_eq_vs32_reg (vs32 a, vs32 b) { return a == b; }

we now generate:
cmp_eq_vs32_reg:
vldr.64 d4, .L123   @ 8 [c=8 l=4]  *mve_movv4si/8
vldr.64 d5, .L123+8
vldr.64 d6, .L123+16@ 9 [c=8 l=4]  *mve_movv4si/8
vldr.64 d7, .L123+24
vcmp.i32  eq, q0, q1@ 7 [c=16 l=4]  mve_vcmpeqq_v4si
vpsel q0, q3, q2@ 15[c=8 l=4]  mve_vpselq_sv4si
bx  lr  @ 26[c=8 l=4]  *thumb2_return
.L124:
.align  3
.L123:
.word   0
.word   0
.word   0
.word   0
.word   1
.word   1
.word   1
.word   1

For some reason emit_move_insn (zero, CONST0_RTX (cmp_mode)) produces
a pair of vldr instead of vmov.i32, qX, #0

I think ideally we would even want:
vpte  eq, q0, q1
vmovt.i32 q0, #0
vmove.i32 q0, #1

But we don't have a way to generate VPT blocks with multiple 
instructions yet unfortunately so I guess VPSEL will have to do for now.




2021-03-01  Christophe Lyon  

gcc/
* config/arm/arm-protos.h (arm_expand_vector_compare): Update
prototype.
* config/arm/arm.c (arm_expand_vector_compare): Add support for
MVE.
(arm_expand_vcond): Likewise.
* config/arm/iterators.md (supf): Remove VCMPNEQ_S, VCMPEQQ_S,
VCMPEQQ_N_S, VCMPNEQ_N_S.
(VCMPNEQ, VCMPEQQ, VCMPEQQ_N, VCMPNEQ_N): Remove.
* config/arm/mve.md (@mve_vcmpq_): Add '@' prefix.
(@mve_vcmpq_f): Likewise.
(@mve_vcmpq_n_f): Likewise.
(@mve_vpselq_): Likewise.
(@mve_vpselq_f"): Likewise.
* config/arm/neon.md (vec_cmp): Likewise.
(vcond): Likewise.
(vcond): Likewise.
(vcondu): Likewise.
(vcond_mask_): Likewise.
* config/arm/unspecs.md (VCMPNEQ_U, VCMPNEQ_S, VCMPEQQ_S)
(VCMPEQQ_N_S, VCMPNEQ_N_S, VCMPEQQ_U, CMPEQQ_N_U, VCMPNEQ_N_U)
(VCMPGEQ_N_S, VCMPGEQ_S, VCMPGTQ_N_S, VCMPGTQ_S, VCMPLEQ_N_S)
(VCMPLEQ_S, VCMPLTQ_N_S, VCMPLTQ_S, VCMPCSQ_N_U, VCMPCSQ_U)
(VCMPHIQ_N_U, VCMPHIQ_U): Remove.
* config/arm/vec-common.md (vec_cmp): Likewise.
(vcond): Likewise.
(vcond): Likewise.
(vcondu): Likewise.
(vcond_mask_): Likewise.

gcc/testsuite
* gcc.target/arm/simd/mve-compare-1.c: New test with GCC vectors.
* gcc.target/arm/simd/mve-compare-2.c: New test with GCC vectors.
* gcc.target/arm/simd/mve-compare-scalar-1.c: New test with GCC
vectors.
* gcc.target/arm/simd/mve-vcmp-f32.c: New test for
auto-vectorization.
* gcc.target/arm/simd/mve-vcmp.c: New test for auto-vectorization.

add gcc/testsuite/gcc.target/arm/simd/mve-compare-scalar-1.c
---
  gcc/config/arm/arm-protos.h|   2 +-
  gcc/config/arm/arm.c   | 211 -
  gcc/config/arm/iterators.md|   9 +-
  gcc/config/arm/mve.md  |  10 +-
  gcc/config/arm/neon.md |  87 -
  gcc/config/arm/unspecs.md  |  20 --
  gcc/config/arm/vec-common.md   | 107 +++
  gcc/testsuite/gcc.target/arm/simd/mve-compare-1.c  |  80 
  gcc/testsuite/gcc.target/arm/simd/mve-compare-2.c  |  38 
  .../gcc.target/arm/simd/mve-compare-scalar-1.c |  69 +++
  gcc/testsuite/gcc.target/arm/simd/mve-vcmp-f32.c   |  30 +++
  gcc/testsuite/gcc.target/arm/simd/mve-vcmp.c   |  50 +
  12 files changed, 547 insertions(+), 166 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-compare-1.c
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-compare-2.c
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-compare-scalar-1.c
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vcmp-f32.c
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vcmp.c

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2521541..ffccaa7 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -373,7 +373,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rtx_code, 
rtx, rtx, rtx, rtx,
  extern bool arm_fusion_enabled_p (tune_params::fuse_ops);
  extern bool arm_valid_symbolic_address_p (rtx);
  extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
-extern bool arm_expand_vector_compare (rtx, rtx_code, rtx, rtx, bool);
+extern bool arm_expand_vector_compare (rtx, rtx_code, rtx, rtx, bool, bool);
  #endif /* 

[RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-04-30 Thread Andre Vieira (lists) via Gcc-patches

Hi,

The aim of this RFC is to explore a way of cleaning up the codegen 
around data_references.  To be specific, I'd like to reuse the 
main-loop's updated data_reference as the base_address for the 
epilogue's corresponding data_reference, rather than use the niters.  We 
have found this leads to better codegen in the vectorized epilogue loops.


The approach in this RFC creates a map if iv_updates which always 
contain an updated pointer that is caputed in vectorizable_{load,store}, 
an iv_update may also contain a skip_edge in case we decide the 
vectorization can be skipped in 'vect_do_peeling'. During the epilogue 
update this map of iv_updates is then checked to see if it contains an 
entry for a data_reference and it is used accordingly and if not it 
reverts back to the old behavior of using the niters to advance the 
data_reference.


The motivation for this work is to improve codegen for the option 
`--param vect-partial-vector-usage=1` for SVE. We found that one of the 
main problems for the codegen here was coming from unnecessary 
conversions caused by the way we update the data_references in the epilogue.


This patch passes regression tests in aarch64-linux-gnu, but the codegen 
is still not optimal in some cases. Specifically those where we have a 
scalar epilogue, as this does not use the data_reference's and will rely 
on the gimple scalar code, thus constructing again a memory access using 
the niters.  This is a limitation for which I haven't quite worked out a 
solution yet and does cause some minor regressions due to unfortunate 
spills.


Let me know what you think and if you have ideas of how we can better 
achieve this.


Kind regards,
Andre Vieira

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 
c1d6e02194b251f7c940784c291d58c754f07454..ebb71948abe4ca27d495a2707254beb27e385a0d
 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1928,6 +1928,15 @@ vect_gen_prolog_loop_niters (loop_vec_info loop_vinfo,
   return iters_name;
 }
 
+static bool
+maybe_not_zero (tree t)
+{
+  if (!t)
+return false;
+  if (TREE_CODE (t) != INTEGER_CST)
+return true;
+  return !tree_int_cst_equal (t, build_zero_cst (TREE_TYPE (t)));
+}
 
 /* Function vect_update_init_of_dr
 
@@ -1954,6 +1963,76 @@ vect_update_init_of_dr (dr_vec_info *dr_info, tree 
niters, tree_code code)
   dr_info->offset = offset;
 }
 
+static void
+vect_update_base_of_dr (struct data_reference * dr,
+   loop_vec_info epilogue_vinfo, iv_update *iv_update)
+{
+  tree new_base_addr = iv_update->new_base_addr;
+  edge skip_e = iv_update->skip_edge;
+  if (skip_e)
+{
+  /* If we have SKIP_E we need to use the phi-node that joins the IV coming
+from the main loop and the initial IV.  */
+  gimple_seq stmts;
+  tree base_addr = DR_BASE_ADDRESS (dr);
+  tree type = TREE_TYPE (base_addr);
+  gphi *new_phi;
+
+  edge e = EDGE_PRED (skip_e->dest, 0);
+  e = e != skip_e ? e : EDGE_PRED (skip_e->dest, 1);
+
+  base_addr = force_gimple_operand (base_addr, , true,
+   NULL_TREE);
+  gimple_stmt_iterator gsi = gsi_last_bb (skip_e->src);
+  if (is_gimple_assign (gsi_stmt (gsi))
+ || is_gimple_call (gsi_stmt (gsi)))
+   gsi_insert_seq_after (, stmts, GSI_NEW_STMT);
+  else
+   gsi_insert_seq_before (, stmts, GSI_NEW_STMT);
+
+  /* Make sure NEW_BASE_ADDR and the initial base address use the same
+type.  Not sure why I chose to use DR_BASE_ADDR's type here, probably
+makes more sense to use the NEW_BASE_ADDR's type.  */
+  stmts = NULL;
+  new_base_addr = fold_convert (type, new_base_addr);
+  new_base_addr = force_gimple_operand (new_base_addr, , true, 
NULL_TREE);
+  gsi = gsi_last_bb (e->src);
+  if (is_gimple_assign (gsi_stmt (gsi))
+ || is_gimple_call (gsi_stmt (gsi)))
+   gsi_insert_seq_after (, stmts, GSI_NEW_STMT);
+  else
+   gsi_insert_seq_before (, stmts, GSI_NEW_STMT);
+
+  new_phi = create_phi_node (make_ssa_name (type), skip_e->dest);
+  add_phi_arg (new_phi, new_base_addr, e, UNKNOWN_LOCATION);
+  add_phi_arg (new_phi, base_addr, skip_e, UNKNOWN_LOCATION);
+
+  new_base_addr = gimple_phi_result (new_phi);
+}
+  else
+{
+  gimple_seq stmts;
+  class loop *loop = LOOP_VINFO_LOOP (epilogue_vinfo);
+  tree type = TREE_TYPE (DR_BASE_ADDRESS (dr));
+  new_base_addr = fold_convert (type, new_base_addr);
+  new_base_addr = force_gimple_operand (new_base_addr, , true,
+   NULL_TREE);
+  gimple_stmt_iterator gsi
+   = gsi_last_bb (loop_preheader_edge (loop)->src);
+  if (!gsi_stmt (gsi)
+ || is_gimple_assign (gsi_stmt (gsi))
+ || is_gimple_call (gsi_stmt (gsi)))
+   gsi_insert_seq_after (, stmts, GSI_NEW_STMT);
+  else
+   gsi_insert_seq_be

Re: [PATCH][PR98791]: IRA: Make sure allocno copy mode's are ordered

2021-03-10 Thread Andre Vieira (lists) via Gcc-patches



On 19/02/2021 15:05, Vladimir Makarov wrote:


On 2021-02-19 5:53 a.m., Andre Vieira (lists) wrote:

Hi,

This patch makes sure that allocno copies are not created for 
unordered modes. The testcases in the PR highlighted a case where an 
allocno copy was being created for:

(insn 121 120 123 11 (parallel [
    (set (reg:VNx2QI 217)
    (vec_duplicate:VNx2QI (subreg/s/v:QI (reg:SI 93 [ _2 
]) 0)))

    (clobber (scratch:VNx16BI))
    ]) 4750 {*vec_duplicatevnx2qi_reg}
 (expr_list:REG_DEAD (reg:SI 93 [ _2 ])
    (nil)))

As the compiler detected that the vec_duplicate_reg pattern 
allowed the input and output operand to be of the same register 
class, it tried to create an allocno copy for these two operands, 
stripping subregs in the process. However, this meant that the copy 
was between VNx2QI and SI, which have unordered mode precisions.


So at compile time we do not know which of the two modes is smaller 
which is a requirement when updating allocno copy costs.


Regression tested on aarch64-linux-gnu.

Is this OK for trunk (and after a week backport to gcc-10) ?

OK.  Yes, it is wise to wait a bit and see how the patch behaves on 
the trunk before submitting it to gcc-10 branch.  Sometimes such 
changes can have quite unexpected consequences.  But I guess not in 
this is case.



Is it OK to backport now? The committed patch applies cleanly and I 
regression tested it on gcc-10 branch for aarch64-linux-gnu.


Kind regards,

Andre



Re: [PATCH][PR98791]: IRA: Make sure allocno copy mode's are ordered

2021-02-22 Thread Andre Vieira (lists) via Gcc-patches

Hi Alex,

On 22/02/2021 10:20, Alex Coplan wrote:

For the testcase, you might want to use the one I posted most recently:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98791#c3
which avoids the dependency on the aarch64-autovec-preference param
(which is in GCC 11 only) as this will simplify backporting.

But if it's preferable to have a testcase without SVE intrinsics for GCC
11 then we should stick with that.
I don't see any problem with having SVE intrinsics in the testcase, 
committed with your other test as I agree it makes the backport easier 
eventually.


Thanks for pointing that out.
diff --git a/gcc/ira-conflicts.c b/gcc/ira-conflicts.c
index 
2c2234734c3166872d94d94c5960045cb89ff2a8..d83cfc1c1a708ba04f5e01a395721540e31173f0
 100644
--- a/gcc/ira-conflicts.c
+++ b/gcc/ira-conflicts.c
@@ -275,7 +275,10 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool 
constraint_p,
   ira_allocno_t a1 = ira_curr_regno_allocno_map[REGNO (reg1)];
   ira_allocno_t a2 = ira_curr_regno_allocno_map[REGNO (reg2)];
 
-  if (!allocnos_conflict_for_copy_p (a1, a2) && offset1 == offset2)
+  if (!allocnos_conflict_for_copy_p (a1, a2)
+ && offset1 == offset2
+ && ordered_p (GET_MODE_PRECISION (ALLOCNO_MODE (a1)),
+   GET_MODE_PRECISION (ALLOCNO_MODE (a2
{
  cp = ira_add_allocno_copy (a1, a2, freq, constraint_p, insn,
 ira_curr_loop_tree_node);
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98791.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr98791.c
new file mode 100644
index 
..cc1f1831afb68ba70016cbe26f8f9273cfceafa8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98791.c
@@ -0,0 +1,12 @@
+/* PR rtl-optimization/98791  */
+/* { dg-do compile } */
+/* { dg-options "-O -ftree-vectorize" } */
+#include 
+extern char a[11];
+extern long b[];
+void f() {
+  for (int d; d < 10; d++) {
+a[d] = svaddv(svptrue_b8(), svdup_u8(0));
+b[d] = 0;
+  }
+}


[PATCH][PR98791]: IRA: Make sure allocno copy mode's are ordered

2021-02-19 Thread Andre Vieira (lists) via Gcc-patches

Hi,

This patch makes sure that allocno copies are not created for unordered 
modes. The testcases in the PR highlighted a case where an allocno copy 
was being created for:

(insn 121 120 123 11 (parallel [
    (set (reg:VNx2QI 217)
    (vec_duplicate:VNx2QI (subreg/s/v:QI (reg:SI 93 [ _2 ]) 
0)))

    (clobber (scratch:VNx16BI))
    ]) 4750 {*vec_duplicatevnx2qi_reg}
 (expr_list:REG_DEAD (reg:SI 93 [ _2 ])
    (nil)))

As the compiler detected that the vec_duplicate_reg pattern 
allowed the input and output operand to be of the same register class, 
it tried to create an allocno copy for these two operands, stripping 
subregs in the process. However, this meant that the copy was between 
VNx2QI and SI, which have unordered mode precisions.


So at compile time we do not know which of the two modes is smaller 
which is a requirement when updating allocno copy costs.


Regression tested on aarch64-linux-gnu.

Is this OK for trunk (and after a week backport to gcc-10) ?

Regards,
Andre


gcc/ChangeLog:
2021-02-19  Andre Vieira  

    PR rtl-optimization/98791
    * ira-conflicts.c (process_regs_for_copy): Don't create allocno 
copies for unordered modes.


gcc/testsuite/ChangeLog:
2021-02-19  Andre Vieira  

    PR rtl-optimization/98791
    * gcc.target/aarch64/sve/pr98791.c: New test.

diff --git a/gcc/ira-conflicts.c b/gcc/ira-conflicts.c
index 
2c2234734c3166872d94d94c5960045cb89ff2a8..d83cfc1c1a708ba04f5e01a395721540e31173f0
 100644
--- a/gcc/ira-conflicts.c
+++ b/gcc/ira-conflicts.c
@@ -275,7 +275,10 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool 
constraint_p,
   ira_allocno_t a1 = ira_curr_regno_allocno_map[REGNO (reg1)];
   ira_allocno_t a2 = ira_curr_regno_allocno_map[REGNO (reg2)];
 
-  if (!allocnos_conflict_for_copy_p (a1, a2) && offset1 == offset2)
+  if (!allocnos_conflict_for_copy_p (a1, a2)
+ && offset1 == offset2
+ && ordered_p (GET_MODE_PRECISION (ALLOCNO_MODE (a1)),
+   GET_MODE_PRECISION (ALLOCNO_MODE (a2
{
  cp = ira_add_allocno_copy (a1, a2, freq, constraint_p, insn,
 ira_curr_loop_tree_node);
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98791.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr98791.c
new file mode 100644
index 
..ee0c7b51602cacd45f9e33acecb1eaa9f9edebf2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98791.c
@@ -0,0 +1,12 @@
+/* PR rtl-optimization/98791  */
+/* { dg-do compile } */
+/* { dg-options "-O -ftree-vectorize --param=aarch64-autovec-preference=3" } */
+extern char a[], b[];
+short c, d;
+long *e;
+void f() {
+  for (int g; g < c; g += 1) {
+a[g] = d;
+b[g] = e[g];
+  }
+}


[AArch64] PR98657: Fix vec_duplicate creation in SVE's 3

2021-02-17 Thread Andre Vieira (lists) via Gcc-patches

Hi,

This patch prevents generating a vec_duplicate with illegal predicate.

Regression tested on aarch64-linux-gnu.

OK for trunk?

gcc/ChangeLog:
2021-02-17  Andre Vieira  

    PR target/98657
    * config/aarch64/aarch64-sve.md: Use 'expand_vector_broadcast' 
to emit vec_duplicate's

    in '3' pattern.

gcc/testsuite/ChangeLog:
2021-02-17  Andre Vieira  

    PR target/98657
    * gcc.target/aarch64/sve/pr98657.c: New test.
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
608319600318974b414e47285ee1474b041f0e05..7db2938bb84e04d066a7b07574e5cf344a3a8fb6
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4549,10 +4549,8 @@ (define_expand "3"
   }
 else
   {
-   amount = gen_reg_rtx (mode);
-   emit_insn (gen_vec_duplicate (amount,
-   convert_to_mode (mode,
-operands[2], 0)));
+   amount = convert_to_mode (mode, operands[2], 0);
+   amount = expand_vector_broadcast (mode, amount);
   }
 emit_insn (gen_v3 (operands[0], operands[1], amount));
 DONE;
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98657.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr98657.c
new file mode 100644
index 
..592af25d7bbc69bc05823d27358f07cd741dbe20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98657.c
@@ -0,0 +1,9 @@
+/* PR target/98657  */
+/* { dg-do compile } */
+/* { dg-options "-O3 -msve-vector-bits=256" } */
+extern char a[];
+void b(_Bool c[][18]) {
+  int d;
+  for (int e = 0; e < 23; e++)
+a[e] = 6 >> c[1][d];
+}


Re: PR98974: Fix vectorizable_condition after STMT_VINFO_VEC_STMTS

2021-02-05 Thread Andre Vieira (lists) via Gcc-patches



On 05/02/2021 12:47, Richard Sandiford wrote:

"Andre Vieira (lists)"  writes:

Hi,

As mentioned in the PR, this patch fixes up the nvectors parameter passed to 
vect_get_loop_mask in vectorizable_condition.
Before the STMT_VINFO_VEC_STMTS rework we used to handle each ncopy separately, 
now we gather them all at the same time and don't need to multiply vec_num with 
ncopies.

The reduced testcase I used to illustrate the issue in the PR gives a warning, 
if someone knows how to get rid of that (it's Fortran) I'd include it as a 
testcase for this.

Looks like Richi's since posted one.

Included it.

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 
0bc1cb1c5b4f6c1f0447241b4d31434bf7cca1a4..d07602f6d38f9c51936ac09942599fc5a14f46ab
 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -10237,8 +10237,7 @@ vectorizable_condition (vec_info *vinfo,
{
  unsigned vec_num = vec_oprnds0.length ();
  tree loop_mask
-   = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
- vectype, i);
+   = vect_get_loop_mask (gsi, masks, vec_num, vectype, i);
  tree tmp2 = make_ssa_name (vec_cmp_type);
  gassign *g
= gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare,

Does removing the shadowed vec_num work?  I think that would be less
confusing, and means that the calculation stays in sync with the

Yeah that works too.

Here's a reworked patch.


gcc/ChangeLog:
2021-02-05  Andre Vieira  

    PR middle-end/98974
    * tree-vect-stmts.c (vectorizable_condition): Fix nvectors 
parameter

    for vect_get_loop_mask call.

gcc/testsuite/ChangeLog:
2021-02-05  Andre Vieira  

    * gfortran.dg/pr98974.F90: New test.
diff --git a/gcc/testsuite/gfortran.dg/pr98974.F90 
b/gcc/testsuite/gfortran.dg/pr98974.F90
new file mode 100644
index 
..b3db6a6654a0b36bc567405c70429a5dbe168d1e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr98974.F90
@@ -0,0 +1,20 @@
+! PR middle-end/98974
+! { dg-do compile { target { aarch64*-*-* } } }
+! { dg-options "-Ofast -mcpu=neoverse-v1" }
+
+module module_foobar
+  integer,parameter :: fp_kind = selected_real_kind(15)
+contains
+ subroutine foobar( foo, ix ,jx ,kx,iy,ky)
+   real, dimension( ix, kx, jx )  :: foo
+   real(fp_kind), dimension( iy, ky, 3 ) :: bar, baz
+   do k=1,ky
+  do i=1,iy
+if ( baz(i,k,1) > 0. ) then
+  bar(i,k,1) = 0
+endif
+foo(i,nk,j) = baz0 *  bar(i,k,1)
+  enddo
+   enddo
+ end
+end
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 
0bc1cb1c5b4f6c1f0447241b4d31434bf7cca1a4..064e5d138ce9a151287662a0caefd9925b0a2920
 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -10235,7 +10235,6 @@ vectorizable_condition (vec_info *vinfo,
 
  if (masks)
{
- unsigned vec_num = vec_oprnds0.length ();
  tree loop_mask
= vect_get_loop_mask (gsi, masks, vec_num * ncopies,
  vectype, i);


PR98974: Fix vectorizable_condition after STMT_VINFO_VEC_STMTS

2021-02-05 Thread Andre Vieira (lists) via Gcc-patches

Hi,

As mentioned in the PR, this patch fixes up the nvectors parameter passed to 
vect_get_loop_mask in vectorizable_condition.
Before the STMT_VINFO_VEC_STMTS rework we used to handle each ncopy separately, 
now we gather them all at the same time and don't need to multiply vec_num with 
ncopies.

The reduced testcase I used to illustrate the issue in the PR gives a warning, 
if someone knows how to get rid of that (it's Fortran) I'd include it as a 
testcase for this.

Bootstrapped and regression tested on aarch64-none-linux-gnu. I don't believe 
that code triggers for other targets, so not sure it makes sense to test on 
others?

Is this OK for trunk? Would you rather wait for the testcase?

gcc/ChangeLog:
2021-02-05  Andre Vieira  

PR middle-end/98974
* tree-vect-stmts.c (vectorizable_condition): Fix nvectors parameter
for vect_get_loop_mask call.

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 
0bc1cb1c5b4f6c1f0447241b4d31434bf7cca1a4..d07602f6d38f9c51936ac09942599fc5a14f46ab
 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -10237,8 +10237,7 @@ vectorizable_condition (vec_info *vinfo,
{
  unsigned vec_num = vec_oprnds0.length ();
  tree loop_mask
-   = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
- vectype, i);
+   = vect_get_loop_mask (gsi, masks, vec_num, vectype, i);
  tree tmp2 = make_ssa_name (vec_cmp_type);
  gassign *g
= gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare,


[AArch64] Fix vector multiplication costs

2021-02-03 Thread Andre Vieira (lists) via Gcc-patches
This patch introduces a vect.mul RTX cost and decouples the vector 
multiplication costing from the scalar one.


After Wilco's "AArch64: Add cost table for Cortex-A76" patch we saw a 
regression in vector codegen. Reproduceable with the small test added in 
this patch.
Upon further investigation we noticed 'aarch64_rtx_mult_cost' was using 
scalar costs to calculate the cost of vector multiplication, which was 
now lower and preventing 'choose_mult_variant' from making the right 
choice to expand such vector multiplications with constants as shift and 
sub's. I also added a special case for SSRA to use the default vector 
cost rather than mult, SSRA seems to be cost using 
'aarch64_rtx_mult_cost', which to be fair is quite curious. I believe we 
should have a better look at 'aarch64_rtx_costs' altogether and 
completely decouple vector and scalar costs. Though that is something 
that requires more rewriting than I believe should be done in Stage 4.


I gave all targets a vect.mult cost of 4x the vect.alu cost, with the 
exception of targets with cost 0 for vect.alu, those I gave the cost 4.


Bootstrapped on aarch64.

Is this OK for trunk?

gcc/ChangeLog:

    * config/aarch64/aarch64-cost-tables.h: Add entries for vect.mul.
    * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Use 
vect.mul for

    vector multiplies and vect.alu for SSRA.
    * config/arm/aarch-common-protos.h (struct vector_cost_table): 
Define

    vect.mul cost field.
    * config/arm/aarch-cost-tables.h: Add entries for vect.mul.
    * config/arm/arm.c: Likewise.

gcc/testsuite/ChangeLog:

    * gcc.target/aarch64/asimd-mul-to-shl-sub.c: New test.

diff --git a/gcc/config/aarch64/aarch64-cost-tables.h 
b/gcc/config/aarch64/aarch64-cost-tables.h
index 
c309f88cbd56f0d2347996d860c982a3a6744492..dd2e7e7cbb13d24f0b51092270cd7e2d75fabf29
 100644
--- a/gcc/config/aarch64/aarch64-cost-tables.h
+++ b/gcc/config/aarch64/aarch64-cost-tables.h
@@ -123,7 +123,8 @@ const struct cpu_cost_table qdf24xx_extra_costs =
   },
   /* Vector */
   {
-COSTS_N_INSNS (1)  /* alu.  */
+COSTS_N_INSNS (1),  /* alu.  */
+COSTS_N_INSNS (4)   /* mult.  */
   }
 };
 
@@ -227,7 +228,8 @@ const struct cpu_cost_table thunderx_extra_costs =
   },
   /* Vector */
   {
-COSTS_N_INSNS (1)  /* Alu.  */
+COSTS_N_INSNS (1), /* Alu.  */
+COSTS_N_INSNS (4)  /* mult.  */
   }
 };
 
@@ -330,7 +332,8 @@ const struct cpu_cost_table thunderx2t99_extra_costs =
   },
   /* Vector */
   {
-COSTS_N_INSNS (1)  /* Alu.  */
+COSTS_N_INSNS (1), /* Alu.  */
+COSTS_N_INSNS (4)  /* Mult.  */
   }
 };
 
@@ -433,7 +436,8 @@ const struct cpu_cost_table thunderx3t110_extra_costs =
   },
   /* Vector */
   {
-COSTS_N_INSNS (1)  /* Alu.  */
+COSTS_N_INSNS (1), /* Alu.  */
+COSTS_N_INSNS (4)  /* Mult.  */
   }
 };
 
@@ -537,7 +541,8 @@ const struct cpu_cost_table tsv110_extra_costs =
   },
   /* Vector */
   {
-COSTS_N_INSNS (1)  /* alu.  */
+COSTS_N_INSNS (1),  /* alu.  */
+COSTS_N_INSNS (4)   /* mult.  */
   }
 };
 
@@ -640,7 +645,8 @@ const struct cpu_cost_table a64fx_extra_costs =
   },
   /* Vector */
   {
-COSTS_N_INSNS (1)  /* alu.  */
+COSTS_N_INSNS (1),  /* alu.  */
+COSTS_N_INSNS (4)   /* mult.  */
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
b6192e55521004ae70cd13acbdb4dab142216845..146ed8c1b693d7204a754bc4e6d17025e0af544b
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11568,7 +11568,6 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int 
outer, bool speed)
   if (VECTOR_MODE_P (mode))
 {
   unsigned int vec_flags = aarch64_classify_vector_mode (mode);
-  mode = GET_MODE_INNER (mode);
   if (vec_flags & VEC_ADVSIMD)
{
  /* The by-element versions of the instruction have the same costs as
@@ -11582,6 +11581,17 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int 
outer, bool speed)
  else if (GET_CODE (op1) == VEC_DUPLICATE)
op1 = XEXP (op1, 0);
}
+  cost += rtx_cost (op0, mode, MULT, 0, speed);
+  cost += rtx_cost (op1, mode, MULT, 1, speed);
+  if (speed)
+   {
+ if (GET_CODE (x) == MULT)
+   cost += extra_cost->vect.mult;
+ /* This is to catch the SSRA costing currently flowing here.  */
+ else
+   cost += extra_cost->vect.alu;
+   }
+  return cost;
 }
 
   /* Integer multiply/fma.  */
diff --git a/gcc/config/arm/aarch-common-protos.h 
b/gcc/config/arm/aarch-common-protos.h
index 
251de3d61a833a2bb4b77e9211cac7fbc17c0b75..7a9cf3d324c103de74af741abe9ef30b76fea5ce
 100644
--- a/gcc/config/arm/aarch-common-protos.h
+++ b/gcc/config/arm/aarch-common-protos.h
@@ -132,6 +132,7 @@ struct fp_cost_table
 struct vector_cost_table
 {
   const int alu;
+  const int mult;
 };
 
 struct cpu_cost_table
diff --git a/gcc/config/arm/aarch-cost-tables.h 

Re: [PATCH] arm: Fix up neon_vector_mem_operand [PR97528]

2021-02-03 Thread Andre Vieira (lists) via Gcc-patches
Same patch applies cleanly on gcc-8, bootstrapped 
arm-none-linux-gnueabihf and ran regressions also clean.


Can I also commit it to gcc-8?

Thanks,
Andre

On 02/02/2021 17:36, Kyrylo Tkachov wrote:



-Original Message-
From: Andre Vieira (lists) 
Sent: 02 February 2021 17:27
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; ja...@redhat.com
Subject: Re: [PATCH] arm: Fix up neon_vector_mem_operand [PR97528]

Hi,

This is a gcc-9 backport of the PR97528 fix that has been applied to
trunk and gcc-10.
Bootstraped on arm-linux-gnueabihf and regression tested.

OK for gcc-9 branch?

Ok.
Thanks,
Kyrill


2021-02-02  Andre Vieira  

      Backport from mainline
      2020-11-20  Jakub Jelinek  

      PR target/97528
      * config/arm/arm.c (neon_vector_mem_operand): For POST_MODIFY,
require
      first POST_MODIFY operand is a REG and is equal to the first operand
      of PLUS.

      * gcc.target/arm/pr97528.c: New test.

On 20/11/2020 11:25, Kyrylo Tkachov via Gcc-patches wrote:

-Original Message-
From: Jakub Jelinek 
Sent: 19 November 2020 18:57
To: Richard Earnshaw ; Ramana
Radhakrishnan ; Kyrylo Tkachov

Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH] arm: Fix up neon_vector_mem_operand [PR97528]

Hi!

The documentation for POST_MODIFY says:
 Currently, the compiler can only handle second operands of the
 form (plus (reg) (reg)) and (plus (reg) (const_int)), where
 the first operand of the PLUS has to be the same register as
 the first operand of the *_MODIFY.
The following testcase ICEs, because combine just attempts to simplify
things and ends up with
(post_modify (reg1) (plus (mult (reg2) (const_int 4)) (reg1))
but the target predicates accept it, because they only verify
that POST_MODIFY's second operand is PLUS and the second operand
of the PLUS is a REG.

The following patch fixes this by performing further verification that
the POST_MODIFY is in the form it should be.

Bootstrapped/regtested on armv7hl-linux-gnueabi, ok for trunk
and release branches after a while?

Ok.
Thanks,
Kyrill


2020-11-19  Jakub Jelinek  

PR target/97528
* config/arm/arm.c (neon_vector_mem_operand): For
POST_MODIFY, require
first POST_MODIFY operand is a REG and is equal to the first operand
of PLUS.

* gcc.target/arm/pr97528.c: New test.

--- gcc/config/arm/arm.c.jj 2020-11-13 19:00:46.729620560 +0100
+++ gcc/config/arm/arm.c2020-11-18 17:05:44.656867343 +0100
@@ -13429,7 +13429,9 @@ neon_vector_mem_operand (rtx op, int typ
 /* Allow post-increment by register for VLDn */
 if (type == 2 && GET_CODE (ind) == POST_MODIFY
 && GET_CODE (XEXP (ind, 1)) == PLUS
-  && REG_P (XEXP (XEXP (ind, 1), 1)))
+  && REG_P (XEXP (XEXP (ind, 1), 1))
+  && REG_P (XEXP (ind, 0))
+  && rtx_equal_p (XEXP (ind, 0), XEXP (XEXP (ind, 1), 0)))
return true;

 /* Match:
--- gcc/testsuite/gcc.target/arm/pr97528.c.jj   2020-11-18
17:09:58.195053288 +0100
+++ gcc/testsuite/gcc.target/arm/pr97528.c  2020-11-18
17:09:47.839168237 +0100
@@ -0,0 +1,28 @@
+/* PR target/97528 */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O1" }  */
+/* { dg-add-options arm_neon } */
+
+#include 
+
+typedef __simd64_int16_t T;
+typedef __simd64_uint16_t U;
+unsigned short c;
+int d;
+U e;
+
+void
+foo (void)
+{
+  unsigned short *dst = 
+  int g = d, b = 4;
+  U dc = e;
+  for (int h = 0; h < b; h++)
+{
+  unsigned short *i = dst;
+  U j = dc;
+  vst1_s16 ((int16_t *) i, (T) j);
+  dst += g;
+}
+}


Jakub


Re: [PATCH] arm: Fix up neon_vector_mem_operand [PR97528]

2021-02-02 Thread Andre Vieira (lists) via Gcc-patches

Hi,

This is a gcc-9 backport of the PR97528 fix that has been applied to 
trunk and gcc-10.

Bootstraped on arm-linux-gnueabihf and regression tested.

OK for gcc-9 branch?

2021-02-02  Andre Vieira  

    Backport from mainline
    2020-11-20  Jakub Jelinek  

    PR target/97528
    * config/arm/arm.c (neon_vector_mem_operand): For POST_MODIFY, require
    first POST_MODIFY operand is a REG and is equal to the first operand
    of PLUS.

    * gcc.target/arm/pr97528.c: New test.

On 20/11/2020 11:25, Kyrylo Tkachov via Gcc-patches wrote:



-Original Message-
From: Jakub Jelinek 
Sent: 19 November 2020 18:57
To: Richard Earnshaw ; Ramana
Radhakrishnan ; Kyrylo Tkachov

Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH] arm: Fix up neon_vector_mem_operand [PR97528]

Hi!

The documentation for POST_MODIFY says:
Currently, the compiler can only handle second operands of the
form (plus (reg) (reg)) and (plus (reg) (const_int)), where
the first operand of the PLUS has to be the same register as
the first operand of the *_MODIFY.
The following testcase ICEs, because combine just attempts to simplify
things and ends up with
(post_modify (reg1) (plus (mult (reg2) (const_int 4)) (reg1))
but the target predicates accept it, because they only verify
that POST_MODIFY's second operand is PLUS and the second operand
of the PLUS is a REG.

The following patch fixes this by performing further verification that
the POST_MODIFY is in the form it should be.

Bootstrapped/regtested on armv7hl-linux-gnueabi, ok for trunk
and release branches after a while?

Ok.
Thanks,
Kyrill


2020-11-19  Jakub Jelinek  

PR target/97528
* config/arm/arm.c (neon_vector_mem_operand): For
POST_MODIFY, require
first POST_MODIFY operand is a REG and is equal to the first operand
of PLUS.

* gcc.target/arm/pr97528.c: New test.

--- gcc/config/arm/arm.c.jj 2020-11-13 19:00:46.729620560 +0100
+++ gcc/config/arm/arm.c2020-11-18 17:05:44.656867343 +0100
@@ -13429,7 +13429,9 @@ neon_vector_mem_operand (rtx op, int typ
/* Allow post-increment by register for VLDn */
if (type == 2 && GET_CODE (ind) == POST_MODIFY
&& GET_CODE (XEXP (ind, 1)) == PLUS
-  && REG_P (XEXP (XEXP (ind, 1), 1)))
+  && REG_P (XEXP (XEXP (ind, 1), 1))
+  && REG_P (XEXP (ind, 0))
+  && rtx_equal_p (XEXP (ind, 0), XEXP (XEXP (ind, 1), 0)))
   return true;

/* Match:
--- gcc/testsuite/gcc.target/arm/pr97528.c.jj   2020-11-18
17:09:58.195053288 +0100
+++ gcc/testsuite/gcc.target/arm/pr97528.c  2020-11-18
17:09:47.839168237 +0100
@@ -0,0 +1,28 @@
+/* PR target/97528 */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O1" }  */
+/* { dg-add-options arm_neon } */
+
+#include 
+
+typedef __simd64_int16_t T;
+typedef __simd64_uint16_t U;
+unsigned short c;
+int d;
+U e;
+
+void
+foo (void)
+{
+  unsigned short *dst = 
+  int g = d, b = 4;
+  U dc = e;
+  for (int h = 0; h < b; h++)
+{
+  unsigned short *i = dst;
+  U j = dc;
+  vst1_s16 ((int16_t *) i, (T) j);
+  dst += g;
+}
+}


Jakub
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
04edd637d43198ad801bb5ada8f1807faae7001e..4679da75dd823778d5a3e37c497ee10793e9c7d7
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -12730,7 +12730,9 @@ neon_vector_mem_operand (rtx op, int type, bool strict)
   /* Allow post-increment by register for VLDn */
   if (type == 2 && GET_CODE (ind) == POST_MODIFY
   && GET_CODE (XEXP (ind, 1)) == PLUS
-  && REG_P (XEXP (XEXP (ind, 1), 1)))
+  && REG_P (XEXP (XEXP (ind, 1), 1))
+  && REG_P (XEXP (ind, 0))
+  && rtx_equal_p (XEXP (ind, 0), XEXP (XEXP (ind, 1), 0)))
  return true;
 
   /* Match:
diff --git a/gcc/testsuite/gcc.target/arm/pr97528.c 
b/gcc/testsuite/gcc.target/arm/pr97528.c
new file mode 100644
index 
..6cc59f2158c5f8c8dd78e5083ca7ebc4e5f63a44
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr97528.c
@@ -0,0 +1,28 @@
+/* PR target/97528 */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O1" }  */
+/* { dg-add-options arm_neon } */
+
+#include 
+
+typedef __simd64_int16_t T;
+typedef __simd64_uint16_t U;
+unsigned short c;
+int d;
+U e;
+
+void
+foo (void)
+{
+  unsigned short *dst = 
+  int g = d, b = 4;
+  U dc = e;
+  for (int h = 0; h < b; h++)
+{
+  unsigned short *i = dst;
+  U j = dc;
+  vst1_s16 ((int16_t *) i, (T) j);
+  dst += g;
+}
+}


Re: RFC: ARM MVE and Neon auto-vectorization

2020-12-09 Thread Andre Vieira (lists) via Gcc-patches



On 08/12/2020 13:50, Christophe Lyon via Gcc-patches wrote:

Hi,


My 'vand' patch changes the definition of VDQ so that the relevant
modes are enabled only when !TARGET_HAVE_MVE (V8QI, ...), and this
helps writing a simpler expander.

However, vneg is used by vshr (right-shifts by register are
implemented as left-shift by negation of that register), so the
expander uses something like:

   emit_insn (gen_neg2 (neg, operands[2]));
   if (TARGET_NEON)
   emit_insn (gen_ashl3_signed (operands[0], operands[1], neg));
   else
   emit_insn (gen_mve_vshlq_s (operands[0], operands[1], neg));

which does not work if the iterator has conditional members: the
'else' part is still generated for  unsupported by MVE.

So I guess my question is:  do we want to enforce implementation
of Neon / MVE common parts? There are already lots of partly
overlapping/duplicate iterators. I have tried to split iterators into
eg VDQ_COMMON_TO_NEON_AND_MVE and VDQ_NEON_ONLY but this means we have
to basically duplicate the expanders which defeats the point...
Ideally I think we'd want a minimal number iterators and defines, which 
was the idea behind the conditional iterators disabling 64-bit modes for 
MVE.


Obviously that then breaks the code above. For this specific case I 
would suggest unifying define_insns ashl3_{signed,unsigned} and 
mve_vshlq_, they are very much the same patterns, I also 
don't understand why ahsl's signed and unsigned are separate. For 
instance create a 'ashl3__' or something like that, and make 
sure the calls to gen_ashl33_{unsigned,signed} now call to 
gen_ashl3__ and that arm_mve_builtins.def use 
ashl3__ instead of this,  needs to be at the end of 
the name for the builtin construct. Whether this 'form' would work 
everywhere, I don't know. And I suspect you might find more issues like 
this. If there are more than you are willing to change right now then 
maybe the easier step forward is to try to tackle them one at a time, 
and use a new conditional iterator where you've been able to merge NEON 
and MVE patterns.


As a general strategy I think we should try to clean the mess up, but I 
don't think we should try to clean it all up in one go as that will 
probably lead to it not getting done at all. I'm not the maintainer, so 
I'd be curious to see how Kyrill feels about this, but in my opinion we 
should take patches that don't make it less maintainable, so if you can 
clean it up as much as possible, great! Otherwise if its not making the 
mess bigger and its enabling auto-vec then I personally don't see why it 
shouldn't be accepted.

Or we can keep different expanders for Neon and MVE? But we have
already quite a few in vec-common.md.
We can't keep different expanders if they expand the same optab with the 
same modes in the same backend. So we will always have to make NEON and 
MVE work together.


Re: [PATCH 1/7] arm: Auto-vectorization for MVE: vand

2020-11-27 Thread Andre Vieira (lists) via Gcc-patches

Hi Christophe,

On 26/11/2020 15:31, Christophe Lyon wrote:

Hi Andre,

Thanks for the quick feedback.

On Wed, 25 Nov 2020 at 18:17, Andre Simoes Dias Vieira
 wrote:

Hi Christophe,

Thanks for these! See some inline comments.

On 25/11/2020 13:54, Christophe Lyon via Gcc-patches wrote:

This patch enables MVE vandq instructions for auto-vectorization.  MVE
vandq insns in mve.md are modified to use 'and' instead of unspec
expression to support and3.  The and3 expander is added to
vec-common.md

2020-11-12  Christophe Lyon  

   gcc/
   * gcc/config/arm/iterators.md (supf): Remove VANDQ_S and VANDQ_U.
   (VANQ): Remove.
   * config/arm/mve.md (mve_vandq_u): New entry for vand
   instruction using expression and.
   (mve_vandq_s): New expander.
   * config/arm/neon.md (and3): Renamed into and3_neon.
   * config/arm/unspecs.md (VANDQ_S, VANDQ_U): Remove.
   * config/arm/vec-common.md (and3): New expander.

   gcc/testsuite/
   * gcc.target/arm/simd/mve-vand.c: New test.
---
   gcc/config/arm/iterators.md  |  4 +---
   gcc/config/arm/mve.md| 20 
   gcc/config/arm/neon.md   |  2 +-
   gcc/config/arm/unspecs.md|  2 --
   gcc/config/arm/vec-common.md | 15 
   gcc/testsuite/gcc.target/arm/simd/mve-vand.c | 34 

   6 files changed, 66 insertions(+), 11 deletions(-)
   create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vand.c

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 592af35..72039e4 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -1232,8 +1232,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") 
(VREV16Q_S "s")
  (VADDLVQ_P_U "u") (VCMPNEQ_U "u") (VCMPNEQ_S "s")
  (VABDQ_M_S "s") (VABDQ_M_U "u") (VABDQ_S "s")
  (VABDQ_U "u") (VADDQ_N_S "s") (VADDQ_N_U "u")
-(VADDVQ_P_S "s") (VADDVQ_P_U "u") (VANDQ_S "s")
-(VANDQ_U "u") (VBICQ_S "s") (VBICQ_U "u")
+(VADDVQ_P_S "s") (VADDVQ_P_U "u") (VBICQ_S "s") (VBICQ_U 
"u")
  (VBRSRQ_N_S "s") (VBRSRQ_N_U "u") (VCADDQ_ROT270_S "s")
  (VCADDQ_ROT270_U "u") (VCADDQ_ROT90_S "s")
  (VCMPEQQ_S "s") (VCMPEQQ_U "u") (VCADDQ_ROT90_U "u")
@@ -1501,7 +1500,6 @@ (define_int_iterator VABDQ [VABDQ_S VABDQ_U])
   (define_int_iterator VADDQ_N [VADDQ_N_S VADDQ_N_U])
   (define_int_iterator VADDVAQ [VADDVAQ_S VADDVAQ_U])
   (define_int_iterator VADDVQ_P [VADDVQ_P_U VADDVQ_P_S])
-(define_int_iterator VANDQ [VANDQ_U VANDQ_S])
   (define_int_iterator VBICQ [VBICQ_S VBICQ_U])
   (define_int_iterator VBRSRQ_N [VBRSRQ_N_U VBRSRQ_N_S])
   (define_int_iterator VCADDQ_ROT270 [VCADDQ_ROT270_S VCADDQ_ROT270_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index ecbaaa9..975eb7d 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -894,17 +894,27 @@ (define_insn "mve_vaddvq_p_"
   ;;
   ;; [vandq_u, vandq_s])
   ;;
-(define_insn "mve_vandq_"
+;; signed and unsigned versions are the same: define the unsigned
+;; insn, and use an expander for the signed one as we still reference
+;; both names from arm_mve.h.
+(define_insn "mve_vandq_u"
 [
  (set (match_operand:MVE_2 0 "s_register_operand" "=w")
- (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
-(match_operand:MVE_2 2 "s_register_operand" "w")]
-  VANDQ))
+ (and:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
+(match_operand:MVE_2 2 "s_register_operand" "w")))

The predicate on the second operand is more restrictive than the one in
expand 'neon_inv_logic_op2'. This should still work with immediates, or
well I checked for integers, it generates a loop as such:


Right, thanks for catching it.


  vldrw.32q3, [r0]
  vldr.64 d4, .L8
  vldr.64 d5, .L8+8
  vandq3, q3, q2
  vstrw.32q3, [r2]

MVE does support vand's with immediates, just like NEON, I suspect you
could just copy the way Neon handles these, possibly worth the little
extra effort. You can use dest[i] = a[i] & ~1 as a testcase.
If you don't it might still be worth expanding the test to make sure
other immediates-types combinations don't trigger an ICE?

I'm not sure I understand why it loads it in two 64-bit chunks and not
do a single load or not just do something like a vmov or vbic immediate.
Anyhow that's a worry for another day I guess..

Do you mean something like the attached (on top of this patch)?
I dislike the code duplication in mve_vandq_u which would
become a copy of and3_neon.

Hi Christophe,

Yeah that's what I meant. I agree with the code duplication. The reason 
we still use separate ones is because of the difference in supported 
modes. Maybe the right way around 

Re: [PATCH 3/7] arm: Auto-vectorization for MVE: veor

2020-11-26 Thread Andre Vieira (lists) via Gcc-patches

LGTM,  but please wait for maintainer review.

On 25/11/2020 13:54, Christophe Lyon via Gcc-patches wrote:

This patch enables MVE veorq instructions for auto-vectorization.  MVE
veorq insns in mve.md are modified to use xor instead of unspec
expression to support xor3.  The xor3 expander is added to
vec-common.md

2020-11-12  Christophe Lyon  

gcc/
* config/arm/iterators.md (supf): Remove VEORQ_S and VEORQ_U.
(VEORQ): Remove.
* config/arm/mve.md (mve_veorq_u): New entry for veor
instruction using expression xor.
(mve_veorq_s): New expander.
* config/arm/neon.md (xor3): Renamed into xor3_neon.
* config/arm/unspscs.md (VEORQ_S, VEORQ_U): Remove.
* config/arm/vec-common.md (xor3): New expander.

gcc/testsuite/
* gcc.target/arm/simd/mve-veor.c: Add tests for veor.
---
  gcc/config/arm/iterators.md  |  3 +--
  gcc/config/arm/mve.md| 17 ++
  gcc/config/arm/neon.md   |  2 +-
  gcc/config/arm/unspecs.md|  2 --
  gcc/config/arm/vec-common.md | 15 
  gcc/testsuite/gcc.target/arm/simd/mve-veor.c | 34 
  6 files changed, 63 insertions(+), 10 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-veor.c

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5fcb7af..0195275 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -1237,7 +1237,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") 
(VREV16Q_S "s")
   (VCADDQ_ROT270_U "u") (VCADDQ_ROT90_S "s")
   (VCMPEQQ_S "s") (VCMPEQQ_U "u") (VCADDQ_ROT90_U "u")
   (VCMPEQQ_N_S "s") (VCMPEQQ_N_U "u") (VCMPNEQ_N_S "s")
-  (VCMPNEQ_N_U "u") (VEORQ_S "s") (VEORQ_U "u")
+  (VCMPNEQ_N_U "u")
   (VHADDQ_N_S "s") (VHADDQ_N_U "u") (VHADDQ_S "s")
   (VHADDQ_U "u") (VHSUBQ_N_S "s")  (VHSUBQ_N_U "u")
   (VHSUBQ_S "s") (VMAXQ_S "s") (VMAXQ_U "u") (VHSUBQ_U "u")
@@ -1507,7 +1507,6 @@ (define_int_iterator VCADDQ_ROT90 [VCADDQ_ROT90_U 
VCADDQ_ROT90_S])
  (define_int_iterator VCMPEQQ [VCMPEQQ_U VCMPEQQ_S])
  (define_int_iterator VCMPEQQ_N [VCMPEQQ_N_S VCMPEQQ_N_U])
  (define_int_iterator VCMPNEQ_N [VCMPNEQ_N_U VCMPNEQ_N_S])
-(define_int_iterator VEORQ [VEORQ_U VEORQ_S])
  (define_int_iterator VHADDQ [VHADDQ_S VHADDQ_U])
  (define_int_iterator VHADDQ_N [VHADDQ_N_U VHADDQ_N_S])
  (define_int_iterator VHSUBQ [VHSUBQ_S VHSUBQ_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 0f04044..a5f5d75 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -1204,17 +1204,24 @@ (define_insn "mve_vcmpneq_n_"
  ;;
  ;; [veorq_u, veorq_s])
  ;;
-(define_insn "mve_veorq_"
+(define_insn "mve_veorq_u"
[
 (set (match_operand:MVE_2 0 "s_register_operand" "=w")
-   (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
-  (match_operand:MVE_2 2 "s_register_operand" "w")]
-VEORQ))
+   (xor:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
+  (match_operand:MVE_2 2 "s_register_operand" "w")))
]
"TARGET_HAVE_MVE"
-  "veor %q0, %q1, %q2"
+  "veor\t%q0, %q1, %q2"
[(set_attr "type" "mve_move")
  ])
+(define_expand "mve_veorq_s"
+  [
+   (set (match_operand:MVE_2 0 "s_register_operand")
+   (xor:MVE_2 (match_operand:MVE_2 1 "s_register_operand")
+  (match_operand:MVE_2 2 "s_register_operand")))
+  ]
+  "TARGET_HAVE_MVE"
+)
  
  ;;

  ;; [vhaddq_n_u, vhaddq_n_s])
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 669c34d..e1263b0 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -747,7 +747,7 @@ (define_insn "bic3_neon"
[(set_attr "type" "neon_logic")]
  )
  
-(define_insn "xor3"

+(define_insn "xor3_neon"
[(set (match_operand:VDQ 0 "s_register_operand" "=w")
(xor:VDQ (match_operand:VDQ 1 "s_register_operand" "w")
 (match_operand:VDQ 2 "s_register_operand" "w")))]
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index f111ad8..78313ea 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -608,7 +608,6 @@ (define_c_enum "unspec" [
VCMPEQQ_S
VCMPEQQ_N_S
VCMPNEQ_N_S
-  VEORQ_S
VHADDQ_S
VHADDQ_N_S
VHSUBQ_S
@@ -653,7 +652,6 @@ (define_c_enum "unspec" [
VCMPEQQ_U
VCMPEQQ_N_U
VCMPNEQ_N_U
-  VEORQ_U
VHADDQ_U
VHADDQ_N_U
VHSUBQ_U
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index 413fb07..687134a 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -202,3 +202,18 @@ (define_expand "ior3"
  (match_operand:VNINOTM1 2 "neon_logic_op2" "")))]
"TARGET_NEON"
  )
+
+(define_expand "xor3"
+  [(set 

Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-07-20 Thread Andre Vieira (lists)



On 08/07/2020 09:04, Andre Simoes Dias Vieira wrote:


On 07/07/2020 13:43, Christophe Lyon wrote:

Hi,


On Mon, 6 Jul 2020 at 16:31, Andre Vieira (lists)
 wrote:


On 30/06/2020 14:50, Andre Vieira (lists) wrote:

On 29/06/2020 11:15, Christophe Lyon wrote:

On Mon, 29 Jun 2020 at 10:56, Andre Vieira (lists)
 wrote:

On 23/06/2020 21:52, Christophe Lyon wrote:

On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists)
 wrote:

On 23/06/2020 13:10, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber
callee saved
registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while
compiling
for size (-Os) with a thumb-1 target the generated code will
clear the
registers r7-r10. These however are callee saved and should be
preserved
accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1
in a
way to make sure they are not used, as pushing and popping
hi-registers
requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which
accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be
aware of
'callee_saved_reg_p''s definition, as it does still take call 
used

registers into account, which aren't callee_saved in my opinion,
so it
is a rather misnoemer, works in our advantage here though as it
does
exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous
versions if
stable.)

Ok.
Thanks,
Kyrill

As I was getting ready to push this I noticed I didn't add any
skip-ifs
to prevent this failing with specific target options. So here's 
a new

version with those.

Still OK?


Hi,

This is not sufficient to skip arm-linux-gnueabi* configs built 
with

non-default cpu/fpu.

For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16 --with-float=hard
I see:
FAIL: gcc.target/arm/pr95646.c (test for excess errors)
Excess errors:
cc1: error: ARMv8-M Security Extensions incompatible with 
selected FPU

cc1: error: target CPU does not support ARM mode

and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os

Resending as I don't think my earlier one made it to the lists
(sorry if
you are receiving this double!)

I'm not following this, before I go off and try to reproduce it,
what do
you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse
-Os'?
These are the options you are seeing in the log file? Surely they
should
override the default options? Only thing I can think of is this 
might

need an extra -mfloat-abi=soft to make sure it overrides the default
float-abi.  Could you give that a try?

No it doesn't make a difference alone.

I also had to add:
-mfpu=auto (that clears the above warning)
-mthumb otherwise we now get cc1: error: target CPU does not support
ARM mode

Looks like some effective-target machinery is needed

So I had a look at this,  I was pretty sure that -mfloat-abi=soft
overwrote -mfpu=<>, which in large it does, as in no FP instructions
will be generated but the error you see only checks for the right
number of FP registers. Which doesn't check whether
'TARGET_HARD_FLOAT' is set or not. I'll fix this too and use the
check-effective-target for armv8-m.base for this test as it is indeed
a better approach than my bag of skip-ifs. I'm testing it locally to
make sure my changes don't break anything.

Cheers,
Andre

Hi,

Sorry for the delay. So I changed the test to use the effective-target
machinery as you suggested and I also made sure that you don't get the
"ARMv8-M Security Extensions incompatible with selected FPU" when
-mfloat-abi=soft.
Further changed 'asm' to '__asm__' to avoid failures with '-std=' 
options.


Regression tested on arm-none-eabi.
@Christophe: could you test this for your configuration, shouldn't fail
anymore!


Indeed with your patch I don't see any failure with pr95646.c

Note that it is still unsupported with arm-eabi when running the tests
with -mcpu=cortex-mXX
because the compiler complains that -mcpu=cortex-mXX conflicts with
-march=armv8-m.base,
thus the effective-target test fails.

BTW, is that warning useful/practical? Wouldn't it be more convenient
if the last -mcpu/-march
on the command line was the only one taken into account? (I had a
similar issue when
running tests (libstdc++) getting -march=armv8-m.main+fp from their
multilib environment
and forcing -mcpu=cortex-m33 because it also means '+dsp' and produces
a warning;
I had to use -mcpu=cortex-m33 -march=armv8-m.main+fp+dsp to 
workaround this)
Yeah I've been annoyed by that before, also in the context of testing 
multilibs.


Even though I can see how it can be a useful warning though, if you 
are using these in build-systems and you accidentally introduce a new 
(incompatible) -mcpu/-march alo

Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-07-06 Thread Andre Vieira (lists)


On 30/06/2020 14:50, Andre Vieira (lists) wrote:


On 29/06/2020 11:15, Christophe Lyon wrote:

On Mon, 29 Jun 2020 at 10:56, Andre Vieira (lists)
 wrote:


On 23/06/2020 21:52, Christophe Lyon wrote:

On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists)
 wrote:

On 23/06/2020 13:10, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber 
callee saved

registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while 
compiling
for size (-Os) with a thumb-1 target the generated code will 
clear the
registers r7-r10. These however are callee saved and should be 
preserved

accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1 
in a
way to make sure they are not used, as pushing and popping 
hi-registers

requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which 
accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be 
aware of

'callee_saved_reg_p''s definition, as it does still take call used
registers into account, which aren't callee_saved in my opinion, 
so it
is a rather misnoemer, works in our advantage here though as it 
does

exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous 
versions if

stable.)

Ok.
Thanks,
Kyrill
As I was getting ready to push this I noticed I didn't add any 
skip-ifs

to prevent this failing with specific target options. So here's a new
version with those.

Still OK?


Hi,

This is not sufficient to skip arm-linux-gnueabi* configs built with
non-default cpu/fpu.

For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16 --with-float=hard
I see:
FAIL: gcc.target/arm/pr95646.c (test for excess errors)
Excess errors:
cc1: error: ARMv8-M Security Extensions incompatible with selected FPU
cc1: error: target CPU does not support ARM mode

and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os
Resending as I don't think my earlier one made it to the lists 
(sorry if

you are receiving this double!)

I'm not following this, before I go off and try to reproduce it, 
what do
you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse 
-Os'?
These are the options you are seeing in the log file? Surely they 
should

override the default options? Only thing I can think of is this might
need an extra -mfloat-abi=soft to make sure it overrides the default
float-abi.  Could you give that a try?

No it doesn't make a difference alone.

I also had to add:
-mfpu=auto (that clears the above warning)
-mthumb otherwise we now get cc1: error: target CPU does not support 
ARM mode


Looks like some effective-target machinery is needed
So I had a look at this,  I was pretty sure that -mfloat-abi=soft 
overwrote -mfpu=<>, which in large it does, as in no FP instructions 
will be generated but the error you see only checks for the right 
number of FP registers. Which doesn't check whether 
'TARGET_HARD_FLOAT' is set or not. I'll fix this too and use the 
check-effective-target for armv8-m.base for this test as it is indeed 
a better approach than my bag of skip-ifs. I'm testing it locally to 
make sure my changes don't break anything.


Cheers,
Andre

Hi,

Sorry for the delay. So I changed the test to use the effective-target 
machinery as you suggested and I also made sure that you don't get the 
"ARMv8-M Security Extensions incompatible with selected FPU" when 
-mfloat-abi=soft.

Further changed 'asm' to '__asm__' to avoid failures with '-std=' options.

Regression tested on arm-none-eabi.
@Christophe: could you test this for your configuration, shouldn't fail 
anymore!


Is this OK for trunk?

Cheers,
Andre

gcc/ChangeLog:
2020-07-06  Andre Vieira  

    * config/arm/arm.c (arm_options_perform_arch_sanity_checks): Do not
    check +D32 for CMSE if -mfloat-abi=soft

gcc/testsuite/ChangeLog:
2020-07-06  Andre Vieira  

    * gcc.target/arm/pr95646.c: Fix testism.


Christophe



Cheers,
Andre

Christophe


Cheers,
Andre

Cheers,
Andre

gcc/ChangeLog:
2020-06-22  Andre Vieira 

    PR target/95646
    * config/arm/arm.c: 
(cmse_nonsecure_entry_clear_before_return):

Use 'callee_saved_reg_p' instead of
    'calL_used_or_fixed_reg_p'.

gcc/testsuite/ChangeLog:
2020-06-22  Andre Vieira 

    PR target/95646
    * gcc.target/arm/pr95646.c: New test.
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
dac9a6fb5c41ce42cd7a278b417eab25239a043c..38500220bfb2a1ddbbff15eb552451701f7256d5
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3834,7 +3834,7 @@ arm_options_perform_arch_sanity_checks (void)
 
   /* We don't clear D16-D31 VFP registers for cmse_nonsecure_call functions
  and ARMv8-

Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-06-30 Thread Andre Vieira (lists)



On 29/06/2020 11:15, Christophe Lyon wrote:

On Mon, 29 Jun 2020 at 10:56, Andre Vieira (lists)
 wrote:


On 23/06/2020 21:52, Christophe Lyon wrote:

On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists)
 wrote:

On 23/06/2020 13:10, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved
registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while compiling
for size (-Os) with a thumb-1 target the generated code will clear the
registers r7-r10. These however are callee saved and should be preserved
accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1 in a
way to make sure they are not used, as pushing and popping hi-registers
requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be aware of
'callee_saved_reg_p''s definition, as it does still take call used
registers into account, which aren't callee_saved in my opinion, so it
is a rather misnoemer, works in our advantage here though as it does
exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous versions if
stable.)

Ok.
Thanks,
Kyrill

As I was getting ready to push this I noticed I didn't add any skip-ifs
to prevent this failing with specific target options. So here's a new
version with those.

Still OK?


Hi,

This is not sufficient to skip arm-linux-gnueabi* configs built with
non-default cpu/fpu.

For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16 --with-float=hard
I see:
FAIL: gcc.target/arm/pr95646.c (test for excess errors)
Excess errors:
cc1: error: ARMv8-M Security Extensions incompatible with selected FPU
cc1: error: target CPU does not support ARM mode

and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os

Resending as I don't think my earlier one made it to the lists (sorry if
you are receiving this double!)

I'm not following this, before I go off and try to reproduce it, what do
you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os'?
These are the options you are seeing in the log file? Surely they should
override the default options? Only thing I can think of is this might
need an extra -mfloat-abi=soft to make sure it overrides the default
float-abi.  Could you give that a try?

No it doesn't make a difference alone.

I also had to add:
-mfpu=auto (that clears the above warning)
-mthumb otherwise we now get cc1: error: target CPU does not support ARM mode

Looks like some effective-target machinery is needed
So I had a look at this,  I was pretty sure that -mfloat-abi=soft 
overwrote -mfpu=<>, which in large it does, as in no FP instructions 
will be generated but the error you see only checks for the right number 
of FP registers. Which doesn't check whether 'TARGET_HARD_FLOAT' is set 
or not. I'll fix this too and use the check-effective-target for 
armv8-m.base for this test as it is indeed a better approach than my bag 
of skip-ifs. I'm testing it locally to make sure my changes don't break 
anything.


Cheers,
Andre


Christophe



Cheers,
Andre

Christophe


Cheers,
Andre

Cheers,
Andre

gcc/ChangeLog:
2020-06-22  Andre Vieira  

PR target/95646
* config/arm/arm.c: (cmse_nonsecure_entry_clear_before_return):
Use 'callee_saved_reg_p' instead of
'calL_used_or_fixed_reg_p'.

gcc/testsuite/ChangeLog:
2020-06-22  Andre Vieira  

PR target/95646
* gcc.target/arm/pr95646.c: New test.


Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-06-29 Thread Andre Vieira (lists)



On 23/06/2020 21:52, Christophe Lyon wrote:

On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists)
 wrote:

On 23/06/2020 13:10, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved
registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while compiling
for size (-Os) with a thumb-1 target the generated code will clear the
registers r7-r10. These however are callee saved and should be preserved
accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1 in a
way to make sure they are not used, as pushing and popping hi-registers
requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be aware of
'callee_saved_reg_p''s definition, as it does still take call used
registers into account, which aren't callee_saved in my opinion, so it
is a rather misnoemer, works in our advantage here though as it does
exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous versions if
stable.)

Ok.
Thanks,
Kyrill

As I was getting ready to push this I noticed I didn't add any skip-ifs
to prevent this failing with specific target options. So here's a new
version with those.

Still OK?


Hi,

This is not sufficient to skip arm-linux-gnueabi* configs built with
non-default cpu/fpu.

For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16 --with-float=hard
I see:
FAIL: gcc.target/arm/pr95646.c (test for excess errors)
Excess errors:
cc1: error: ARMv8-M Security Extensions incompatible with selected FPU
cc1: error: target CPU does not support ARM mode

and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os
I'm not following this, before I go off and try to reproduce it, what do 
you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os'? 
These are the options you are seeing in the log file? Surely they should 
override the default options? Only thing I can think of is this might 
need an extra -mfloat-abi=soft to make sure it overrides the default 
float-abi.  Could you give that a try?


Cheers,
Andre


Christophe


Cheers,
Andre

Cheers,
Andre

gcc/ChangeLog:
2020-06-22  Andre Vieira  

   PR target/95646
   * config/arm/arm.c: (cmse_nonsecure_entry_clear_before_return):
Use 'callee_saved_reg_p' instead of
   'calL_used_or_fixed_reg_p'.

gcc/testsuite/ChangeLog:
2020-06-22  Andre Vieira  

   PR target/95646
   * gcc.target/arm/pr95646.c: New test.


Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-06-29 Thread Andre Vieira (lists)



On 23/06/2020 21:52, Christophe Lyon wrote:

On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists)
 wrote:

On 23/06/2020 13:10, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved
registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while compiling
for size (-Os) with a thumb-1 target the generated code will clear the
registers r7-r10. These however are callee saved and should be preserved
accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1 in a
way to make sure they are not used, as pushing and popping hi-registers
requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be aware of
'callee_saved_reg_p''s definition, as it does still take call used
registers into account, which aren't callee_saved in my opinion, so it
is a rather misnoemer, works in our advantage here though as it does
exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous versions if
stable.)

Ok.
Thanks,
Kyrill

As I was getting ready to push this I noticed I didn't add any skip-ifs
to prevent this failing with specific target options. So here's a new
version with those.

Still OK?


Hi,

This is not sufficient to skip arm-linux-gnueabi* configs built with
non-default cpu/fpu.

For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16 --with-float=hard
I see:
FAIL: gcc.target/arm/pr95646.c (test for excess errors)
Excess errors:
cc1: error: ARMv8-M Security Extensions incompatible with selected FPU
cc1: error: target CPU does not support ARM mode

and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os
Resending as I don't think my earlier one made it to the lists (sorry if 
you are receiving this double!)


I'm not following this, before I go off and try to reproduce it, what do 
you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os'? 
These are the options you are seeing in the log file? Surely they should 
override the default options? Only thing I can think of is this might 
need an extra -mfloat-abi=soft to make sure it overrides the default 
float-abi.  Could you give that a try?


Cheers,
Andre


Christophe


Cheers,
Andre

Cheers,
Andre

gcc/ChangeLog:
2020-06-22  Andre Vieira  

   PR target/95646
   * config/arm/arm.c: (cmse_nonsecure_entry_clear_before_return):
Use 'callee_saved_reg_p' instead of
   'calL_used_or_fixed_reg_p'.

gcc/testsuite/ChangeLog:
2020-06-22  Andre Vieira  

   PR target/95646
   * gcc.target/arm/pr95646.c: New test.


Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-06-23 Thread Andre Vieira (lists)

On 23/06/2020 13:10, Kyrylo Tkachov wrote:



-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved
registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while compiling
for size (-Os) with a thumb-1 target the generated code will clear the
registers r7-r10. These however are callee saved and should be preserved
accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1 in a
way to make sure they are not used, as pushing and popping hi-registers
requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be aware of
'callee_saved_reg_p''s definition, as it does still take call used
registers into account, which aren't callee_saved in my opinion, so it
is a rather misnoemer, works in our advantage here though as it does
exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous versions if
stable.)

Ok.
Thanks,
Kyrill
As I was getting ready to push this I noticed I didn't add any skip-ifs 
to prevent this failing with specific target options. So here's a new 
version with those.


Still OK?

Cheers,
Andre



Cheers,
Andre

gcc/ChangeLog:
2020-06-22  Andre Vieira  

      PR target/95646
      * config/arm/arm.c: (cmse_nonsecure_entry_clear_before_return):
Use 'callee_saved_reg_p' instead of
      'calL_used_or_fixed_reg_p'.

gcc/testsuite/ChangeLog:
2020-06-22  Andre Vieira  

      PR target/95646
      * gcc.target/arm/pr95646.c: New test.
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
6b7ca829f1c8cbe3d427da474b079882dc522e1a..dac9a6fb5c41ce42cd7a278b417eab25239a043c
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -26960,7 +26960,7 @@ cmse_nonsecure_entry_clear_before_return (void)
continue;
   if (IN_RANGE (regno, IP_REGNUM, PC_REGNUM))
continue;
-  if (call_used_or_fixed_reg_p (regno)
+  if (!callee_saved_reg_p (regno)
  && (!IN_RANGE (regno, FIRST_VFP_REGNUM, LAST_VFP_REGNUM)
  || TARGET_HARD_FLOAT))
bitmap_set_bit (to_clear_bitmap, regno);
diff --git a/gcc/testsuite/gcc.target/arm/pr95646.c 
b/gcc/testsuite/gcc.target/arm/pr95646.c
new file mode 100644
index 
..12d06a0c8c1ed7de1f8d4d15130432259e613a32
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr95646.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-march=*" } 
{ "-march=armv8-m.base" } } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-mcpu=*" } { 
"-mcpu=cortex-m23" } } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-mfpu=*" } { 
} } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { 
"-mfloat-abi=*" } { "-mfloat-abi=soft" } } */
+/* { dg-options "-mcpu=cortex-m23 -mcmse" } */
+/* { dg-additional-options "-Os" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+int __attribute__ ((cmse_nonsecure_entry))
+foo (void)
+{
+  return 1;
+}
+/* { { dg-final { scan-assembler-not "mov\tr9, r0" } } */
+
+/*
+** __acle_se_bar:
+** mov (r[0-3]), r9
+** push{\1}
+** ...
+** pop {(r[0-3])}
+** mov r9, \2
+** ...
+** bxnslr
+*/
+int __attribute__ ((cmse_nonsecure_entry))
+bar (void)
+{
+  asm ("": : : "r9");
+  return 1;
+}


[PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-06-22 Thread Andre Vieira (lists)

Hi,

As reported in bugzilla when the -mcmse option is used while compiling 
for size (-Os) with a thumb-1 target the generated code will clear the 
registers r7-r10. These however are callee saved and should be preserved 
accross ABI boundaries. The reason this happens is because these 
registers are made "fixed" when optimising for size with Thumb-1 in a 
way to make sure they are not used, as pushing and popping hi-registers 
requires extra moves to and from LO_REGS.


To fix this, this patch uses 'callee_saved_reg_p', which accounts for 
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be aware of 
'callee_saved_reg_p''s definition, as it does still take call used 
registers into account, which aren't callee_saved in my opinion, so it 
is a rather misnoemer, works in our advantage here though as it does 
exactly what we need.


Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous versions if 
stable.)


Cheers,
Andre

gcc/ChangeLog:
2020-06-22  Andre Vieira  

    PR target/95646
    * config/arm/arm.c: (cmse_nonsecure_entry_clear_before_return): 
Use 'callee_saved_reg_p' instead of

    'calL_used_or_fixed_reg_p'.

gcc/testsuite/ChangeLog:
2020-06-22  Andre Vieira  

    PR target/95646
    * gcc.target/arm/pr95646.c: New test.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
6b7ca829f1c8cbe3d427da474b079882dc522e1a..dac9a6fb5c41ce42cd7a278b417eab25239a043c
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -26960,7 +26960,7 @@ cmse_nonsecure_entry_clear_before_return (void)
continue;
   if (IN_RANGE (regno, IP_REGNUM, PC_REGNUM))
continue;
-  if (call_used_or_fixed_reg_p (regno)
+  if (!callee_saved_reg_p (regno)
  && (!IN_RANGE (regno, FIRST_VFP_REGNUM, LAST_VFP_REGNUM)
  || TARGET_HARD_FLOAT))
bitmap_set_bit (to_clear_bitmap, regno);
diff --git a/gcc/testsuite/gcc.target/arm/pr95646.c 
b/gcc/testsuite/gcc.target/arm/pr95646.c
new file mode 100644
index 
..c9fdc37618ccaddcdb597647c7076054af17789a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr95646.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-mcmse -Os -mcpu=cortex-m23" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+int __attribute__ ((cmse_nonsecure_entry))
+foo (void)
+{
+  return 1;
+}
+/* { { dg-final { scan-assembler-not "mov\tr9, r0" } } */
+
+/*
+** __acle_se_bar:
+** mov (r[0-3]), r9
+** push{\1}
+** ...
+** pop {(r[0-3])}
+** mov r9, \2
+** ...
+** bxnslr
+*/
+int __attribute__ ((cmse_nonsecure_entry))
+bar (void)
+{
+  asm ("": : : "r9");
+  return 1;
+}


Re: [RFC][vect] BB SLP reduction prototype

2020-06-09 Thread Andre Vieira (lists)
The 'you' here is Richi, which Richi is probably aware but maybe not the 
rest of the list :')


On 09/06/2020 15:29, Andre Vieira (lists) wrote:

Hi,

So this is my rework of the code you sent me, I have not included the 
'permute' code you included as I can't figure out what it is meant to 
be doing. Maybe something to look at later.


I have also included three tests that show it working for some simple 
cases and even a nested one.


Unfortunately it will not handle other simple cases where reassoc 
doesn't put the reduction in the form of :

sum0 = a + b;
sum1 = c + sum0;
...

For instance a testcase I have been looking at is:
unsigned int u32_single_abs_sum (unsigned int * a, unsigned int * b)
{
  unsigned int sub0 = a[0] - b[0];
  unsigned int sub1 = a[1] - b[1];
  unsigned int sub2 = a[2] - b[2];
  unsigned int sub3 = a[3] - b[3];
  unsigned int sum = sub0 + sub1;
  sum += sub2;
  sum += sub3;
  return sum;
}

Unfortunately, the code that reaches slp will look like:
  _1 = *a_10(D);
  _2 = *b_11(D);
  _3 = MEM[(unsigned int *)a_10(D) + 4B];
  _4 = MEM[(unsigned int *)b_11(D) + 4B];
  _5 = MEM[(unsigned int *)a_10(D) + 8B];
  _6 = MEM[(unsigned int *)b_11(D) + 8B];
  _7 = MEM[(unsigned int *)a_10(D) + 12B];
  _8 = MEM[(unsigned int *)b_11(D) + 12B];
  _28 = _1 - _2;
  _29 = _3 + _28;
  _30 = _29 - _4;
  _31 = _5 + _30;
  _32 = _31 - _6;
  _33 = _7 + _32;
  sum_18 = _33 - _8;
  return sum_18;

Which doesn't have the format expected as I described above... I am 
wondering how to teach it to support this. Maybe starting with your 
suggestion of making plus_expr and minus_expr have the same hash, so 
it groups all these statements together might be a start, but you'd 
still need to 'rebalance' the tree somehow I need to give this a 
bit more thought but I wanted to share what I have so far.


The code is severely lacking in comments for now btw...

Cheers,
Andre



[RFC][vect] BB SLP reduction prototype

2020-06-09 Thread Andre Vieira (lists)

Hi,

So this is my rework of the code you sent me, I have not included the 
'permute' code you included as I can't figure out what it is meant to be 
doing. Maybe something to look at later.


I have also included three tests that show it working for some simple 
cases and even a nested one.


Unfortunately it will not handle other simple cases where reassoc 
doesn't put the reduction in the form of :

sum0 = a + b;
sum1 = c + sum0;
...

For instance a testcase I have been looking at is:
unsigned int u32_single_abs_sum (unsigned int * a, unsigned int * b)
{
  unsigned int sub0 = a[0] - b[0];
  unsigned int sub1 = a[1] - b[1];
  unsigned int sub2 = a[2] - b[2];
  unsigned int sub3 = a[3] - b[3];
  unsigned int sum = sub0 + sub1;
  sum += sub2;
  sum += sub3;
  return sum;
}

Unfortunately, the code that reaches slp will look like:
  _1 = *a_10(D);
  _2 = *b_11(D);
  _3 = MEM[(unsigned int *)a_10(D) + 4B];
  _4 = MEM[(unsigned int *)b_11(D) + 4B];
  _5 = MEM[(unsigned int *)a_10(D) + 8B];
  _6 = MEM[(unsigned int *)b_11(D) + 8B];
  _7 = MEM[(unsigned int *)a_10(D) + 12B];
  _8 = MEM[(unsigned int *)b_11(D) + 12B];
  _28 = _1 - _2;
  _29 = _3 + _28;
  _30 = _29 - _4;
  _31 = _5 + _30;
  _32 = _31 - _6;
  _33 = _7 + _32;
  sum_18 = _33 - _8;
  return sum_18;

Which doesn't have the format expected as I described above... I am 
wondering how to teach it to support this. Maybe starting with your 
suggestion of making plus_expr and minus_expr have the same hash, so it 
groups all these statements together might be a start, but you'd still 
need to 'rebalance' the tree somehow I need to give this a bit more 
thought but I wanted to share what I have so far.


The code is severely lacking in comments for now btw...

Cheers,
Andre

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-1.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-1.c
new file mode 100644
index 
..66b53ff9bb1e77414e7493c07ab87d46f4d33651
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-1.c
@@ -0,0 +1,66 @@
+/* { dg-require-effective-target vect_int } */
+#include 
+#include "tree-vect.h"
+extern int abs (int);
+
+#define ABS4(N)\
+  sum += abs (a[N]);   \
+  sum += abs (a[N+1]); \
+  sum += abs (a[N+2]); \
+  sum += abs (a[N+3]);
+
+#define ABS8(N)  \
+  ABS4(N)\
+  ABS4(N+4)
+
+#define ABS16(N)  \
+  ABS8(N)\
+  ABS8(N+8)
+
+__attribute__ ((noipa)) unsigned char
+u8_single_abs_sum (signed char * a)
+{
+  unsigned char sum = 0;
+  ABS16(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned short
+u16_single_abs_sum (signed short * a)
+{
+  unsigned short sum = 0;
+  ABS8(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned int
+u32_single_abs_sum (signed int * a)
+{
+  unsigned int sum = 0;
+  ABS4(0)
+  return sum;
+}
+
+signed char u8[16] = {0, 1, 2, 3, 4, 5, 6, -7, -8, -9, -10, -11, -12, -13,
+   -14, -15};
+signed short u16[8] = {0, 1, 2, 3, 4, -5, -6, -7};
+signed int u32[4] = {-10, -20, 30, 40};
+
+
+int main (void)
+{
+  check_vect ();
+
+  if (u8_single_abs_sum (&(u8[0])) != 120)
+abort ();
+
+  if (u16_single_abs_sum (&(u16[0])) != 28)
+abort ();
+
+  if (u32_single_abs_sum (&(u32[0])) != 100)
+abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 3 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-2.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-2.c
new file mode 100644
index 
..298a22cfef687f6634d61bf808a41942c3ce4a85
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-2.c
@@ -0,0 +1,82 @@
+/* { dg-require-effective-target vect_int } */
+#include 
+#include "tree-vect.h"
+extern int abs (int);
+
+#define ABS4(N)\
+  sum += abs (a[N]);   \
+  sum += abs (a[N+1]); \
+  sum += abs (a[N+2]); \
+  sum += abs (a[N+3]);
+
+#define ABS8(N)  \
+  ABS4(N)\
+  ABS4(N+4)
+
+#define ABS16(N)  \
+  ABS8(N)\
+  ABS8(N+8)
+
+__attribute__ ((noipa)) unsigned char
+u8_double_abs_sum (signed char * a)
+{
+  unsigned char sum = 0;
+  ABS16(0)
+  ABS16(16)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned short
+u16_double_abs_sum (signed short * a)
+{
+  unsigned short sum = 0;
+  ABS16(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned int
+u32_double_abs_sum (signed int * a)
+{
+  unsigned int sum = 0;
+  ABS8(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned int
+u32_triple_abs_sum (signed int * a)
+{
+  unsigned int sum = 0;
+  ABS8(0)
+  ABS4(8)
+  return sum;
+}
+
+signed char u8[32] = {0, 1, 2, 3, 4, 5, 6, -7, -8, -9, -10, -11, -12, -13,
+ -14, -15, 0, 1, 2, 3, 4, 5, 6, -7, -8, -9, -10, -11, -12, 
-13,
+ -14, -30};
+
+signed short u16[16] = {0, 1, 2, 3, 4, -5, -6, -7, 10, 20, 30, 40, -10, -20,
+  -30, -40};
+signed int u32[16] = {-10, -20, 30, 40, 100, 200, -300, -500, -600, -700, 1000,
+2000};
+

[AArch64][GCC-8][GCC-9] Use __getauxval instead of getauxval in LSE detection code in libgcc

2020-05-28 Thread Andre Vieira (lists)

The patch applies cleanly on gcc-9 and gcc-8.
I bootstrapped this on aarch64-none-linux-gnu and tested 
aarch64-none-elf for both.


Is this OK for those backports?

libgcc/ChangeLog:
2020-05-28  Andre Vieira  

    Backport from mainline.
    2020-05-06  Kyrylo Tkachov  

    * config/aarch64/lse-init.c (init_have_lse_atomics): Use __getauxval
    instead of getauxval.
    (AT_HWCAP): Define.
    (HWCAP_ATOMICS): Define.
    Guard detection on __gnu_linux__.

On 06/05/2020 16:24, Kyrylo Tkachov wrote:



-Original Message-
From: Joseph Myers 
Sent: 06 May 2020 15:46
To: Richard Biener 
Cc: Kyrylo Tkachov ; Florian Weimer
; Szabolcs Nagy ; gcc-
patc...@gcc.gnu.org; Jakub Jelinek 
Subject: Re: [PATCH][AArch64] Use __getauxval instead of getauxval in LSE
detection code in libgcc

On Wed, 6 May 2020, Richard Biener wrote:


Here is the updated patch for the record.
Jakub, richi, is this ok for the GCC 10 branch?

I'll defer to Joseph who is release manager as well.

This version is OK with me.

Thank you Joseph,
I've committed this version to trunk and the gcc-10 branch.
Kyrill


--
Joseph S. Myers
jos...@codesourcery.com


[PATCH][GCC-8][Aarch64]: Backport Force TImode values into even registers

2020-04-29 Thread Andre Vieira (lists)

Hi,

This is a backport from trunk/gcc-9 that I think we need now that we 
have backported the casp LSE instructions.


Bootstrapped and regression tested on aarch64.

Is this OK for gcc-8?

Cheers,
Andre

The LSE CASP instruction requires values to be placed in even
register pairs.  A solution involving two additional register
classes was rejected in favor of the much simpler solution of
simply requiring all TImode values to be aligned.

gcc/ChangeLog:
2020-04-29  Andre Vieira  

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_hard_regno_mode_ok): Force
    16-byte modes held in GP registers to use an even regno.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
5eec1aae54abe04b8320deaf8202621c8e193c01..525deba56ea363a621cccec1a923da241908dd06
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1369,10 +1369,14 @@ aarch64_hard_regno_mode_ok (unsigned regno, 
machine_mode mode)
   if (regno == FRAME_POINTER_REGNUM || regno == ARG_POINTER_REGNUM)
 return mode == Pmode;
 
-  if (GP_REGNUM_P (regno) && known_le (GET_MODE_SIZE (mode), 16))
-return true;
-
-  if (FP_REGNUM_P (regno))
+  if (GP_REGNUM_P (regno))
+{
+  if (known_le (GET_MODE_SIZE (mode), 8))
+   return true;
+  else if (known_le (GET_MODE_SIZE (mode), 16))
+   return (regno & 1) == 0;
+}
+  else if (FP_REGNUM_P (regno))
 {
   if (vec_flags & VEC_STRUCT)
return end_hard_regno (mode, regno) - 1 <= V31_REGNUM;


[PATCH][GCC-8][Aarch64]: Fix for PR target/9481

2020-04-28 Thread Andre Vieira (lists)

Hi,

Backport of PR target/94518: Fix memmodel index in 
aarch64_store_exclusive_pair


This fixes bootstrap with --enable-checking=yes,rtl for aarch64.

OK for gcc-8?

Cheers,
Andre

gcc/ChangeLog:
2020-04-28  Andre Vieira  

    PR target/94814
    Backport from gcc-9.
    2020-04-07  Kyrylo Tkachov  

    PR target/94518
    2019-09-23  Richard Sandiford 

    * config/aarch64/atomics.md (aarch64_store_exclusive_pair): Fix
    memmodel index.

diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
1005462ae23aa13dbc3013a255aa189096e33366..0e0b03731922d8e50e8468de94e0ff345d10c32f
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -752,7 +752,7 @@
  UNSPECV_SX))]
   ""
   {
-enum memmodel model = memmodel_from_int (INTVAL (operands[3]));
+enum memmodel model = memmodel_from_int (INTVAL (operands[4]));
 if (is_mm_relaxed (model) || is_mm_consume (model) || is_mm_acquire 
(model))
   return "stxp\t%w0, %x2, %x3, %1";
 else


[PATCH][GCC][Arm]: Fix bootstrap failure with rtl-checking

2020-04-27 Thread Andre Vieira (lists)

Hi,

The code change that caused this regression was not meant to affect neon 
code-gen, however I missed the REG fall through.  This patch makes sure 
we only get the left-hand of the PLUS if it is indeed a PLUS expr.


I suggest that in gcc-11 this code is cleaned up, as I do not think we 
even need the overlap checks, NEON only loads from or stores to FP 
registers and these can't be used in its addressing modes.


Bootstrapped arm-linux-gnueabihf with '--enable-checking=yes,rtl' for 
armv7-a and amrv8-a.


Is this OK for trunk?

gcc/ChangeLog:
2020-04-27  Andre Vieira  

    * config/arm/arm.c (output_move_neon): Only get the first operand,
    if addr is PLUS.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
0151bda90d961ae1a001c61cd5e94d6ec67e3aea..74454dddbb948a5d37f502e8e2146a81cb83d58b
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -20145,7 +20145,8 @@ output_move_neon (rtx *operands)
}
   /* Fall through.  */
 case PLUS:
-  addr = XEXP (addr, 0);
+  if (GET_CODE (addr) == PLUS)
+   addr = XEXP (addr, 0);
   /* Fall through.  */
 case LABEL_REF:
   {


[PATCH][wwwdocs] Add -moutline-atomics for AArch64 on gcc-9 and gcc-8

2020-04-24 Thread Andre Vieira (lists)
Add the backported functionality of -moutline-atomics for AArch64 to the 
gcc-9 and gcc-8 changes.html


Validates. Is this OK?
diff --git a/htdocs/gcc-8/changes.html b/htdocs/gcc-8/changes.html
index 
83dd1bc010a6e4debb76790b3fe62275bf0e0657..83e57db181294110f71a5d59960fb4d3fed7be98
 100644
--- a/htdocs/gcc-8/changes.html
+++ b/htdocs/gcc-8/changes.html
@@ -1394,5 +1394,22 @@ known to be fixed in the 8.4 release. This list might 
not be
 complete (that is, it is possible that some PRs that have been fixed
 are not listed here).
 
+
+GCC 8.5
+
+ Target Specific Changes
+
+AArch64
+  
+
+  The option -moutline-atomics has been added to aid
+  deployment of the Large System Extensions (LSE) on GNU/Linux systems built
+  with a baseline architecture targeting Armv8-A.  When the option is
+  specified code is emitted to detect the presence of LSE instructions at
+  runtime and use them for standard atomic operations.
+  For more information please refer to the documentation.
+
+  
+
 
 
diff --git a/htdocs/gcc-9/changes.html b/htdocs/gcc-9/changes.html
index 
74c7cde72ef5ab8ec059e20a8da3e46907ecd9a3..a2a28a9aeb851cae298e828d2c4b57c6fa414cf4
 100644
--- a/htdocs/gcc-9/changes.html
+++ b/htdocs/gcc-9/changes.html
@@ -1132,5 +1132,21 @@ complete (that is, it is possible that some PRs that 
have been fixed
 are not listed here).
 
 
+GCC 9.4
+
+ Target Specific Changes
+
+AArch64
+  
+
+  The option -moutline-atomics has been added to aid
+  deployment of the Large System Extensions (LSE) on GNU/Linux systems built
+  with a baseline architecture targeting Armv8-A.  When the option is
+  specified code is emitted to detect the presence of LSE instructions at
+  runtime and use them for standard atomic operations.
+  For more information please refer to the documentation.
+
+  
+
 
 


[committed][gcc-9] aarch64: Fix bootstrap with old binutils [PR93053]

2020-04-22 Thread Andre Vieira (lists)

Went ahead and committed the backport to gcc-9.

As reported in the PR, GCC 10 (and also 9.3.1 but not 9.3.0) fails to build
when using older binutils which lack LSE support, because those instructions
are used in libgcc.
Thanks to Kyrylo's hint, the following patches (hopefully) allow it to build
even with older binutils by using .inst directive if LSE support isn't
available in the assembler.

2020-04-22  Andre Vieira  

    Backport from mainline.
    2020-04-15  Jakub Jelinek  

    PR target/93053
    * configure.ac (LIBGCC_CHECK_AS_LSE): Add HAVE_AS_LSE checking.
    * config/aarch64/lse.S: Include auto-target.h, if HAVE_AS_LSE
    is not defined, use just .arch armv8-a.
    (B, M, N, OPN): Define.
    (COMMENT): New .macro.
    (CAS, CASP, SWP, LDOP): Use .inst directive if HAVE_AS_LSE is not
    defined.  Otherwise, move the operands right after the glue? and
    comment out operands where the macros are used.
    * configure: Regenerated.
    * config.in: Regenerated.

On 22/04/2020 10:59, Kyrylo Tkachov wrote:

Hi Andre,


-Original Message-
From: Andre Vieira (lists) 
Sent: 22 April 2020 09:26
To: Kyrylo Tkachov ; gcc-patches@gcc.gnu.org
Cc: Richard Sandiford ; s...@amazon.com
Subject: Re: [PATCH 0/19][GCC-8] aarch64: Backport outline atomics


On 20/04/2020 09:42, Kyrylo Tkachov wrote:

Hi Andre,


-Original Message-
From: Andre Vieira (lists) 
Sent: 16 April 2020 13:24
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Sandiford
; s...@amazon.com
Subject: [PATCH 0/19][GCC-8] aarch64: Backport outline atomics

Hi,

This series backports all the patches and fixes regarding outline
atomics to the gcc-8 branch.

Bootstrapped the series for aarch64-linux-gnu and regression tested.
Is this OK for gcc-8?

Andre Vieira (19):
aarch64: Add early clobber for aarch64_store_exclusive
aarch64: Simplify LSE cas generation
aarch64: Improve cas generation
aarch64: Improve swp generation
aarch64: Improve atomic-op lse generation
aarch64: Remove early clobber from ATOMIC_LDOP scratch
aarch64: Extend %R for integer registers
aarch64: Implement TImode compare-and-swap
aarch64: Tidy aarch64_split_compare_and_swap
aarch64: Add out-of-line functions for LSE atomics
Add visibility to libfunc constructors
aarch64: Implement -moutline-atomics
Aarch64: Fix shrinkwrapping interactions with atomics (PR92692)
aarch64: Fix store-exclusive in load-operate LSE helpers
aarch64: Configure for sys/auxv.h in libgcc for lse-init.c
aarch64: Fix up aarch64_compare_and_swaphi pattern [PR94368]
aarch64: Fix bootstrap with old binutils [PR93053]

Thanks for putting these together.
Before they can go in we need to get this fix for PR93053 into GCC 9.
Can you please test it on that branch to help Jakub out?
Thanks,
Kyrill

Bootstrapped and regression tested the PR93053 fix from Jakub on gcc-9
branch and it looks good.

Thanks, can you please apply the patch to the gcc-9 branch then? (making sure 
the PR markers are there in the commit message so that Bugzilla is updated).
We can then proceed with the GCC 8 backports.

Kyrill


aarch64: Fix ICE due to aarch64_gen_compare_reg_maybe_ze [PR94435]
re PR target/90724 (ICE with __sync_bool_compare_and_swap with
-march=armv8.2-a+sve)
diff --git a/libgcc/config.in b/libgcc/config.in
index 
59a3d8daf52e72e548d3d9425d6043d5e0c663ad..5be5321d2584392bac1ec3af779cd96823212902
 100644
--- a/libgcc/config.in
+++ b/libgcc/config.in
@@ -10,6 +10,9 @@
*/
 #undef HAVE_AS_CFI_SECTIONS
 
+/* Define to 1 if the assembler supports LSE. */
+#undef HAVE_AS_LSE
+
 /* Define to 1 if the target assembler supports thread-local storage. */
 #undef HAVE_CC_TLS
 
diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
index 
c7979382ad7770b61bb1c64d32ba2395963a9d7a..f7f1c19587beaec2ccb6371378d54d50139ba1c9
 100644
--- a/libgcc/config/aarch64/lse.S
+++ b/libgcc/config/aarch64/lse.S
@@ -48,8 +48,14 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
  * separately to minimize code size.
  */
 
+#include "auto-target.h"
+
 /* Tell the assembler to accept LSE instructions.  */
+#ifdef HAVE_AS_LSE
.arch armv8-a+lse
+#else
+   .arch armv8-a
+#endif
 
 /* Declare the symbol gating the LSE implementations.  */
.hidden __aarch64_have_lse_atomics
@@ -58,12 +64,19 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #if SIZE == 1
 # define S b
 # define UXT   uxtb
+# define B 0x
 #elif SIZE == 2
 # define S h
 # define UXT   uxth
+# define B 0x4000
 #elif SIZE == 4 || SIZE == 8 || SIZE == 16
 # define S
 # define UXT   mov
+# if SIZE == 4
+#  define B0x8000
+# elif SIZE == 8
+#  define B0xc000
+# endif
 #else
 # error
 #endif
@@ -72,18 +85,26 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 # define SUFF  _relax
 # define A
 # define L
+# define M 0x00
+# define N 0x00
 #elif MODEL == 2
 # define SUFF  _acq
 # define A a
 

Re: [PATCH 0/19][GCC-8] aarch64: Backport outline atomics

2020-04-22 Thread Andre Vieira (lists)



On 20/04/2020 09:42, Kyrylo Tkachov wrote:

Hi Andre,


-Original Message-
From: Andre Vieira (lists) 
Sent: 16 April 2020 13:24
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Sandiford
; s...@amazon.com
Subject: [PATCH 0/19][GCC-8] aarch64: Backport outline atomics

Hi,

This series backports all the patches and fixes regarding outline
atomics to the gcc-8 branch.

Bootstrapped the series for aarch64-linux-gnu and regression tested.
Is this OK for gcc-8?

Andre Vieira (19):
aarch64: Add early clobber for aarch64_store_exclusive
aarch64: Simplify LSE cas generation
aarch64: Improve cas generation
aarch64: Improve swp generation
aarch64: Improve atomic-op lse generation
aarch64: Remove early clobber from ATOMIC_LDOP scratch
aarch64: Extend %R for integer registers
aarch64: Implement TImode compare-and-swap
aarch64: Tidy aarch64_split_compare_and_swap
aarch64: Add out-of-line functions for LSE atomics
Add visibility to libfunc constructors
aarch64: Implement -moutline-atomics
Aarch64: Fix shrinkwrapping interactions with atomics (PR92692)
aarch64: Fix store-exclusive in load-operate LSE helpers
aarch64: Configure for sys/auxv.h in libgcc for lse-init.c
aarch64: Fix up aarch64_compare_and_swaphi pattern [PR94368]
aarch64: Fix bootstrap with old binutils [PR93053]

Thanks for putting these together.
Before they can go in we need to get this fix for PR93053 into GCC 9.
Can you please test it on that branch to help Jakub out?
Thanks,
Kyrill
Bootstrapped and regression tested the PR93053 fix from Jakub on gcc-9 
branch and it looks good.

aarch64: Fix ICE due to aarch64_gen_compare_reg_maybe_ze [PR94435]
re PR target/90724 (ICE with __sync_bool_compare_and_swap with
-march=armv8.2-a+sve)




Re: [PATCH 0/19][GCC-8] aarch64: Backport outline atomics

2020-04-16 Thread Andre Vieira (lists)

On 16/04/2020 13:24, Andre Vieira (lists) wrote:

Hi,

This series backports all the patches and fixes regarding outline 
atomics to the gcc-8 branch.


Bootstrapped the series for aarch64-linux-gnu and regression tested.
Is this OK for gcc-8?

Andre Vieira (19):
aarch64: Add early clobber for aarch64_store_exclusive
aarch64: Simplify LSE cas generation
aarch64: Improve cas generation
aarch64: Improve swp generation
aarch64: Improve atomic-op lse generation
aarch64: Remove early clobber from ATOMIC_LDOP scratch
aarch64: Extend %R for integer registers
aarch64: Implement TImode compare-and-swap
aarch64: Tidy aarch64_split_compare_and_swap
aarch64: Add out-of-line functions for LSE atomics
Add visibility to libfunc constructors
aarch64: Implement -moutline-atomics
Aarch64: Fix shrinkwrapping interactions with atomics (PR92692)
aarch64: Fix store-exclusive in load-operate LSE helpers
aarch64: Configure for sys/auxv.h in libgcc for lse-init.c
aarch64: Fix up aarch64_compare_and_swaphi pattern [PR94368]
aarch64: Fix bootstrap with old binutils [PR93053]
aarch64: Fix ICE due to aarch64_gen_compare_reg_maybe_ze [PR94435]
re PR target/90724 (ICE with __sync_bool_compare_and_swap with 
-march=armv8.2-a+sve)


Hmm something went wrong when sending these, I had tried to make the 
N/19 patches reply to this one, but failed and also I was pretty sure I 
had CC'ed Kyrill and Richard S.


Adding them now.



[PATCH 15/19][GCC-8] aarch64: Configure for sys/auxv.h in libgcc for lse-init.c

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline
    2019-09-25  Richard Henderson 

    PR target/91833
    * config/aarch64/lse-init.c: Include auto-target.h.  Disable
    initialization if !HAVE_SYS_AUXV_H.
    * configure.ac (AC_CHECK_HEADERS): Add sys/auxv.h.
    * config.in, configure: Rebuild.

diff --git a/libgcc/config.in b/libgcc/config.in
index 
d634af9d949741e26f5acc2606d40062d491dd8b..59a3d8daf52e72e548d3d9425d6043d5e0c663ad
 100644
--- a/libgcc/config.in
+++ b/libgcc/config.in
@@ -43,6 +43,9 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_STRING_H
 
+/* Define to 1 if you have the  header file. */
+#undef HAVE_SYS_AUXV_H
+
 /* Define to 1 if you have the  header file. */
 #undef HAVE_SYS_STAT_H
 
@@ -82,6 +85,11 @@
 /* Define to 1 if the target use emutls for thread-local storage. */
 #undef USE_EMUTLS
 
+/* Enable large inode numbers on Mac OS X 10.5.  */
+#ifndef _DARWIN_USE_64_BIT_INODE
+# define _DARWIN_USE_64_BIT_INODE 1
+#endif
+
 /* Number of bits in a file offset, on hosts where this is settable. */
 #undef _FILE_OFFSET_BITS
 
diff --git a/libgcc/config/aarch64/lse-init.c b/libgcc/config/aarch64/lse-init.c
index 
33d2914747994a1e07dcae906f0352e64045ab02..1a8f4c55213f25c67c8bb8cdc1cc6f1bbe3255cb
 100644
--- a/libgcc/config/aarch64/lse-init.c
+++ b/libgcc/config/aarch64/lse-init.c
@@ -23,12 +23,14 @@ a copy of the GCC Runtime Library Exception along with this 
program;
 see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 <http://www.gnu.org/licenses/>.  */
 
+#include "auto-target.h"
+
 /* Define the symbol gating the LSE implementations.  */
 _Bool __aarch64_have_lse_atomics
   __attribute__((visibility("hidden"), nocommon));
 
 /* Disable initialization of __aarch64_have_lse_atomics during bootstrap.  */
-#ifndef inhibit_libc
+#if !defined(inhibit_libc) && defined(HAVE_SYS_AUXV_H)
 # include 
 
 /* Disable initialization if the system headers are too old.  */
diff --git a/libgcc/configure b/libgcc/configure
old mode 100644
new mode 100755
index 
b2f3f8708441e473b8e2941c4748748b6c7c40b8..7962cd9b87e1eb67037180e110f7d0de145bb2e1
--- a/libgcc/configure
+++ b/libgcc/configure
@@ -641,6 +641,7 @@ infodir
 docdir
 oldincludedir
 includedir
+runstatedir
 localstatedir
 sharedstatedir
 sysconfdir
@@ -729,6 +730,7 @@ datadir='${datarootdir}'
 sysconfdir='${prefix}/etc'
 sharedstatedir='${prefix}/com'
 localstatedir='${prefix}/var'
+runstatedir='${localstatedir}/run'
 includedir='${prefix}/include'
 oldincludedir='/usr/include'
 docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
@@ -980,6 +982,15 @@ do
   | -silent | --silent | --silen | --sile | --sil)
 silent=yes ;;
 
+  -runstatedir | --runstatedir | --runstatedi | --runstated \
+  | --runstate | --runstat | --runsta | --runst | --runs \
+  | --run | --ru | --r)
+ac_prev=runstatedir ;;
+  -runstatedir=* | --runstatedir=* | --runstatedi=* | --runstated=* \
+  | --runstate=* | --runstat=* | --runsta=* | --runst=* | --runs=* \
+  | --run=* | --ru=* | --r=*)
+runstatedir=$ac_optarg ;;
+
   -sbindir | --sbindir | --sbindi | --sbind | --sbin | --sbi | --sb)
 ac_prev=sbindir ;;
   -sbindir=* | --sbindir=* | --sbindi=* | --sbind=* | --sbin=* \
@@ -1117,7 +1128,7 @@ fi
 for ac_var in  exec_prefix prefix bindir sbindir libexecdir datarootdir \
datadir sysconfdir sharedstatedir localstatedir includedir \
oldincludedir docdir infodir htmldir dvidir pdfdir psdir \
-   libdir localedir mandir
+   libdir localedir mandir runstatedir
 do
   eval ac_val=\$$ac_var
   # Remove trailing slashes.
@@ -1272,6 +1283,7 @@ Fine tuning of the installation directories:
   --sysconfdir=DIRread-only single-machine data [PREFIX/etc]
   --sharedstatedir=DIRmodifiable architecture-independent data [PREFIX/com]
   --localstatedir=DIR modifiable single-machine data [PREFIX/var]
+  --runstatedir=DIR   modifiable per-process data [LOCALSTATEDIR/run]
   --libdir=DIRobject code libraries [EPREFIX/lib]
   --includedir=DIRC header files [PREFIX/include]
   --oldincludedir=DIR C header files for non-gcc [/usr/include]
@@ -4091,7 +4103,7 @@ else
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
   && LARGE_OFF_T % 2147483647 == 1)
  ? 1 : -1];
@@ -4137,7 +4149,7 @@ else
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + 

[PATCH 19/19][GCC-8] re PR target/90724 (ICE with __sync_bool_compare_and_swap with -march=armv8.2-a+sve)

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline
    2019-08-21  Prathamesh Kulkarni 

    PR target/90724
    * config/aarch64/aarch64.c (aarch64_gen_compare_reg_maybe_ze): Force y
    in reg if it fails aarch64_plus_operand predicate.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
6bac63402e508027e77a9f4557cb10c578ea7c2c..0da927be15c339295ef940d6e05a37e95135aa5a
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1574,6 +1574,9 @@ aarch64_gen_compare_reg_maybe_ze (RTX_CODE code, rtx x, 
rtx y,
}
 }
 
+  if (!aarch64_plus_operand (y, y_mode))
+y = force_reg (y_mode, y);
+
   return aarch64_gen_compare_reg (code, x, y);
 }
 


[PATCH 16/19][GCC-8] aarch64: Fix up aarch64_compare_and_swaphi pattern [PR94368]

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline
    2020-03-31  Jakub Jelinek 

    PR target/94368
    * config/aarch64/constraints.md (Uph): New constraint.
    * config/aarch64/atomics.md (cas_short_expected_imm): New mode attr.
    (aarch64_compare_and_swap): Use it instead of n in operand 2's
    constraint.

    * gcc.dg/pr94368.c: New test.

diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
0ee8d2efac05877d610981b719bd02afdf93a832..1005462ae23aa13dbc3013a255aa189096e33366
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -38,6 +38,8 @@
 
 (define_mode_attr cas_short_expected_pred
   [(QI "aarch64_reg_or_imm") (HI "aarch64_plushi_operand")])
+(define_mode_attr cas_short_expected_imm
+  [(QI "n") (HI "Uph")])
 
 (define_insn_and_split "aarch64_compare_and_swap"
   [(set (reg:CC CC_REGNUM) ;; bool out
@@ -47,7 +49,8 @@
   (match_operand:SHORT 1 "aarch64_sync_memory_operand" "+Q"))) ;; memory
(set (match_dup 1)
 (unspec_volatile:SHORT
-  [(match_operand:SHORT 2 "" "rn");; 
expected
+  [(match_operand:SHORT 2 ""
+ "r")  ;; expected
(match_operand:SHORT 3 "aarch64_reg_or_zero" "rZ")  ;; desired
(match_operand:SI 4 "const_int_operand");; 
is_weak
(match_operand:SI 5 "const_int_operand");; mod_s
diff --git a/gcc/config/aarch64/constraints.md 
b/gcc/config/aarch64/constraints.md
index 
32a0fa60a198c714f7c0b8b987da6bc26992845d..03626d2faf87e0b038bf3b8602d4feb8ef7d077c
 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -213,6 +213,13 @@
   (and (match_code "const_int")
(match_test "(unsigned) exact_log2 (ival) <= 4")))
 
+(define_constraint "Uph"
+  "@internal
+  A constraint that matches HImode integers zero extendable to
+  SImode plus_operand."
+  (and (match_code "const_int")
+   (match_test "aarch64_plushi_immediate (op, VOIDmode)")))
+
 (define_memory_constraint "Q"
  "A memory address which uses a single base register with no offset."
  (and (match_code "mem")
diff --git a/gcc/testsuite/gcc.dg/pr94368.c b/gcc/testsuite/gcc.dg/pr94368.c
new file mode 100644
index 
..1267b8220983ef1477a8339bdcc6369abaeca592
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94368.c
@@ -0,0 +1,25 @@
+/* PR target/94368 */
+/* { dg-do compile { target fpic } } */
+/* { dg-options "-fpic -O1 -fcommon" } */
+
+int b, c, d, e, f, h;
+short g;
+int foo (int) __attribute__ ((__const__));
+
+void
+bar (void)
+{
+  while (1)
+{
+  while (1)
+   {
+ __atomic_load_n (, 0);
+ if (foo (2))
+   __sync_val_compare_and_swap (, 0, f);
+ b = 1;
+ if (h == e)
+   break;
+   }
+  __sync_val_compare_and_swap (, -1, f);
+}
+}


[PATCH 18/19][GCC-8] aarch64: Fix ICE due to aarch64_gen_compare_reg_maybe_ze [PR94435]

2020-04-16 Thread Andre Vieira (lists)

The following testcase ICEs, because aarch64_gen_compare_reg_maybe_ze emits
invalid RTL.
For y_mode [QH]Imode it expects y to be of that mode (or CONST_INT that fits
into that mode) and x being SImode; for non-CONST_INT y it zero extends y
into SImode and compares that against x, for CONST_INT y it zero extends y
into SImode.  The problem is that when the zero extended constant isn't
usable directly, it forces it into a REG, but with y_mode mode, and then
compares against y.  That is wrong, because it should force it into a SImode
REG and compare that way.

2020-04-16  Andre Vieira 

    Backport from mainline
    2020-04-02  Jakub Jelinek 

    PR target/94435
    * config/aarch64/aarch64.c (aarch64_gen_compare_reg_maybe_ze): For
    y_mode E_[QH]Imode and y being a CONST_INT, change y_mode to SImode.

    * gcc.target/aarch64/pr94435.c: New test.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
21124b5a3479dd388eb767402e080e2181153467..6bac63402e508027e77a9f4557cb10c578ea7c2c
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1556,7 +1556,10 @@ aarch64_gen_compare_reg_maybe_ze (RTX_CODE code, rtx x, 
rtx y,
   if (y_mode == E_QImode || y_mode == E_HImode)
 {
   if (CONST_INT_P (y))
-   y = GEN_INT (INTVAL (y) & GET_MODE_MASK (y_mode));
+   {
+ y = GEN_INT (INTVAL (y) & GET_MODE_MASK (y_mode));
+ y_mode = SImode;
+   }
   else
{
  rtx t, cc_reg;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr94435.c 
b/gcc/testsuite/gcc.target/aarch64/pr94435.c
new file mode 100644
index 
..5713c14d5f90b1d42f92d040e9030ecc03c97d51
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr94435.c
@@ -0,0 +1,25 @@
+/* PR target/94435 */
+/* { dg-do compile } */
+/* { dg-options "-march=armv8-a+nolse -moutline-atomics" } */
+
+int b, c, d, e, f, h;
+short g;
+int foo (int) __attribute__ ((__const__));
+
+void
+bar (void)
+{
+  while (1)
+{
+  while (1)
+   {
+ __atomic_load_n (, 0);
+ if (foo (2))
+   __sync_val_compare_and_swap (, 0, f);
+ b = 1;
+ if (h == e)
+   break;
+   }
+  __sync_val_compare_and_swap (, -1, f);
+}
+}


[PATCH 17/19][GCC-8] aarch64: Fix bootstrap with old binutils [PR93053]

2020-04-16 Thread Andre Vieira (lists)

As reported in the PR, GCC 10 (and also 9.3.1 but not 9.3.0) fails to build
when using older binutils which lack LSE support, because those instructions
are used in libgcc.
Thanks to Kyrylo's hint, the following patches (hopefully) allow it to build
even with older binutils by using .inst directive if LSE support isn't
available in the assembler.

2020-04-16 Andre Vieira 

    Backport from mainline
    2020-04-15  Jakub Jelinek 

    PR target/93053
    * configure.ac (LIBGCC_CHECK_AS_LSE): Add HAVE_AS_LSE checking.
    * config/aarch64/lse.S: Include auto-target.h, if HAVE_AS_LSE
    is not defined, use just .arch armv8-a.
    (B, M, N, OPN): Define.
    (COMMENT): New .macro.
    (CAS, CASP, SWP, LDOP): Use .inst directive if HAVE_AS_LSE is not
    defined.  Otherwise, move the operands right after the glue? and
    comment out operands where the macros are used.
    * configure: Regenerated.
    * config.in: Regenerated.

diff --git a/libgcc/config.in b/libgcc/config.in
index 
59a3d8daf52e72e548d3d9425d6043d5e0c663ad..5be5321d2584392bac1ec3af779cd96823212902
 100644
--- a/libgcc/config.in
+++ b/libgcc/config.in
@@ -10,6 +10,9 @@
*/
 #undef HAVE_AS_CFI_SECTIONS
 
+/* Define to 1 if the assembler supports LSE. */
+#undef HAVE_AS_LSE
+
 /* Define to 1 if the target assembler supports thread-local storage. */
 #undef HAVE_CC_TLS
 
diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
index 
c7979382ad7770b61bb1c64d32ba2395963a9d7a..f7f1c19587beaec2ccb6371378d54d50139ba1c9
 100644
--- a/libgcc/config/aarch64/lse.S
+++ b/libgcc/config/aarch64/lse.S
@@ -48,8 +48,14 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
  * separately to minimize code size.
  */
 
+#include "auto-target.h"
+
 /* Tell the assembler to accept LSE instructions.  */
+#ifdef HAVE_AS_LSE
.arch armv8-a+lse
+#else
+   .arch armv8-a
+#endif
 
 /* Declare the symbol gating the LSE implementations.  */
.hidden __aarch64_have_lse_atomics
@@ -58,12 +64,19 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #if SIZE == 1
 # define S b
 # define UXT   uxtb
+# define B 0x
 #elif SIZE == 2
 # define S h
 # define UXT   uxth
+# define B 0x4000
 #elif SIZE == 4 || SIZE == 8 || SIZE == 16
 # define S
 # define UXT   mov
+# if SIZE == 4
+#  define B0x8000
+# elif SIZE == 8
+#  define B0xc000
+# endif
 #else
 # error
 #endif
@@ -72,18 +85,26 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 # define SUFF  _relax
 # define A
 # define L
+# define M 0x00
+# define N 0x00
 #elif MODEL == 2
 # define SUFF  _acq
 # define A a
 # define L
+# define M 0x40
+# define N 0x80
 #elif MODEL == 3
 # define SUFF  _rel
 # define A
 # define L l
+# define M 0x008000
+# define N 0x40
 #elif MODEL == 4
 # define SUFF  _acq_rel
 # define A a
 # define L l
+# define M 0x408000
+# define N 0xc0
 #else
 # error
 #endif
@@ -144,9 +165,13 @@ STARTFNNAME(cas)
JUMP_IF_NOT_LSE 8f
 
 #if SIZE < 16
-#define CASglue4(cas, A, L, S)
+#ifdef HAVE_AS_LSE
+# define CAS   glue4(cas, A, L, S) s(0), s(1), [x2]
+#else
+# define CAS   .inst 0x08a07c41 + B + M
+#endif
 
-   CAS s(0), s(1), [x2]
+   CAS /* s(0), s(1), [x2] */
ret
 
 8: UXT s(tmp0), s(0)
@@ -160,9 +185,13 @@ STARTFNNAME(cas)
 #else
 #define LDXP   glue3(ld, A, xp)
 #define STXP   glue3(st, L, xp)
-#define CASP   glue3(casp, A, L)
+#ifdef HAVE_AS_LSE
+# define CASP  glue3(casp, A, L)   x0, x1, x2, x3, [x4]
+#else
+# define CASP  .inst 0x48207c82 + M
+#endif
 
-   CASPx0, x1, x2, x3, [x4]
+   CASP/* x0, x1, x2, x3, [x4] */
ret
 
 8: mov x(tmp0), x0
@@ -181,12 +210,16 @@ ENDFN NAME(cas)
 #endif
 
 #ifdef L_swp
-#define SWPglue4(swp, A, L, S)
+#ifdef HAVE_AS_LSE
+# define SWP   glue4(swp, A, L, S) s(0), s(0), [x1]
+#else
+# define SWP   .inst 0x38208020 + B + N
+#endif
 
 STARTFNNAME(swp)
JUMP_IF_NOT_LSE 8f
 
-   SWP s(0), s(0), [x1]
+   SWP /* s(0), s(0), [x1] */
ret
 
 8: mov s(tmp0), s(0)
@@ -204,24 +237,32 @@ ENDFN NAME(swp)
 #ifdef L_ldadd
 #define LDNM   ldadd
 #define OP add
+#define OPN0x
 #elif defined(L_ldclr)
 #define LDNM   ldclr
 #define OP bic
+#define OPN0x1000
 #elif defined(L_ldeor)
 #define LDNM   ldeor
 #define OP eor
+#define OPN0x2000
 #elif defined(L_ldset)
 #define LDNM   ldset
 #define OP orr
+#define OPN0x3000
 #else
 #error
 #endif
-#define LDOP   glue4(LDNM, A, L, S)
+#ifdef HAVE_AS_LSE
+# define LDOP  glue4(LDNM, A, L, S)s(0), s(0), [x1]
+#else
+# define LDOP  .inst 0x38200020 + OPN + B + N
+#endif
 
 STARTFNNAME(LDNM)
JUMP_IF_NOT_LSE 8f
 
-   LDOPs(0), s(0), [x1]

[PATCH 13/19][GCC-8] Aarch64: Fix shrinkwrapping interactions with atomics

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline
    2020-01-17  Wilco Dijkstra 

    PR target/92692
    * config/aarch64/atomics.md (aarch64_compare_and_swap)
    Use epilogue_completed rather than reload_completed.

diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
28a1dbc4231009333c2e766d9d3aead54a491631..0ee8d2efac05877d610981b719bd02afdf93a832
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -104,7 +104,7 @@
(clobber (match_scratch:SI 7 "="))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_compare_and_swap (operands);


[PATCH 8/19][GCC-8] aarch64: Implement TImode compare-and-swap

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline.
    2019-09-19  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_gen_compare_reg): Add support
    for NE comparison of TImode values.
    (aarch64_emit_load_exclusive): Add support for TImode.
    (aarch64_emit_store_exclusive): Likewise.
    (aarch64_split_compare_and_swap): Disable strong_zero_p for TImode.
    * config/aarch64/atomics.md (atomic_compare_and_swapti):
    Change iterator from ALLI to ALLI_TI.
    (atomic_compare_and_swapti): New.
    (atomic_compare_and_swapti: New.
    (aarch64_load_exclusive_pair): New.
    (aarch64_store_exclusive_pair): New.
    * config/aarch64/iterators.md (ALLI_TI): New iterator.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
317571e018c4f96046799675e042cdfaabb5b94a..09e78313489d266daaca9eba3647f150534893f6
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1517,10 +1517,33 @@ emit_set_insn (rtx x, rtx y)
 rtx
 aarch64_gen_compare_reg (RTX_CODE code, rtx x, rtx y)
 {
-  machine_mode mode = SELECT_CC_MODE (code, x, y);
-  rtx cc_reg = gen_rtx_REG (mode, CC_REGNUM);
+  machine_mode cmp_mode = GET_MODE (x);
+  machine_mode cc_mode;
+  rtx cc_reg;
 
-  emit_set_insn (cc_reg, gen_rtx_COMPARE (mode, x, y));
+  if (cmp_mode == TImode)
+{
+  gcc_assert (code == NE);
+
+  cc_mode = CCmode;
+  cc_reg = gen_rtx_REG (cc_mode, CC_REGNUM);
+
+  rtx x_lo = operand_subword (x, 0, 0, TImode);
+  rtx y_lo = operand_subword (y, 0, 0, TImode);
+  emit_set_insn (cc_reg, gen_rtx_COMPARE (cc_mode, x_lo, y_lo));
+
+  rtx x_hi = operand_subword (x, 1, 0, TImode);
+  rtx y_hi = operand_subword (y, 1, 0, TImode);
+  emit_insn (gen_ccmpdi (cc_reg, cc_reg, x_hi, y_hi,
+gen_rtx_EQ (cc_mode, cc_reg, const0_rtx),
+GEN_INT (AARCH64_EQ)));
+}
+  else
+{
+  cc_mode = SELECT_CC_MODE (code, x, y);
+  cc_reg = gen_rtx_REG (cc_mode, CC_REGNUM);
+  emit_set_insn (cc_reg, gen_rtx_COMPARE (cc_mode, x, y));
+}
   return cc_reg;
 }
 
@@ -14145,40 +14168,54 @@ static void
 aarch64_emit_load_exclusive (machine_mode mode, rtx rval,
 rtx mem, rtx model_rtx)
 {
-  rtx (*gen) (rtx, rtx, rtx);
-
-  switch (mode)
+  if (mode == TImode)
+emit_insn (gen_aarch64_load_exclusive_pair (gen_lowpart (DImode, rval),
+   gen_highpart (DImode, rval),
+   mem, model_rtx));
+  else
 {
-case E_QImode: gen = gen_aarch64_load_exclusiveqi; break;
-case E_HImode: gen = gen_aarch64_load_exclusivehi; break;
-case E_SImode: gen = gen_aarch64_load_exclusivesi; break;
-case E_DImode: gen = gen_aarch64_load_exclusivedi; break;
-default:
-  gcc_unreachable ();
-}
+  rtx (*gen) (rtx, rtx, rtx);
+
+  switch (mode)
+   {
+   case E_QImode: gen = gen_aarch64_load_exclusiveqi; break;
+   case E_HImode: gen = gen_aarch64_load_exclusivehi; break;
+   case E_SImode: gen = gen_aarch64_load_exclusivesi; break;
+   case E_DImode: gen = gen_aarch64_load_exclusivedi; break;
+   default:
+ gcc_unreachable ();
+   }
 
-  emit_insn (gen (rval, mem, model_rtx));
+  emit_insn (gen (rval, mem, model_rtx));
+}
 }
 
 /* Emit store exclusive.  */
 
 static void
 aarch64_emit_store_exclusive (machine_mode mode, rtx bval,
- rtx rval, rtx mem, rtx model_rtx)
+ rtx mem, rtx rval, rtx model_rtx)
 {
-  rtx (*gen) (rtx, rtx, rtx, rtx);
-
-  switch (mode)
+  if (mode == TImode)
+emit_insn (gen_aarch64_store_exclusive_pair
+  (bval, mem, operand_subword (rval, 0, 0, TImode),
+   operand_subword (rval, 1, 0, TImode), model_rtx));
+  else
 {
-case E_QImode: gen = gen_aarch64_store_exclusiveqi; break;
-case E_HImode: gen = gen_aarch64_store_exclusivehi; break;
-case E_SImode: gen = gen_aarch64_store_exclusivesi; break;
-case E_DImode: gen = gen_aarch64_store_exclusivedi; break;
-default:
-  gcc_unreachable ();
-}
+  rtx (*gen) (rtx, rtx, rtx, rtx);
+
+  switch (mode)
+   {
+   case E_QImode: gen = gen_aarch64_store_exclusiveqi; break;
+   case E_HImode: gen = gen_aarch64_store_exclusivehi; break;
+   case E_SImode: gen = gen_aarch64_store_exclusivesi; break;
+   case E_DImode: gen = gen_aarch64_store_exclusivedi; break;
+   default:
+ gcc_unreachable ();
+   }
 
-  emit_insn (gen (bval, rval, mem, model_rtx));
+  emit_insn (gen (bval, mem, rval, model_rtx));
+}
 }
 
 /* Mark the previous jump instruction as unlikely.  */
@@ -14197,16 +14234,6 @@ aarch64_expand_compare_and_swap (rtx operands[])
 {
   rtx bval, rval, mem, oldval, newval, is_weak, mod_s, mod_f, x, cc_reg;
   machine_mode mode, r_mode;
-  typedef rtx (*gen_atomic_cas_fn) (rtx, rtx, rtx, rtx

[PATCH 7/19][GCC-8] aarch64: Extend %R for integer registers

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline.
    2019-09-19  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_print_operand): Allow integer
    registers with %R.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
1068cfd899a759c506e3217e1e2c19cd778b4372..317571e018c4f96046799675e042cdfaabb5b94a
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -6627,7 +6627,7 @@ sizetochar (int size)
  'S/T/U/V':Print a FP/SIMD register name for a register 
list.
The register printed is the FP/SIMD register name
of X + 0/1/2/3 for S/T/U/V.
- 'R':  Print a scalar FP/SIMD register name + 1.
+ 'R':  Print a scalar Integer/FP/SIMD register name + 1.
  'X':  Print bottom 16 bits of integer constant in hex.
  'w/x':Print a general register name or the zero register
(32-bit or 64-bit).
@@ -6813,12 +6813,13 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   break;
 
 case 'R':
-  if (!REG_P (x) || !FP_REGNUM_P (REGNO (x)))
-   {
- output_operand_lossage ("incompatible floating point / vector 
register operand for '%%%c'", code);
- return;
-   }
-  asm_fprintf (f, "q%d", REGNO (x) - V0_REGNUM + 1);
+  if (REG_P (x) && FP_REGNUM_P (REGNO (x)))
+   asm_fprintf (f, "q%d", REGNO (x) - V0_REGNUM + 1);
+  else if (REG_P (x) && GP_REGNUM_P (REGNO (x)))
+   asm_fprintf (f, "x%d", REGNO (x) - R0_REGNUM + 1);
+  else
+   output_operand_lossage ("incompatible register operand for '%%%c'",
+   code);
   break;
 
 case 'X':


[PATCH 14/19][GCC-8] aarch64: Fix store-exclusive in load-operate LSE helpers

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline
    2019-09-25  Richard Henderson 

    PR target/91834
    * config/aarch64/lse.S (LDNM): Ensure STXR output does not
    overlap the inputs.

diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
index 
a5f6673596c73c497156a6f128799cc43b400504..c7979382ad7770b61bb1c64d32ba2395963a9d7a
 100644
--- a/libgcc/config/aarch64/lse.S
+++ b/libgcc/config/aarch64/lse.S
@@ -227,8 +227,8 @@ STARTFN NAME(LDNM)
 8: mov s(tmp0), s(0)
 0: LDXRs(0), [x1]
OP  s(tmp1), s(0), s(tmp0)
-   STXRw(tmp1), s(tmp1), [x1]
-   cbnzw(tmp1), 0b
+   STXRw(tmp2), s(tmp1), [x1]
+   cbnzw(tmp2), 0b
ret
 
 ENDFN  NAME(LDNM)


[PATCH 10/19][GCC-8] aarch64: Add out-of-line functions for LSE atomics

2020-04-16 Thread Andre Vieira (lists)

This is the libgcc part of the interface -- providing the functions.
Rationale is provided at the top of libgcc/config/aarch64/lse.S.

2020-04-16  Andre Vieira 

    Backport from mainline
    2019-09-19  Richard Henderson 

    * config/aarch64/lse-init.c: New file.
    * config/aarch64/lse.S: New file.
    * config/aarch64/t-lse: New file.
    * config.host: Add t-lse to all aarch64 tuples.

diff --git a/libgcc/config.host b/libgcc/config.host
index 
b12c86267dac9da8da9e1ab4123d5171c3e07f40..e436ade1a68c6cd918d2f370b14d61682cb9fd59
 100644
--- a/libgcc/config.host
+++ b/libgcc/config.host
@@ -337,23 +337,27 @@ aarch64*-*-elf | aarch64*-*-rtems*)
extra_parts="$extra_parts crtbegin.o crtend.o crti.o crtn.o"
extra_parts="$extra_parts crtfastmath.o"
tmake_file="${tmake_file} ${cpu_type}/t-aarch64"
+   tmake_file="${tmake_file} ${cpu_type}/t-lse t-slibgcc-libgcc"
tmake_file="${tmake_file} ${cpu_type}/t-softfp t-softfp t-crtfm"
md_unwind_header=aarch64/aarch64-unwind.h
;;
 aarch64*-*-freebsd*)
extra_parts="$extra_parts crtfastmath.o"
tmake_file="${tmake_file} ${cpu_type}/t-aarch64"
+   tmake_file="${tmake_file} ${cpu_type}/t-lse t-slibgcc-libgcc"
tmake_file="${tmake_file} ${cpu_type}/t-softfp t-softfp t-crtfm"
md_unwind_header=aarch64/freebsd-unwind.h
;;
 aarch64*-*-fuchsia*)
tmake_file="${tmake_file} ${cpu_type}/t-aarch64"
+   tmake_file="${tmake_file} ${cpu_type}/t-lse t-slibgcc-libgcc"
tmake_file="${tmake_file} ${cpu_type}/t-softfp t-softfp"
;;
 aarch64*-*-linux*)
extra_parts="$extra_parts crtfastmath.o"
md_unwind_header=aarch64/linux-unwind.h
tmake_file="${tmake_file} ${cpu_type}/t-aarch64"
+   tmake_file="${tmake_file} ${cpu_type}/t-lse t-slibgcc-libgcc"
tmake_file="${tmake_file} ${cpu_type}/t-softfp t-softfp t-crtfm"
;;
 alpha*-*-linux*)
diff --git a/libgcc/config/aarch64/lse-init.c b/libgcc/config/aarch64/lse-init.c
new file mode 100644
index 
..33d2914747994a1e07dcae906f0352e64045ab02
--- /dev/null
+++ b/libgcc/config/aarch64/lse-init.c
@@ -0,0 +1,45 @@
+/* Out-of-line LSE atomics for AArch64 architecture, Init.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   Contributed by Linaro Ltd.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+<http://www.gnu.org/licenses/>.  */
+
+/* Define the symbol gating the LSE implementations.  */
+_Bool __aarch64_have_lse_atomics
+  __attribute__((visibility("hidden"), nocommon));
+
+/* Disable initialization of __aarch64_have_lse_atomics during bootstrap.  */
+#ifndef inhibit_libc
+# include 
+
+/* Disable initialization if the system headers are too old.  */
+# if defined(AT_HWCAP) && defined(HWCAP_ATOMICS)
+
+static void __attribute__((constructor))
+init_have_lse_atomics (void)
+{
+  unsigned long hwcap = getauxval (AT_HWCAP);
+  __aarch64_have_lse_atomics = (hwcap & HWCAP_ATOMICS) != 0;
+}
+
+# endif /* HWCAP */
+#endif /* inhibit_libc */
diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
new file mode 100644
index 
..a5f6673596c73c497156a6f128799cc43b400504
--- /dev/null
+++ b/libgcc/config/aarch64/lse.S
@@ -0,0 +1,235 @@
+/* Out-of-line LSE atomics for AArch64 architecture.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   Contributed by Linaro Ltd.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+Under Section 7 of GPL version 3, you are granted addi

[PATCH 9/19][GCC-8] aarch64: Tidy aarch64_split_compare_and_swap

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline.
    2019-09-19  Richard Henderson 

    * config/aarch64/aarch64 (aarch64_split_compare_and_swap):Unify 
some code paths;

    use aarch64_gen_compare_reg instead of open-coding.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
09e78313489d266daaca9eba3647f150534893f6..2df5bf3db97d9362155c3c8d9c9d7f14c41b9520
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14359,13 +14359,11 @@ aarch64_split_compare_and_swap (rtx operands[])
   /* Split after prolog/epilog to avoid interactions with shrinkwrapping.  */
   gcc_assert (epilogue_completed);
 
-  rtx rval, mem, oldval, newval, scratch;
+  rtx rval, mem, oldval, newval, scratch, x, model_rtx;
   machine_mode mode;
   bool is_weak;
   rtx_code_label *label1, *label2;
-  rtx x, cond;
   enum memmodel model;
-  rtx model_rtx;
 
   rval = operands[0];
   mem = operands[1];
@@ -14386,7 +14384,7 @@ aarch64_split_compare_and_swap (rtx operands[])
CBNZscratch, .label1
 .label2:
CMP rval, 0.  */
-  bool strong_zero_p = !is_weak && oldval == const0_rtx && mode != TImode;
+  bool strong_zero_p = (!is_weak && oldval == const0_rtx && mode != TImode);
 
   label1 = NULL;
   if (!is_weak)
@@ -14399,26 +14397,20 @@ aarch64_split_compare_and_swap (rtx operands[])
   /* The initial load can be relaxed for a __sync operation since a final
  barrier will be emitted to stop code hoisting.  */
   if (is_mm_sync (model))
-aarch64_emit_load_exclusive (mode, rval, mem,
-GEN_INT (MEMMODEL_RELAXED));
+aarch64_emit_load_exclusive (mode, rval, mem, GEN_INT (MEMMODEL_RELAXED));
   else
 aarch64_emit_load_exclusive (mode, rval, mem, model_rtx);
 
   if (strong_zero_p)
-{
-  x = gen_rtx_NE (VOIDmode, rval, const0_rtx);
-  x = gen_rtx_IF_THEN_ELSE (VOIDmode, x,
-   gen_rtx_LABEL_REF (Pmode, label2), pc_rtx);
-  aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
-}
+x = gen_rtx_NE (VOIDmode, rval, const0_rtx);
   else
 {
-  cond = aarch64_gen_compare_reg_maybe_ze (NE, rval, oldval, mode);
-  x = gen_rtx_NE (VOIDmode, cond, const0_rtx);
-  x = gen_rtx_IF_THEN_ELSE (VOIDmode, x,
-   gen_rtx_LABEL_REF (Pmode, label2), pc_rtx);
-  aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
+  rtx cc_reg = aarch64_gen_compare_reg_maybe_ze (NE, rval, oldval, mode);
+  x = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
 }
+  x = gen_rtx_IF_THEN_ELSE (VOIDmode, x,
+   gen_rtx_LABEL_REF (Pmode, label2), pc_rtx);
+  aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
 
   aarch64_emit_store_exclusive (mode, scratch, mem, newval, model_rtx);
 
@@ -14430,22 +14422,16 @@ aarch64_split_compare_and_swap (rtx operands[])
   aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
 }
   else
-{
-  cond = gen_rtx_REG (CCmode, CC_REGNUM);
-  x = gen_rtx_COMPARE (CCmode, scratch, const0_rtx);
-  emit_insn (gen_rtx_SET (cond, x));
-}
+aarch64_gen_compare_reg (NE, scratch, const0_rtx);
 
   emit_label (label2);
+
   /* If we used a CBNZ in the exchange loop emit an explicit compare with RVAL
  to set the condition flags.  If this is not used it will be removed by
  later passes.  */
   if (strong_zero_p)
-{
-  cond = gen_rtx_REG (CCmode, CC_REGNUM);
-  x = gen_rtx_COMPARE (CCmode, rval, const0_rtx);
-  emit_insn (gen_rtx_SET (cond, x));
-}
+aarch64_gen_compare_reg (NE, rval, const0_rtx);
+
   /* Emit any final barrier needed for a __sync operation.  */
   if (is_mm_sync (model))
 aarch64_emit_post_barrier (model);


[PATCH 12/19][GCC-8] aarch64: Implement -moutline-atomics

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline
    2019-09-19  Richard Henderson 

    * config/aarch64/aarch64.opt (-moutline-atomics): New.
    * config/aarch64/aarch64.c (aarch64_atomic_ool_func): New.
    (aarch64_ool_cas_names, aarch64_ool_swp_names): New.
    (aarch64_ool_ldadd_names, aarch64_ool_ldset_names): New.
    (aarch64_ool_ldclr_names, aarch64_ool_ldeor_names): New.
    (aarch64_expand_compare_and_swap): Honor TARGET_OUTLINE_ATOMICS.
    * config/aarch64/atomics.md (atomic_exchange): Likewise.
    (atomic_): Likewise.
    (atomic_fetch_): Likewise.
    (atomic__fetch): Likewise.
    * doc/invoke.texi: Document -moutline-atomics.

    * gcc.target/aarch64/atomic-op-acq_rel.c: Use -mno-outline-atomics.
    * gcc.target/aarch64/atomic-comp-swap-release-acquire.c: Likewise.
    * gcc.target/aarch64/atomic-op-acquire.c: Likewise.
    * gcc.target/aarch64/atomic-op-char.c: Likewise.
    * gcc.target/aarch64/atomic-op-consume.c: Likewise.
    * gcc.target/aarch64/atomic-op-imm.c: Likewise.
    * gcc.target/aarch64/atomic-op-int.c: Likewise.
    * gcc.target/aarch64/atomic-op-long.c: Likewise.
    * gcc.target/aarch64/atomic-op-relaxed.c: Likewise.
    * gcc.target/aarch64/atomic-op-release.c: Likewise.
    * gcc.target/aarch64/atomic-op-seq_cst.c: Likewise.
    * gcc.target/aarch64/atomic-op-short.c: Likewise.
    * gcc.target/aarch64/atomic_cmp_exchange_zero_reg_1.c: Likewise.
    * gcc.target/aarch64/atomic_cmp_exchange_zero_strong_1.c: Likewise.
    * gcc.target/aarch64/sync-comp-swap.c: Likewise.
    * gcc.target/aarch64/sync-op-acquire.c: Likewise.
    * gcc.target/aarch64/sync-op-full.c: Likewise.

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
da68ce0e7d096bf4a512c2b8ef52bf236f8f76f4..0f1dc75a27f3fdd2218e57811e208fc28139ac4a
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -548,4 +548,17 @@ rtl_opt_pass *make_pass_fma_steering (gcc::context *ctxt);
 
 poly_uint64 aarch64_regmode_natural_size (machine_mode);
 
+struct atomic_ool_names
+{
+const char *str[5][4];
+};
+
+rtx aarch64_atomic_ool_func(machine_mode mode, rtx model_rtx,
+   const atomic_ool_names *names);
+extern const atomic_ool_names aarch64_ool_swp_names;
+extern const atomic_ool_names aarch64_ool_ldadd_names;
+extern const atomic_ool_names aarch64_ool_ldset_names;
+extern const atomic_ool_names aarch64_ool_ldclr_names;
+extern const atomic_ool_names aarch64_ool_ldeor_names;
+
 #endif /* GCC_AARCH64_PROTOS_H */
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
2df5bf3db97d9362155c3c8d9c9d7f14c41b9520..21124b5a3479dd388eb767402e080e2181153467
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14227,6 +14227,82 @@ aarch64_emit_unlikely_jump (rtx insn)
   add_reg_br_prob_note (jump, profile_probability::very_unlikely ());
 }
 
+/* We store the names of the various atomic helpers in a 5x4 array.
+   Return the libcall function given MODE, MODEL and NAMES.  */
+
+rtx
+aarch64_atomic_ool_func(machine_mode mode, rtx model_rtx,
+   const atomic_ool_names *names)
+{
+  memmodel model = memmodel_base (INTVAL (model_rtx));
+  int mode_idx, model_idx;
+
+  switch (mode)
+{
+case E_QImode:
+  mode_idx = 0;
+  break;
+case E_HImode:
+  mode_idx = 1;
+  break;
+case E_SImode:
+  mode_idx = 2;
+  break;
+case E_DImode:
+  mode_idx = 3;
+  break;
+case E_TImode:
+  mode_idx = 4;
+  break;
+default:
+  gcc_unreachable ();
+}
+
+  switch (model)
+{
+case MEMMODEL_RELAXED:
+  model_idx = 0;
+  break;
+case MEMMODEL_CONSUME:
+case MEMMODEL_ACQUIRE:
+  model_idx = 1;
+  break;
+case MEMMODEL_RELEASE:
+  model_idx = 2;
+  break;
+case MEMMODEL_ACQ_REL:
+case MEMMODEL_SEQ_CST:
+  model_idx = 3;
+  break;
+default:
+  gcc_unreachable ();
+}
+
+  return init_one_libfunc_visibility (names->str[mode_idx][model_idx],
+ VISIBILITY_HIDDEN);
+}
+
+#define DEF0(B, N) \
+  { "__aarch64_" #B #N "_relax", \
+"__aarch64_" #B #N "_acq", \
+"__aarch64_" #B #N "_rel", \
+"__aarch64_" #B #N "_acq_rel" }
+
+#define DEF4(B)  DEF0(B, 1), DEF0(B, 2), DEF0(B, 4), DEF0(B, 8), \
+{ NULL, NULL, NULL, NULL }
+#define DEF5(B)  DEF0(B, 1), DEF0(B, 2), DEF0(B, 4), DEF0(B, 8), DEF0(B, 16)
+
+static const atomic_ool_names aarch64_ool_cas_names = { { DEF5(cas) } };
+const atomic_ool_names aarch64_ool_swp_names = { { DEF4(swp) } };
+const atomic_ool_names aarch64_ool_ldadd_names = { { DEF4(ldadd) } };
+const atomic_ool_names aarch64_ool_ldset_names = { { DEF4(ldset) } };
+const atomic_ool_names aarch64_ool_ldclr_names = { { DEF4(ldclr) } };
+const atomic_ool_names aarch64_ool_ldeor

[PATCH 11/19][GCC-8] Add visibility to libfunc constructors

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * optabs-libfuncs.c (build_libfunc_function_visibility):
    New, split out from...
    (build_libfunc_function): ... here.
    (init_one_libfunc_visibility): New, split out from ...
    (init_one_libfunc): ... here.

diff --git a/gcc/optabs-libfuncs.h b/gcc/optabs-libfuncs.h
index 
0669ea1fdd7dc666d28fc0407a2288de86b3918b..cf39da36887516193aa789446ef0b6a7c24fb1ef
 100644
--- a/gcc/optabs-libfuncs.h
+++ b/gcc/optabs-libfuncs.h
@@ -63,7 +63,9 @@ void gen_satfract_conv_libfunc (convert_optab, const char *,
 void gen_satfractuns_conv_libfunc (convert_optab, const char *,
   machine_mode, machine_mode);
 
+tree build_libfunc_function_visibility (const char *, symbol_visibility);
 tree build_libfunc_function (const char *);
+rtx init_one_libfunc_visibility (const char *, symbol_visibility);
 rtx init_one_libfunc (const char *);
 rtx set_user_assembler_libfunc (const char *, const char *);
 
diff --git a/gcc/optabs-libfuncs.c b/gcc/optabs-libfuncs.c
index 
bd0df8baa3711febcbdf2745588d5d43519af72b..73a28e9ca7a1e5b1564861071e0923d8b8219d25
 100644
--- a/gcc/optabs-libfuncs.c
+++ b/gcc/optabs-libfuncs.c
@@ -719,10 +719,10 @@ struct libfunc_decl_hasher : ggc_ptr_hash
 /* A table of previously-created libfuncs, hashed by name.  */
 static GTY (()) hash_table *libfunc_decls;
 
-/* Build a decl for a libfunc named NAME.  */
+/* Build a decl for a libfunc named NAME with visibility VIS.  */
 
 tree
-build_libfunc_function (const char *name)
+build_libfunc_function_visibility (const char *name, symbol_visibility vis)
 {
   /* ??? We don't have any type information; pretend this is "int foo ()".  */
   tree decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL,
@@ -731,7 +731,7 @@ build_libfunc_function (const char *name)
   DECL_EXTERNAL (decl) = 1;
   TREE_PUBLIC (decl) = 1;
   DECL_ARTIFICIAL (decl) = 1;
-  DECL_VISIBILITY (decl) = VISIBILITY_DEFAULT;
+  DECL_VISIBILITY (decl) = vis;
   DECL_VISIBILITY_SPECIFIED (decl) = 1;
   gcc_assert (DECL_ASSEMBLER_NAME (decl));
 
@@ -742,11 +742,19 @@ build_libfunc_function (const char *name)
   return decl;
 }
 
+/* Build a decl for a libfunc named NAME.  */
+
+tree
+build_libfunc_function (const char *name)
+{
+  return build_libfunc_function_visibility (name, VISIBILITY_DEFAULT);
+}
+
 /* Return a libfunc for NAME, creating one if we don't already have one.
-   The returned rtx is a SYMBOL_REF.  */
+   The decl is given visibility VIS.  The returned rtx is a SYMBOL_REF.  */
 
 rtx
-init_one_libfunc (const char *name)
+init_one_libfunc_visibility (const char *name, symbol_visibility vis)
 {
   tree id, decl;
   hashval_t hash;
@@ -763,12 +771,18 @@ init_one_libfunc (const char *name)
 {
   /* Create a new decl, so that it can be passed to
 targetm.encode_section_info.  */
-  decl = build_libfunc_function (name);
+  decl = build_libfunc_function_visibility (name, vis);
   *slot = decl;
 }
   return XEXP (DECL_RTL (decl), 0);
 }
 
+rtx
+init_one_libfunc (const char *name)
+{
+  return init_one_libfunc_visibility (name, VISIBILITY_DEFAULT);
+}
+
 /* Adjust the assembler name of libfunc NAME to ASMSPEC.  */
 
 rtx


[PATCH 2/19][GCC-8] aarch64: Simplify LSE cas generation

2020-04-16 Thread Andre Vieira (lists)

The cas insn is a single insn, and if expanded properly need not
be split after reload.  Use the proper inputs for the insn.

2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_expand_compare_and_swap):
    Force oldval into the rval register for TARGET_LSE; emit the compare
    during initial expansion so that it may be deleted if unused.
    (aarch64_gen_atomic_cas): Remove.
    * config/aarch64/atomics.md (aarch64_compare_and_swap_lse):
    Change = to +r for operand 0; use match_dup for operand 2;
    remove is_weak and mod_f operands as unused.  Drop the split
    and merge with...
    (aarch64_atomic_cas): ... this pattern's output; remove.
    (aarch64_compare_and_swap_lse): Similarly.
    (aarch64_atomic_cas): Similarly.

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
cda2895d28e7496f8fd6c1b365c4bb497b54c323..a03565c3b4e13990dc1a0064f9cbbc38bb109795
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -496,7 +496,6 @@ rtx aarch64_load_tp (rtx);
 
 void aarch64_expand_compare_and_swap (rtx op[]);
 void aarch64_split_compare_and_swap (rtx op[]);
-void aarch64_gen_atomic_cas (rtx, rtx, rtx, rtx, rtx);
 
 bool aarch64_atomic_ldop_supported_p (enum rtx_code);
 void aarch64_gen_atomic_ldop (enum rtx_code, rtx, rtx, rtx, rtx, rtx);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
20761578fb6051e600299cd58f245774bd457432..c83a9f7ae78d4ed3da6636fce4d1f57c27048756
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14169,17 +14169,19 @@ aarch64_expand_compare_and_swap (rtx operands[])
 {
   rtx bval, rval, mem, oldval, newval, is_weak, mod_s, mod_f, x;
   machine_mode mode, cmp_mode;
-  typedef rtx (*gen_cas_fn) (rtx, rtx, rtx, rtx, rtx, rtx, rtx);
+  typedef rtx (*gen_split_cas_fn) (rtx, rtx, rtx, rtx, rtx, rtx, rtx);
+  typedef rtx (*gen_atomic_cas_fn) (rtx, rtx, rtx, rtx);
   int idx;
-  gen_cas_fn gen;
-  const gen_cas_fn split_cas[] =
+  gen_split_cas_fn split_gen;
+  gen_atomic_cas_fn atomic_gen;
+  const gen_split_cas_fn split_cas[] =
   {
 gen_aarch64_compare_and_swapqi,
 gen_aarch64_compare_and_swaphi,
 gen_aarch64_compare_and_swapsi,
 gen_aarch64_compare_and_swapdi
   };
-  const gen_cas_fn atomic_cas[] =
+  const gen_atomic_cas_fn atomic_cas[] =
   {
 gen_aarch64_compare_and_swapqi_lse,
 gen_aarch64_compare_and_swaphi_lse,
@@ -14238,14 +14240,29 @@ aarch64_expand_compare_and_swap (rtx operands[])
   gcc_unreachable ();
 }
   if (TARGET_LSE)
-gen = atomic_cas[idx];
+{
+  atomic_gen = atomic_cas[idx];
+  /* The CAS insn requires oldval and rval overlap, but we need to
+have a copy of oldval saved across the operation to tell if
+the operation is successful.  */
+  if (mode == QImode || mode == HImode)
+   rval = copy_to_mode_reg (SImode, gen_lowpart (SImode, oldval));
+  else if (reg_overlap_mentioned_p (rval, oldval))
+rval = copy_to_mode_reg (mode, oldval);
+  else
+   emit_move_insn (rval, oldval);
+  emit_insn (atomic_gen (rval, mem, newval, mod_s));
+  aarch64_gen_compare_reg (EQ, rval, oldval);
+}
   else
-gen = split_cas[idx];
-
-  emit_insn (gen (rval, mem, oldval, newval, is_weak, mod_s, mod_f));
+{
+  split_gen = split_cas[idx];
+  emit_insn (split_gen (rval, mem, oldval, newval, is_weak, mod_s, mod_f));
+}
 
   if (mode == QImode || mode == HImode)
-emit_move_insn (operands[1], gen_lowpart (mode, rval));
+rval = gen_lowpart (mode, rval);
+  emit_move_insn (operands[1], rval);
 
   x = gen_rtx_REG (CCmode, CC_REGNUM);
   x = gen_rtx_EQ (SImode, x, const0_rtx);
@@ -14295,42 +14312,6 @@ aarch64_emit_post_barrier (enum memmodel model)
 }
 }
 
-/* Emit an atomic compare-and-swap operation.  RVAL is the destination register
-   for the data in memory.  EXPECTED is the value expected to be in memory.
-   DESIRED is the value to store to memory.  MEM is the memory location.  MODEL
-   is the memory ordering to use.  */
-
-void
-aarch64_gen_atomic_cas (rtx rval, rtx mem,
-   rtx expected, rtx desired,
-   rtx model)
-{
-  rtx (*gen) (rtx, rtx, rtx, rtx);
-  machine_mode mode;
-
-  mode = GET_MODE (mem);
-
-  switch (mode)
-{
-case E_QImode: gen = gen_aarch64_atomic_casqi; break;
-case E_HImode: gen = gen_aarch64_atomic_cashi; break;
-case E_SImode: gen = gen_aarch64_atomic_cassi; break;
-case E_DImode: gen = gen_aarch64_atomic_casdi; break;
-default:
-  gcc_unreachable ();
-}
-
-  /* Move the expected value into the CAS destination register.  */
-  emit_insn (gen_rtx_SET (rval, expected));
-
-  /* Emit the CAS.  */
-  emit_insn (gen (rval, mem, desired, model));
-
-  /* Compare the expected value with the value loaded by the CAS, to establish
- whether the swap

[PATCH 6/19][GCC-8] aarch64: Remove early clobber from ATOMIC_LDOP scratch

2020-04-16 Thread Andre Vieira (lists)

2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * config/aarch64/atomics.md (aarch64_atomic__lse):
    scratch register need not be early-clobber.  Document the reason
    why we cannot use ST.

diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
47a8a40c5b82e349b2caf4e48f9f81577f4c3ed3..d740f4a100b1b624eafdb279f38ac1ce9db587dd
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -263,6 +263,18 @@
   }
 )
 
+;; It is tempting to want to use ST for relaxed and release
+;; memory models here.  However, that is incompatible with the
+;; C++ memory model for the following case:
+;;
+;; atomic_fetch_add(ptr, 1, memory_order_relaxed);
+;; atomic_thread_fence(memory_order_acquire);
+;;
+;; The problem is that the architecture says that ST (and LD
+;; insns where the destination is XZR) are not regarded as a read.
+;; However we also implement the acquire memory barrier with DMB LD,
+;; and so the ST is not blocked by the barrier.
+
 (define_insn "aarch64_atomic__lse"
   [(set (match_operand:ALLI 0 "aarch64_sync_memory_operand" "+Q")
(unspec_volatile:ALLI
@@ -270,7 +282,7 @@
   (match_operand:ALLI 1 "register_operand" "r")
   (match_operand:SI 2 "const_int_operand")]
   ATOMIC_LDOP))
-   (clobber (match_scratch:ALLI 3 "="))]
+   (clobber (match_scratch:ALLI 3 "=r"))]
   "TARGET_LSE"
   {
enum memmodel model = memmodel_from_int (INTVAL (operands[2]));


[PATCH 4/19][GCC-8] aarch64: Improve swp generation

2020-04-16 Thread Andre Vieira (lists)

Allow zero as an input; fix constraints; avoid unnecessary split.

2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_emit_atomic_swap): Remove.
    (aarch64_gen_atomic_ldop): Don't call it.
    * config/aarch64/atomics.md (atomic_exchange):
    Use aarch64_reg_or_zero.
    (aarch64_atomic_exchange): Likewise.
    (aarch64_atomic_exchange_lse): Remove split; remove & from
    operand 0; use aarch64_reg_or_zero for input; merge ...
    (aarch64_atomic_swp): ... this and remove.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
b6a6e314153ecf4a7ae1b83cfb64e6192197edc5..bac69474598ff19161b72748505151b0d6185a9b
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14454,27 +14454,6 @@ aarch64_emit_bic (machine_mode mode, rtx dst, rtx s1, 
rtx s2, int shift)
   emit_insn (gen (dst, s2, shift_rtx, s1));
 }
 
-/* Emit an atomic swap.  */
-
-static void
-aarch64_emit_atomic_swap (machine_mode mode, rtx dst, rtx value,
- rtx mem, rtx model)
-{
-  rtx (*gen) (rtx, rtx, rtx, rtx);
-
-  switch (mode)
-{
-case E_QImode: gen = gen_aarch64_atomic_swpqi; break;
-case E_HImode: gen = gen_aarch64_atomic_swphi; break;
-case E_SImode: gen = gen_aarch64_atomic_swpsi; break;
-case E_DImode: gen = gen_aarch64_atomic_swpdi; break;
-default:
-  gcc_unreachable ();
-}
-
-  emit_insn (gen (dst, mem, value, model));
-}
-
 /* Operations supported by aarch64_emit_atomic_load_op.  */
 
 enum aarch64_atomic_load_op_code
@@ -14587,10 +14566,6 @@ aarch64_gen_atomic_ldop (enum rtx_code code, rtx 
out_data, rtx out_result,
  a SET then emit a swap instruction and finish.  */
   switch (code)
 {
-case SET:
-  aarch64_emit_atomic_swap (mode, out_data, src, mem, model_rtx);
-  return;
-
 case MINUS:
   /* Negate the value and treat it as a PLUS.  */
   {
diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
b0e84b8addd809598b3e358a265b86582ce96462..6cc14fbf6c103ab19e6c201333a9eba06b90c469
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -136,7 +136,7 @@
 (define_expand "atomic_exchange"
  [(match_operand:ALLI 0 "register_operand" "")
   (match_operand:ALLI 1 "aarch64_sync_memory_operand" "")
-  (match_operand:ALLI 2 "register_operand" "")
+  (match_operand:ALLI 2 "aarch64_reg_or_zero" "")
   (match_operand:SI 3 "const_int_operand" "")]
   ""
   {
@@ -156,10 +156,10 @@
 
 (define_insn_and_split "aarch64_atomic_exchange"
   [(set (match_operand:ALLI 0 "register_operand" "=");; 
output
-(match_operand:ALLI 1 "aarch64_sync_memory_operand" "+Q")) ;; memory
+(match_operand:ALLI 1 "aarch64_sync_memory_operand" "+Q")) ;; memory
(set (match_dup 1)
 (unspec_volatile:ALLI
-  [(match_operand:ALLI 2 "register_operand" "r")   ;; input
+  [(match_operand:ALLI 2 "aarch64_reg_or_zero" "rZ")   ;; input
(match_operand:SI 3 "const_int_operand" "")];; model
   UNSPECV_ATOMIC_EXCHG))
(clobber (reg:CC CC_REGNUM))
@@ -175,22 +175,25 @@
   }
 )
 
-(define_insn_and_split "aarch64_atomic_exchange_lse"
-  [(set (match_operand:ALLI 0 "register_operand" "=")
+(define_insn "aarch64_atomic_exchange_lse"
+  [(set (match_operand:ALLI 0 "register_operand" "=r")
 (match_operand:ALLI 1 "aarch64_sync_memory_operand" "+Q"))
(set (match_dup 1)
 (unspec_volatile:ALLI
-  [(match_operand:ALLI 2 "register_operand" "r")
+  [(match_operand:ALLI 2 "aarch64_reg_or_zero" "rZ")
(match_operand:SI 3 "const_int_operand" "")]
   UNSPECV_ATOMIC_EXCHG))]
   "TARGET_LSE"
-  "#"
-  "&& reload_completed"
-  [(const_int 0)]
   {
-aarch64_gen_atomic_ldop (SET, operands[0], NULL, operands[1],
-operands[2], operands[3]);
-DONE;
+enum memmodel model = memmodel_from_int (INTVAL (operands[3]));
+if (is_mm_relaxed (model))
+  return "swp\t%2, %0, %1";
+else if (is_mm_acquire (model) || is_mm_consume (model))
+  return "swpa\t%2, %0, %1";
+else if (is_mm_release (model))
+  return "swpl\t%2, %0, %1";
+else
+  return "swpal\t%2, %0, %1";
   }
 )
 
@@ -582,28 +585,6 @@
 
 ;; ARMv8.1-A LSE instructions.
 
-;; Atomic swap with memory.
-(define_insn "aarch64_atomic_swp"
- [(set (match_operand:ALLI 0 "register_operand" "+")
-   (match_operand:ALLI 1 "aarch

[PATCH 5/19][GCC-8] aarch64: Improve atomic-op lse generation

2020-04-16 Thread Andre Vieira (lists)

Fix constraints; avoid unnecessary split.  Drop the use of the atomic_op
iterator in favor of the ATOMIC_LDOP iterator; this is simplier and more
logical for ldclr aka bic.

2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_emit_bic): Remove.
    (aarch64_atomic_ldop_supported_p): Remove.
    (aarch64_gen_atomic_ldop): Remove.
    * config/aarch64/atomic.md (atomic_):
    Fully expand LSE operations here.
    (atomic_fetch_): Likewise.
    (atomic__fetch): Likewise.
    (aarch64_atomic__lse): Drop atomic_op iterator
    and use ATOMIC_LDOP instead; use register_operand for the input;
    drop the split and emit insns directly.
    (aarch64_atomic_fetch__lse): Likewise.
    (aarch64_atomic__fetch_lse): Remove.
    (aarch64_atomic_load): Remove.

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
a03565c3b4e13990dc1a0064f9cbbc38bb109795..da68ce0e7d096bf4a512c2b8ef52bf236f8f76f4
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -497,8 +497,6 @@ rtx aarch64_load_tp (rtx);
 void aarch64_expand_compare_and_swap (rtx op[]);
 void aarch64_split_compare_and_swap (rtx op[]);
 
-bool aarch64_atomic_ldop_supported_p (enum rtx_code);
-void aarch64_gen_atomic_ldop (enum rtx_code, rtx, rtx, rtx, rtx, rtx);
 void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
 
 bool aarch64_gen_adjusted_ldpstp (rtx *, bool, scalar_mode, RTX_CODE);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
bac69474598ff19161b72748505151b0d6185a9b..1068cfd899a759c506e3217e1e2c19cd778b4372
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14292,32 +14292,6 @@ aarch64_expand_compare_and_swap (rtx operands[])
   emit_insn (gen_rtx_SET (bval, x));
 }
 
-/* Test whether the target supports using a atomic load-operate instruction.
-   CODE is the operation and AFTER is TRUE if the data in memory after the
-   operation should be returned and FALSE if the data before the operation
-   should be returned.  Returns FALSE if the operation isn't supported by the
-   architecture.  */
-
-bool
-aarch64_atomic_ldop_supported_p (enum rtx_code code)
-{
-  if (!TARGET_LSE)
-return false;
-
-  switch (code)
-{
-case SET:
-case AND:
-case IOR:
-case XOR:
-case MINUS:
-case PLUS:
-  return true;
-default:
-  return false;
-}
-}
-
 /* Emit a barrier, that is appropriate for memory model MODEL, at the end of a
sequence implementing an atomic operation.  */
 
@@ -14435,227 +14409,6 @@ aarch64_split_compare_and_swap (rtx operands[])
 aarch64_emit_post_barrier (model);
 }
 
-/* Emit a BIC instruction.  */
-
-static void
-aarch64_emit_bic (machine_mode mode, rtx dst, rtx s1, rtx s2, int shift)
-{
-  rtx shift_rtx = GEN_INT (shift);
-  rtx (*gen) (rtx, rtx, rtx, rtx);
-
-  switch (mode)
-{
-case E_SImode: gen = gen_and_one_cmpl_lshrsi3; break;
-case E_DImode: gen = gen_and_one_cmpl_lshrdi3; break;
-default:
-  gcc_unreachable ();
-}
-
-  emit_insn (gen (dst, s2, shift_rtx, s1));
-}
-
-/* Operations supported by aarch64_emit_atomic_load_op.  */
-
-enum aarch64_atomic_load_op_code
-{
-  AARCH64_LDOP_PLUS,   /* A + B  */
-  AARCH64_LDOP_XOR,/* A ^ B  */
-  AARCH64_LDOP_OR, /* A | B  */
-  AARCH64_LDOP_BIC /* A & ~B  */
-};
-
-/* Emit an atomic load-operate.  */
-
-static void
-aarch64_emit_atomic_load_op (enum aarch64_atomic_load_op_code code,
-machine_mode mode, rtx dst, rtx src,
-rtx mem, rtx model)
-{
-  typedef rtx (*aarch64_atomic_load_op_fn) (rtx, rtx, rtx, rtx);
-  const aarch64_atomic_load_op_fn plus[] =
-  {
-gen_aarch64_atomic_loadaddqi,
-gen_aarch64_atomic_loadaddhi,
-gen_aarch64_atomic_loadaddsi,
-gen_aarch64_atomic_loadadddi
-  };
-  const aarch64_atomic_load_op_fn eor[] =
-  {
-gen_aarch64_atomic_loadeorqi,
-gen_aarch64_atomic_loadeorhi,
-gen_aarch64_atomic_loadeorsi,
-gen_aarch64_atomic_loadeordi
-  };
-  const aarch64_atomic_load_op_fn ior[] =
-  {
-gen_aarch64_atomic_loadsetqi,
-gen_aarch64_atomic_loadsethi,
-gen_aarch64_atomic_loadsetsi,
-gen_aarch64_atomic_loadsetdi
-  };
-  const aarch64_atomic_load_op_fn bic[] =
-  {
-gen_aarch64_atomic_loadclrqi,
-gen_aarch64_atomic_loadclrhi,
-gen_aarch64_atomic_loadclrsi,
-gen_aarch64_atomic_loadclrdi
-  };
-  aarch64_atomic_load_op_fn gen;
-  int idx = 0;
-
-  switch (mode)
-{
-case E_QImode: idx = 0; break;
-case E_HImode: idx = 1; break;
-case E_SImode: idx = 2; break;
-case E_DImode: idx = 3; break;
-default:
-  gcc_unreachable ();
-}
-
-  switch (code)
-{
-case AARCH64_LDOP_PLUS: gen = plus[idx]; break;
-case AARCH64_LDOP_XOR: gen = eor[idx]; break;
-case AARCH64_LDOP_OR: gen = ior[idx]; break;
-

[PATCH 3/19] aarch64: Improve cas generation

2020-04-16 Thread Andre Vieira (lists)

Do not zero-extend the input to the cas for subword operations;
instead, use the appropriate zero-extending compare insns.
Correct the predicates and constraints for immediate expected operand.

2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-10-31  Richard Henderson 

    * config/aarch64/aarch64.c (aarch64_gen_compare_reg_maybe_ze): New.
    (aarch64_split_compare_and_swap): Use it.
    (aarch64_expand_compare_and_swap): Likewise.  Remove convert_modes;
    test oldval against the proper predicate.
    * config/aarch64/atomics.md (atomic_compare_and_swap):
    Use nonmemory_operand for expected.
    (cas_short_expected_pred): New.
    (aarch64_compare_and_swap): Use it; use "rn" not "rI" to match.
    (aarch64_compare_and_swap): Use "rn" not "rI" for expected.
    * config/aarch64/predicates.md (aarch64_plushi_immediate): New.
    (aarch64_plushi_operand): New.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
c83a9f7ae78d4ed3da6636fce4d1f57c27048756..b6a6e314153ecf4a7ae1b83cfb64e6192197edc5
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1524,6 +1524,33 @@ aarch64_gen_compare_reg (RTX_CODE code, rtx x, rtx y)
   return cc_reg;
 }
 
+/* Similarly, but maybe zero-extend Y if Y_MODE < SImode.  */
+
+static rtx
+aarch64_gen_compare_reg_maybe_ze (RTX_CODE code, rtx x, rtx y,
+  machine_mode y_mode)
+{
+  if (y_mode == E_QImode || y_mode == E_HImode)
+{
+  if (CONST_INT_P (y))
+   y = GEN_INT (INTVAL (y) & GET_MODE_MASK (y_mode));
+  else
+   {
+ rtx t, cc_reg;
+ machine_mode cc_mode;
+
+ t = gen_rtx_ZERO_EXTEND (SImode, y);
+ t = gen_rtx_COMPARE (CC_SWPmode, t, x);
+ cc_mode = CC_SWPmode;
+ cc_reg = gen_rtx_REG (cc_mode, CC_REGNUM);
+ emit_set_insn (cc_reg, t);
+ return cc_reg;
+   }
+}
+
+  return aarch64_gen_compare_reg (code, x, y);
+}
+
 /* Build the SYMBOL_REF for __tls_get_addr.  */
 
 static GTY(()) rtx tls_get_addr_libfunc;
@@ -14167,20 +14194,11 @@ aarch64_emit_unlikely_jump (rtx insn)
 void
 aarch64_expand_compare_and_swap (rtx operands[])
 {
-  rtx bval, rval, mem, oldval, newval, is_weak, mod_s, mod_f, x;
-  machine_mode mode, cmp_mode;
-  typedef rtx (*gen_split_cas_fn) (rtx, rtx, rtx, rtx, rtx, rtx, rtx);
+  rtx bval, rval, mem, oldval, newval, is_weak, mod_s, mod_f, x, cc_reg;
+  machine_mode mode, r_mode;
   typedef rtx (*gen_atomic_cas_fn) (rtx, rtx, rtx, rtx);
   int idx;
-  gen_split_cas_fn split_gen;
   gen_atomic_cas_fn atomic_gen;
-  const gen_split_cas_fn split_cas[] =
-  {
-gen_aarch64_compare_and_swapqi,
-gen_aarch64_compare_and_swaphi,
-gen_aarch64_compare_and_swapsi,
-gen_aarch64_compare_and_swapdi
-  };
   const gen_atomic_cas_fn atomic_cas[] =
   {
 gen_aarch64_compare_and_swapqi_lse,
@@ -14198,36 +14216,19 @@ aarch64_expand_compare_and_swap (rtx operands[])
   mod_s = operands[6];
   mod_f = operands[7];
   mode = GET_MODE (mem);
-  cmp_mode = mode;
 
   /* Normally the succ memory model must be stronger than fail, but in the
  unlikely event of fail being ACQUIRE and succ being RELEASE we need to
  promote succ to ACQ_REL so that we don't lose the acquire semantics.  */
-
   if (is_mm_acquire (memmodel_from_int (INTVAL (mod_f)))
   && is_mm_release (memmodel_from_int (INTVAL (mod_s
 mod_s = GEN_INT (MEMMODEL_ACQ_REL);
 
-  switch (mode)
+  r_mode = mode;
+  if (mode == QImode || mode == HImode)
 {
-case E_QImode:
-case E_HImode:
-  /* For short modes, we're going to perform the comparison in SImode,
-so do the zero-extension now.  */
-  cmp_mode = SImode;
-  rval = gen_reg_rtx (SImode);
-  oldval = convert_modes (SImode, mode, oldval, true);
-  /* Fall through.  */
-
-case E_SImode:
-case E_DImode:
-  /* Force the value into a register if needed.  */
-  if (!aarch64_plus_operand (oldval, mode))
-   oldval = force_reg (cmp_mode, oldval);
-  break;
-
-default:
-  gcc_unreachable ();
+  r_mode = SImode;
+  rval = gen_reg_rtx (r_mode);
 }
 
   switch (mode)
@@ -14245,27 +14246,49 @@ aarch64_expand_compare_and_swap (rtx operands[])
   /* The CAS insn requires oldval and rval overlap, but we need to
 have a copy of oldval saved across the operation to tell if
 the operation is successful.  */
-  if (mode == QImode || mode == HImode)
-   rval = copy_to_mode_reg (SImode, gen_lowpart (SImode, oldval));
-  else if (reg_overlap_mentioned_p (rval, oldval))
-rval = copy_to_mode_reg (mode, oldval);
+  if (reg_overlap_mentioned_p (rval, oldval))
+rval = copy_to_mode_reg (r_mode, oldval);
   else
-   emit_move_insn (rval, oldval);
+   emit_move_insn (rval, gen_lowpart (r_mode, oldval));
+
   emit_insn (atomic_gen (rval, mem, newval, mo

[PATCH 1/19][GCC-8] aarch64: Fix up aarch64_compare_and_swaphi pattern [PR94368]

2020-04-16 Thread Andre Vieira (lists)

gcc/ChangeLog:
2020-04-16  Andre Vieira 

    Backport from mainline.
    2018-07-16  Ramana Radhakrishnan 

    * config/aarch64/atomics.md (aarch64_store_execlusive): Add
    early clobber.

diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
686e39ff2ee5940e9e93d0c2b802b46ff9f2c4e4..fba5ec6db5832a184b0323e62041f9c473761bae
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -530,7 +530,7 @@
 )
 
 (define_insn "aarch64_store_exclusive"
-  [(set (match_operand:SI 0 "register_operand" "=r")
+  [(set (match_operand:SI 0 "register_operand" "=")
 (unspec_volatile:SI [(const_int 0)] UNSPECV_SX))
(set (match_operand:ALLI 1 "aarch64_sync_memory_operand" "=Q")
 (unspec_volatile:ALLI


[PATCH 0/19][GCC-8] aarch64: Backport outline atomics

2020-04-16 Thread Andre Vieira (lists)

Hi,

This series backports all the patches and fixes regarding outline 
atomics to the gcc-8 branch.


Bootstrapped the series for aarch64-linux-gnu and regression tested.
Is this OK for gcc-8?

Andre Vieira (19):
aarch64: Add early clobber for aarch64_store_exclusive
aarch64: Simplify LSE cas generation
aarch64: Improve cas generation
aarch64: Improve swp generation
aarch64: Improve atomic-op lse generation
aarch64: Remove early clobber from ATOMIC_LDOP scratch
aarch64: Extend %R for integer registers
aarch64: Implement TImode compare-and-swap
aarch64: Tidy aarch64_split_compare_and_swap
aarch64: Add out-of-line functions for LSE atomics
Add visibility to libfunc constructors
aarch64: Implement -moutline-atomics
Aarch64: Fix shrinkwrapping interactions with atomics (PR92692)
aarch64: Fix store-exclusive in load-operate LSE helpers
aarch64: Configure for sys/auxv.h in libgcc for lse-init.c
aarch64: Fix up aarch64_compare_and_swaphi pattern [PR94368]
aarch64: Fix bootstrap with old binutils [PR93053]
aarch64: Fix ICE due to aarch64_gen_compare_reg_maybe_ze [PR94435]
re PR target/90724 (ICE with __sync_bool_compare_and_swap with 
-march=armv8.2-a+sve)




[PATCH][GCC][Arm]: MVE: Add mve vec_duplicate pattern

2020-04-15 Thread Andre Vieira (lists)

Hi,

This patch fixes an ICE we were seeing due to a missing vec_duplicate 
pattern.


Regression tested on arm-none-eabi.

Is this OK for trunk?

gcc/ChangeLog:
2020-04-15  Andre Vieira  

    * config/arm/mve.md (mve_vec_duplicate): New pattern.
    (V_sz_elem2): Remove unused mode attribute.

gcc/testsuite/ChangeLog:
2020-04-15  Andre Vieira 
    Srinath Parvathaneni 

    * gcc.target/arm/mve/intrinsics/mve_vec_duplicate.c: New test.

diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
c49c14c4240838ce086f424f58726e2e94cf190e..047b4769a28daebdc0175804c578a0d11830a291
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,8 +17,6 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; <http://www.gnu.org/licenses/>.
 
-(define_mode_attr V_sz_elem2 [(V16QI "s8") (V8HI "u16") (V4SI "u32")
- (V2DI "u64")])
 (define_mode_iterator MVE_types [V16QI V8HI V4SI V2DI TI V8HF V4SF V2DF])
 (define_mode_iterator MVE_VLD_ST [V16QI V8HI V4SI V8HF V4SF])
 (define_mode_iterator MVE_0 [V8HF V4SF])
@@ -11301,3 +11299,10 @@ (define_insn "mve_vshlcq_m_"
  "vpst\;vshlct\t%q0, %1, %4"
  [(set_attr "type" "mve_move")
   (set_attr "length" "8")])
+
+(define_insn "*mve_vec_duplicate"
+ [(set (match_operand:MVE_VLD_ST 0 "s_register_operand" "=w")
+   (vec_duplicate:MVE_VLD_ST (match_operand: 1 "general_operand" 
"r")))]
+ "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
+ "vdup.\t%q0, %1"
+ [(set_attr "type" "mve_move")])
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vec_duplicate.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vec_duplicate.c
new file mode 100644
index 
..eda836151b3a16eb54ddebabf185be3cd8980acc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vec_duplicate.c
@@ -0,0 +1,13 @@
+/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
+/* { dg-additional-options "-O2" } */
+
+#include "arm_mve.h"
+
+float32x4_t a;
+
+void foo (void)
+{
+  a = 1.41176471f - 0.47058824f * a;
+}
+


Re: [PATCH 3/5] testsuite: [arm/mve] Fix mve_move_gpr_to_gpr.c

2020-04-14 Thread Andre Vieira (lists)

On 10/04/2020 14:55, Christophe Lyon via Gcc-patches wrote:

This test can pass with a hard-float toolchain, provided we don't
force -mfloat-abi=softfp.

This patch removes this useless option, as well as -save-temps which
is implied by arm_v8_1m_mve_fp.

Hi Christophe,

LGTM, but you need to wait for maintainer approval.

Cheers,
Andre


2020-04-10  Christophe Lyon  

gcc/tesuite/
* gcc.target/arm/mve/intrinsics/mve_move_gpr_to_gpr.c: Remove
useless options.
---
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_move_gpr_to_gpr.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_move_gpr_to_gpr.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_move_gpr_to_gpr.c
index 374bc4d..53300e5 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_move_gpr_to_gpr.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_move_gpr_to_gpr.c
@@ -1,6 +1,6 @@
  /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
  /* { dg-add-options arm_v8_1m_mve_fp } */
-/* { dg-additional-options "-O2 -mfloat-abi=softfp --save-temps" } */
+/* { dg-additional-options "-O2" } */
  
  #include "arm_mve.h"
  


Re: [PATCH 5/5] testsuite: [arm/mve] Include arm_mve.h in arm_v8_1m_mve_ok

2020-04-14 Thread Andre Vieira (lists)



On 10/04/2020 14:55, Christophe Lyon via Gcc-patches wrote:

Since arm_mve.h includes stdint.h, its use requires the presence of
the right gnu/stub-*.h, so make sure to include it when checking the
arm_v8_1m_mve_ok_nocache effective target, otherwise we can decide MVE
is supported while it's not really. This makes several tests
unsupported rather than fail.

Hi Christophe,

LGTM, but you need to wait for maintainer approval.

Cheers,
Andre


2020-04-10  Christophe Lyon  

gcc/testsuite/
* lib/target-supports.exp
(check_effective_target_arm_v8_1m_mve_ok_nocache): Include
arm_mve.h.
---
  gcc/testsuite/lib/target-supports.exp | 1 +
  1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 6c8dd01..d16498d 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4965,6 +4965,7 @@ proc check_effective_target_arm_v8_1m_mve_ok_nocache { } {
#if __ARM_BIG_ENDIAN
#error "MVE intrinsics are not supported in Big-Endian mode."
#endif
+   #include 
  } "$flags -mthumb"] } {
  set et_arm_v8_1m_mve_flags "$flags -mthumb --save-temps"
  return 1


Re: [PATCH 4/5] testsuite: [arm/mve] Use dg-add-options arm_v8_1m_mve in MVE tests

2020-04-14 Thread Andre Vieira (lists)

On 10/04/2020 14:55, Christophe Lyon via Gcc-patches wrote:

Several ARM/MVE tests can be compiled even if the toolchain does not
support -mfloat-abi=hard (softfp is OK).

Use dg-add-options arm_v8_1m_mve or arm_v8_1m_mve_fp instead of using
dg-additional-options.

Hi Christophe,

I think a bunch of these tests were initially meant to test hard float 
abi with vectors, especially in the MVE integer cases, this is what the 
scan-assemblers are meant to test. However, it seems to pass for 
float-abi=softfp too ... which means these tests don't really mean 
anything. I think it would be good to remove the scan-assembler tests 
for now and improve these tests later with run tests, or some smarter 
function body testing.


I suggest we apply your changes and I can follow up with a patch to 
remove the scan-assemblers for now, if a maintainer agrees with me that is.


Cheers,
Andre

2020-04-10  Christophe Lyon  

gcc/testsuite/
* gcc.target/arm/mve/intrinsics/mve_vector_float.c: Use
arm_v8_1m_mve_fp.
* gcc.target/arm/mve/intrinsics/mve_vector_float1.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_vector_float2.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_vector_int.c: Use
arm_v8_1m_mve.
* gcc.target/arm/mve/intrinsics/mve_vector_int1.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_vector_int2.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_vector_uint.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_vector_uint1.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_vector_uint2.c: Likewise.
---
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c  | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c| 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int1.c   | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int2.c   | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_uint.c   | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_uint1.c  | 2 +-
  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_uint2.c  | 2 +-
  9 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c
index 881157f..6519b81 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c
@@ -1,6 +1,6 @@
  /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
  
  #include "arm_mve.h"
  
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c

index 9515ed6..855e3b8 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c
@@ -1,6 +1,6 @@
  /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
  
  #include "arm_mve.h"
  
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c

index 3ce8ea3..e3cf8f8 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c
@@ -1,6 +1,6 @@
  /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
  
  #include "arm_mve.h"
  
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c

index dab0705..e70cbc1 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c
@@ -1,6 +1,6 @@
  /* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */
-/* { dg-additional-options "-march=armv8.1-m.main+mve -mfpu=auto -mfloat-abi=hard 
-mthumb --save-temps" } */
  
  #include "arm_mve.h"
  
diff --git 

Re: [PATCH 1/5] testsuite: [arm] Add arm_softfp_ok and arm_hard_ok effective targets.

2020-04-14 Thread Andre Vieira (lists)

On 10/04/2020 14:55, Christophe Lyon via Gcc-patches wrote:

For arm-linux-gnueabi* targets, a toolchain cannot support the
float-abi opposite to the one it has been configured for: since glibc
does not support such multilibs, we end up lacking gnu/stubs-*.h when
including stdint.h for instance.

This patch introduces two new effective targets to detect whether we
can compile tests with -mfloat-abi=softfp or -mfloat-abi=hard.

This enables to make such tests unsupported rather than fail.

Hi Christophe,

LGTM, but you need to wait for maintainer approval.

Cheers,
Andre

2020-04-10  Christophe Lyon  

gcc/testsuite/
* lib/target-supports.exp (arm_softfp_ok): New effective target.
(arm_hard_ok): Likewise.
---
  gcc/testsuite/lib/target-supports.exp | 20 
  1 file changed, 20 insertions(+)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 3758bb3..6c8dd01 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4739,6 +4739,26 @@ proc check_effective_target_default_branch_protection { 
} {
  return [check_configured_with "enable-standard-branch-protection"]
  }
  
+# Return 1 if this is an ARM target supporting -mfloat-abi=softfp.

+
+proc check_effective_target_arm_softfp_ok { } {
+return [check_no_compiler_messages arm_softfp_ok object {
+   #include 
+   int dummy;
+   int main (void) { return 0; }
+   } "-mfloat-abi=softfp"]
+}
+
+# Return 1 if this is an ARM target supporting -mfloat-abi=hard.
+
+proc check_effective_target_arm_hard_ok { } {
+return [check_no_compiler_messages arm_hard_ok object {
+   #include 
+   int dummy;
+   int main (void) { return 0; }
+   } "-mfloat-abi=hard"]
+}
+
  # Return 1 if the target supports ARMv8.1-M MVE with floating point
  # instructions, 0 otherwise.  The test is valid for ARM.
  # Record the command line options needed.


Re: [PATCH 2/5] testsuite: [arm/mve] Use arm_softfp and arm_hard as needed in MVE tests

2020-04-14 Thread Andre Vieira (lists)

On 10/04/2020 14:55, Christophe Lyon via Gcc-patches wrote:

Some MVE tests explicitly test a -mfloat-abi=hard option, but we need
to check that the toolchain actually supports it (which may not be the
case for arm-linux-gnueabi* targets).

We also make use of dg-add-options arm_v8_1m_mve_fp and arm_v8_1m_mve
instead of duplicating the corresponding options in
dg-additional-options where we keep only -mfloat-abi to override the
option selected by arm_v8_1m_mve_fp.

Hi Christophe,

This sounds good!! Thank you for doing this.

diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
index 1462dd4..0fa3afd 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
@@ -1,6 +1,8 @@
  /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-require-effective-target arm_hard_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */

I was wondering, do we still need the skip-if with the arm_hard_ok?

-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfloat-abi=hard -mthumb 
-mfpu=auto --save-temps" } */
+/* { dg-additional-options "-mfloat-abi=hard" } */
  
  #include "arm_mve.h"
  
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c

index d528133..1fca110 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c
@@ -1,5 +1,7 @@
  /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfloat-abi=softfp -mthumb 
-mfpu=auto --save-temps" } */
+/* { dg-require-effective-target arm_softfp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
+/* { dg-additional-options "-mfloat-abi=softfp" } */
  
  #include "arm_mve.h"
  
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu1.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu1.c

index 59ca724..726f9ec 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu1.c
@@ -1,6 +1,8 @@
  /* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_hard_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */

Same here.

-/* { dg-additional-options "-march=armv8.1-m.main+mve -mfloat-abi=hard -mthumb 
-mfpu=auto --save-temps" } */
+/* { dg-additional-options "-mfloat-abi=hard" } */
  
  #include "arm_mve.h"
  
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu2.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu2.c

index ce297ea..7f39905 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu2.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fpu2.c
@@ -1,6 +1,8 @@
  /* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_softfp_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
  /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } 
{""} } */

And here.

-/* { dg-additional-options "-march=armv8.1-m.main+mve -mfloat-abi=softfp -mthumb 
-mfpu=auto --save-temps" } */
+/* { dg-additional-options "-mfloat-abi=softfp" } */
  
  #include "arm_mve.h"
  

LGTM.


[PATCH][GCC][Arm]: MVE: Add C++ polymorphism and fix some more issues

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch adds C++ polymorphism for the MVE intrinsics, by using the 
native C++ polymorphic functions when C++ is used.


It also moves the PRESERVE name macro definitions to the right place so 
that the variants without the '__arm_' prefix are not available if we 
define the PRESERVE NAMESPACE macro.


This patch further fixes two testisms that were brought to light by C++ 
testing added in this patch.


Regression tested on arm-none-eabi.

Is this OK for trunk?

gcc/ChangeLog:
2020-04-07  Andre Vieira  

    * config/arm/arm_mve.h: Add C++ polymorphism and fix
    preserve MACROs.

gcc/testsuite/ChangeLog:
2020-04-07  Andre Vieira  

    * g++.target/arm/mve.exp: New.
    * gcc.target/arm/mve/intrinsics/vcmpneq_n_f16: Fix testism.
    * gcc.target/arm/mve/intrinsics/vcmpneq_n_f32: Likewise.

<>


[PATCH][GCC][Arm]: MVE: Fixes for pointers used in intrinsics for c++

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch fixes the passing of some pointers to builtins that expect 
slightly different types of pointers.  In C this didn't prove an issue, 
but when compiling for C++ gcc complains.


Regression tested on arm-none-eabi.

Is this OK for trunk?

2020-04-07  Andre Vieira  

    * config/arm/arm_mve.h: Cast some pointers to expected types.

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 
44cb0d4a92f7f64d08c17722944d20bd6ea7048a..49c7fb95f17347d283c4df34a6875d686a3e3f09
 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -12897,56 +12897,56 @@ __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_offset_p_s64 (int64_t * __base, uint64x2_t __offset, 
int64x2_t __value, mve_pred16_t __p)
 {
-  __builtin_mve_vstrdq_scatter_offset_p_sv2di (__base, __offset, __value, __p);
+  __builtin_mve_vstrdq_scatter_offset_p_sv2di ((__builtin_neon_di *) __base, 
__offset, __value, __p);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_offset_p_u64 (uint64_t * __base, uint64x2_t __offset, 
uint64x2_t __value, mve_pred16_t __p)
 {
-  __builtin_mve_vstrdq_scatter_offset_p_uv2di (__base, __offset, __value, __p);
+  __builtin_mve_vstrdq_scatter_offset_p_uv2di ((__builtin_neon_di *) __base, 
__offset, __value, __p);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_offset_s64 (int64_t * __base, uint64x2_t __offset, 
int64x2_t __value)
 {
-  __builtin_mve_vstrdq_scatter_offset_sv2di (__base, __offset, __value);
+  __builtin_mve_vstrdq_scatter_offset_sv2di ((__builtin_neon_di *) __base, 
__offset, __value);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_offset_u64 (uint64_t * __base, uint64x2_t __offset, 
uint64x2_t __value)
 {
-  __builtin_mve_vstrdq_scatter_offset_uv2di (__base, __offset, __value);
+  __builtin_mve_vstrdq_scatter_offset_uv2di ((__builtin_neon_di *) __base, 
__offset, __value);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_shifted_offset_p_s64 (int64_t * __base, uint64x2_t 
__offset, int64x2_t __value, mve_pred16_t __p)
 {
-  __builtin_mve_vstrdq_scatter_shifted_offset_p_sv2di (__base, __offset, 
__value, __p);
+  __builtin_mve_vstrdq_scatter_shifted_offset_p_sv2di ((__builtin_neon_di *) 
__base, __offset, __value, __p);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_shifted_offset_p_u64 (uint64_t * __base, uint64x2_t 
__offset, uint64x2_t __value, mve_pred16_t __p)
 {
-  __builtin_mve_vstrdq_scatter_shifted_offset_p_uv2di (__base, __offset, 
__value, __p);
+  __builtin_mve_vstrdq_scatter_shifted_offset_p_uv2di ((__builtin_neon_di *) 
__base, __offset, __value, __p);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_shifted_offset_s64 (int64_t * __base, uint64x2_t 
__offset, int64x2_t __value)
 {
-  __builtin_mve_vstrdq_scatter_shifted_offset_sv2di (__base, __offset, 
__value);
+  __builtin_mve_vstrdq_scatter_shifted_offset_sv2di ((__builtin_neon_di *) 
__base, __offset, __value);
 }
 
 __extension__ extern __inline void
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vstrdq_scatter_shifted_offset_u64 (uint64_t * __base, uint64x2_t 
__offset, uint64x2_t __value)
 {
-  __builtin_mve_vstrdq_scatter_shifted_offset_uv2di (__base, __offset, 
__value);
+  __builtin_mve_vstrdq_scatter_shifted_offset_uv2di ((__builtin_neon_di *) 
__base, __offset, __value);
 }
 
 __extension__ extern __inline void
@@ -18968,14 +18968,14 @@ __extension__ extern __inline float16x8_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vldrhq_gather_shifted_offset_f16 (float16_t const * __base, uint16x8_t 
__offset)
 {
-  return __builtin_mve_vldrhq_gather_shifted_offset_fv8hf (__base, __offset);
+  return __builtin_mve_vldrhq_gather_shifted_offset_fv8hf ((__builtin_neon_hi 
*) __base, __offset);
 }
 
 __extension__ extern __inline float16x8_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vldrhq_gather_shifted_offset_z_f16 (float16_t const * __base, uint16x8_t 
__offset, mve_pred16_t __p)
 {
-  return __builtin_mve_vldrhq_gather_shifted_offset_z_fv8hf (__base, __offset, 
__p);
+  return __builtin_mve_vldrhq_gather_shifted_offset_z_fv8hf 
((__builtin_neon_hi *) __base, __offset, __p);
 }
 
 __extension__ extern __inline float32x4_t
@@ -19010,84 +19010,84 @@ __extension__ extern __inline float32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vldrwq_gather_shifted_offset_f32 (float32_t const * __base, uint32x4_t

[PATCH][GCC][Arm]MVE: Fix -Wall testisms

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch fixes some testisms I found when testing using -Wall/-Werror.

Is this OK for trunk?

gcc/testsuite/ChangeLog:
2020-04-07  Andre Vieira  

    * gcc.target/arm/mve/intrinsics/vuninitializedq_float.c: Likewise.
    * gcc.target/arm/mve/intrinsics/vuninitializedq_float1.c: Likewise.
    * gcc.target/arm/mve/intrinsics/vuninitializedq_int.c: Likewise.
    * gcc.target/arm/mve/intrinsics/vuninitializedq_int1.c: Likewise.

diff --git 
a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float.c
index 
3b9c0a740976854e7189ab23a6a8b2764c9b888a..52bad05b6219621ada414dc74ab2deebdd1c93e3
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float.c
@@ -4,11 +4,12 @@
 
 #include "arm_mve.h"
 
+float16x8_t fa;
+float32x4_t fb;
+
 void
 foo ()
 {
-  float16x8_t fa;
-  float32x4_t fb;
   fa = vuninitializedq_f16 ();
   fb = vuninitializedq_f32 ();
 }
diff --git 
a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float1.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float1.c
index 
0c94608af41fc30c65b959759704033a76bb879f..c6724a52074c6ce0361fdba66c4add831e8c13db
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_float1.c
@@ -4,13 +4,14 @@
 
 #include "arm_mve.h"
 
+float16x8_t fa, faa;
+float32x4_t fb, fbb;
+
 void
 foo ()
 {
-  float16x8_t fa, faa;
-  float32x4_t fb, fbb;
   fa = vuninitializedq (faa);
   fb = vuninitializedq (fbb);
 }
 
-/* { dg-final { scan-assembler-times "vstrb.8" } */
+/* { dg-final { scan-assembler-times "vstrb.8" 6 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int.c
index 
9ae17e240083a66e7c20c16ae06b99463c213bf9..13a0109a9b5380cd83f48154df231081ddb8f08e
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int.c
@@ -3,18 +3,18 @@
 /* { dg-additional-options "-O0" } */
 
 #include "arm_mve.h"
+int8x16_t a;
+int16x8_t b;
+int32x4_t c;
+int64x2_t d;
+uint8x16_t ua;
+uint16x8_t ub;
+uint32x4_t uc;
+uint64x2_t ud;
 
 void
 foo ()
 {
-  int8x16_t a;
-  int16x8_t b;
-  int32x4_t c;
-  int64x2_t d;
-  uint8x16_t ua;
-  uint16x8_t ub;
-  uint32x4_t uc;
-  uint64x2_t ud;
   a = vuninitializedq_s8 ();
   b = vuninitializedq_s16 ();
   c = vuninitializedq_s32 ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int1.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int1.c
index 
e8c1f1019c95af6d871cda9c9142c346ff3b49ae..a321398709e65ee7daadfab9c6089116baccde83
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vuninitializedq_int1.c
@@ -4,17 +4,18 @@
 
 #include "arm_mve.h"
 
+int8x16_t a, aa;
+int16x8_t b, bb;
+int32x4_t c, cc;
+int64x2_t d, dd;
+uint8x16_t ua, uaa;
+uint16x8_t ub, ubb;
+uint32x4_t uc, ucc;
+uint64x2_t ud, udd;
+
 void
 foo ()
 {
-  int8x16_t a, aa;
-  int16x8_t b, bb;
-  int32x4_t c, cc;
-  int64x2_t d, dd;
-  uint8x16_t ua, uaa;
-  uint16x8_t ub, ubb;
-  uint32x4_t uc, ucc;
-  uint64x2_t ud, udd;
   a = vuninitializedq (aa);
   b = vuninitializedq (bb);
   c = vuninitializedq (cc);


[PATCH][GCC][Arm]: MVE: Fix vec extracts to memory

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch fixes vec extracts to memory that can arise from code as seen 
in the testcase added. The patch fixes this by allowing mem operands in 
the set of mve_vec_extract patterns, which given the only '=r' 
constraint will lead to the scalar value being written to a register and 
then stored in memory using scalar store pattern.


Regression tested on arm-none-eabi.

Is this OK for trunk?

gcc/ChangeLog:
2020-04-07  Andre Vieira  

    * config/arm/mve.md (mve_vec_extract*): Allow memory operands 
in set.


gcc/testsuite/ChangeLog:
2020-04-07  Andre Vieira  

    * gcc.target/arm/mve/intrinsics/mve_vec_extracts_from_memory.c: 
New test.


diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
3c75f9ebc70d5765a59934b944955c757b6b2195..c49c14c4240838ce086f424f58726e2e94cf190e
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -10993,7 +10993,7 @@ (define_insn "mve_vld4q"
 ;; [vgetq_lane_u, vgetq_lane_s, vgetq_lane_f])
 ;;
 (define_insn "mve_vec_extract"
- [(set (match_operand: 0 "s_register_operand" "=r")
+ [(set (match_operand: 0 "nonimmediate_operand" "=r")
(vec_select:
 (match_operand:MVE_VLD_ST 1 "s_register_operand" "w")
 (parallel [(match_operand:SI 2 "immediate_operand" "i")])))]
@@ -11011,7 +11011,7 @@ (define_insn "mve_vec_extract"
  [(set_attr "type" "mve_move")])
 
 (define_insn "mve_vec_extractv2didi"
- [(set (match_operand:DI 0 "s_register_operand" "=r")
+ [(set (match_operand:DI 0 "nonimmediate_operand" "=r")
(vec_select:DI
 (match_operand:V2DI 1 "s_register_operand" "w")
 (parallel [(match_operand:SI 2 "immediate_operand" "i")])))]
@@ -11024,7 +11024,7 @@ (define_insn "mve_vec_extractv2didi"
   if (elt == 0)
return "vmov\t%Q0, %R0, %e1";
   else
-   return "vmov\t%J0, %K0, %f1";
+   return "vmov\t%Q0, %R0, %f1";
 }
  [(set_attr "type" "mve_move")])
 
diff --git 
a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vec_extracts_from_memory.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vec_extracts_from_memory.c
new file mode 100644
index 
..12f2f2d38d3c2e189a9c06f21fc63e2c23e2e721
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vec_extracts_from_memory.c
@@ -0,0 +1,40 @@
+/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
+/* { dg-additional-options "-O3" } */
+
+#include "arm_mve.h"
+
+uint8x16_t *vu8;
+int8x16_t *vs8;
+uint16x8_t *vu16;
+int16x8_t *vs16;
+uint32x4_t *vu32;
+int32x4_t *vs32;
+uint64x2_t *vu64;
+int64x2_t *vs64;
+float16x8_t *vf16;
+float32x4_t *vf32;
+uint8_t u8;
+uint16_t u16;
+uint32_t u32;
+uint64_t u64;
+int8_t s8;
+int16_t s16;
+int32_t s32;
+int64_t s64;
+float16_t f16;
+float32_t f32;
+
+void foo (void)
+{
+  u8 = vgetq_lane (*vu8, 1);
+  u16 = vgetq_lane (*vu16, 1);
+  u32 = vgetq_lane (*vu32, 1);
+  u64 = vgetq_lane (*vu64, 1);
+  s8 = vgetq_lane (*vs8, 1);
+  s16 = vgetq_lane (*vs16, 1);
+  s32 = vgetq_lane (*vs32, 1);
+  s64 = vgetq_lane (*vs64, 1);
+  f16 = vgetq_lane (*vf16, 1);
+  f32 = vgetq_lane (*vf32, 1);
+}


[PATCH][GCC][Arm]: MVE Fix immediate constraints on some vector instructions

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch fixes the immediate checks on vcvt and vqshr(u)n[bt] 
instrucitons.  It also removes the 'arm_mve_immediate_check' as the 
check was wrong and the error message is not much better than the 
constraint one, which albeit isn't great either.


Regression tested on arm-none-eabi.

Is this OK for trunk?

gcc/ChangeLog:
2020-04-07  Andre Vieira      (mve_vcvtq_n_to_f_*, mve_vcvtq_n_from_f_*, mve_vqshrnbq_n_*, 
mve_vqshrntq_n_*,
 mve_vqshrunbq_n_s*, mve_vqshruntq_n_s*, 
mve_vcvtq_m_n_from_f_*, mve_vcvtq_m_n_to_f_*,

 mve_vqshrnbq_m_n_*, mve_vqrshruntq_m_n_s*, mve_vqshrunbq_m_n_s*,
 mve_vqshruntq_m_n_s*): Fixed immediate constraints.

gcc/testsuite/ChangeLog:
2020-04-07  Andre Vieira  diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
1af9d5cf145f6d01e364a1afd7ceb3df5da86c9a..cd0a49cdb63690d794981a73e1e7e0d47f6d1987
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -32693,31 +32693,6 @@ arm_simd_check_vect_par_cnst_half_p (rtx op, 
machine_mode mode,
   return true;
 }
 
-/* To check op's immediate values matches the mode of the defined insn.  */
-bool
-arm_mve_immediate_check (rtx op, machine_mode mode, bool val)
-{
-  if (val)
-{
-  if (((GET_CODE (op) == CONST_INT) && (INTVAL (op) <= 7)
-  && (mode == E_V16QImode))
- || ((GET_CODE (op) == CONST_INT) && (INTVAL (op) <= 15)
-  && (mode == E_V8HImode))
- || ((GET_CODE (op) == CONST_INT) && (INTVAL (op) <= 31)
-  && (mode == E_V4SImode)))
-   return true;
-}
-  else
-{
-  if (((GET_CODE (op) == CONST_INT) && (INTVAL (op) <= 7)
-  && (mode == E_V8HImode))
- || ((GET_CODE (op) == CONST_INT) && (INTVAL (op) <= 15)
-  && (mode == E_V4SImode)))
-   return true;
-}
-  return false;
-}
-
 /* Can output mi_thunk for all cases except for non-zero vcall_offset
in Thumb1.  */
 static bool
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
4a506cc3861534b4ddc30ba8f4f3c4ec28a8cc69..3c75f9ebc70d5765a59934b944955c757b6b2195
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -401,8 +401,10 @@ (define_int_attr mode1 [(VCTP8Q "8") (VCTP16Q "16") 
(VCTP32Q "32")
(VCTP64Q "64") (VCTP8Q_M "8") (VCTP16Q_M "16")
(VCTP32Q_M "32") (VCTP64Q_M "64")])
 (define_mode_attr MVE_pred2 [(V16QI "mve_imm_8") (V8HI "mve_imm_16")
-(V4SI "mve_imm_32")])
-(define_mode_attr MVE_constraint2 [(V16QI "Rb") (V8HI "Rd") (V4SI "Rf")])
+(V4SI "mve_imm_32")
+(V8HF "mve_imm_16") (V4SF "mve_imm_32")])
+(define_mode_attr MVE_constraint2 [(V16QI "Rb") (V8HI "Rd") (V4SI "Rf")
+   (V8HF "Rd") (V4SF "Rf")])
 (define_mode_attr MVE_LANES [(V16QI "16") (V8HI "8") (V4SI "4")])
 (define_mode_attr MVE_constraint [ (V16QI "Ra") (V8HI "Rc") (V4SI "Re")])
 (define_mode_attr MVE_pred [ (V16QI "mve_imm_7") (V8HI "mve_imm_15")
@@ -1330,7 +1332,7 @@ (define_insn "mve_vcvtq_n_to_f_"
   [
(set (match_operand:MVE_0 0 "s_register_operand" "=w")
(unspec:MVE_0 [(match_operand: 1 "s_register_operand" "w")
-  (match_operand:SI 2 "mve_imm_16" "Rd")]
+  (match_operand:SI 2 "" "")]
 VCVTQ_N_TO_F))
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
@@ -1389,7 +1391,7 @@ (define_insn "mve_vcvtq_n_from_f_"
   [
(set (match_operand:MVE_5 0 "s_register_operand" "=w")
(unspec:MVE_5 [(match_operand: 1 "s_register_operand" "w")
-  (match_operand:SI 2 "mve_imm_16" "Rd")]
+  (match_operand:SI 2 "" "")]
 VCVTQ_N_FROM_F))
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
@@ -5484,7 +5486,7 @@ (define_insn "mve_vqshrnbq_n_"
(set (match_operand: 0 "s_register_operand" "=w")
(unspec: [(match_operand: 1 
"s_register_operand" "0")
   (match_operand:MVE_5 2 "s_register_operand" "w")
-  (match_operand:SI 3 "" "")]
+  (match_operand:SI 3 "" "")]
 VQSHRNBQ_N))
   ]
   "TARGET_HAVE_MVE"
@@ -5500,7 +5502,7 @@ (define_insn "mve_vqshrntq_n_"
(set (match_operand: 0 "s

Re: [PATCH][GCC][Arm]: MVE: Fix v[id]wdup's

2020-04-07 Thread Andre Vieira (lists)

On 07/04/2020 11:57, Christophe Lyon wrote:

On Tue, 7 Apr 2020 at 12:40, Andre Vieira (lists)
 wrote:

Hi,

This patch fixes v[id]wdup intrinsics. They had two issues:
1) the predicated versions did not link the incoming inactive vector
parameter to the output
2) The backend didn't enforce the wrap limit operand be in an odd register.

1) was fixed like we did for all other predicated intrinsics
2) requires a temporary hack where we pass the value in the top end of
DImode operand. The proper fix would be to add a register CLASS but this
interacted badly with other existing targets codegen.  We will look to
fix this properly in GCC 11.

Regression tested on arm-none-eabi.


Hi Andre,

How did you find problem 1? I suspect you are using an internal
simulator since qemu does not support MVE yet?
And you probably have runtime tests to exhibit this failure?

Hi Christophe,

I actually found 1) because I was fixing 2). Though yes, I am trying to 
complement testing using an internal simulator and running tests in 
Arm's CMSIS DSP Library (https://github.com/ARM-software/CMSIS_5) that 
use MVE.


Cheers,
Andre

Thanks,

Christophe


Is this OK for trunk?

gcc/ChangeLog:
2020-04-07  Andre Vieira  

  * config/arm/arm_mve.h: Fix v[id]wdup intrinsics.
  * config/arm/mve/md: Fix v[id]wdup patterns.



Re: [PATCH][GCC][Arm]: MVE: Fix constant load pattern

2020-04-07 Thread Andre Vieira (lists)
The diff looks weird, but this only removes the first if 
(TARGET_HAVE_MVE... ) block and updates the variable 'addr' which is 
only used in the consecutive(TARGET_HAVE_MVE ..) blocks. So it doesn't 
change NEON codegen.


It's unfortunate the diff looks so complicated :(

Cheers,
Andre

On 07/04/2020 11:52, Kyrylo Tkachov wrote:



-Original Message-
From: Andre Vieira (lists) 
Sent: 07 April 2020 11:35
To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm]: MVE: Fix constant load pattern

Hi,

This patch fixes the constant load pattern for MVE, this was not
accounting correctly for label + offset cases.

Added test that ICE'd before and removed the scan assemblers for the
mve_vector* tests as they were too fragile.

Bootstrapped on arm-linux-gnueabihf and regression tested on arm-none-
eabi.

Is this OK for trunk?

This makes me a bit nervous as it touches the common output_move_neon code ☹ 
but it looks like it mostly shuffles things around.
Ok for trunk but please watch out for fallout.
Thanks,
Kyrill


gcc/ChangeLog:
2020-04-07  Andre Vieira  

      * config/arm/arm.c (output_move_neon): Deal with label + offset
cases.
      * config/arm/mve.md (*mve_mov): Handle const vectors.

gcc/testsuite/ChangeLog:
2020-04-07  Andre Vieira  

      * gcc.target/arm/mve/intrinsics/mve_load_from_array.c: New test.
      * gcc.target/arm/mve/intrinsics/mve_vector_float.c: Remove
scan-assembler.
      * gcc.target/arm/mve/intrinsics/mve_vector_float1.c: Likewise.
      * gcc.target/arm/mve/intrinsics/mve_vector_int1.c: Likewise.
      * gcc.target/arm/mve/intrinsics/mve_vector_int2.c: Likewise.


[PATCH][GCC][Arm]: MVE Don't use lsll for 32-bit shifts scalar

2020-04-07 Thread Andre Vieira (lists)

Hi,

After fixing the v[id]wdups using the "moving the wrap parameter" into 
the top-end of a DImode operand using a shift, I noticed we were using 
lsll for 32-bit shifts in scalars, where we don't need to, as we can 
simply do a move, which is much better if we don't need to use the 
bottom part.


We can solve this in a better way, but for now this will do.

Regression tested on arm-none-eabi.

Is this OK for trunk?

2020-04-07  Andre Vieira  

    * config/arm/arm.d (ashldi3): Don't use lsll for constant 32-bit
    shifts.

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 
1a7ea0d701e5677965574d877d0fe4b2f5bc149f..6d5560398dae3d0ace0342b4907542d2a6865f70
 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -4422,7 +4422,8 @@ (define_expand "ashldi3"
 operands[2] = force_reg (SImode, operands[2]);
 
   /* Armv8.1-M Mainline double shifts are not expanded.  */
-  if (arm_reg_or_long_shift_imm (operands[2], GET_MODE (operands[2])))
+  if (arm_reg_or_long_shift_imm (operands[2], GET_MODE (operands[2]))
+ && (REG_P (operands[2]) || INTVAL(operands[2]) != 32))
 {
  if (!reg_overlap_mentioned_p(operands[0], operands[1]))
emit_insn (gen_movdi (operands[0], operands[1]));


[PATCH][GCC][Arm]: MVE: Fix v[id]wdup's

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch fixes v[id]wdup intrinsics. They had two issues:
1) the predicated versions did not link the incoming inactive vector 
parameter to the output

2) The backend didn't enforce the wrap limit operand be in an odd register.

1) was fixed like we did for all other predicated intrinsics
2) requires a temporary hack where we pass the value in the top end of 
DImode operand. The proper fix would be to add a register CLASS but this 
interacted badly with other existing targets codegen.  We will look to 
fix this properly in GCC 11.


Regression tested on arm-none-eabi.

Is this OK for trunk?

gcc/ChangeLog:
2020-04-07  Andre Vieira  

    * config/arm/arm_mve.h: Fix v[id]wdup intrinsics.
    * config/arm/mve/md: Fix v[id]wdup patterns.

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 
e31c2e8fdc4f500bf9408d05ad86e151397627f7..47eead71d9515b4103a5b66999a3f9357dc3c3be
 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -13585,29 +13585,33 @@ __extension__ extern __inline uint8x16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_m_n_u8 (uint8x16_t __inactive, uint32_t __a, uint32_t __b, const 
int __imm, mve_pred16_t __p)
 {
-  return __builtin_mve_vdwdupq_m_n_uv16qi (__inactive, __a, __b, __imm, __p);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  return __builtin_mve_vdwdupq_m_n_uv16qi (__inactive, __a, __c, __imm, __p);
 }
 
 __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_m_n_u32 (uint32x4_t __inactive, uint32_t __a, uint32_t __b, 
const int __imm, mve_pred16_t __p)
 {
-  return __builtin_mve_vdwdupq_m_n_uv4si (__inactive, __a, __b, __imm, __p);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  return __builtin_mve_vdwdupq_m_n_uv4si (__inactive, __a, __c, __imm, __p);
 }
 
 __extension__ extern __inline uint16x8_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_m_n_u16 (uint16x8_t __inactive, uint32_t __a, uint32_t __b, 
const int __imm, mve_pred16_t __p)
 {
-  return __builtin_mve_vdwdupq_m_n_uv8hi (__inactive, __a, __b, __imm, __p);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  return __builtin_mve_vdwdupq_m_n_uv8hi (__inactive, __a, __c, __imm, __p);
 }
 
 __extension__ extern __inline uint8x16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_m_wb_u8 (uint8x16_t __inactive, uint32_t * __a, uint32_t __b, 
const int __imm, mve_pred16_t __p)
 {
-  uint8x16_t __res =  __builtin_mve_vdwdupq_m_n_uv16qi (__inactive, *__a, __b, 
__imm, __p);
-  *__a = __builtin_mve_vdwdupq_m_wb_uv16qi (__inactive, *__a, __b, __imm, __p);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  uint8x16_t __res =  __builtin_mve_vdwdupq_m_n_uv16qi (__inactive, *__a, __c, 
__imm, __p);
+  *__a = __builtin_mve_vdwdupq_m_wb_uv16qi (__inactive, *__a, __c, __imm, __p);
   return __res;
 }
 
@@ -13615,8 +13619,9 @@ __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_m_wb_u32 (uint32x4_t __inactive, uint32_t * __a, uint32_t __b, 
const int __imm, mve_pred16_t __p)
 {
-  uint32x4_t __res =  __builtin_mve_vdwdupq_m_n_uv4si (__inactive, *__a, __b, 
__imm, __p);
-  *__a = __builtin_mve_vdwdupq_m_wb_uv4si (__inactive, *__a, __b, __imm, __p);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  uint32x4_t __res =  __builtin_mve_vdwdupq_m_n_uv4si (__inactive, *__a, __c, 
__imm, __p);
+  *__a = __builtin_mve_vdwdupq_m_wb_uv4si (__inactive, *__a, __c, __imm, __p);
   return __res;
 }
 
@@ -13624,8 +13629,9 @@ __extension__ extern __inline uint16x8_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_m_wb_u16 (uint16x8_t __inactive, uint32_t * __a, uint32_t __b, 
const int __imm, mve_pred16_t __p)
 {
-  uint16x8_t __res =  __builtin_mve_vdwdupq_m_n_uv8hi (__inactive, *__a, __b, 
__imm, __p);
-  *__a = __builtin_mve_vdwdupq_m_wb_uv8hi (__inactive, *__a, __b, __imm, __p);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  uint16x8_t __res =  __builtin_mve_vdwdupq_m_n_uv8hi (__inactive, *__a, __c, 
__imm, __p);
+  *__a = __builtin_mve_vdwdupq_m_wb_uv8hi (__inactive, *__a, __c, __imm, __p);
   return __res;
 }
 
@@ -13633,29 +13639,33 @@ __extension__ extern __inline uint8x16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_n_u8 (uint32_t __a, uint32_t __b, const int __imm)
 {
-  return __builtin_mve_vdwdupq_n_uv16qi (__a, __b, __imm);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  return __builtin_mve_vdwdupq_n_uv16qi (__a, __c, __imm);
 }
 
 __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vdwdupq_n_u32 (uint32_t __a, uint32_t __b, const int __imm)
 {
-  return __builtin_mve_vdwdupq_n_uv4si (__a, __b, __imm);
+  uint64_t __c = ((uint64_t) __b) << 32;
+  return __builtin_mve_vdwdup

[PATCH][GCC][Arm]: MVE: Fix constant load pattern

2020-04-07 Thread Andre Vieira (lists)

Hi,

This patch fixes the constant load pattern for MVE, this was not 
accounting correctly for label + offset cases.


Added test that ICE'd before and removed the scan assemblers for the 
mve_vector* tests as they were too fragile.


Bootstrapped on arm-linux-gnueabihf and regression tested on arm-none-eabi.

Is this OK for trunk?

gcc/ChangeLog:
2020-04-07  Andre Vieira  

    * config/arm/arm.c (output_move_neon): Deal with label + offset 
cases.

    * config/arm/mve.md (*mve_mov): Handle const vectors.

gcc/testsuite/ChangeLog:
2020-04-07  Andre Vieira  

    * gcc.target/arm/mve/intrinsics/mve_load_from_array.c: New test.
    * gcc.target/arm/mve/intrinsics/mve_vector_float.c: Remove 
scan-assembler.

    * gcc.target/arm/mve/intrinsics/mve_vector_float1.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_int1.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_int2.c: Likewise.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
d5207e0d8f07f9be5265fc6d175c148c6cdd53cb..1af9d5cf145f6d01e364a1afd7ceb3df5da86c9a
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -20122,52 +20122,43 @@ output_move_neon (rtx *operands)
  break;
}
   /* Fall through.  */
-case LABEL_REF:
 case PLUS:
+  addr = XEXP (addr, 0);
+  /* Fall through.  */
+case LABEL_REF:
   {
int i;
int overlap = -1;
-   if (TARGET_HAVE_MVE && !BYTES_BIG_ENDIAN
-   && GET_CODE (addr) != LABEL_REF)
+   for (i = 0; i < nregs; i++)
  {
-   sprintf (buff, "v%srw.32\t%%q0, %%1", load ? "ld" : "st");
-   ops[0] = reg;
-   ops[1] = mem;
-   output_asm_insn (buff, ops);
- }
-   else
- {
-   for (i = 0; i < nregs; i++)
+   /* We're only using DImode here because it's a convenient
+  size.  */
+   ops[0] = gen_rtx_REG (DImode, REGNO (reg) + 2 * i);
+   ops[1] = adjust_address (mem, DImode, 8 * i);
+   if (reg_overlap_mentioned_p (ops[0], mem))
  {
-   /* We're only using DImode here because it's a convenient
-  size.  */
-   ops[0] = gen_rtx_REG (DImode, REGNO (reg) + 2 * i);
-   ops[1] = adjust_address (mem, DImode, 8 * i);
-   if (reg_overlap_mentioned_p (ops[0], mem))
- {
-   gcc_assert (overlap == -1);
-   overlap = i;
- }
-   else
- {
-   if (TARGET_HAVE_MVE && GET_CODE (addr) == LABEL_REF)
- sprintf (buff, "v%sr.64\t%%P0, %%1", load ? "ld" : "st");
-   else
- sprintf (buff, "v%sr%%?\t%%P0, %%1", load ? "ld" : "st");
-   output_asm_insn (buff, ops);
- }
+   gcc_assert (overlap == -1);
+   overlap = i;
  }
-   if (overlap != -1)
+   else
  {
-   ops[0] = gen_rtx_REG (DImode, REGNO (reg) + 2 * overlap);
-   ops[1] = adjust_address (mem, SImode, 8 * overlap);
if (TARGET_HAVE_MVE && GET_CODE (addr) == LABEL_REF)
- sprintf (buff, "v%sr.32\t%%P0, %%1", load ? "ld" : "st");
+ sprintf (buff, "v%sr.64\t%%P0, %%1", load ? "ld" : "st");
else
  sprintf (buff, "v%sr%%?\t%%P0, %%1", load ? "ld" : "st");
output_asm_insn (buff, ops);
  }
  }
+   if (overlap != -1)
+ {
+   ops[0] = gen_rtx_REG (DImode, REGNO (reg) + 2 * overlap);
+   ops[1] = adjust_address (mem, SImode, 8 * overlap);
+   if (TARGET_HAVE_MVE && GET_CODE (addr) == LABEL_REF)
+ sprintf (buff, "v%sr.32\t%%P0, %%1", load ? "ld" : "st");
+   else
+ sprintf (buff, "v%sr%%?\t%%P0, %%1", load ? "ld" : "st");
+   output_asm_insn (buff, ops);
+ }
 
 return "";
   }
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
d1028f4542b4972b4080e46544c86d625d77383a..10abc3fae3709891346b63213afb1fe3754af41a
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -695,9 +695,9 @@ (define_insn "*mve_mov"
 case 2:
   return "vmov\t%Q0, %R0, %e1  @ \;vmov\t%J0, %K0, %f1";
 case 4:
-  if ((TARGET_HAVE_MVE_FLOAT && VALID_MVE_SF_MODE (mode))
- || (MEM_P (operands[1])
- && GET_CODE (XEXP (operands[1], 0)) == LABEL_REF))
+  if (MEM_P (operands[1])
+ && (GET_CODE (XEXP (operands[1], 0))

Re: [testsuite][arm] Fix cmse-15.c expected output

2020-04-07 Thread Andre Vieira (lists)

On 06/04/2020 16:12, Christophe Lyon via Gcc-patches wrote:

Hi,

While checking Martin's fix for PR ipa/94445, he made me realize that
the cmse-15.c testcase still fails at -Os because ICF means that we
generate
nonsecure2:
 b   nonsecure0

which is OK, but does not match the currently expected
nonsecure2:
...
 bl  __gnu_cmse_nonsecure_call

(see https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543190.html)

The test has already different expectations for v8-M and v8.1-M.

I've decided to try to use check-function-bodies to account for the
different possibilities:
- v8-M vs v8.1-M via two different prefixes
- code generation variants (-0?) via multiple regexps

I've tested that the test now passes with --target-board=-march=armv8-m.main
and --target-board=-march=armv8.1-m.main.

I feel this a bit too much of a burden for the purpose, maybe there's
a better way of handling all these alternatives (in particular,
there's a lot of duplication since the expected code for the secure*
functions is the same for v8-M and v8.1-M).

OK?

Thanks,

Christophe

Hi Christophe,

This check-function-bodies functionality is pretty sweet, I assume the ( 
A | B ) checks for either of them?
If so that looks like a good improvement. Ideally we'd also check the 
clearing for the v8.1-M cases, but that wasn't there before either and 
they would need again splitting for -mfloat-abi=soft+softfp and 
-mfloat-abi=hard.



So yeah this LGTM but you need approval from a port/global maintainer.

Cheers,
Andre


Re: [PATCH][GCC][Arm]: MVE: Fix polymorphism for scalars and constants

2020-04-07 Thread Andre Vieira (lists)

Now with the zipped patch so it reaches the mailing list.

Sorry for that.

On 07/04/2020 09:57, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 07 April 2020 09:57
To: Kyrylo Tkachov ; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH][GCC][Arm]: MVE: Fix polymorphism for scalars and
constants

Hi,

I rebased this patch and made some extra fixes.

This patch merges some polymorphic functions that were uncorrectly
separating scalar variants. It also simplifies the way we detect
scalars and constants in mve_typeid.

I also fixed some polymorphic intrinsics that were splitting of scalar cases.

Regression tested for arm-none-eabi.

Is this OK for trunk?

Ok.
Thanks,
Kyrill


2020-04-07  Andre Vieira  

      * config/arm/arm_mve.h (vsubq_n): Merge with...
      (vsubq): ... this.
      (vmulq_n): Merge with...
      (vmulq): ... this.
      (__ARM_mve_typeid): Simplify scalar and constant detection.

2020-04-07  Andre Vieira  

      * gcc.target/arm/mve/intrinsics/vmulq_n_f16.c: Fix test.
      * gcc.target/arm/mve/intrinsics/vmulq_n_f32.c: Likewise.
      * gcc.target/arm/mve/intrinsics/vmulq_n_s16.c: Likewise.
      * gcc.target/arm/mve/intrinsics/vmulq_n_s32.c: Likewise.
      * gcc.target/arm/mve/intrinsics/vmulq_n_s8.c: Likewise.
      * gcc.target/arm/mve/intrinsics/vmulq_n_u16.c: Likewise.
      * gcc.target/arm/mve/intrinsics/vmulq_n_u32.c: Likewise.
      * gcc.target/arm/mve/intrinsics/vmulq_n_u8.c: Likewise.

On 02/04/2020 10:58, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 02 April 2020 09:22
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm]: MVE: Fix polymorphism for scalars and
constants

Hi,

This patch merges some polymorphic functions that were incorrectly
separating scalar variants. It also simplifies the way we detect
scalars and constants in mve_typeid.

Regression tested for arm-none-eabi.

Is this OK for trunk?

Ok.
Thanks,
Kyrill


2020-04-02  Andre Vieira  

       * config/arm/arm_mve.h (vsubq_n): Merge with...
       (vsubq): ... this.
       (vmulq_n): Merge with...
       (vmulq): ... this.
       (__ARM_mve_typeid): Simplify scalar and constant detection.
<>


[committed][GCC][Arm]: MVE: Fix unintended change to tests

2020-04-03 Thread Andre Vieira (lists)
When committing my last patch I accidentally removed -mfpu=auto from the 
following tests. This puts it back.


2020-04-03  Andre Vieira  

    * gcc.target/arm/mve/intrinsics/mve_vector_float.c: Put 
-mfpu=auto back.

    * gcc.target/arm/mve/intrinsics/mve_vector_float1.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_float2.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_int.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_int1.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_int2.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_uint.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_uint1.c: Likewise.
    * gcc.target/arm/mve/intrinsics/mve_vector_uint2.c: Likewise.

diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c
index 
a6f95a63859c8b130f2c63f788cbea766dd8c5b2..9de47e6a1e0214fef26630b7959f11e58809d2c0
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
 /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } {""} 
} */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfloat-abi=hard 
-mthumb --save-temps" } */
+/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
 
 #include "arm_mve.h"
 
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c
index 
7745eecacca5faa082d3440c50585f62fd34c0af..ba8fb6dd5da4464dd8e58e837d84540acd1d
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float1.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
 /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } {""} 
} */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfloat-abi=hard 
-mthumb --save-temps" } */
+/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
 
 #include "arm_mve.h"
 
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c
index 
02653f08359176a245234448e1273fe106e324e3..3ce8ea3b303509df1ecd8096b990ea9b02846c79
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_float2.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
 /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } {""} 
} */
-/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfloat-abi=hard 
-mthumb --save-temps" } */
+/* { dg-additional-options "-march=armv8.1-m.main+mve.fp -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
 
 #include "arm_mve.h"
 
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c
index 
5c009ba746278e2bc742d5a89ca510f0899b5db2..dab07051bda3b823b2643d8d0c6aa266515a84c2
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target arm_v8_1m_mve_ok } */
 /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } {""} 
} */
-/* { dg-additional-options "-march=armv8.1-m.main+mve -mfloat-abi=hard -mthumb 
--save-temps" } */
+/* { dg-additional-options "-march=armv8.1-m.main+mve -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
 
 #include "arm_mve.h"
 
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int1.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int1.c
index 
50f0bd1efa52b4b2aaf505664b3e571309a26bd3..2d2fd116dfcfcdb04220e92090706c362350e8d2
 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int1.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_vector_int1.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target arm_v8_1m_mve_ok } */
 /* { dg-skip-if "Incompatible float ABI" { *-*-* } { "-mfloat-abi=soft" } {""} 
} */
-/* { dg-additional-options "-march=armv8.1-m.main+mve -mfloat-abi=hard -mthumb 
--save-temps" } */
+/* { dg-additional-options "-march=armv8.1-m.main+mve -mfpu=auto 
-mfloat-abi=hard -mthumb --save-temps" } */
 
 #include "arm_mve.h"
 
diff --git a/gcc/testsuit

[PATCH][GCC][Arm]: Do not process rest of MVE header file after unsupported error

2020-04-02 Thread Andre Vieira (lists)

Hi,

This patch makes sure the rest of the header file is not parsed if MVE 
is not supported.  The user should not be including this file if MVE is 
not supported, nevertheless making sure it doesn't parse the rest of the 
header file will save the user from a huge error output that would be 
rather useless.


Is this OK for trunk?

gcc/ChangeLog:
2020-04-02  Andre Vieira  

    * config/arm/arm_mve.h: Condition the header file on 
__ARM_FEATURE_MVE.


diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 
f1dcdc2153217e796c58526ba0e5be11be642234..1ce55bd2fc4f5c6a171ffe116d7fd9029e11a619
 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -24,11 +24,9 @@
 
 #if __ARM_BIG_ENDIAN
 #error "MVE intrinsics are not supported in Big-Endian mode."
-#endif
-
-#if !__ARM_FEATURE_MVE
+#elif !__ARM_FEATURE_MVE
 #error "MVE feature not supported"
-#endif
+#else
 
 #include 
 #ifndef  __cplusplus
@@ -27554,4 +27552,5 @@ extern void *__ARM_undef;
 }
 #endif
 
+#endif /* __ARM_FEATURE_MVE  */
 #endif /* _GCC_ARM_MVE_H.  */


[PATCH 1/2] arm: Add earlyclobber to MVE instructions that require it

2020-03-23 Thread Andre Vieira (lists)


Hi,

This patch adds an earlyclobber to the MVE instructions that require it 
and were missing it. These are vrev64 and 32-bit element variants of 
vcadd, vhcadd vcmul, vmull[bt] and vqdmull[bt].


Regression tested on arm-none-eabi.

Is this OK for trunk?

Cheers,
Andre

2020-03-23  Andre Vieira  

    * config/arm/mve.md (earlyclobber_32): New mode attribute.
    (mve_vrev64q_*, mve_vcaddq*, mve_vhcaddq_*, mve_vcmulq_*,
 mve_vmull[bt]q_*, mve_vqdmull[bt]q_*): Add appropriate early 
clobbers.
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
2e28d9d8408127dd52b9d16c772e7f27a47d390a..0cd67962a2641a3be46fe67819e093c0a712751b
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -411,6 +411,8 @@ (define_mode_attr MVE_B_ELEM [ (V16QI "V16QI") (V8HI 
"V8QI") (V4SI "V4QI")])
 (define_mode_attr MVE_H_ELEM [ (V8HI "V8HI") (V4SI "V4HI")])
 (define_mode_attr V_sz_elem1 [(V16QI "b") (V8HI  "h") (V4SI "w") (V8HF "h")
  (V4SF "w")])
+(define_mode_attr earlyclobber_32 [(V16QI "=w") (V8HI "=w") (V4SI "=")
+   (V8HF "=w") (V4SF "=")])
 
 (define_int_iterator VCVTQ_TO_F [VCVTQ_TO_F_S VCVTQ_TO_F_U])
 (define_int_iterator VMVNQ_N [VMVNQ_N_U VMVNQ_N_S])
@@ -856,7 +858,7 @@ (define_insn "mve_vrndaq_f"
 ;;
 (define_insn "mve_vrev64q_f"
   [
-   (set (match_operand:MVE_0 0 "s_register_operand" "=w")
+   (set (match_operand:MVE_0 0 "s_register_operand" "=")
(unspec:MVE_0 [(match_operand:MVE_0 1 "s_register_operand" "w")]
 VREV64Q_F))
   ]
@@ -967,7 +969,7 @@ (define_insn "mve_vcvtq_to_f_"
 ;;
 (define_insn "mve_vrev64q_"
   [
-   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
+   (set (match_operand:MVE_2 0 "s_register_operand" "=")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")]
 VREV64Q))
   ]
@@ -1541,7 +1543,7 @@ (define_insn "mve_vbrsrq_n_"
 ;;
 (define_insn "mve_vcaddq_rot270_"
   [
-   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
+   (set (match_operand:MVE_2 0 "s_register_operand" "")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
   (match_operand:MVE_2 2 "s_register_operand" "w")]
 VCADDQ_ROT270))
@@ -1556,7 +1558,7 @@ (define_insn "mve_vcaddq_rot270_"
 ;;
 (define_insn "mve_vcaddq_rot90_"
   [
-   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
+   (set (match_operand:MVE_2 0 "s_register_operand" "")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
   (match_operand:MVE_2 2 "s_register_operand" "w")]
 VCADDQ_ROT90))
@@ -1841,7 +1843,7 @@ (define_insn "mve_vhaddq_"
 ;;
 (define_insn "mve_vhcaddq_rot270_s"
   [
-   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
+   (set (match_operand:MVE_2 0 "s_register_operand" "")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
   (match_operand:MVE_2 2 "s_register_operand" "w")]
 VHCADDQ_ROT270_S))
@@ -1856,7 +1858,7 @@ (define_insn "mve_vhcaddq_rot270_s"
 ;;
 (define_insn "mve_vhcaddq_rot90_s"
   [
-   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
+   (set (match_operand:MVE_2 0 "s_register_operand" "")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
   (match_operand:MVE_2 2 "s_register_operand" "w")]
 VHCADDQ_ROT90_S))
@@ -2096,7 +2098,7 @@ (define_insn "mve_vmulhq_"
 ;;
 (define_insn "mve_vmullbq_int_"
   [
-   (set (match_operand: 0 "s_register_operand" "=w")
+   (set (match_operand: 0 "s_register_operand" 
"")
(unspec: [(match_operand:MVE_2 1 "s_register_operand" 
"w")
  (match_operand:MVE_2 2 "s_register_operand" 
"w")]
 VMULLBQ_INT))
@@ -2111,7 +2113,7 @@ (define_insn "mve_vmullbq_int_"
 ;;
 (define_insn "mve_vmulltq_int_"
   [
-   (set (match_operand: 0 "s_register_operand" "=w")
+   (set (match_operand: 0 "s_register_operand" 
"")
(unspec: [(match_operand:MVE_2 1 "s_register_operand" 
"w")
  (match

[PATCH 0/2] arm: Enable assembling when testing MVE

2020-03-23 Thread Andre Vieira (lists)

Hi,

This patch series changes all MVE tests into assembly tests so we check whether 
the generated assembly is syntactically correct.  The first patch of the series 
fixes an issue this caught where the instructions don't allow destination and 
source registers to be the same.

Andre Vieira (2):
arm: Add earlyclobber to MVE instructions that require it
testsuite, arm: Change tests to assemble



[PATCH][GCC][Arm]: Revert changes to {get, set}_fpscr

2020-03-20 Thread Andre Vieira (lists)

Hi,

MVE made changes to {get,set}_fpscr to enable the compiler to optimize
unneccesary gets and sets when using these for intrinsics that use 
and/or write
the carry bit.  However, these actually get and set the full FPSCR 
register and
are used by fp env intrinsics to modify the fp context.  So MVE should 
not be

using these.

This fixes regressions for gcc.dg/atomic/c11-atomic-exec-5.c

Bootstrapped and tested arm-linux-gnueabihf.

Is this OK for trunk?

gcc/ChangeLog:
2020-03-20  Andre Vieira  

    * config/arm/unspecs.md (UNSPEC_GET_FPSCR): Rename this to ...
    (VUNSPEC_GET_FPSCR): ... this, and move it to vunspec.
    * config/arm/vfp.md: (get_fpscr, set_fpscr): Revert to old 
patterns.


diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 
e76609f79418af38b70746336dd43592a1dc8713..f0b1f465de4b63d624510783576700519044717d
 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -170,7 +170,6 @@ (define_c_enum "unspec" [
   UNSPEC_TORC  ; Used by the intrinsic form of the iWMMXt TORC 
instruction.
   UNSPEC_TORVSC; Used by the intrinsic form of the iWMMXt 
TORVSC instruction.
   UNSPEC_TEXTRC; Used by the intrinsic form of the iWMMXt 
TEXTRC instruction.
-  UNSPEC_GET_FPSCR ; Represent fetch of FPSCR content.
 ])
 
 
@@ -217,6 +216,7 @@ (define_c_enum "unspecv" [
   VUNSPEC_SLX  ; Represent a store-register-release-exclusive.
   VUNSPEC_LDA  ; Represent a store-register-acquire.
   VUNSPEC_STL  ; Represent a store-register-release.
+  VUNSPEC_GET_FPSCR; Represent fetch of FPSCR content.
   VUNSPEC_SET_FPSCR; Represent assign of FPSCR content.
   VUNSPEC_PROBE_STACK_RANGE ; Represent stack range probing.
   VUNSPEC_CDP  ; Represent the coprocessor cdp instruction.
diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md
index 
eb6ae7bea7927c666f36219797d54c0127001bc1..dfb1031431af3ec87d9cccdee35db04e0adffe04
 100644
--- a/gcc/config/arm/vfp.md
+++ b/gcc/config/arm/vfp.md
@@ -2096,9 +2096,8 @@ (define_insn "3"
 
 ;; Write Floating-point Status and Control Register.
 (define_insn "set_fpscr"
-  [(set (reg:SI VFPCC_REGNUM)
-   (unspec_volatile:SI
-[(match_operand:SI 0 "register_operand" "r")] VUNSPEC_SET_FPSCR))]
+  [(unspec_volatile [(match_operand:SI 0 "register_operand" "r")]
+VUNSPEC_SET_FPSCR)]
   "TARGET_VFP_BASE"
   "mcr\\tp10, 7, %0, cr1, cr0, 0\\t @SET_FPSCR"
   [(set_attr "type" "mrs")])
@@ -2106,7 +2105,7 @@ (define_insn "set_fpscr"
 ;; Read Floating-point Status and Control Register.
 (define_insn "get_fpscr"
   [(set (match_operand:SI 0 "register_operand" "=r")
-   (unspec:SI [(reg:SI VFPCC_REGNUM)] UNSPEC_GET_FPSCR))]
+(unspec_volatile:SI [(const_int 0)] VUNSPEC_GET_FPSCR))]
   "TARGET_VFP_BASE"
   "mrc\\tp10, 7, %0, cr1, cr0, 0\\t @GET_FPSCR"
   [(set_attr "type" "mrs")])


<    1   2   3   4   5   6   7   8   >