Re: [PATCH v3 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-30 Thread Andre Simoes Dias Vieira
Hi Richard,

Thanks for the reviews, I'm making these changes but just a heads up.

When hardcoding LR_REGNUM like this we need to change the way we compare the 
register in doloop_condition_get. This function currently compares the rtx 
nodes by address, which I think happens to be fine before we assign hard 
registers, as I suspect we always share the rtx node for the same pseudo, but 
when assigning registers it seems like we create copies, so things like:
`XEXP (inc_src, 0) == reg` will fail for
inc_src: (plus (reg LR) (const_int -n)'
reg: (reg LR)

Instead I will substitute the operand '==' with calls to 'rtx_equal_p (op1, 
op2, NULL)'.

Sound good?

Kind regards,
Andre


From: Richard Earnshaw (lists) 
Sent: Tuesday, January 30, 2024 11:36 AM
To: Andre Simoes Dias Vieira; gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov; Stam Markianos-Wright
Subject: Re: [PATCH v3 2/2] arm: Add support for MVE Tail-Predicated Low 
Overhead Loops

On 19/01/2024 14:40, Andre Vieira wrote:
>
> Respin after comments from Kyrill and rebase. I also removed an if-then-else
> construct in arm_mve_check_reg_origin_is_num_elems similar to the other 
> functions
> Kyrill pointed out.
>
> After an earlier comment from Richard Sandiford I also added comments to the
> two tail predication patterns added to explain the need for the unspecs.

[missing ChangeLog]

I'm just going to focus on loop-doloop.c in this reply, I'll respond to the 
other bits in a follow-up.

  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-(label_ref (label))
-(pc))).
+(set (pc) (if_then_else (reg != 0)
+(label_ref (label))
+(pc))).

  Some targets (ARM) do the comparison before the branch, as in the
  following form:

- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
...


This comment is becoming confusing.  Really the text leading up to 3)... should 
be inside 3.  Something like:

  3) Some targets (ARM) do the comparison before the branch, as in the
  following form:

  (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
 (set (reg) (plus (reg) (const_int -1)))])
  (set (pc) (if_then_else (cc == NE)
  (label_ref (label))
  (pc)))])


The same issue on the comment structure also applies to the new point 4...

+  The ARM target also supports a special case of a counter that decrements
+  by `n` and terminating in a GTU condition.  In that case, the compare and
+  branch are all part of one insn, containing an UNSPEC:
+
+  4) (parallel [
+   (set (pc)
+   (if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr)
+   (const_int -n))])
+  (const_int n-1]))
+   (label_ref)
+   (pc)))
+   (set (reg:SI 14 lr)
+(plus:SI (reg:SI 14 lr)
+ (const_int -n)))
+ */

I think this needs a bit more clarification.  Specifically that this construct 
supports a predicated vectorized do loop.  Also, the placement of the unspec 
inside the comparison is ugnly and unnecessary.  It should be sufficient to 
have the unspec inside a USE expression, which the mid-end can then ignore 
entirely.  So

(parallel
 [(set (pc) (if_then_else (gtu (plus (reg) (const_int -n))
   (const_int n-1))
  (label_ref) (pc)))
  (set (reg) (plus (reg) (const_int -n)))
  (additional clobbers and uses)])

For Arm, we then add a (use (unspec [(const_int 0)] N)) that is specific to 
this pattern to stop anything else from matching it.

Note that we don't need to mention that the register is 'LR' or the modes, 
those are specific to a particular backend, not the generic pattern we want to 
match.

+  || !CONST_INT_P (XEXP (inc_src, 1))
+  || INTVAL (XEXP (inc_src, 1)) >= 0)
 return 0;
+  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));

We can just use '-INTVAL(...)' here, we've verified just above that the 
constant is negative.

-  if ((XEXP (condition, 0) == reg)
+  /* For the ARM special case of having a GTU: re-form the condition without
+ the unspec for the benefit of the middle-end.  */
+  if (GET_CODE (condition) == GTU)
+{
+  condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src,
+ GEN_INT (dec_num - 1));
+  return condition;
+}

If you make the change I mentioned above, this re-forming isn't needed any 
more, so the arm-specific comment goes away

-   {
+{
  if (GET

Re: [r13-7135 Regression] FAIL: gcc.dg/vect/vect-simd-clone-18f.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 on Linux/x86_64

2023-04-13 Thread Andre Simoes Dias Vieira via Gcc-patches
Hi,

@Andrew: Could you have a look at these? I had a quick look at 17f.c and it 
looks to me like the target selectors aren't specific enough. Unfortunately I 
am not familiar enough with target selectors (or targets for that matter) for 
x86_64. From what I could tell with -m32 gcc decides to unroll the vectorized 
loop so you end up with 4 simdclones rather than the 2 it tests for, GCC 
probably uses a different cost structure for -m32 that decides it is profitable 
to unroll?

As for -march=cascadelake, that seems to prevent gcc from using the inbranch 
simdclones altogether, so I suspect that cascadelake doesn't support these 
inbranch simdclones or the vector types it is trying to use.

Kind regards,
Andre


From: haochen.jiang 
Sent: Thursday, April 13, 2023 2:48 AM
To: Andre Simoes Dias Vieira; gcc-regress...@gcc.gnu.org; 
gcc-patches@gcc.gnu.org; haochen.ji...@intel.com
Subject: [r13-7135 Regression] FAIL: gcc.dg/vect/vect-simd-clone-18f.c 
scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 on Linux/x86_64

On Linux/x86_64,

58c8c1b383bc3c286d6527fc6e8fb62463f9a877 is the first bad commit
commit 58c8c1b383bc3c286d6527fc6e8fb62463f9a877
Author: Andre Vieira 
Date:   Tue Apr 11 10:07:43 2023 +0100

if-conv: Restore MASK_CALL conversion [PR10]

caused

FAIL: gcc.dg/vect/vect-simd-clone-16e.c scan-tree-dump-times vect "[\\n\\r] 
[^\\n]* = foo\\.simdclone" 3
FAIL: gcc.dg/vect/vect-simd-clone-16f.c scan-tree-dump-times vect "[\\n\\r] 
[^\\n]* = foo\\.simdclone" 2
FAIL: gcc.dg/vect/vect-simd-clone-17e.c scan-tree-dump-times vect "[\\n\\r] 
[^\\n]* = foo\\.simdclone" 3
FAIL: gcc.dg/vect/vect-simd-clone-17f.c scan-tree-dump-times vect "[\\n\\r] 
[^\\n]* = foo\\.simdclone" 2
FAIL: gcc.dg/vect/vect-simd-clone-18e.c scan-tree-dump-times vect "[\\n\\r] 
[^\\n]* = foo\\.simdclone" 3
FAIL: gcc.dg/vect/vect-simd-clone-18f.c scan-tree-dump-times vect "[\\n\\r] 
[^\\n]* = foo\\.simdclone" 2

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r13-7135/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16e.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16e.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16f.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16f.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17e.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17e.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17f.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17f.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18e.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18e.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18f.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18f.c 
--target_board='unix{-m32\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com)


[PATCH] arm: Make MVE masked stores read memory operand [PR 108177]

2023-01-13 Thread Andre Simoes Dias Vieira via Gcc-patches
Hi,

This patch adds the memory operand of MVE masked stores as input operands to
mimic the 'partial' writes, to prevent erroneous write-after-write
optimizations as described in the PR.

Regression tested on arm-none-eabi for armv8.1-m.main+mve.fp. 

OK for trunk?

gcc/ChangeLog:

PR target/108177
* config/arm/mve.md (mve_vstrbq_p_, mve_vstrhq_p_fv8hf,
mve_vstrhq_p_, mve_vstrwq_p_v4si): Add memory operand
as input operand.

gcc/testsuite/ChangeLog:

*   gcc.target/arm/mve/pr108177-1-run.c: New test.
*   gcc.target/arm/mve/pr108177-1.c: New test.
*   gcc.target/arm/mve/pr108177-10-run.c: New test.
*   gcc.target/arm/mve/pr108177-10.c: New test.
*   gcc.target/arm/mve/pr108177-11-run.c: New test.
*   gcc.target/arm/mve/pr108177-11.c: New test.
*   gcc.target/arm/mve/pr108177-12-run.c: New test.
*   gcc.target/arm/mve/pr108177-12.c: New test.
*   gcc.target/arm/mve/pr108177-13-run.c: New test.
*   gcc.target/arm/mve/pr108177-13.c: New test.
*   gcc.target/arm/mve/pr108177-14-run.c: New test.
*   gcc.target/arm/mve/pr108177-14.c: New test.
*   gcc.target/arm/mve/pr108177-2-run.c: New test.
*   gcc.target/arm/mve/pr108177-2.c: New test.
*   gcc.target/arm/mve/pr108177-3-run.c: New test.
*   gcc.target/arm/mve/pr108177-3.c: New test.
*   gcc.target/arm/mve/pr108177-4-run.c: New test.
*   gcc.target/arm/mve/pr108177-4.c: New test.
*   gcc.target/arm/mve/pr108177-5-run.c: New test.
*   gcc.target/arm/mve/pr108177-5.c: New test.
*   gcc.target/arm/mve/pr108177-6-run.c: New test.
*   gcc.target/arm/mve/pr108177-6.c: New test.
*   gcc.target/arm/mve/pr108177-7-run.c: New test.
*   gcc.target/arm/mve/pr108177-7.c: New test.
*   gcc.target/arm/mve/pr108177-8-run.c: New test.
*   gcc.target/arm/mve/pr108177-8.c: New test.
*   gcc.target/arm/mve/pr108177-9-run.c: New test.
*   gcc.target/arm/mve/pr108177-9.c: New test.
*   gcc.target/arm/mve/pr108177-main.x: New test include.
*   gcc.target/arm/mve/pr108177.x: New test include.


pr108177.patch
Description: pr108177.patch


[PATCH] arm: Fix MVE's vcmp vector-scalar patterns [PR107987]

2022-12-06 Thread Andre Simoes Dias Vieira via Gcc-patches
Hi,

This patch surrounds the scalar operand of the MVE vcmp patterns with a
vec_duplicate to ensure both operands of the comparision operator have the same
(vector) mode.

Regression tested on arm-none-eabi. Is this OK for trunk? And a backport to GCC 
12?

gcc/ChangeLog:

PR target/107987
* config/arm/mve.md (mve_vcmpq_n_,
@mve_vcmpq_n_f): Apply vec_duplicate to scalar
operand.

gcc/testsuite/ChangeLog:

* gcc/testsuite/gcc.target/arm/mve/pr107987.c: New test.


pr107987.patch
Description: pr107987.patch


Re: [PATCH v3 06/15] arm: Fix mve_vmvnq_n_ argument mode

2022-01-20 Thread Andre Simoes Dias Vieira via Gcc-patches



On 20/01/2022 09:23, Christophe Lyon wrote:



On Wed, Jan 19, 2022 at 8:03 PM Andre Vieira (lists) via Gcc-patches 
 wrote:



On 13/01/2022 14:56, Christophe Lyon via Gcc-patches wrote:
> The vmvnq_n* intrinsics and have [u]int[16|32]_t arguments, so use
>  iterator instead of HI in mve_vmvnq_n_.
>
> 2022-01-13  Christophe Lyon  
>
>       gcc/
>       * config/arm/mve.md (mve_vmvnq_n_): Use V_elem
mode
>       for operand 1.
>
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 171dd384133..5c3b34dce3a 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -617,7 +617,7 @@ (define_insn "mve_vcvtaq_"
>   (define_insn "mve_vmvnq_n_"
>     [
>      (set (match_operand:MVE_5 0 "s_register_operand" "=w")
> -     (unspec:MVE_5 [(match_operand:HI 1 "immediate_operand" "i")]
> +     (unspec:MVE_5 [(match_operand: 1
"immediate_operand" "i")]
>        VMVNQ_N))
>     ]
>     "TARGET_HAVE_MVE"

While fixing this it might be good to fix the constraint and
predicate
inspired by "DL" and "neon_inv_logic_op2" respectively. This would
avoid
the compiler generating wrong assembly, and instead it would probably
lead to the compiler using a load literal.

I kind of think it would be better to have the intrinsic refuse the
immediate altogether, but it seems for NEON we also use the load
literal
approach.


Ha, I thought that patch had been approved at v2 too: 
https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581344.html



Yeah sorry I had not looked at the previous version of these series!

I can put together a follow-up for this then.


Re: [AArch64] Enable generation of FRINTNZ instructions

2021-11-12 Thread Andre Simoes Dias Vieira via Gcc-patches



On 12/11/2021 10:56, Richard Biener wrote:

On Thu, 11 Nov 2021, Andre Vieira (lists) wrote:


Hi,

This patch introduces two IFN's FTRUNC32 and FTRUNC64, the corresponding
optabs and mappings. It also creates a backend pattern to implement them for
aarch64 and a match.pd pattern to idiom recognize these.
These IFN's (and optabs) represent a truncation towards zero, as if performed
by first casting it to a signed integer of 32 or 64 bits and then back to the
same floating point type/mode.

The match.pd pattern choses to use these, when supported, regardless of
trapping math, since these new patterns mimic the original behavior of
truncating through an integer.

I didn't think any of the existing IFN's represented these. I know it's a bit
late in stage 1, but I thought this might be OK given it's only used by a
single target and should have very little impact on anything else.

Bootstrapped on aarch64-none-linux.

OK for trunk?

On the RTL side ftrunc32/ftrunc64 would probably be better a conversion
optab (with two modes), so not

+OPTAB_D (ftrunc32_optab, "ftrunc$asi2")
+OPTAB_D (ftrunc64_optab, "ftrunc$adi2")

but

OPTAB_CD (ftrunc_shrt_optab, "ftrunc$a$I$b2")

or so?  I know that gets somewhat awkward for the internal function,
but IMHO we shouldn't tie our hands because of that?
I tried doing this originally, but indeed I couldn't find a way to 
correctly tie the internal function to it.


direct_optab_supported_p with multiple types expect those to be of the 
same mode. I see convert_optab_supported_p does but I don't know how 
that is used...


Any ideas?



Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-06-08 Thread Andre Simoes Dias Vieira via Gcc-patches

Hi Bin,

Thank you for the reply, I have some questions, see below.

On 07/06/2021 12:28, Bin.Cheng wrote:

On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
 wrote:

Hi Andre,
I didn't look into the details of the IV sharing RFC.  It seems to me
costing outside uses is trying to generate better code for later code
(epilogue loop here).  The only problem is IVOPTs doesn't know that
the outside use is not in the final form - which will be transformed
by IVOPTs again.

I think this example is not good at describing your problem because it
shows exactly that considering outside use results in better code,
compared to the other two approaches.
I don't quite understand what you are saying here :( What do you mean by 
final form? It seems to me that costing uses inside and outside loop the 
same way is wrong because calculating the IV inside the loop has to be 
done every iteration, whereas if you can resolve it to a single update 
(without an IV) then you can sink it outside the loop. This is why I 
think this example shows why we need to cost these uses differently.

2) Is there a cleaner way to generate the optimal 'post-increment' use
for the outside-use variable? I first thought the position in the
candidate might be something I could use or even the var_at_stmt
functionality, but the outside IV has the actual increment of the
variable as it's use, rather than the outside uses. This is this RFC's
main weakness I find.

To answer why IVOPTs behaves like this w/o your two patches.  The main
problem is the point IVOPTs rewrites outside use IV - I don't remember
the exact point - but looks like at the end of loop while before
incrementing instruction of main IV.  It's a known issue that outside
use should be costed/re-written on the exit edge along which its value
flows out of loop.  I had a patch a long time ago but discarded it,
because it didn't bring obvious improvement and is complicated in case
of multi-exit edges.
Yeah I haven't looked at multi-exit edges and I understand that 
complicates things. But for now we could disable the special casing of 
outside uses when dealing with multi-exit loops and keep the current 
behavior.


But in general, I am less convinced that any of the two patches is the
right direction solving IV sharing issue between vectorized loop and
epilogue loop.  I would need to read the previous RFC before giving
further comments though.


The previous RFC still has a lot of unanswered questions too, but 
regardless of that, take the following (non-vectorizer) example:


#include 
#include 

void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned long long n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned long long i = 0;
    for (; i < (n & ~(svcntb() - 1)); i += svcntb()) {
  svuint8_t va = svld1 (all_true, (uint8_t*)a);
  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
  a += svcntb();
  b += svcntb();
  c += svcntb();
  }
  svbool_t pred;
  for (; i < (n); i += svcntb()) {
  pred = svwhilelt_b8 (i, n);
  svuint8_t va = svld1 (pred, (uint8_t*)a);
  svuint8_t vb = svld1 (pred, (uint8_t*)b);
  svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb));
  a += svcntb();
  b += svcntb();
  c += svcntb();
  }


Current IVOPTs will use 4 iterators for the first loop, when it could do 
with just 1. In fact, if you use my patches it will create just a single 
IV and sink the uses and it is then able to merge them with loads & 
stores of the next loop.


I am not saying setting outside costs to 0 is the right thing to do by 
the way. It is absolutely not! It will break cost considerations for 
other cases. Like I said above I've been playing around with using 
'!use->outside' as a multiplier for the cost. Unfortunately it won't 
help with the case above, because this seems to choose 'infinite_cost' 
because the candidate IV has a lower precision than the use IV. I don't 
quite understand yet how candidates are created, but something I'm going 
to try to look at. Just wanted to show this as an example of how IVOPTs 
would not improve code with multiple loops that don't involve the 
vectorizer.


BR,
Andre




Thanks,
bin


Re: [PATCH 1/7] arm: Auto-vectorization for MVE: vand

2020-11-25 Thread Andre Simoes Dias Vieira via Gcc-patches

Hi Christophe,

Thanks for these! See some inline comments.

On 25/11/2020 13:54, Christophe Lyon via Gcc-patches wrote:

This patch enables MVE vandq instructions for auto-vectorization.  MVE
vandq insns in mve.md are modified to use 'and' instead of unspec
expression to support and3.  The and3 expander is added to
vec-common.md

2020-11-12  Christophe Lyon  

gcc/
* gcc/config/arm/iterators.md (supf): Remove VANDQ_S and VANDQ_U.
(VANQ): Remove.
* config/arm/mve.md (mve_vandq_u): New entry for vand
instruction using expression and.
(mve_vandq_s): New expander.
* config/arm/neon.md (and3): Renamed into and3_neon.
* config/arm/unspecs.md (VANDQ_S, VANDQ_U): Remove.
* config/arm/vec-common.md (and3): New expander.

gcc/testsuite/
* gcc.target/arm/simd/mve-vand.c: New test.
---
  gcc/config/arm/iterators.md  |  4 +---
  gcc/config/arm/mve.md| 20 
  gcc/config/arm/neon.md   |  2 +-
  gcc/config/arm/unspecs.md|  2 --
  gcc/config/arm/vec-common.md | 15 
  gcc/testsuite/gcc.target/arm/simd/mve-vand.c | 34 
  6 files changed, 66 insertions(+), 11 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vand.c

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 592af35..72039e4 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -1232,8 +1232,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") 
(VREV16Q_S "s")
   (VADDLVQ_P_U "u") (VCMPNEQ_U "u") (VCMPNEQ_S "s")
   (VABDQ_M_S "s") (VABDQ_M_U "u") (VABDQ_S "s")
   (VABDQ_U "u") (VADDQ_N_S "s") (VADDQ_N_U "u")
-  (VADDVQ_P_S "s")   (VADDVQ_P_U "u") (VANDQ_S "s")
-  (VANDQ_U "u") (VBICQ_S "s") (VBICQ_U "u")
+  (VADDVQ_P_S "s")   (VADDVQ_P_U "u") (VBICQ_S "s") (VBICQ_U 
"u")
   (VBRSRQ_N_S "s") (VBRSRQ_N_U "u") (VCADDQ_ROT270_S "s")
   (VCADDQ_ROT270_U "u") (VCADDQ_ROT90_S "s")
   (VCMPEQQ_S "s") (VCMPEQQ_U "u") (VCADDQ_ROT90_U "u")
@@ -1501,7 +1500,6 @@ (define_int_iterator VABDQ [VABDQ_S VABDQ_U])
  (define_int_iterator VADDQ_N [VADDQ_N_S VADDQ_N_U])
  (define_int_iterator VADDVAQ [VADDVAQ_S VADDVAQ_U])
  (define_int_iterator VADDVQ_P [VADDVQ_P_U VADDVQ_P_S])
-(define_int_iterator VANDQ [VANDQ_U VANDQ_S])
  (define_int_iterator VBICQ [VBICQ_S VBICQ_U])
  (define_int_iterator VBRSRQ_N [VBRSRQ_N_U VBRSRQ_N_S])
  (define_int_iterator VCADDQ_ROT270 [VCADDQ_ROT270_S VCADDQ_ROT270_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index ecbaaa9..975eb7d 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -894,17 +894,27 @@ (define_insn "mve_vaddvq_p_"
  ;;
  ;; [vandq_u, vandq_s])
  ;;
-(define_insn "mve_vandq_"
+;; signed and unsigned versions are the same: define the unsigned
+;; insn, and use an expander for the signed one as we still reference
+;; both names from arm_mve.h.
+(define_insn "mve_vandq_u"
[
 (set (match_operand:MVE_2 0 "s_register_operand" "=w")
-   (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
-  (match_operand:MVE_2 2 "s_register_operand" "w")]
-VANDQ))
+   (and:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
+  (match_operand:MVE_2 2 "s_register_operand" "w")))
The predicate on the second operand is more restrictive than the one in 
expand 'neon_inv_logic_op2'. This should still work with immediates, or 
well I checked for integers, it generates a loop as such:


    vldrw.32    q3, [r0]
    vldr.64 d4, .L8
    vldr.64 d5, .L8+8
    vand    q3, q3, q2
    vstrw.32    q3, [r2]

MVE does support vand's with immediates, just like NEON, I suspect you 
could just copy the way Neon handles these, possibly worth the little 
extra effort. You can use dest[i] = a[i] & ~1 as a testcase.
If you don't it might still be worth expanding the test to make sure 
other immediates-types combinations don't trigger an ICE?


I'm not sure I understand why it loads it in two 64-bit chunks and not 
do a single load or not just do something like a vmov or vbic immediate. 
Anyhow that's a worry for another day I guess..

]
"TARGET_HAVE_MVE"
-  "vand %q0, %q1, %q2"
+  "vand\t%q0, %q1, %q2"
[(set_attr "type" "mve_move")
  ])
+(define_expand "mve_vandq_s"
+  [
+   (set (match_operand:MVE_2 0 "s_register_operand")
+   (and:MVE_2 (match_operand:MVE_2 1 "s_register_operand")
+  (match_operand:MVE_2 2 "s_register_operand")))
+  ]
+  "TARGET_HAVE_MVE"
+)
  
  ;;

  ;; [vbicq_s, vbicq_u])
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 2d76769..dc4707d 100644
--- a/gcc/config/arm/neon.md
+++ 

Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE

2020-07-08 Thread Andre Simoes Dias Vieira



On 07/07/2020 13:43, Christophe Lyon wrote:

Hi,


On Mon, 6 Jul 2020 at 16:31, Andre Vieira (lists)
 wrote:


On 30/06/2020 14:50, Andre Vieira (lists) wrote:

On 29/06/2020 11:15, Christophe Lyon wrote:

On Mon, 29 Jun 2020 at 10:56, Andre Vieira (lists)
 wrote:

On 23/06/2020 21:52, Christophe Lyon wrote:

On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists)
 wrote:

On 23/06/2020 13:10, Kyrylo Tkachov wrote:

-Original Message-
From: Andre Vieira (lists) 
Sent: 22 June 2020 09:52
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov 
Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber
callee saved
registers with CMSE

Hi,

As reported in bugzilla when the -mcmse option is used while
compiling
for size (-Os) with a thumb-1 target the generated code will
clear the
registers r7-r10. These however are callee saved and should be
preserved
accross ABI boundaries. The reason this happens is because these
registers are made "fixed" when optimising for size with Thumb-1
in a
way to make sure they are not used, as pushing and popping
hi-registers
requires extra moves to and from LO_REGS.

To fix this, this patch uses 'callee_saved_reg_p', which
accounts for
this optimisation, instead of 'call_used_or_fixed_reg_p'. Be
aware of
'callee_saved_reg_p''s definition, as it does still take call used
registers into account, which aren't callee_saved in my opinion,
so it
is a rather misnoemer, works in our advantage here though as it
does
exactly what we need.

Regression tested on arm-none-eabi.

Is this OK for trunk? (Will eventually backport to previous
versions if
stable.)

Ok.
Thanks,
Kyrill

As I was getting ready to push this I noticed I didn't add any
skip-ifs
to prevent this failing with specific target options. So here's a new
version with those.

Still OK?


Hi,

This is not sufficient to skip arm-linux-gnueabi* configs built with
non-default cpu/fpu.

For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16 --with-float=hard
I see:
FAIL: gcc.target/arm/pr95646.c (test for excess errors)
Excess errors:
cc1: error: ARMv8-M Security Extensions incompatible with selected FPU
cc1: error: target CPU does not support ARM mode

and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os

Resending as I don't think my earlier one made it to the lists
(sorry if
you are receiving this double!)

I'm not following this, before I go off and try to reproduce it,
what do
you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse
-Os'?
These are the options you are seeing in the log file? Surely they
should
override the default options? Only thing I can think of is this might
need an extra -mfloat-abi=soft to make sure it overrides the default
float-abi.  Could you give that a try?

No it doesn't make a difference alone.

I also had to add:
-mfpu=auto (that clears the above warning)
-mthumb otherwise we now get cc1: error: target CPU does not support
ARM mode

Looks like some effective-target machinery is needed

So I had a look at this,  I was pretty sure that -mfloat-abi=soft
overwrote -mfpu=<>, which in large it does, as in no FP instructions
will be generated but the error you see only checks for the right
number of FP registers. Which doesn't check whether
'TARGET_HARD_FLOAT' is set or not. I'll fix this too and use the
check-effective-target for armv8-m.base for this test as it is indeed
a better approach than my bag of skip-ifs. I'm testing it locally to
make sure my changes don't break anything.

Cheers,
Andre

Hi,

Sorry for the delay. So I changed the test to use the effective-target
machinery as you suggested and I also made sure that you don't get the
"ARMv8-M Security Extensions incompatible with selected FPU" when
-mfloat-abi=soft.
Further changed 'asm' to '__asm__' to avoid failures with '-std=' options.

Regression tested on arm-none-eabi.
@Christophe: could you test this for your configuration, shouldn't fail
anymore!


Indeed with your patch I don't see any failure with pr95646.c

Note that it is still unsupported with arm-eabi when running the tests
with -mcpu=cortex-mXX
because the compiler complains that -mcpu=cortex-mXX conflicts with
-march=armv8-m.base,
thus the effective-target test fails.

BTW, is that warning useful/practical? Wouldn't it be more convenient
if the last -mcpu/-march
on the command line was the only one taken into account? (I had a
similar issue when
running tests (libstdc++) getting -march=armv8-m.main+fp from their
multilib environment
and forcing -mcpu=cortex-m33 because it also means '+dsp' and produces
a warning;
I had to use -mcpu=cortex-m33 -march=armv8-m.main+fp+dsp to workaround this)
Yeah I've been annoyed by that before, also in the context of testing 
multilibs.


Even though I can see how it can be a useful warning though, if you are 
using these in build-systems and you accidentally introduce a new 
(incompatible) -mcpu/-march alongside the old one. Though it seems 
arbitrary, as we do not warn against multiple -mcpu's or 

Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP

2018-06-25 Thread Andre Simoes Dias Vieira
On 18/06/18 09:08, Andre Simoes Dias Vieira wrote:
> Hi Richard,
> 
> Sorry for the delay I have been on holidays.  I had a look and I think you 
> are right.  With these changes Umq and Uml seem to have the same 
> functionality though, so I would suggest using only one.  Maybe use a 
> different name for both, removing both Umq and Uml in favour of Umn, where 
> the n indicates it narrows the addressing mode.  How does that sound to you?
> 
> I also had a look at Ump, but that one is used in the parallel pattern for 
> STP/LDP which does not use this "narrowing". So we should leave that one as 
> is.
> 
> Cheers,
> Andre
> 
> 
> From: Richard Sandiford 
> Sent: Thursday, June 14, 2018 12:28:16 PM
> To: Andre Simoes Dias Vieira
> Cc: gcc-patches@gcc.gnu.org; nd
> Subject: Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP
> 
> Andre Simoes Dias Vieira  writes:
>> @@ -5716,10 +5717,17 @@ aarch64_classify_address (struct 
>> aarch64_address_info *info,
>>unsigned int vec_flags = aarch64_classify_vector_mode (mode);
>>bool advsimd_struct_p = (vec_flags == (VEC_ADVSIMD | VEC_STRUCT));
>>bool load_store_pair_p = (type == ADDR_QUERY_LDP_STP
>> + || type == ADDR_QUERY_LDP_STP_N
>>   || mode == TImode
>>   || mode == TFmode
>>   || (BYTES_BIG_ENDIAN && advsimd_struct_p));
>>
>> +  /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming 
>> mode
>> + corresponds to the actual size of the memory being loaded/stored and 
>> the
>> + mode of the corresponding addressing mode is half of that.  */
>> +  if (type == ADDR_QUERY_LDP_STP_N && known_eq (GET_MODE_SIZE (mode), 16))
>> +mode = DFmode;
>> +
>>bool allow_reg_index_p = (!load_store_pair_p
>>   && (known_lt (GET_MODE_SIZE (mode), 16)
>>   || vec_flags == VEC_ADVSIMD
> 
> I don't know whether it matters in practice, but that description also
> applies to Umq, not just Uml.  It might be worth changing it too so
> that things stay consistent.
> 
> Thanks,
> Richard
> 
Hi all,

This is a reworked patched, replacing Umq and Uml with Umn now.

Bootstrapped and tested on aarch64-none-linux-gnu.

Is this OK for trunk?

gcc
2018-06-25  Andre Vieira  

* config/aarch64/aarch64-simd.md (aarch64_simd_mov):
Replace
Umq with Umn.
(store_pair_lanes): Likewise.
* config/aarch64/aarch64-protos.h (aarch64_addr_query_type): Add new
enum value 'ADDR_QUERY_LDP_STP_N'.
* config/aarch64/aarch64.c (aarch64_addr_query_type): Likewise.
(aarch64_print_address_internal): Add declaration.
(aarch64_print_ldpstp_address): Remove.
(aarch64_classify_address): Adapt mode for 'ADDR_QUERY_LDP_STP_N'.
(aarch64_print_operand): Change printing of 'y'.
* config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Use
new enum value 'ADDR_QUERY_LDP_STP_N', don't hardcode mode and use
'true' rather than '1'.
* gcc/config/aarch64/constraints.md (Uml): Likewise.
(Uml): Rename to Umn.
(Umq): Remove.
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 4ea50acaa59c0b58a213bd1f27fb78b6d8deee96..c03a442107815eed44a3b6bceb386d78a6615483 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -120,6 +120,10 @@ enum aarch64_symbol_type
ADDR_QUERY_LDP_STP
   Query what is valid for a load/store pair.
 
+   ADDR_QUERY_LDP_STP_N
+  Query what is valid for a load/store pair, but narrow the incoming mode
+  for address checking.  This is used for the store_pair_lanes patterns.
+
ADDR_QUERY_ANY
   Query what is valid for at least one memory constraint, which may
   allow things that "m" doesn't.  For example, the SVE LDR and STR
@@ -128,6 +132,7 @@ enum aarch64_symbol_type
 enum aarch64_addr_query_type {
   ADDR_QUERY_M,
   ADDR_QUERY_LDP_STP,
+  ADDR_QUERY_LDP_STP_N,
   ADDR_QUERY_ANY
 };
 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 4e5c42b0f15b863f3088dba4aac450f31ca309bb..7936581947b360a6d6a88ce6523bbb804c3eb89c 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -131,7 +131,7 @@
 
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VQ 0 "nonimmediate_operand"
-		"=w, Umq,  m,  w, ?r, ?w, ?r, w")
+		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
 	(match_operand:VQ 1 "general_operand"
 		"m,  Dz, w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
@@ -30

Re: [AArch64][PATCH 2/2] PR target/83009: Relax strict address checking for store pair lanes

2018-06-25 Thread Andre Simoes Dias Vieira
On 14/06/18 12:47, Richard Sandiford wrote:
> Kyrill  Tkachov  writes:
>> Hi Andre,
>> On 07/06/18 18:02, Andre Simoes Dias Vieira wrote:
>>> Hi,
>>>
>>> See below a patch to address PR 83009.
>>>
>>> Tested with aarch64-linux-gnu bootstrap and regtests for c, c++ and fortran.
>>> Ran the adjusted testcase for -mabi=ilp32.
>>>
>>> Is this OK for trunk?
>>>
>>> Cheers,
>>> Andre
>>>
>>> PR target/83009: Relax strict address checking for store pair lanes
>>>
>>> The operand constraint for the memory address of store/load pair lanes was
>>> enforcing strictly hardware registers be allowed as memory addresses.  We 
>>> want
>>> to relax that such that these patterns can be used by combine, prior
>>> to reload.
>>> During register allocation the register constraint will enforce the correct
>>> register is chosen.
>>>
>>> gcc
>>> 2018-06-07  Andre Vieira 
>>>
>>> PR target/83009
>>> * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): 
>>> Make
>>> address check not strict prior to reload.
>>>
>>> gcc/testsuite
>>> 2018-06-07 Andre Vieira 
>>>
>>> PR target/83009
>>> * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests.
>>
>> diff --git a/gcc/config/aarch64/predicates.md 
>> b/gcc/config/aarch64/predicates.md
>> index 
>> f0917af8b4cec945ba4e38e4dc670200f8812983..30aa88838671bf343a883077c2b606a035c030dd
>>  100644
>> --- a/gcc/config/aarch64/predicates.md
>> +++ b/gcc/config/aarch64/predicates.md
>> @@ -227,7 +227,7 @@
>>   (define_predicate "aarch64_mem_pair_lanes_operand"
>> (and (match_code "mem")
>>  (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 
>> 0),
>> -  true,
>> +  reload_completed,
>>ADDR_QUERY_LDP_STP_N)")))
>>   
>>
>> If you want to enforce strict checking during reload and later then I think 
>> you need to use reload_in_progress || reload_completed ?
> 
> That was the old way, but it would be lra_in_progress now.
> However...
> 
>> I guess that would be equivalent to !can_create_pseudo ().
> 
> We should never see pseudos when reload_completed, so the choice
> shouldn't really matter then.  And I don't think we should use
> lra_in_progress either, since that would make the checks stricter
> before RA has actually happened, which would likely lead to an
> unrecognisable insn ICE if recog is called during one of the LRA
> subpasses.
> 
> So unless we know a reason otherwise, I think this should just
> be "false" (like it already is for aarch64_mem_pair_operand).
> 
> Thanks,
> Richard
> 
Changed it to false.

Bootstrapped and regression testing for aarch64-none-linux-gnu.

Is this OK for trunk?

Cheers,
Andre

gcc
2018-06-25  Andre Vieira  

PR target/83009
* config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand):
Make
address check not strict.

gcc/testsuite
2018-06-25  Andre Vieira  

PR target/83009
* gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests.
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index f0917af8b4cec945ba4e38e4dc670200f8812983..e1a022b5c5a371230c71cd1dd944f5b0d4f4dc4c 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -227,7 +227,7 @@
 (define_predicate "aarch64_mem_pair_lanes_operand"
   (and (match_code "mem")
(match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0),
-		  true,
+		  false,
 		  ADDR_QUERY_LDP_STP_N)")))
 
 (define_predicate "aarch64_prefetch_operand"
diff --git a/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c b/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c
index 990aea32de6f8239effa95a081950684c6e11386..3296d04da14149d26d19da785663b87bd5ad8994 100644
--- a/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c
+++ b/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c
@@ -22,10 +22,32 @@ construct_lane_2 (long long *y, v2di *z)
   z[2] = x;
 }
 
+void
+construct_lane_3 (double **py, v2df **pz)
+{
+  double *y = *py;
+  v2df *z = *pz;
+  double y0 = y[0] + 1;
+  double y1 = y[1] + 2;
+  v2df x = {y0, y1};
+  z[2] = x;
+}
+
+void
+construct_lane_4 (long long **py, v2di **pz)
+{
+  long long *y = *py;
+  v2di *z = *pz;
+  long long y0 = y[0] + 1;
+  long long y1 = y[1] + 2;
+  v2di x = {y0, y1};

Re: [AArch64][PATCH 1/2] Make AES unspecs commutative

2018-06-21 Thread Andre Simoes Dias Vieira
Hey Kyrill,

I think it should be possible, I'll have a quick look.

Cheers,
Andre


From: Kyrill Tkachov 
Sent: Wednesday, June 20, 2018 9:32:50 AM
To: Andre Simoes Dias Vieira; gcc-patches@gcc.gnu.org
Cc: nd; James Greenhalgh; Richard Earnshaw; Marcus Shawcroft
Subject: Re: [AArch64][PATCH 1/2] Make AES unspecs commutative

Hi Andre,

On 18/06/18 10:38, Andre Simoes Dias Vieira wrote:
> Hi,
>
> This patch teaches the AArch64 backend that the AESE and AESD unspecs are 
> commutative (which correspond to the vaeseq_u8 and vaesdq_u8 intrinsics). 
> This improves register allocation around their corresponding instructions 
> avoiding unnecessary moves.
>
> For instance, with the old patterns code such as:
>
> uint8x16_t
> test0 (uint8x16_t a, uint8x16_t b)
> {
>   uint8x16_t result;
>   result = vaeseq_u8 (a, b);
>   result = vaeseq_u8 (result, a);
>   return result;
> }
>
> would lead to suboptimal register allocation such as:
> test0:
> mov v2.16b, v0.16b
> aesev2.16b, v1.16b
> mov v1.16b, v2.16b
> aesev1.16b, v0.16b
> mov v0.16b, v1.16b
> ret
>
> whereas with the new patterns we see:
> aesev1.16b, v0.16b
> aesev0.16b, v1.16b
> ret
>
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Is this OK for trunk?
>

Nice one!
Do you think we can get an equivalent patch for arm?

Thanks,
Kyrill

> Cheers,
> Andre
>
>
> gcc
> 2018-06-18  Andre Vieira 
>
>
> * config/aarch64/aarch64-simd.md (aarch64_crypto_aesv16qi):
> Make operands of the unspec commutative.
>
> gcc/testsuite
> 2018-06-18 Andre Vieira 
>
> * gcc/target/aarch64/aes_2.c: New test.



[AArch64][PATCH 2/2] Combine AES instructions with xor and zero operands

2018-06-18 Thread Andre Simoes Dias Vieira
Hi,

This patch teaches the AArch64 backend that AES instructions with a XOR and 
zero operands can be simplified by replacing the operands of the AES with XOR's 
thus eliminating the XOR. This is OK because the AES instruction XORs the input 
operands.

This will improve code-generation when dealing with code like:
static const uint8x16_t zero = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint8x16_t test0 (uint8x16_t a, uint8x16_t b)
{
  uint8x16_t result = vaeseq_u8 (a ^ b, zero);
  result = vaesdq_u8 (result ^ a, zero);
  return result;
}

Whereas this would lead to the generation of an unnecessary 'eor' instructions:
test0:
moviv2.4s, 0
eor v1.16b, v0.16b, v1.16b
aesev1.16b, v2.16b
eor v0.16b, v0.16b, v1.16b
aesdv0.16b, v2.16b
ret

Whereas with this patch we get:
test0:
aesev1.16b, v0.16b
aesdv0.16b, v1.16b
ret

Bootstrapped and tested on aarch64-none-linux-gnu.

Is this OK for trunk?

Cheers,
Andre

gcc
2018-06-18  Andre Vieira  


* config/aarch64/aarch64-simd.md 
(*aarch64_crypto_aesv16qi_xor_combine):
New.

gcc/testsuite
2018-06-18  Andre Vieira  

* gcc.target/aarch64/aes_xor_combine.c: New test.

aes-2.patch
Description: aes-2.patch


[AArch64][PATCH 1/2] Make AES unspecs commutative

2018-06-18 Thread Andre Simoes Dias Vieira
Hi,

This patch teaches the AArch64 backend that the AESE and AESD unspecs are 
commutative (which correspond to the vaeseq_u8 and vaesdq_u8 intrinsics). This 
improves register allocation around their corresponding instructions avoiding 
unnecessary moves.

For instance, with the old patterns code such as:

uint8x16_t
test0 (uint8x16_t a, uint8x16_t b)
{
  uint8x16_t result;
  result = vaeseq_u8 (a, b);
  result = vaeseq_u8 (result, a);
  return result;
}

would lead to suboptimal register allocation such as:
test0:
mov v2.16b, v0.16b
aesev2.16b, v1.16b
mov v1.16b, v2.16b
aesev1.16b, v0.16b
mov v0.16b, v1.16b
ret

whereas with the new patterns we see:
aesev1.16b, v0.16b
aesev0.16b, v1.16b
ret


Bootstrapped and tested on aarch64-none-linux-gnu.

Is this OK for trunk?

Cheers,
Andre


gcc
2018-06-18  Andre Vieira  


* config/aarch64/aarch64-simd.md (aarch64_crypto_aesv16qi):
Make operands of the unspec commutative.

gcc/testsuite
2018-06-18 Andre Vieira  

* gcc/target/aarch64/aes_2.c: New test.

aes-1.patch
Description: aes-1.patch


[AArch64][PATCH 0/2] Improve codegen for AES instructions

2018-06-18 Thread Andre Simoes Dias Vieira
Hi,

This patch series aims to improve codegen for the AArch64 AES instructions by 
doing two things.

The first is to make the AES unspecs commutative and by consequence make the 
corresponding intrinsics commutative, since the instructions themselves are 
commutative in the input.
This will improve register allocation around these instructions. The second 
step is to combine AES instructions with the following format 'AES (XOR (a,b), 
0)' into 'AES (a, b)'.

Andre Vieira (2):
Make AES unspecs commutative
Combine AES instructions with xor and zero operands


Cheers,
Andre

Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP

2018-06-18 Thread Andre Simoes Dias Vieira
Hi Richard,

Sorry for the delay I have been on holidays.  I had a look and I think you are 
right.  With these changes Umq and Uml seem to have the same functionality 
though, so I would suggest using only one.  Maybe use a different name for 
both, removing both Umq and Uml in favour of Umn, where the n indicates it 
narrows the addressing mode.  How does that sound to you?

I also had a look at Ump, but that one is used in the parallel pattern for 
STP/LDP which does not use this "narrowing". So we should leave that one as is.

Cheers,
Andre


From: Richard Sandiford 
Sent: Thursday, June 14, 2018 12:28:16 PM
To: Andre Simoes Dias Vieira
Cc: gcc-patches@gcc.gnu.org; nd
Subject: Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP

Andre Simoes Dias Vieira  writes:
> @@ -5716,10 +5717,17 @@ aarch64_classify_address (struct aarch64_address_info 
> *info,
>unsigned int vec_flags = aarch64_classify_vector_mode (mode);
>bool advsimd_struct_p = (vec_flags == (VEC_ADVSIMD | VEC_STRUCT));
>bool load_store_pair_p = (type == ADDR_QUERY_LDP_STP
> + || type == ADDR_QUERY_LDP_STP_N
>   || mode == TImode
>   || mode == TFmode
>   || (BYTES_BIG_ENDIAN && advsimd_struct_p));
>
> +  /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming mode
> + corresponds to the actual size of the memory being loaded/stored and the
> + mode of the corresponding addressing mode is half of that.  */
> +  if (type == ADDR_QUERY_LDP_STP_N && known_eq (GET_MODE_SIZE (mode), 16))
> +mode = DFmode;
> +
>bool allow_reg_index_p = (!load_store_pair_p
>   && (known_lt (GET_MODE_SIZE (mode), 16)
>   || vec_flags == VEC_ADVSIMD

I don't know whether it matters in practice, but that description also
applies to Umq, not just Uml.  It might be worth changing it too so
that things stay consistent.

Thanks,
Richard


[AArch64][PATCH 1/2] Fix addressing printing of LDP/STP

2018-06-07 Thread Andre Simoes Dias Vieira
Hi,

The address printing for LDP/STP patterns that don't use parallel was not 
working properly when dealing with a post-index addressing mode. The post-index 
address printing uses the mode's size to determine the post-index immediate.  
To fix an earlier issue with range checking of these instructions this mode was 
being hard-coded to DFmode for the operand modifier 'y', which was added for 
this particular pattern.  This was done because the range of LDP/STP for two 
64-bit operands is the same as a single 64-bit load/store. Instead of 
hard-coding the mode early on we introduce a new address query type 
'ADDR_QUERY_LDP_STP_N' to be used for such cases. This will halve the mode used 
for computing the range check, but leave the original mode of the operand as 
is, making sure the post-index printing is correct.

Bootstrapped and tested on aarch64-none-linux-gnu.

Is this OK for trunk?

gcc
2018-06-07  Andre Vieira  

* config/aarch64/aarch64-protos.h (aarch64_addr_query_type): Add new
enum value 'ADDR_QUERY_LDP_STP_N'.
* config/aarch64/aarch64.c (aarch64_addr_query_type): Likewise.
(aarch64_print_address_internal): Add declaration.
(aarch64_print_ldpstp_address): Remove.
(aarch64_classify_address): Adapt mode for 'ADDR_QUERY_LDP_STP_N'.
(aarch64_print_operand): Change printing of 'y'.
* config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Use
new enum value 'ADDR_QUERY_LDP_STP_N', don't hardcode mode and use
'true' rather than '1'.
* gcc/config/aarch64/constraints.md (Uml): Likewise.

stp-1.patch
Description: stp-1.patch


[AArch64][PATCH 2/2] PR target/83009: Relax strict address checking for store pair lanes

2018-06-07 Thread Andre Simoes Dias Vieira
Hi,

See below a patch to address PR 83009.

Tested with aarch64-linux-gnu bootstrap and regtests for c, c++ and fortran.
Ran the adjusted testcase for -mabi=ilp32.

Is this OK for trunk?

Cheers,
Andre

PR target/83009: Relax strict address checking for store pair lanes

The operand constraint for the memory address of store/load pair lanes was
enforcing strictly hardware registers be allowed as memory addresses.  We want
to relax that such that these patterns can be used by combine, prior to reload.
During register allocation the register constraint will enforce the correct
register is chosen.

gcc
2018-06-07  Andre Vieira  

PR target/83009
* config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Make
address check not strict prior to reload.

gcc/testsuite
2018-06-07 Andre Vieira  

PR target/83009
* gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests.

stp-2.patch
Description: stp-2.patch


[AArch64][PATCH 0/2] PR target/83009: Relax strict address checking for store pair lanes

2018-06-07 Thread Andre Simoes Dias Vieira
Hi,

This patch series is aimed at resolving PR83009 that identified a missing 
optimization opportunity for Store pair lanes. The second patch of these series 
was initially reverted because it lead to a wrong codegen. This codegen was a 
latent issue that was now being hit due to STP's being generated in more cases. 
The first patch in this series remedies that latent issue.

Andre Vieira (2):
Fix addressing printing of LDP/STP
PR target/83009: Relax strict address checking for store pair lanes