Re: [PATCH v3 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
Hi Richard, Thanks for the reviews, I'm making these changes but just a heads up. When hardcoding LR_REGNUM like this we need to change the way we compare the register in doloop_condition_get. This function currently compares the rtx nodes by address, which I think happens to be fine before we assign hard registers, as I suspect we always share the rtx node for the same pseudo, but when assigning registers it seems like we create copies, so things like: `XEXP (inc_src, 0) == reg` will fail for inc_src: (plus (reg LR) (const_int -n)' reg: (reg LR) Instead I will substitute the operand '==' with calls to 'rtx_equal_p (op1, op2, NULL)'. Sound good? Kind regards, Andre From: Richard Earnshaw (lists) Sent: Tuesday, January 30, 2024 11:36 AM To: Andre Simoes Dias Vieira; gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov; Stam Markianos-Wright Subject: Re: [PATCH v3 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops On 19/01/2024 14:40, Andre Vieira wrote: > > Respin after comments from Kyrill and rebase. I also removed an if-then-else > construct in arm_mve_check_reg_origin_is_num_elems similar to the other > functions > Kyrill pointed out. > > After an earlier comment from Richard Sandiford I also added comments to the > two tail predication patterns added to explain the need for the unspecs. [missing ChangeLog] I'm just going to focus on loop-doloop.c in this reply, I'll respond to the other bits in a follow-up. 2) (set (reg) (plus (reg) (const_int -1)) - (set (pc) (if_then_else (reg != 0) -(label_ref (label)) -(pc))). +(set (pc) (if_then_else (reg != 0) +(label_ref (label)) +(pc))). Some targets (ARM) do the comparison before the branch, as in the following form: - 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0))) - (set (reg) (plus (reg) (const_int -1)))]) -(set (pc) (if_then_else (cc == NE) ... This comment is becoming confusing. Really the text leading up to 3)... should be inside 3. Something like: 3) Some targets (ARM) do the comparison before the branch, as in the following form: (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0)) (set (reg) (plus (reg) (const_int -1)))]) (set (pc) (if_then_else (cc == NE) (label_ref (label)) (pc)))]) The same issue on the comment structure also applies to the new point 4... + The ARM target also supports a special case of a counter that decrements + by `n` and terminating in a GTU condition. In that case, the compare and + branch are all part of one insn, containing an UNSPEC: + + 4) (parallel [ + (set (pc) + (if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr) + (const_int -n))]) + (const_int n-1])) + (label_ref) + (pc))) + (set (reg:SI 14 lr) +(plus:SI (reg:SI 14 lr) + (const_int -n))) + */ I think this needs a bit more clarification. Specifically that this construct supports a predicated vectorized do loop. Also, the placement of the unspec inside the comparison is ugnly and unnecessary. It should be sufficient to have the unspec inside a USE expression, which the mid-end can then ignore entirely. So (parallel [(set (pc) (if_then_else (gtu (plus (reg) (const_int -n)) (const_int n-1)) (label_ref) (pc))) (set (reg) (plus (reg) (const_int -n))) (additional clobbers and uses)]) For Arm, we then add a (use (unspec [(const_int 0)] N)) that is specific to this pattern to stop anything else from matching it. Note that we don't need to mention that the register is 'LR' or the modes, those are specific to a particular backend, not the generic pattern we want to match. + || !CONST_INT_P (XEXP (inc_src, 1)) + || INTVAL (XEXP (inc_src, 1)) >= 0) return 0; + int dec_num = abs (INTVAL (XEXP (inc_src, 1))); We can just use '-INTVAL(...)' here, we've verified just above that the constant is negative. - if ((XEXP (condition, 0) == reg) + /* For the ARM special case of having a GTU: re-form the condition without + the unspec for the benefit of the middle-end. */ + if (GET_CODE (condition) == GTU) +{ + condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, + GEN_INT (dec_num - 1)); + return condition; +} If you make the change I mentioned above, this re-forming isn't needed any more, so the arm-specific comment goes away - { +{ if (GET
Re: [r13-7135 Regression] FAIL: gcc.dg/vect/vect-simd-clone-18f.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 on Linux/x86_64
Hi, @Andrew: Could you have a look at these? I had a quick look at 17f.c and it looks to me like the target selectors aren't specific enough. Unfortunately I am not familiar enough with target selectors (or targets for that matter) for x86_64. From what I could tell with -m32 gcc decides to unroll the vectorized loop so you end up with 4 simdclones rather than the 2 it tests for, GCC probably uses a different cost structure for -m32 that decides it is profitable to unroll? As for -march=cascadelake, that seems to prevent gcc from using the inbranch simdclones altogether, so I suspect that cascadelake doesn't support these inbranch simdclones or the vector types it is trying to use. Kind regards, Andre From: haochen.jiang Sent: Thursday, April 13, 2023 2:48 AM To: Andre Simoes Dias Vieira; gcc-regress...@gcc.gnu.org; gcc-patches@gcc.gnu.org; haochen.ji...@intel.com Subject: [r13-7135 Regression] FAIL: gcc.dg/vect/vect-simd-clone-18f.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 on Linux/x86_64 On Linux/x86_64, 58c8c1b383bc3c286d6527fc6e8fb62463f9a877 is the first bad commit commit 58c8c1b383bc3c286d6527fc6e8fb62463f9a877 Author: Andre Vieira Date: Tue Apr 11 10:07:43 2023 +0100 if-conv: Restore MASK_CALL conversion [PR10] caused FAIL: gcc.dg/vect/vect-simd-clone-16e.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 3 FAIL: gcc.dg/vect/vect-simd-clone-16f.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 FAIL: gcc.dg/vect/vect-simd-clone-17e.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 3 FAIL: gcc.dg/vect/vect-simd-clone-17f.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 FAIL: gcc.dg/vect/vect-simd-clone-18e.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 3 FAIL: gcc.dg/vect/vect-simd-clone-18f.c scan-tree-dump-times vect "[\\n\\r] [^\\n]* = foo\\.simdclone" 2 with GCC configured with ../../gcc/configure --prefix=/export/users/haochenj/src/gcc-bisect/master/master/r13-7135/usr --enable-clocale=gnu --with-system-zlib --with-demangler-in-ld --with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl --enable-libmpx x86_64-linux --disable-bootstrap To reproduce: $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16e.c --target_board='unix{-m32}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16e.c --target_board='unix{-m32\ -march=cascadelake}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16f.c --target_board='unix{-m32}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-16f.c --target_board='unix{-m32\ -march=cascadelake}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17e.c --target_board='unix{-m32}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17e.c --target_board='unix{-m32\ -march=cascadelake}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17f.c --target_board='unix{-m32}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-17f.c --target_board='unix{-m32\ -march=cascadelake}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18e.c --target_board='unix{-m32}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18e.c --target_board='unix{-m32\ -march=cascadelake}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18f.c --target_board='unix{-m32}'" $ cd {build_dir}/gcc && make check RUNTESTFLAGS="vect.exp=gcc.dg/vect/vect-simd-clone-18f.c --target_board='unix{-m32\ -march=cascadelake}'" (Please do not reply to this email, for question about this report, contact me at haochen dot jiang at intel.com)
[PATCH] arm: Make MVE masked stores read memory operand [PR 108177]
Hi, This patch adds the memory operand of MVE masked stores as input operands to mimic the 'partial' writes, to prevent erroneous write-after-write optimizations as described in the PR. Regression tested on arm-none-eabi for armv8.1-m.main+mve.fp. OK for trunk? gcc/ChangeLog: PR target/108177 * config/arm/mve.md (mve_vstrbq_p_, mve_vstrhq_p_fv8hf, mve_vstrhq_p_, mve_vstrwq_p_v4si): Add memory operand as input operand. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/pr108177-1-run.c: New test. * gcc.target/arm/mve/pr108177-1.c: New test. * gcc.target/arm/mve/pr108177-10-run.c: New test. * gcc.target/arm/mve/pr108177-10.c: New test. * gcc.target/arm/mve/pr108177-11-run.c: New test. * gcc.target/arm/mve/pr108177-11.c: New test. * gcc.target/arm/mve/pr108177-12-run.c: New test. * gcc.target/arm/mve/pr108177-12.c: New test. * gcc.target/arm/mve/pr108177-13-run.c: New test. * gcc.target/arm/mve/pr108177-13.c: New test. * gcc.target/arm/mve/pr108177-14-run.c: New test. * gcc.target/arm/mve/pr108177-14.c: New test. * gcc.target/arm/mve/pr108177-2-run.c: New test. * gcc.target/arm/mve/pr108177-2.c: New test. * gcc.target/arm/mve/pr108177-3-run.c: New test. * gcc.target/arm/mve/pr108177-3.c: New test. * gcc.target/arm/mve/pr108177-4-run.c: New test. * gcc.target/arm/mve/pr108177-4.c: New test. * gcc.target/arm/mve/pr108177-5-run.c: New test. * gcc.target/arm/mve/pr108177-5.c: New test. * gcc.target/arm/mve/pr108177-6-run.c: New test. * gcc.target/arm/mve/pr108177-6.c: New test. * gcc.target/arm/mve/pr108177-7-run.c: New test. * gcc.target/arm/mve/pr108177-7.c: New test. * gcc.target/arm/mve/pr108177-8-run.c: New test. * gcc.target/arm/mve/pr108177-8.c: New test. * gcc.target/arm/mve/pr108177-9-run.c: New test. * gcc.target/arm/mve/pr108177-9.c: New test. * gcc.target/arm/mve/pr108177-main.x: New test include. * gcc.target/arm/mve/pr108177.x: New test include. pr108177.patch Description: pr108177.patch
[PATCH] arm: Fix MVE's vcmp vector-scalar patterns [PR107987]
Hi, This patch surrounds the scalar operand of the MVE vcmp patterns with a vec_duplicate to ensure both operands of the comparision operator have the same (vector) mode. Regression tested on arm-none-eabi. Is this OK for trunk? And a backport to GCC 12? gcc/ChangeLog: PR target/107987 * config/arm/mve.md (mve_vcmpq_n_, @mve_vcmpq_n_f): Apply vec_duplicate to scalar operand. gcc/testsuite/ChangeLog: * gcc/testsuite/gcc.target/arm/mve/pr107987.c: New test. pr107987.patch Description: pr107987.patch
Re: [PATCH v3 06/15] arm: Fix mve_vmvnq_n_ argument mode
On 20/01/2022 09:23, Christophe Lyon wrote: On Wed, Jan 19, 2022 at 8:03 PM Andre Vieira (lists) via Gcc-patches wrote: On 13/01/2022 14:56, Christophe Lyon via Gcc-patches wrote: > The vmvnq_n* intrinsics and have [u]int[16|32]_t arguments, so use > iterator instead of HI in mve_vmvnq_n_. > > 2022-01-13 Christophe Lyon > > gcc/ > * config/arm/mve.md (mve_vmvnq_n_): Use V_elem mode > for operand 1. > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md > index 171dd384133..5c3b34dce3a 100644 > --- a/gcc/config/arm/mve.md > +++ b/gcc/config/arm/mve.md > @@ -617,7 +617,7 @@ (define_insn "mve_vcvtaq_" > (define_insn "mve_vmvnq_n_" > [ > (set (match_operand:MVE_5 0 "s_register_operand" "=w") > - (unspec:MVE_5 [(match_operand:HI 1 "immediate_operand" "i")] > + (unspec:MVE_5 [(match_operand: 1 "immediate_operand" "i")] > VMVNQ_N)) > ] > "TARGET_HAVE_MVE" While fixing this it might be good to fix the constraint and predicate inspired by "DL" and "neon_inv_logic_op2" respectively. This would avoid the compiler generating wrong assembly, and instead it would probably lead to the compiler using a load literal. I kind of think it would be better to have the intrinsic refuse the immediate altogether, but it seems for NEON we also use the load literal approach. Ha, I thought that patch had been approved at v2 too: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581344.html Yeah sorry I had not looked at the previous version of these series! I can put together a follow-up for this then.
Re: [AArch64] Enable generation of FRINTNZ instructions
On 12/11/2021 10:56, Richard Biener wrote: On Thu, 11 Nov 2021, Andre Vieira (lists) wrote: Hi, This patch introduces two IFN's FTRUNC32 and FTRUNC64, the corresponding optabs and mappings. It also creates a backend pattern to implement them for aarch64 and a match.pd pattern to idiom recognize these. These IFN's (and optabs) represent a truncation towards zero, as if performed by first casting it to a signed integer of 32 or 64 bits and then back to the same floating point type/mode. The match.pd pattern choses to use these, when supported, regardless of trapping math, since these new patterns mimic the original behavior of truncating through an integer. I didn't think any of the existing IFN's represented these. I know it's a bit late in stage 1, but I thought this might be OK given it's only used by a single target and should have very little impact on anything else. Bootstrapped on aarch64-none-linux. OK for trunk? On the RTL side ftrunc32/ftrunc64 would probably be better a conversion optab (with two modes), so not +OPTAB_D (ftrunc32_optab, "ftrunc$asi2") +OPTAB_D (ftrunc64_optab, "ftrunc$adi2") but OPTAB_CD (ftrunc_shrt_optab, "ftrunc$a$I$b2") or so? I know that gets somewhat awkward for the internal function, but IMHO we shouldn't tie our hands because of that? I tried doing this originally, but indeed I couldn't find a way to correctly tie the internal function to it. direct_optab_supported_p with multiple types expect those to be of the same mode. I see convert_optab_supported_p does but I don't know how that is used... Any ideas?
Re: [RFC] Implementing detection of saturation and rounding arithmetic
Hi Bin, Thank you for the reply, I have some questions, see below. On 07/06/2021 12:28, Bin.Cheng wrote: On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches wrote: Hi Andre, I didn't look into the details of the IV sharing RFC. It seems to me costing outside uses is trying to generate better code for later code (epilogue loop here). The only problem is IVOPTs doesn't know that the outside use is not in the final form - which will be transformed by IVOPTs again. I think this example is not good at describing your problem because it shows exactly that considering outside use results in better code, compared to the other two approaches. I don't quite understand what you are saying here :( What do you mean by final form? It seems to me that costing uses inside and outside loop the same way is wrong because calculating the IV inside the loop has to be done every iteration, whereas if you can resolve it to a single update (without an IV) then you can sink it outside the loop. This is why I think this example shows why we need to cost these uses differently. 2) Is there a cleaner way to generate the optimal 'post-increment' use for the outside-use variable? I first thought the position in the candidate might be something I could use or even the var_at_stmt functionality, but the outside IV has the actual increment of the variable as it's use, rather than the outside uses. This is this RFC's main weakness I find. To answer why IVOPTs behaves like this w/o your two patches. The main problem is the point IVOPTs rewrites outside use IV - I don't remember the exact point - but looks like at the end of loop while before incrementing instruction of main IV. It's a known issue that outside use should be costed/re-written on the exit edge along which its value flows out of loop. I had a patch a long time ago but discarded it, because it didn't bring obvious improvement and is complicated in case of multi-exit edges. Yeah I haven't looked at multi-exit edges and I understand that complicates things. But for now we could disable the special casing of outside uses when dealing with multi-exit loops and keep the current behavior. But in general, I am less convinced that any of the two patches is the right direction solving IV sharing issue between vectorized loop and epilogue loop. I would need to read the previous RFC before giving further comments though. The previous RFC still has a lot of unanswered questions too, but regardless of that, take the following (non-vectorizer) example: #include #include void bar (char * __restrict__ a, char * __restrict__ b, char * __restrict__ c, unsigned long long n) { svbool_t all_true = svptrue_b8 (); unsigned long long i = 0; for (; i < (n & ~(svcntb() - 1)); i += svcntb()) { svuint8_t va = svld1 (all_true, (uint8_t*)a); svuint8_t vb = svld1 (all_true, (uint8_t*)b); svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb)); a += svcntb(); b += svcntb(); c += svcntb(); } svbool_t pred; for (; i < (n); i += svcntb()) { pred = svwhilelt_b8 (i, n); svuint8_t va = svld1 (pred, (uint8_t*)a); svuint8_t vb = svld1 (pred, (uint8_t*)b); svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb)); a += svcntb(); b += svcntb(); c += svcntb(); } Current IVOPTs will use 4 iterators for the first loop, when it could do with just 1. In fact, if you use my patches it will create just a single IV and sink the uses and it is then able to merge them with loads & stores of the next loop. I am not saying setting outside costs to 0 is the right thing to do by the way. It is absolutely not! It will break cost considerations for other cases. Like I said above I've been playing around with using '!use->outside' as a multiplier for the cost. Unfortunately it won't help with the case above, because this seems to choose 'infinite_cost' because the candidate IV has a lower precision than the use IV. I don't quite understand yet how candidates are created, but something I'm going to try to look at. Just wanted to show this as an example of how IVOPTs would not improve code with multiple loops that don't involve the vectorizer. BR, Andre Thanks, bin
Re: [PATCH 1/7] arm: Auto-vectorization for MVE: vand
Hi Christophe, Thanks for these! See some inline comments. On 25/11/2020 13:54, Christophe Lyon via Gcc-patches wrote: This patch enables MVE vandq instructions for auto-vectorization. MVE vandq insns in mve.md are modified to use 'and' instead of unspec expression to support and3. The and3 expander is added to vec-common.md 2020-11-12 Christophe Lyon gcc/ * gcc/config/arm/iterators.md (supf): Remove VANDQ_S and VANDQ_U. (VANQ): Remove. * config/arm/mve.md (mve_vandq_u): New entry for vand instruction using expression and. (mve_vandq_s): New expander. * config/arm/neon.md (and3): Renamed into and3_neon. * config/arm/unspecs.md (VANDQ_S, VANDQ_U): Remove. * config/arm/vec-common.md (and3): New expander. gcc/testsuite/ * gcc.target/arm/simd/mve-vand.c: New test. --- gcc/config/arm/iterators.md | 4 +--- gcc/config/arm/mve.md| 20 gcc/config/arm/neon.md | 2 +- gcc/config/arm/unspecs.md| 2 -- gcc/config/arm/vec-common.md | 15 gcc/testsuite/gcc.target/arm/simd/mve-vand.c | 34 6 files changed, 66 insertions(+), 11 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vand.c diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md index 592af35..72039e4 100644 --- a/gcc/config/arm/iterators.md +++ b/gcc/config/arm/iterators.md @@ -1232,8 +1232,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s") (VADDLVQ_P_U "u") (VCMPNEQ_U "u") (VCMPNEQ_S "s") (VABDQ_M_S "s") (VABDQ_M_U "u") (VABDQ_S "s") (VABDQ_U "u") (VADDQ_N_S "s") (VADDQ_N_U "u") - (VADDVQ_P_S "s") (VADDVQ_P_U "u") (VANDQ_S "s") - (VANDQ_U "u") (VBICQ_S "s") (VBICQ_U "u") + (VADDVQ_P_S "s") (VADDVQ_P_U "u") (VBICQ_S "s") (VBICQ_U "u") (VBRSRQ_N_S "s") (VBRSRQ_N_U "u") (VCADDQ_ROT270_S "s") (VCADDQ_ROT270_U "u") (VCADDQ_ROT90_S "s") (VCMPEQQ_S "s") (VCMPEQQ_U "u") (VCADDQ_ROT90_U "u") @@ -1501,7 +1500,6 @@ (define_int_iterator VABDQ [VABDQ_S VABDQ_U]) (define_int_iterator VADDQ_N [VADDQ_N_S VADDQ_N_U]) (define_int_iterator VADDVAQ [VADDVAQ_S VADDVAQ_U]) (define_int_iterator VADDVQ_P [VADDVQ_P_U VADDVQ_P_S]) -(define_int_iterator VANDQ [VANDQ_U VANDQ_S]) (define_int_iterator VBICQ [VBICQ_S VBICQ_U]) (define_int_iterator VBRSRQ_N [VBRSRQ_N_U VBRSRQ_N_S]) (define_int_iterator VCADDQ_ROT270 [VCADDQ_ROT270_S VCADDQ_ROT270_U]) diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md index ecbaaa9..975eb7d 100644 --- a/gcc/config/arm/mve.md +++ b/gcc/config/arm/mve.md @@ -894,17 +894,27 @@ (define_insn "mve_vaddvq_p_" ;; ;; [vandq_u, vandq_s]) ;; -(define_insn "mve_vandq_" +;; signed and unsigned versions are the same: define the unsigned +;; insn, and use an expander for the signed one as we still reference +;; both names from arm_mve.h. +(define_insn "mve_vandq_u" [ (set (match_operand:MVE_2 0 "s_register_operand" "=w") - (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w") - (match_operand:MVE_2 2 "s_register_operand" "w")] -VANDQ)) + (and:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w") + (match_operand:MVE_2 2 "s_register_operand" "w"))) The predicate on the second operand is more restrictive than the one in expand 'neon_inv_logic_op2'. This should still work with immediates, or well I checked for integers, it generates a loop as such: vldrw.32 q3, [r0] vldr.64 d4, .L8 vldr.64 d5, .L8+8 vand q3, q3, q2 vstrw.32 q3, [r2] MVE does support vand's with immediates, just like NEON, I suspect you could just copy the way Neon handles these, possibly worth the little extra effort. You can use dest[i] = a[i] & ~1 as a testcase. If you don't it might still be worth expanding the test to make sure other immediates-types combinations don't trigger an ICE? I'm not sure I understand why it loads it in two 64-bit chunks and not do a single load or not just do something like a vmov or vbic immediate. Anyhow that's a worry for another day I guess.. ] "TARGET_HAVE_MVE" - "vand %q0, %q1, %q2" + "vand\t%q0, %q1, %q2" [(set_attr "type" "mve_move") ]) +(define_expand "mve_vandq_s" + [ + (set (match_operand:MVE_2 0 "s_register_operand") + (and:MVE_2 (match_operand:MVE_2 1 "s_register_operand") + (match_operand:MVE_2 2 "s_register_operand"))) + ] + "TARGET_HAVE_MVE" +) ;; ;; [vbicq_s, vbicq_u]) diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 2d76769..dc4707d 100644 --- a/gcc/config/arm/neon.md +++
Re: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE
On 07/07/2020 13:43, Christophe Lyon wrote: Hi, On Mon, 6 Jul 2020 at 16:31, Andre Vieira (lists) wrote: On 30/06/2020 14:50, Andre Vieira (lists) wrote: On 29/06/2020 11:15, Christophe Lyon wrote: On Mon, 29 Jun 2020 at 10:56, Andre Vieira (lists) wrote: On 23/06/2020 21:52, Christophe Lyon wrote: On Tue, 23 Jun 2020 at 15:28, Andre Vieira (lists) wrote: On 23/06/2020 13:10, Kyrylo Tkachov wrote: -Original Message- From: Andre Vieira (lists) Sent: 22 June 2020 09:52 To: gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov Subject: [PATCH][GCC][Arm] PR target/95646: Do not clobber callee saved registers with CMSE Hi, As reported in bugzilla when the -mcmse option is used while compiling for size (-Os) with a thumb-1 target the generated code will clear the registers r7-r10. These however are callee saved and should be preserved accross ABI boundaries. The reason this happens is because these registers are made "fixed" when optimising for size with Thumb-1 in a way to make sure they are not used, as pushing and popping hi-registers requires extra moves to and from LO_REGS. To fix this, this patch uses 'callee_saved_reg_p', which accounts for this optimisation, instead of 'call_used_or_fixed_reg_p'. Be aware of 'callee_saved_reg_p''s definition, as it does still take call used registers into account, which aren't callee_saved in my opinion, so it is a rather misnoemer, works in our advantage here though as it does exactly what we need. Regression tested on arm-none-eabi. Is this OK for trunk? (Will eventually backport to previous versions if stable.) Ok. Thanks, Kyrill As I was getting ready to push this I noticed I didn't add any skip-ifs to prevent this failing with specific target options. So here's a new version with those. Still OK? Hi, This is not sufficient to skip arm-linux-gnueabi* configs built with non-default cpu/fpu. For instance, with arm-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 --with-float=hard I see: FAIL: gcc.target/arm/pr95646.c (test for excess errors) Excess errors: cc1: error: ARMv8-M Security Extensions incompatible with selected FPU cc1: error: target CPU does not support ARM mode and the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os Resending as I don't think my earlier one made it to the lists (sorry if you are receiving this double!) I'm not following this, before I go off and try to reproduce it, what do you mean by 'the testcase is compiled with -mcpu=cortex-m23 -mcmse -Os'? These are the options you are seeing in the log file? Surely they should override the default options? Only thing I can think of is this might need an extra -mfloat-abi=soft to make sure it overrides the default float-abi. Could you give that a try? No it doesn't make a difference alone. I also had to add: -mfpu=auto (that clears the above warning) -mthumb otherwise we now get cc1: error: target CPU does not support ARM mode Looks like some effective-target machinery is needed So I had a look at this, I was pretty sure that -mfloat-abi=soft overwrote -mfpu=<>, which in large it does, as in no FP instructions will be generated but the error you see only checks for the right number of FP registers. Which doesn't check whether 'TARGET_HARD_FLOAT' is set or not. I'll fix this too and use the check-effective-target for armv8-m.base for this test as it is indeed a better approach than my bag of skip-ifs. I'm testing it locally to make sure my changes don't break anything. Cheers, Andre Hi, Sorry for the delay. So I changed the test to use the effective-target machinery as you suggested and I also made sure that you don't get the "ARMv8-M Security Extensions incompatible with selected FPU" when -mfloat-abi=soft. Further changed 'asm' to '__asm__' to avoid failures with '-std=' options. Regression tested on arm-none-eabi. @Christophe: could you test this for your configuration, shouldn't fail anymore! Indeed with your patch I don't see any failure with pr95646.c Note that it is still unsupported with arm-eabi when running the tests with -mcpu=cortex-mXX because the compiler complains that -mcpu=cortex-mXX conflicts with -march=armv8-m.base, thus the effective-target test fails. BTW, is that warning useful/practical? Wouldn't it be more convenient if the last -mcpu/-march on the command line was the only one taken into account? (I had a similar issue when running tests (libstdc++) getting -march=armv8-m.main+fp from their multilib environment and forcing -mcpu=cortex-m33 because it also means '+dsp' and produces a warning; I had to use -mcpu=cortex-m33 -march=armv8-m.main+fp+dsp to workaround this) Yeah I've been annoyed by that before, also in the context of testing multilibs. Even though I can see how it can be a useful warning though, if you are using these in build-systems and you accidentally introduce a new (incompatible) -mcpu/-march alongside the old one. Though it seems arbitrary, as we do not warn against multiple -mcpu's or
Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP
On 18/06/18 09:08, Andre Simoes Dias Vieira wrote: > Hi Richard, > > Sorry for the delay I have been on holidays. I had a look and I think you > are right. With these changes Umq and Uml seem to have the same > functionality though, so I would suggest using only one. Maybe use a > different name for both, removing both Umq and Uml in favour of Umn, where > the n indicates it narrows the addressing mode. How does that sound to you? > > I also had a look at Ump, but that one is used in the parallel pattern for > STP/LDP which does not use this "narrowing". So we should leave that one as > is. > > Cheers, > Andre > > > From: Richard Sandiford > Sent: Thursday, June 14, 2018 12:28:16 PM > To: Andre Simoes Dias Vieira > Cc: gcc-patches@gcc.gnu.org; nd > Subject: Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP > > Andre Simoes Dias Vieira writes: >> @@ -5716,10 +5717,17 @@ aarch64_classify_address (struct >> aarch64_address_info *info, >>unsigned int vec_flags = aarch64_classify_vector_mode (mode); >>bool advsimd_struct_p = (vec_flags == (VEC_ADVSIMD | VEC_STRUCT)); >>bool load_store_pair_p = (type == ADDR_QUERY_LDP_STP >> + || type == ADDR_QUERY_LDP_STP_N >> || mode == TImode >> || mode == TFmode >> || (BYTES_BIG_ENDIAN && advsimd_struct_p)); >> >> + /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming >> mode >> + corresponds to the actual size of the memory being loaded/stored and >> the >> + mode of the corresponding addressing mode is half of that. */ >> + if (type == ADDR_QUERY_LDP_STP_N && known_eq (GET_MODE_SIZE (mode), 16)) >> +mode = DFmode; >> + >>bool allow_reg_index_p = (!load_store_pair_p >> && (known_lt (GET_MODE_SIZE (mode), 16) >> || vec_flags == VEC_ADVSIMD > > I don't know whether it matters in practice, but that description also > applies to Umq, not just Uml. It might be worth changing it too so > that things stay consistent. > > Thanks, > Richard > Hi all, This is a reworked patched, replacing Umq and Uml with Umn now. Bootstrapped and tested on aarch64-none-linux-gnu. Is this OK for trunk? gcc 2018-06-25 Andre Vieira * config/aarch64/aarch64-simd.md (aarch64_simd_mov): Replace Umq with Umn. (store_pair_lanes): Likewise. * config/aarch64/aarch64-protos.h (aarch64_addr_query_type): Add new enum value 'ADDR_QUERY_LDP_STP_N'. * config/aarch64/aarch64.c (aarch64_addr_query_type): Likewise. (aarch64_print_address_internal): Add declaration. (aarch64_print_ldpstp_address): Remove. (aarch64_classify_address): Adapt mode for 'ADDR_QUERY_LDP_STP_N'. (aarch64_print_operand): Change printing of 'y'. * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Use new enum value 'ADDR_QUERY_LDP_STP_N', don't hardcode mode and use 'true' rather than '1'. * gcc/config/aarch64/constraints.md (Uml): Likewise. (Uml): Rename to Umn. (Umq): Remove. diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 4ea50acaa59c0b58a213bd1f27fb78b6d8deee96..c03a442107815eed44a3b6bceb386d78a6615483 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -120,6 +120,10 @@ enum aarch64_symbol_type ADDR_QUERY_LDP_STP Query what is valid for a load/store pair. + ADDR_QUERY_LDP_STP_N + Query what is valid for a load/store pair, but narrow the incoming mode + for address checking. This is used for the store_pair_lanes patterns. + ADDR_QUERY_ANY Query what is valid for at least one memory constraint, which may allow things that "m" doesn't. For example, the SVE LDR and STR @@ -128,6 +132,7 @@ enum aarch64_symbol_type enum aarch64_addr_query_type { ADDR_QUERY_M, ADDR_QUERY_LDP_STP, + ADDR_QUERY_LDP_STP_N, ADDR_QUERY_ANY }; diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 4e5c42b0f15b863f3088dba4aac450f31ca309bb..7936581947b360a6d6a88ce6523bbb804c3eb89c 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -131,7 +131,7 @@ (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, Umq, m, w, ?r, ?w, ?r, w") + "=w, Umn, m, w, ?r, ?w, ?r, w") (match_operand:VQ 1 "general_operand" "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD @@ -30
Re: [AArch64][PATCH 2/2] PR target/83009: Relax strict address checking for store pair lanes
On 14/06/18 12:47, Richard Sandiford wrote: > Kyrill Tkachov writes: >> Hi Andre, >> On 07/06/18 18:02, Andre Simoes Dias Vieira wrote: >>> Hi, >>> >>> See below a patch to address PR 83009. >>> >>> Tested with aarch64-linux-gnu bootstrap and regtests for c, c++ and fortran. >>> Ran the adjusted testcase for -mabi=ilp32. >>> >>> Is this OK for trunk? >>> >>> Cheers, >>> Andre >>> >>> PR target/83009: Relax strict address checking for store pair lanes >>> >>> The operand constraint for the memory address of store/load pair lanes was >>> enforcing strictly hardware registers be allowed as memory addresses. We >>> want >>> to relax that such that these patterns can be used by combine, prior >>> to reload. >>> During register allocation the register constraint will enforce the correct >>> register is chosen. >>> >>> gcc >>> 2018-06-07 Andre Vieira >>> >>> PR target/83009 >>> * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): >>> Make >>> address check not strict prior to reload. >>> >>> gcc/testsuite >>> 2018-06-07 Andre Vieira >>> >>> PR target/83009 >>> * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests. >> >> diff --git a/gcc/config/aarch64/predicates.md >> b/gcc/config/aarch64/predicates.md >> index >> f0917af8b4cec945ba4e38e4dc670200f8812983..30aa88838671bf343a883077c2b606a035c030dd >> 100644 >> --- a/gcc/config/aarch64/predicates.md >> +++ b/gcc/config/aarch64/predicates.md >> @@ -227,7 +227,7 @@ >> (define_predicate "aarch64_mem_pair_lanes_operand" >> (and (match_code "mem") >> (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, >> 0), >> - true, >> + reload_completed, >>ADDR_QUERY_LDP_STP_N)"))) >> >> >> If you want to enforce strict checking during reload and later then I think >> you need to use reload_in_progress || reload_completed ? > > That was the old way, but it would be lra_in_progress now. > However... > >> I guess that would be equivalent to !can_create_pseudo (). > > We should never see pseudos when reload_completed, so the choice > shouldn't really matter then. And I don't think we should use > lra_in_progress either, since that would make the checks stricter > before RA has actually happened, which would likely lead to an > unrecognisable insn ICE if recog is called during one of the LRA > subpasses. > > So unless we know a reason otherwise, I think this should just > be "false" (like it already is for aarch64_mem_pair_operand). > > Thanks, > Richard > Changed it to false. Bootstrapped and regression testing for aarch64-none-linux-gnu. Is this OK for trunk? Cheers, Andre gcc 2018-06-25 Andre Vieira PR target/83009 * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Make address check not strict. gcc/testsuite 2018-06-25 Andre Vieira PR target/83009 * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests. diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md index f0917af8b4cec945ba4e38e4dc670200f8812983..e1a022b5c5a371230c71cd1dd944f5b0d4f4dc4c 100644 --- a/gcc/config/aarch64/predicates.md +++ b/gcc/config/aarch64/predicates.md @@ -227,7 +227,7 @@ (define_predicate "aarch64_mem_pair_lanes_operand" (and (match_code "mem") (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0), - true, + false, ADDR_QUERY_LDP_STP_N)"))) (define_predicate "aarch64_prefetch_operand" diff --git a/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c b/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c index 990aea32de6f8239effa95a081950684c6e11386..3296d04da14149d26d19da785663b87bd5ad8994 100644 --- a/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c +++ b/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c @@ -22,10 +22,32 @@ construct_lane_2 (long long *y, v2di *z) z[2] = x; } +void +construct_lane_3 (double **py, v2df **pz) +{ + double *y = *py; + v2df *z = *pz; + double y0 = y[0] + 1; + double y1 = y[1] + 2; + v2df x = {y0, y1}; + z[2] = x; +} + +void +construct_lane_4 (long long **py, v2di **pz) +{ + long long *y = *py; + v2di *z = *pz; + long long y0 = y[0] + 1; + long long y1 = y[1] + 2; + v2di x = {y0, y1};
Re: [AArch64][PATCH 1/2] Make AES unspecs commutative
Hey Kyrill, I think it should be possible, I'll have a quick look. Cheers, Andre From: Kyrill Tkachov Sent: Wednesday, June 20, 2018 9:32:50 AM To: Andre Simoes Dias Vieira; gcc-patches@gcc.gnu.org Cc: nd; James Greenhalgh; Richard Earnshaw; Marcus Shawcroft Subject: Re: [AArch64][PATCH 1/2] Make AES unspecs commutative Hi Andre, On 18/06/18 10:38, Andre Simoes Dias Vieira wrote: > Hi, > > This patch teaches the AArch64 backend that the AESE and AESD unspecs are > commutative (which correspond to the vaeseq_u8 and vaesdq_u8 intrinsics). > This improves register allocation around their corresponding instructions > avoiding unnecessary moves. > > For instance, with the old patterns code such as: > > uint8x16_t > test0 (uint8x16_t a, uint8x16_t b) > { > uint8x16_t result; > result = vaeseq_u8 (a, b); > result = vaeseq_u8 (result, a); > return result; > } > > would lead to suboptimal register allocation such as: > test0: > mov v2.16b, v0.16b > aesev2.16b, v1.16b > mov v1.16b, v2.16b > aesev1.16b, v0.16b > mov v0.16b, v1.16b > ret > > whereas with the new patterns we see: > aesev1.16b, v0.16b > aesev0.16b, v1.16b > ret > > > Bootstrapped and tested on aarch64-none-linux-gnu. > > Is this OK for trunk? > Nice one! Do you think we can get an equivalent patch for arm? Thanks, Kyrill > Cheers, > Andre > > > gcc > 2018-06-18 Andre Vieira > > > * config/aarch64/aarch64-simd.md (aarch64_crypto_aesv16qi): > Make operands of the unspec commutative. > > gcc/testsuite > 2018-06-18 Andre Vieira > > * gcc/target/aarch64/aes_2.c: New test.
[AArch64][PATCH 2/2] Combine AES instructions with xor and zero operands
Hi, This patch teaches the AArch64 backend that AES instructions with a XOR and zero operands can be simplified by replacing the operands of the AES with XOR's thus eliminating the XOR. This is OK because the AES instruction XORs the input operands. This will improve code-generation when dealing with code like: static const uint8x16_t zero = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; uint8x16_t test0 (uint8x16_t a, uint8x16_t b) { uint8x16_t result = vaeseq_u8 (a ^ b, zero); result = vaesdq_u8 (result ^ a, zero); return result; } Whereas this would lead to the generation of an unnecessary 'eor' instructions: test0: moviv2.4s, 0 eor v1.16b, v0.16b, v1.16b aesev1.16b, v2.16b eor v0.16b, v0.16b, v1.16b aesdv0.16b, v2.16b ret Whereas with this patch we get: test0: aesev1.16b, v0.16b aesdv0.16b, v1.16b ret Bootstrapped and tested on aarch64-none-linux-gnu. Is this OK for trunk? Cheers, Andre gcc 2018-06-18 Andre Vieira * config/aarch64/aarch64-simd.md (*aarch64_crypto_aesv16qi_xor_combine): New. gcc/testsuite 2018-06-18 Andre Vieira * gcc.target/aarch64/aes_xor_combine.c: New test. aes-2.patch Description: aes-2.patch
[AArch64][PATCH 1/2] Make AES unspecs commutative
Hi, This patch teaches the AArch64 backend that the AESE and AESD unspecs are commutative (which correspond to the vaeseq_u8 and vaesdq_u8 intrinsics). This improves register allocation around their corresponding instructions avoiding unnecessary moves. For instance, with the old patterns code such as: uint8x16_t test0 (uint8x16_t a, uint8x16_t b) { uint8x16_t result; result = vaeseq_u8 (a, b); result = vaeseq_u8 (result, a); return result; } would lead to suboptimal register allocation such as: test0: mov v2.16b, v0.16b aesev2.16b, v1.16b mov v1.16b, v2.16b aesev1.16b, v0.16b mov v0.16b, v1.16b ret whereas with the new patterns we see: aesev1.16b, v0.16b aesev0.16b, v1.16b ret Bootstrapped and tested on aarch64-none-linux-gnu. Is this OK for trunk? Cheers, Andre gcc 2018-06-18 Andre Vieira * config/aarch64/aarch64-simd.md (aarch64_crypto_aesv16qi): Make operands of the unspec commutative. gcc/testsuite 2018-06-18 Andre Vieira * gcc/target/aarch64/aes_2.c: New test. aes-1.patch Description: aes-1.patch
[AArch64][PATCH 0/2] Improve codegen for AES instructions
Hi, This patch series aims to improve codegen for the AArch64 AES instructions by doing two things. The first is to make the AES unspecs commutative and by consequence make the corresponding intrinsics commutative, since the instructions themselves are commutative in the input. This will improve register allocation around these instructions. The second step is to combine AES instructions with the following format 'AES (XOR (a,b), 0)' into 'AES (a, b)'. Andre Vieira (2): Make AES unspecs commutative Combine AES instructions with xor and zero operands Cheers, Andre
Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP
Hi Richard, Sorry for the delay I have been on holidays. I had a look and I think you are right. With these changes Umq and Uml seem to have the same functionality though, so I would suggest using only one. Maybe use a different name for both, removing both Umq and Uml in favour of Umn, where the n indicates it narrows the addressing mode. How does that sound to you? I also had a look at Ump, but that one is used in the parallel pattern for STP/LDP which does not use this "narrowing". So we should leave that one as is. Cheers, Andre From: Richard Sandiford Sent: Thursday, June 14, 2018 12:28:16 PM To: Andre Simoes Dias Vieira Cc: gcc-patches@gcc.gnu.org; nd Subject: Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP Andre Simoes Dias Vieira writes: > @@ -5716,10 +5717,17 @@ aarch64_classify_address (struct aarch64_address_info > *info, >unsigned int vec_flags = aarch64_classify_vector_mode (mode); >bool advsimd_struct_p = (vec_flags == (VEC_ADVSIMD | VEC_STRUCT)); >bool load_store_pair_p = (type == ADDR_QUERY_LDP_STP > + || type == ADDR_QUERY_LDP_STP_N > || mode == TImode > || mode == TFmode > || (BYTES_BIG_ENDIAN && advsimd_struct_p)); > > + /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming mode > + corresponds to the actual size of the memory being loaded/stored and the > + mode of the corresponding addressing mode is half of that. */ > + if (type == ADDR_QUERY_LDP_STP_N && known_eq (GET_MODE_SIZE (mode), 16)) > +mode = DFmode; > + >bool allow_reg_index_p = (!load_store_pair_p > && (known_lt (GET_MODE_SIZE (mode), 16) > || vec_flags == VEC_ADVSIMD I don't know whether it matters in practice, but that description also applies to Umq, not just Uml. It might be worth changing it too so that things stay consistent. Thanks, Richard
[AArch64][PATCH 1/2] Fix addressing printing of LDP/STP
Hi, The address printing for LDP/STP patterns that don't use parallel was not working properly when dealing with a post-index addressing mode. The post-index address printing uses the mode's size to determine the post-index immediate. To fix an earlier issue with range checking of these instructions this mode was being hard-coded to DFmode for the operand modifier 'y', which was added for this particular pattern. This was done because the range of LDP/STP for two 64-bit operands is the same as a single 64-bit load/store. Instead of hard-coding the mode early on we introduce a new address query type 'ADDR_QUERY_LDP_STP_N' to be used for such cases. This will halve the mode used for computing the range check, but leave the original mode of the operand as is, making sure the post-index printing is correct. Bootstrapped and tested on aarch64-none-linux-gnu. Is this OK for trunk? gcc 2018-06-07 Andre Vieira * config/aarch64/aarch64-protos.h (aarch64_addr_query_type): Add new enum value 'ADDR_QUERY_LDP_STP_N'. * config/aarch64/aarch64.c (aarch64_addr_query_type): Likewise. (aarch64_print_address_internal): Add declaration. (aarch64_print_ldpstp_address): Remove. (aarch64_classify_address): Adapt mode for 'ADDR_QUERY_LDP_STP_N'. (aarch64_print_operand): Change printing of 'y'. * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Use new enum value 'ADDR_QUERY_LDP_STP_N', don't hardcode mode and use 'true' rather than '1'. * gcc/config/aarch64/constraints.md (Uml): Likewise. stp-1.patch Description: stp-1.patch
[AArch64][PATCH 2/2] PR target/83009: Relax strict address checking for store pair lanes
Hi, See below a patch to address PR 83009. Tested with aarch64-linux-gnu bootstrap and regtests for c, c++ and fortran. Ran the adjusted testcase for -mabi=ilp32. Is this OK for trunk? Cheers, Andre PR target/83009: Relax strict address checking for store pair lanes The operand constraint for the memory address of store/load pair lanes was enforcing strictly hardware registers be allowed as memory addresses. We want to relax that such that these patterns can be used by combine, prior to reload. During register allocation the register constraint will enforce the correct register is chosen. gcc 2018-06-07 Andre Vieira PR target/83009 * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Make address check not strict prior to reload. gcc/testsuite 2018-06-07 Andre Vieira PR target/83009 * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests. stp-2.patch Description: stp-2.patch
[AArch64][PATCH 0/2] PR target/83009: Relax strict address checking for store pair lanes
Hi, This patch series is aimed at resolving PR83009 that identified a missing optimization opportunity for Store pair lanes. The second patch of these series was initially reverted because it lead to a wrong codegen. This codegen was a latent issue that was now being hit due to STP's being generated in more cases. The first patch in this series remedies that latent issue. Andre Vieira (2): Fix addressing printing of LDP/STP PR target/83009: Relax strict address checking for store pair lanes