[Vectorizer] Support masking fold left reductions
Hi, This patch adds support in the vectorizer for masking fold left reductions. This avoids the need to insert a conditional assignment with some identity value. For example, this C code: double f (double *restrict x, int n) { double res = 0.0; for (int i = 0; i < n; i++) { res += x[i]; } return res; } Produced this for SVE: : 0: 2f00e400movid0, #0x0 4: 713fcmp w1, #0x0 8: 5400018db.le38 c: d282mov x2, #0x0// #0 10: 93407c21sxtwx1, w1 14: 25f8c002mov z2.d, #0 18: 25e11fe0whilelo p0.d, xzr, x1 1c: 25d8e3e1ptrue p1.d 20: a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3] 24: 04f0e3e2incdx2 28: 05e2c021sel z1.d, p0, z1.d, z2.d 2c: 25e11c40whilelo p0.d, x2, x1 30: 65d82420fadda d0, p1, d0, z1.d 34: 5461b.ne20 // b.any 38: d65f03c0ret And now I get this: : 0: 2f00e400movid0, #0x0 4: 713fcmp w1, #0x0 8: 5400012db.le2c c: d282mov x2, #0x0// #0 10: 93407c21sxtwx1, w1 14: 25e11fe0whilelo p0.d, xzr, x1 18: a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3] 1c: 04f0e3e2incdx2 20: 65d82020fadda d0, p0, d0, z1.d 24: 25e11c40whilelo p0.d, x2, x1 28: 5481b.ne18 // b.any 2c: d65f03c0ret I've added a new test and run the regression testing. Ok for trunk? Alejandro 2019-06-12 Alejandro Martinez gcc/ * config/aarch64/aarch64-sve.md (mask_fold_left_plus_): Renamed from "*fold_left_plus_", updated operands order. * doc/md.texi (mask_fold_left_plus_@var{m}): Documented new optab. * internal-fn.c (mask_fold_left_direct): New define. (expand_mask_fold_left_optab_fn): Likewise. (direct_mask_fold_left_optab_supported_p): Likewise. * internal-fn.def (MASK_FOLD_LEFT_PLUS): New internal function. * optabs.def (mask_fold_left_plus_optab): New optab. * tree-vect-loop.c (mask_fold_left_plus_optab): New function to get a masked internal_fn for a reduction ifn. (vectorize_fold_left_reduction): Add support for masking reductions. gcc/testsuite/ * gcc.target/aarch64/sve/fadda_1.c: New test. mask_fold_left_v3.patch Description: mask_fold_left_v3.patch
[PATCH] PR tree-optimization/90681 Fix ICE in vect_slp_analyze_node_operations_1
Hi, This patch fixes bug 90681. It was caused by trying to SLP vectorize a non grouped load. We've fixed it by tweaking a bit the implementation: mark masked loads as not vectorizable, but support them as an special case. Then the detect them in the test for normal non-grouped loads that was already there. The bug reproducer now works and the performance test we added is still happy. Alejandro gcc/ChangeLog: 2019-05-31 Alejandro Martinez PR tree-optimization/90681 * internal-fn.c (mask_load_direct): Mark as non-vectorizable again. * tree-vect-slp.c (vect_build_slp_tree_1): Add masked loads as a special case for SLP, but fail on non-grouped loads. 2019-05-31 Alejandro Martinez gcc/testsuite/ PR tree-optimization/90681 * gfortran.dg/vect/pr90681.f: Bug reproducer. fix.patch Description: fix.patch
RE: Implement vector average patterns for SVE2
Turns out I was missing a few bits and pieces. Here is the updated patch and changelog. Alejandro 2019-05-29 Alejandro Martinez * config/aarch64/aarch64-c.c: Added TARGET_SVE2. * config/aarch64/aarch64-sve2.md: New file. (avg3_floor): New pattern. (avg3_ceil): Likewise. (*h): Likewise. * config/aarch64/aarch64.h: Added AARCH64_ISA_SVE2 and TARGET_SVE2. * config/aarch64/aarch64.md: Include aarch64-sve2.md. 2019-05-29 Alejandro Martinez gcc/testsuite/ * gcc.target/aarch64/sve2/aarch64-sve2.exp: New file, regression driver for AArch64 SVE2. * gcc.target/aarch64/sve2/average_1.c: New test. * lib/target-supports.exp (check_effective_target_aarch64_sve2): New helper. (check_effective_target_aarch64_sve1_only): Likewise. (check_effective_target_aarch64_sve2_hw): Likewise. (check_effective_target_vect_avg_qi): Check for SVE1 only. > -Original Message- > From: Richard Sandiford > Sent: 29 May 2019 10:54 > To: Alejandro Martinez Vicente > Cc: GCC Patches ; nd > Subject: Re: Implement vector average patterns for SVE2 > > Alejandro Martinez Vicente writes: > > Hi, > > > > This patch implements the [u]avgM3_floor and [u]avgM3_ceil optabs for > SVE2. > > > > Alejandro > > > > gcc/ChangeLog: > > > > 2019-05-28 Alejandro Martinez > > > > * config/aarch64/aarch64-sve2.md: New file. > > (avg3_floor): New pattern. > > (avg3_ceil): Likewise. > > (*h): Likewise. > > * config/aarch64/aarch64.md: Include aarch64-sve2.md. > > > > > > 2019-05-28 Alejandro Martinez > > > > gcc/testsuite/ > > * gcc.target/aarch64/sve2/average_1.c: New test. > > * lib/target-supports.exp > (check_effective_target_aarch64_sve1_only): > > New helper. > > (check_effective_target_vect_avg_qi): Check for SVE1 only. > > OK, thanks, but... > > > diff --git gcc/testsuite/lib/target-supports.exp > > gcc/testsuite/lib/target-supports.exp > > index f69106d..41431e6 100644 > > --- gcc/testsuite/lib/target-supports.exp > > +++ gcc/testsuite/lib/target-supports.exp > > @@ -3308,6 +3308,12 @@ proc check_effective_target_aarch64_sve2 { } { > > }] > > } > > > > +# Return 1 if this is an AArch64 target only supporting SVE (not SVE2). > > +proc check_effective_target_aarch64_sve1_only { } { > > +return [expr { [check_effective_target_aarch64_sve] > > + && ![check_effective_target_aarch64_sve2] }] } > > ...it needs check_effective_target_aarch64_sve2 to go in first. > > Richard vavg_sve2_v2.patch Description: vavg_sve2_v2.patch
Implement vector average patterns for SVE2
Hi, This patch implements the [u]avgM3_floor and [u]avgM3_ceil optabs for SVE2. Alejandro gcc/ChangeLog: 2019-05-28 Alejandro Martinez * config/aarch64/aarch64-sve2.md: New file. (avg3_floor): New pattern. (avg3_ceil): Likewise. (*h): Likewise. * config/aarch64/aarch64.md: Include aarch64-sve2.md. 2019-05-28 Alejandro Martinez gcc/testsuite/ * gcc.target/aarch64/sve2/average_1.c: New test. * lib/target-supports.exp (check_effective_target_aarch64_sve1_only): New helper. (check_effective_target_vect_avg_qi): Check for SVE1 only. vavg_sve2.patch Description: vavg_sve2.patch
RE: [Vectorizer] Add SLP support for masked loads
Hi Richards, This is the new version of the patch, addressing your comments. Alejandro > -Original Message- > From: Richard Sandiford > Sent: 08 May 2019 14:36 > To: Richard Biener > Cc: Alejandro Martinez Vicente ; GCC > Patches ; nd > Subject: Re: [Vectorizer] Add SLP support for masked loads > > Richard Biener writes: > > On Fri, Apr 26, 2019 at 3:14 PM Richard Sandiford > > wrote: > >> > >> Alejandro Martinez Vicente > writes: > >> > Hi, > >> > > >> > Current vectorizer doesn't support masked loads for SLP. We should > >> > add that, to allow things like: > >> > > >> > void > >> > f (int *restrict x, int *restrict y, int *restrict z, int n) { > >> > for (int i = 0; i < n; i += 2) > >> > { > >> > x[i] = y[i] ? z[i] : 1; > >> > x[i + 1] = y[i + 1] ? z[i + 1] : 2; > >> > } > >> > } > >> > > >> > to be vectorized using contiguous loads rather than LD2 and ST2. > >> > > >> > This patch was motivated by SVE, but it is completely generic and > >> > should apply to any architecture with masked loads. > >> > > >> > After the patch is applied, the above code generates this output > >> > (-march=armv8.2-a+sve -O2 -ftree-vectorize): > >> > > >> > : > >> >0: 717fcmp w3, #0x0 > >> >4: 540002cdb.le5c > >> >8: 51000464sub w4, w3, #0x1 > >> >c: d283mov x3, #0x0// #0 > >> > 10: 9005adrpx5, 0 > >> > 14: 25d8e3e0ptrue p0.d > >> > 18: 53017c84lsr w4, w4, #1 > >> > 1c: 91a5add x5, x5, #0x0 > >> > 20: 11000484add w4, w4, #0x1 > >> > 24: 85c0e0a1ld1rd {z1.d}, p0/z, [x5] > >> > 28: 2598e3e3ptrue p3.s > >> > 2c: d37ff884lsl x4, x4, #1 > >> > 30: 25a41fe2whilelo p2.s, xzr, x4 > >> > 34: d503201fnop > >> > 38: a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2] > >> > 3c: 25808c11cmpne p1.s, p3/z, z0.s, #0 > >> > 40: 25808810cmpne p0.s, p2/z, z0.s, #0 > >> > 44: a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2] > >> > 48: 05a1c400sel z0.s, p1, z0.s, z1.s > >> > 4c: e5434800st1w{z0.s}, p2, [x0, x3, lsl #2] > >> > 50: 04b0e3e3incwx3 > >> > 54: 25a41c62whilelo p2.s, x3, x4 > >> > 58: 5401b.ne38 // b.any > >> > 5c: d65f03c0ret > >> > > >> > > >> > I tested this patch in an aarch64 machine bootstrapping the > >> > compiler and running the checks. > >> > > >> > Alejandro > >> > > >> > gcc/Changelog: > >> > > >> > 2019-01-16 Alejandro Martinez > >> > > >> > * config/aarch64/aarch64-sve.md (copysign3): New > define_expand. > >> > (xorsign3): Likewise. > >> > internal-fn.c: Marked mask_load_direct and mask_store_direct as > >> > vectorizable. > >> > tree-data-ref.c (data_ref_compare_tree): Fixed comment typo. > >> > tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be > >> > combined even if masks different. > >> > (slp_vect_only_p): New function to detect masked loads that are > >> > only > >> > vectorizable using SLP. > >> > (vect_analyze_data_ref_accesses): Mark SLP only vectorizable > >> > groups. > >> > tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to > >> > dissolve SLP-only vectorizable groups when SLP has been discarded. > >> > (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when > needed. > >> > tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads > >> > masks. > >> > (vect_build_slp_tree_1): Fixed comment typo. > >> > (vect_build_slp_tree_2): Include masks from masked loads in SLP > tree. > >> > tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to > get > >> > vec_defs for operand with optional SLP and vectype. > >> >
RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences
Great, committed in rev. 270975 Alejandro > -Original Message- > From: Richard Sandiford > Sent: 07 May 2019 17:18 > To: Alejandro Martinez Vicente > Cc: James Greenhalgh ; GCC Patches patc...@gcc.gnu.org>; nd ; Richard Biener > > Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences > > Alejandro Martinez Vicente writes: > > Thanks for your comments Richard. I think this patch addresses them. > > Yeah, this is OK to install, thanks. > > Richard > > > > > Alejandro > > > >> -Original Message- > >> From: Richard Sandiford > >> Sent: 07 May 2019 15:46 > >> To: Alejandro Martinez Vicente > >> Cc: James Greenhalgh ; GCC Patches >> patc...@gcc.gnu.org>; nd ; Richard Biener > >> > >> Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences > >> > >> Alejandro Martinez Vicente > writes: > >> > +;; Helper expander for aarch64_abd_3 to save the callers > >> > +;; the hassle of constructing the other arm of the MINUS. > >> > +(define_expand "abd_3" > >> > + [(use (match_operand:SVE_I 0 "register_operand")) > >> > + (USMAX:SVE_I (match_operand:SVE_I 1 "register_operand") > >> > +(match_operand:SVE_I 2 "register_operand"))] > >> > + "TARGET_SVE" > >> > + { > >> > +rtx other_arm > >> > + = simplify_gen_binary (, mode, operands[1], > >> > +operands[2]); > >> > >> I realise this is just copied from the Advanced SIMD version, but > >> simplify_gen_binary is a bit dangerous here, since we explicitly want > >> an unsimplified with the two operands given. Probably > better as: > >> > >> gen_rtx_ (mode, ...) > >> > >> > +emit_insn (gen_aarch64_abd_3 (operands[0], > operands[1], > >> > + operands[2], other_arm)); > >> > +DONE; > >> > + } > >> > +) > >> > + > >> > +;; Unpredicated integer absolute difference. > >> > +(define_expand "aarch64_abd_3" > >> > + [(set (match_operand:SVE_I 0 "register_operand") > >> > +(unspec:SVE_I > >> > + [(match_dup 4) > >> > + (minus:SVE_I > >> > + (USMAX:SVE_I > >> > + (match_operand:SVE_I 1 "register_operand" "w") > >> > + (match_operand:SVE_I 2 "register_operand" "w")) > >> > + (match_operator 3 "aarch64_" > >> > + [(match_dup 1) > >> > +(match_dup 2)]))] > >> > + UNSPEC_MERGE_PTRUE))] > >> > + "TARGET_SVE" > >> > + { > >> > +operands[4] = force_reg (mode, CONSTM1_RTX > >> (mode)); > >> > + } > >> > +) > >> > >> I think we should go directly from abd_3 to the final > >> pattern, so that abd_3 does the force_reg too. This would > make... > >> > >> > +;; Predicated integer absolute difference. > >> > +(define_insn "*aarch64_abd_3" > >> > >> ...this the named pattern, instead of starting with "*". > >> > >> > + [(set (match_operand:SVE_I 0 "register_operand" "=w, ?") > >> > +(unspec:SVE_I > >> > + [(match_operand: 1 "register_operand" "Upl, Upl") > >> > + (minus:SVE_I > >> > + (USMAX:SVE_I > >> > + (match_operand:SVE_I 2 "register_operand" "w, w") > >> > >> Should be "0, w", so that the first alternative ties the input to the > >> output. > >> > >> > + (match_operand:SVE_I 3 "register_operand" "w, w")) > >> > + (match_operator 4 "aarch64_" > >> > + [(match_dup 2) > >> > +(match_dup 3)]))] > >> > + UNSPEC_MERGE_PTRUE))] > >> > + "TARGET_SVE" > >> > + "@ > >> > + abd\t%0., %1/m, %0., %3. > >> > + > >> > movprfx\t%0, %2\;abd\t%0., %1/m, %0., %3. >> >" > >> > + [(set_attr "movprfx" "*,yes")] > >> > +) > >> > + > >> > +;; Emit a sequence t
RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences
Thanks for your comments Richard. I think this patch addresses them. Alejandro > -Original Message- > From: Richard Sandiford > Sent: 07 May 2019 15:46 > To: Alejandro Martinez Vicente > Cc: James Greenhalgh ; GCC Patches patc...@gcc.gnu.org>; nd ; Richard Biener > > Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences > > Alejandro Martinez Vicente writes: > > +;; Helper expander for aarch64_abd_3 to save the callers ;; > > +the hassle of constructing the other arm of the MINUS. > > +(define_expand "abd_3" > > + [(use (match_operand:SVE_I 0 "register_operand")) > > + (USMAX:SVE_I (match_operand:SVE_I 1 "register_operand") > > + (match_operand:SVE_I 2 "register_operand"))] > > + "TARGET_SVE" > > + { > > +rtx other_arm > > + = simplify_gen_binary (, mode, operands[1], > > +operands[2]); > > I realise this is just copied from the Advanced SIMD version, but > simplify_gen_binary is a bit dangerous here, since we explicitly want an > unsimplified with the two operands given. Probably better as: > > gen_rtx_ (mode, ...) > > > +emit_insn (gen_aarch64_abd_3 (operands[0], operands[1], > > + operands[2], other_arm)); > > +DONE; > > + } > > +) > > + > > +;; Unpredicated integer absolute difference. > > +(define_expand "aarch64_abd_3" > > + [(set (match_operand:SVE_I 0 "register_operand") > > + (unspec:SVE_I > > + [(match_dup 4) > > + (minus:SVE_I > > +(USMAX:SVE_I > > + (match_operand:SVE_I 1 "register_operand" "w") > > + (match_operand:SVE_I 2 "register_operand" "w")) > > +(match_operator 3 "aarch64_" > > + [(match_dup 1) > > + (match_dup 2)]))] > > + UNSPEC_MERGE_PTRUE))] > > + "TARGET_SVE" > > + { > > +operands[4] = force_reg (mode, CONSTM1_RTX > (mode)); > > + } > > +) > > I think we should go directly from abd_3 to the final pattern, so > that abd_3 does the force_reg too. This would make... > > > +;; Predicated integer absolute difference. > > +(define_insn "*aarch64_abd_3" > > ...this the named pattern, instead of starting with "*". > > > + [(set (match_operand:SVE_I 0 "register_operand" "=w, ?") > > + (unspec:SVE_I > > + [(match_operand: 1 "register_operand" "Upl, Upl") > > + (minus:SVE_I > > +(USMAX:SVE_I > > + (match_operand:SVE_I 2 "register_operand" "w, w") > > Should be "0, w", so that the first alternative ties the input to the output. > > > + (match_operand:SVE_I 3 "register_operand" "w, w")) > > +(match_operator 4 "aarch64_" > > + [(match_dup 2) > > + (match_dup 3)]))] > > + UNSPEC_MERGE_PTRUE))] > > + "TARGET_SVE" > > + "@ > > + abd\t%0., %1/m, %0., %3. > > + > movprfx\t%0, %2\;abd\t%0., %1/m, %0., %3. >" > > + [(set_attr "movprfx" "*,yes")] > > +) > > + > > +;; Emit a sequence to produce a sum-of-absolute-differences of the > > +inputs in ;; operands 1 and 2. The sequence also has to perform a > > +widening reduction of ;; the difference into a vector and accumulate > > +that into operand 3 before ;; copying that into the result operand 0. > > +;; Perform that with a sequence of: > > +;; MOV ones.b, #1 > > +;; UABDdiff.b, p0/m, op1.b, op2.b > > +;; UDOTop3.s, diff.b, ones.b > > +;; MOV op0, op3 // should be eliminated in later passes. > > +;; The signed version just uses the signed variants of the above > instructions. > > Think it would be clearer if we removed the last line and just used [SU]ABD > instead of UABD, since that's the only sign-dependent part of the operation. > Also think we should SVEize it with MOVPRFX, since a separate MOV should > never be needed: > > ;; MOVones.b, #1 > ;; [SU]ABDdiff.b, ptrue/m, op1.b, op2.b > ;; MOVPRFXop0, op3// If necessary > ;; UDOT op0.s, diff.b, ones.b > > > +(define_expand "sad" > > + [(use (match_operand:SVE_SDI 0 "register_operand")) > > + (unspec: [(use (match_operand: 1 "register_operand")) > > + (use (match_operand: 2 "register_operand"
RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences
Hi, I updated the patch after the dot product went in. This is the new covet letter: This patch adds support to vectorize sum of abslolute differences (SAD_EXPR) using SVE. Given this input code: int sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) { int sum = 0; for (int i = 0; i < n; i++) { sum += __builtin_abs (x[i] - y[i]); } return sum; } The resulting SVE code is: : 0: 715fcmp w2, #0x0 4: 5400026db.le50 8: d283mov x3, #0x0// #0 c: 93407c42sxtwx2, w2 10: 2538c002mov z2.b, #0 14: 25221fe0whilelo p0.b, xzr, x2 18: 2538c023mov z3.b, #1 1c: 2518e3e1ptrue p1.b 20: a4034000ld1b{z0.b}, p0/z, [x0, x3] 24: a4034021ld1b{z1.b}, p0/z, [x1, x3] 28: 0430e3e3incbx3 2c: 0520c021sel z1.b, p0, z1.b, z0.b 30: 25221c60whilelo p0.b, x3, x2 34: 040d0420uabdz0.b, p1/m, z0.b, z1.b 38: 44830402udotz2.s, z0.b, z3.b 3c: 5421b.ne20 // b.any 40: 2598e3e0ptrue p0.s 44: 04812042uaddv d2, p0, z2.s 48: 1e260040fmovw0, s2 4c: d65f03c0ret 50: 1e2703e2fmovs2, wzr 54: 1e260040fmovw0, s2 58: d65f03c0ret Notice how udot is used inside a fully masked loop. I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. Alejandro gcc/Changelog: 2019-05-07 Alejandro Martinez * config/aarch64/aarch64-sve.md (abd_3): New define_expand. (aarch64_abd_3): Likewise. (*aarch64_abd_3): New define_insn. (sad): New define_expand. * config/aarch64/iterators.md: Added MAX_OPP attribute. * tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR. (build_vect_cond_expr): Likewise. gcc/testsuite/Changelog: 2019-05-07 Alejandro Martinez * gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute differences. > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org > On Behalf Of Alejandro Martinez Vicente > Sent: 11 February 2019 15:38 > To: James Greenhalgh > Cc: GCC Patches ; nd ; Richard > Sandiford ; Richard Biener > > Subject: RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences > > > -Original Message- > > From: James Greenhalgh > > Sent: 06 February 2019 17:42 > > To: Alejandro Martinez Vicente > > Cc: GCC Patches ; nd ; Richard > > Sandiford ; Richard Biener > > > > Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences > > > > On Mon, Feb 04, 2019 at 07:34:05AM -0600, Alejandro Martinez Vicente > > wrote: > > > Hi, > > > > > > This patch adds support to vectorize sum of absolute differences > > > (SAD_EXPR) using SVE. It also uses the new functionality to ensure > > > that the resulting loop is masked. Therefore, it depends on > > > > > > https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html > > > > > > Given this input code: > > > > > > int > > > sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) { > > > int sum = 0; > > > > > > for (int i = 0; i < n; i++) > > > { > > > sum += __builtin_abs (x[i] - y[i]); > > > } > > > > > > return sum; > > > } > > > > > > The resulting SVE code is: > > > > > > : > > >0: 715fcmp w2, #0x0 > > >4: 5400026db.le50 > > >8: d283mov x3, #0x0// #0 > > >c: 93407c42sxtwx2, w2 > > > 10: 2538c002mov z2.b, #0 > > > 14: 25221fe0whilelo p0.b, xzr, x2 > > > 18: 2538c023mov z3.b, #1 > > > 1c: 2518e3e1ptrue p1.b > > > 20: a4034000ld1b{z0.b}, p0/z, [x0, x3] > > > 24: a4034021ld1b{z1.b}, p0/z, [x1, x3] > > > 28: 0430e3e3incbx3 > > > 2c: 0520c021sel z1.b, p0, z1.b, z0.b > > > 30: 25221c60whilelo p0.b, x3, x2 > > > 34: 040d0420uabdz0.b, p1/m, z0.b, z1.b > > > 38: 44830402udotz2.s, z0.b, z3.b > > > 3c: 5421b.ne20 // b.any > > > 40: 2598e3e0ptrue p0.s > > > 44: 04812042uaddv d2, p0, z2.s > > > 48: 1e260040fmovw0, s2 > > > 4c: d
RE: [Aarch64][SVE] Dot product support
> -Original Message- > From: Richard Sandiford > Sent: 29 April 2019 09:42 > To: Alejandro Martinez Vicente > Cc: GCC Patches ; nd ; Richard > Biener > Subject: Re: [Aarch64][SVE] Dot product support > > Alejandro Martinez Vicente writes: > > @@ -5885,6 +5885,56 @@ is_nonwrapping_integer_induction > (stmt_vec_info stmt_vinfo, struct loop *loop) > > <= TYPE_PRECISION (lhs_type)); > > } > > > > +/* Check if masking can be supported by inserting a condional expression. > > conditional > > > + CODE is the code for the operation. COND_FN is the conditional internal > > + function, if it exists. VECTYPE_IN is the type of the vector > > +input. */ static bool use_mask_by_cond_expr_p (enum tree_code code, > > +internal_fn cond_fn, > > +tree vectype_in) > > +{ > > + if (cond_fn != IFN_LAST > > + && direct_internal_fn_supported_p (cond_fn, vectype_in, > > +OPTIMIZE_FOR_SPEED)) > > +return false; > > + > > + switch (code) > > +{ > > +case DOT_PROD_EXPR: > > + return true; > > + > > +default: > > + return false; > > +} > > +} > > + > > +/* Insert a condional expression to enable masked vectorization. > > +CODE is the > > Same here. > > > + code for the operation. VOP is the array of operands. MASK is the loop > > + mask. GSI is a statement iterator used to place the new conditional > > + expression. */ > > +static void > > +build_vect_cond_expr (enum tree_code code, tree vop[3], tree mask, > > + gimple_stmt_iterator *gsi) > > +{ > > + switch (code) > > +{ > > +case DOT_PROD_EXPR: > > + { > > + tree vectype = TREE_TYPE (vop[1]); > > + tree zero = build_zero_cst (vectype); > > + zero = build_vector_from_val (vectype, zero); > > This last line isn't right -- should just delete it. > > tree zero = build_zero_cst (vectype); > > builds a zero vector in one go. > > OK with those changes, thanks. (This version didn't include the testcase, but > I assume that's because it didn't change from last time.) > Done. I forgot to add the testcase in v2. Alejandro > Richard dot_v3.patch Description: dot_v3.patch
RE: [Aarch64][SVE] Dot product support
Hi Richard, This is the updated patch with your comments. In addition to that, I removed vectype_in from the build_vect_cond_expr call, since it wasn't really necessary. Alejandro > -Original Message- > From: Richard Sandiford > Sent: 26 April 2019 14:29 > To: Alejandro Martinez Vicente > Cc: GCC Patches ; nd ; Richard > Biener > Subject: Re: [Aarch64][SVE] Dot product support > > Alejandro Martinez Vicente writes: > > Hi, > > > > This patch does two things. For the general vectorizer, it adds > > support to perform fully masked reductions over expressions that don't > support masking. > > This is achieved by using VEC_COND_EXPR where possible. At the moment > > this is implemented for DOT_PROD_EXPR only, but the framework is there > > to extend it to other expressions. > > > > Related to that, this patch adds support to vectorize dot product > > using SVE. It also uses the new functionality to ensure that the resulting > loop is masked. > > > > Given this input code: > > > > uint32_t > > dotprod (uint8_t *restrict x, uint8_t *restrict y, int n) { > > uint32_t sum = 0; > > > > for (int i = 0; i < n; i++) > > { > > sum += x[i] * y[i]; > > } > > > > return sum; > > } > > > > The resulting SVE code is: > > > > : > >0: 715fcmp w2, #0x0 > >4: 5400024db.le4c > >8: d283mov x3, #0x0// #0 > >c: 93407c42sxtwx2, w2 > > 10: 2538c001mov z1.b, #0 > > 14: 25221fe0whilelo p0.b, xzr, x2 > > 18: 2538c003mov z3.b, #0 > > 1c: d503201fnop > > 20: a4034002ld1b{z2.b}, p0/z, [x0, x3] > > 24: a4034020ld1b{z0.b}, p0/z, [x1, x3] > > 28: 0430e3e3incbx3 > > 2c: 0523c000sel z0.b, p0, z0.b, z3.b > > 30: 25221c60whilelo p0.b, x3, x2 > > 34: 44820401udotz1.s, z0.b, z2.b > > 38: 5441b.ne20 // b.any > > 3c: 2598e3e0ptrue p0.s > > 40: 04812021uaddv d1, p0, z1.s > > 44: 1e260020fmovw0, s1 > > 48: d65f03c0ret > > 4c: 1e2703e1fmovs1, wzr > > 50: 1e260020fmovw0, s1 > > 54: d65f03c0ret > > > > Notice how udot is used inside a fully masked loop. > > > > I tested this patch in an aarch64 machine bootstrapping the compiler > > and running the checks. > > > > I admit it is too late to merge this into gcc 9, but I'm posting it > > anyway so it can be considered for gcc 10. > > > > Alejandro > > > > > > gcc/Changelog: > > > > 2019-01-31 Alejandro Martinez > > > > * config/aarch64/aarch64-sve.md (dot_prod): Taken > from SVE > > ACLE branch. > > * config/aarch64/iterators.md: Copied Vetype_fourth, VSI2QI and > vsi2qi from > > SVE ACLE branch. > > * tree-vect-loop.c (use_mask_by_cond_expr_p): New function to > check if a > > VEC_COND_EXPR be inserted to emulate a conditional internal > function. > > (build_vect_cond_expr): Emit the VEC_COND_EXPR. > > (vectorizable_reduction): Use the functions above to vectorize in a > > fully masked loop codes that don't have a conditional internal > > function. > > > > gcc/testsuite/Changelog: > > > > 2019-01-31 Alejandro Martinez > > > > * gcc.target/aarch64/sve/dot_1.c: New test for dot product. > > > > diff --git a/gcc/config/aarch64/aarch64-sve.md > > b/gcc/config/aarch64/aarch64-sve.md > > index 5bb3422..2779a21 100644 > > --- a/gcc/config/aarch64/aarch64-sve.md > > +++ b/gcc/config/aarch64/aarch64-sve.md > > @@ -3128,3 +3128,17 @@ > > DONE; > >} > > ) > > + > > +;; Unpredicated DOT product. > > +(define_insn "dot_prod" > > + [(set (match_operand:SVE_SDI 0 "register_operand" "=w, ?") > > + (plus:SVE_SDI (unspec:SVE_SDI [(match_operand: 1 > "register_operand" "w, w") > > + (match_operand: 2 > "register_operand" "w, w")] > > + DOTPROD) > > + (match_operand:SVE_SDI 3 "register_operand" "0, w")))] > > + "TARGET_SVE" > > + "@ > >
[MAINTAINERS] Add myself to MAINTAINERS
Add myself to write after approval. Alejandro Committed to trunk in r 269246 Index: MAINTAINERS === --- MAINTAINERS(revision 269244) +++ MAINTAINERS(working copy) @@ -495,6 +495,7 @@ Jose E. Marchesi Patrick Marlier Simon Martin +Alejandro Martinez Ranjit Mathew Paulo Matos Michael Matz
[AArch64, SVE] Fix vectorized FP converts
Hi, Some of the narrowing/widening FP converts were missing from SVE. I fixed most of them, so they can be vectorized. The ones missing are int64->fp16 and fp16->int64. I extended the tests to cover the cases that were missing. I validated the patch with self-checking and running the new SVE tests on an SVE emulator. Alejandro gcc/Changelog: 2019-02-25 Alejandro Martinez * config/aarch64/aarch64-sve.md (aarch64_sve__vnx8hf2, aarch64_sve__vnx4sf2): Renamed FP to int patterns. (vec_unpack_fix_trunc__, vec_pack_float_): New unpack/pack expanders. * config/aarch64/iterators.md (SVE_HSDI): Fix cut-&-paste of SVE_BHSI. (VWIDEINT): New iterator. (VwideInt): Likewise. gcc/testsuite/Changelog: 2019-02-25 Alejandro Martinez * gcc.target/aarch64/sve/fcvt_1.c: New test for fp to fp convert. * gcc.target/aarch64/sve/fcvt_1_run.c: Likewise. * gcc.target/aarch64/sve/cvtf_signed_1.c Improved test to cover widening and narrowing cases. * gcc.target/aarch64/sve/cvtf_signed_1_run.c: Likewise. * gcc.target/aarch64/sve/cvtf_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/cvtf_unsigned_1_run.c: Likewise. * gcc.target/aarch64/sve/fcvtz_signed_1.c: Likewise. * gcc.target/aarch64/sve/fcvtz_signed_1_run.c: Likewise. * gcc.target/aarch64/sve/fcvtz_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/fcvtz_unsigned_1_run.c: Likewise. cvt_v4.patch Description: cvt_v4.patch
RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> -Original Message- > From: James Greenhalgh > Sent: 06 February 2019 17:42 > To: Alejandro Martinez Vicente > Cc: GCC Patches ; nd ; Richard > Sandiford ; Richard Biener > > Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences > > On Mon, Feb 04, 2019 at 07:34:05AM -0600, Alejandro Martinez Vicente > wrote: > > Hi, > > > > This patch adds support to vectorize sum of absolute differences > > (SAD_EXPR) using SVE. It also uses the new functionality to ensure > > that the resulting loop is masked. Therefore, it depends on > > > > https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html > > > > Given this input code: > > > > int > > sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) { > > int sum = 0; > > > > for (int i = 0; i < n; i++) > > { > > sum += __builtin_abs (x[i] - y[i]); > > } > > > > return sum; > > } > > > > The resulting SVE code is: > > > > : > >0: 715fcmp w2, #0x0 > >4: 5400026db.le50 > >8: d283mov x3, #0x0// #0 > >c: 93407c42sxtwx2, w2 > > 10: 2538c002mov z2.b, #0 > > 14: 25221fe0whilelo p0.b, xzr, x2 > > 18: 2538c023mov z3.b, #1 > > 1c: 2518e3e1ptrue p1.b > > 20: a4034000ld1b{z0.b}, p0/z, [x0, x3] > > 24: a4034021ld1b{z1.b}, p0/z, [x1, x3] > > 28: 0430e3e3incbx3 > > 2c: 0520c021sel z1.b, p0, z1.b, z0.b > > 30: 25221c60whilelo p0.b, x3, x2 > > 34: 040d0420uabdz0.b, p1/m, z0.b, z1.b > > 38: 44830402udotz2.s, z0.b, z3.b > > 3c: 5421b.ne20 // b.any > > 40: 2598e3e0ptrue p0.s > > 44: 04812042uaddv d2, p0, z2.s > > 48: 1e260040fmovw0, s2 > > 4c: d65f03c0ret > > 50: 1e2703e2fmovs2, wzr > > 54: 1e260040fmovw0, s2 > > 58: d65f03c0ret > > > > Notice how udot is used inside a fully masked loop. > > > > I tested this patch in an aarch64 machine bootstrapping the compiler > > and running the checks. > > This doesn't give us much confidence in SVE coverage; unless you have been > running in an environment using SVE by default? Do you have some set of > workloads you could test the compiler against to ensure correct operation of > the SVE vectorization? > I tested it using an SVE model and a big set of workloads, including SPEC 2000, 2006 and 2017. On the plus side, nothing got broken. But impact on performance was very minimal (on average, a tiny gain over the whole set of workloads). I still want this patch (and the companion dot product patch) to make into the compiler because they are the first steps towards vectorising workloads using fully masked loops when the target ISA (like SVE) doesn't support masking in all the operations. Alejandro > > > > I admit it is too late to merge this into gcc 9, but I'm posting it > > anyway so it can be considered for gcc 10. > > Richard Sandiford has the call on whether this patch is OK for trunk now or > GCC 10. With the minimal testing it has had, I'd be uncomfortable with it as a > GCC 9 patch. That said, it is a fairly self-contained pattern for the compiler > and it would be good to see this optimization in GCC 9. > > > > > Alejandro > > > > > > gcc/Changelog: > > > > 2019-02-04 Alejandro Martinez > > > > * config/aarch64/aarch64-sve.md (abd_3): New > define_expand. > > (aarch64_abd_3): Likewise. > > (*aarch64_abd_3): New define_insn. > > (sad): New define_expand. > > * config/aarch64/iterators.md: Added MAX_OPP and max_opp > attributes. > > Added USMAX iterator. > > * config/aarch64/predicates.md: Added aarch64_smin and > aarch64_umin > > predicates. > > * tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR. > > (build_vect_cond_expr): Likewise. > > > > gcc/testsuite/Changelog: > > > > 2019-02-04 Alejandro Martinez > > > > * gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute > > differences. >
[Aarch64][SVE] Vectorise sum-of-absolute-differences
Hi, This patch adds support to vectorize sum of absolute differences (SAD_EXPR) using SVE. It also uses the new functionality to ensure that the resulting loop is masked. Therefore, it depends on https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html Given this input code: int sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) { int sum = 0; for (int i = 0; i < n; i++) { sum += __builtin_abs (x[i] - y[i]); } return sum; } The resulting SVE code is: : 0: 715fcmp w2, #0x0 4: 5400026db.le50 8: d283mov x3, #0x0// #0 c: 93407c42sxtwx2, w2 10: 2538c002mov z2.b, #0 14: 25221fe0whilelo p0.b, xzr, x2 18: 2538c023mov z3.b, #1 1c: 2518e3e1ptrue p1.b 20: a4034000ld1b{z0.b}, p0/z, [x0, x3] 24: a4034021ld1b{z1.b}, p0/z, [x1, x3] 28: 0430e3e3incbx3 2c: 0520c021sel z1.b, p0, z1.b, z0.b 30: 25221c60whilelo p0.b, x3, x2 34: 040d0420uabdz0.b, p1/m, z0.b, z1.b 38: 44830402udotz2.s, z0.b, z3.b 3c: 5421b.ne20 // b.any 40: 2598e3e0ptrue p0.s 44: 04812042uaddv d2, p0, z2.s 48: 1e260040fmovw0, s2 4c: d65f03c0ret 50: 1e2703e2fmovs2, wzr 54: 1e260040fmovw0, s2 58: d65f03c0ret Notice how udot is used inside a fully masked loop. I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. I admit it is too late to merge this into gcc 9, but I'm posting it anyway so it can be considered for gcc 10. Alejandro gcc/Changelog: 2019-02-04 Alejandro Martinez * config/aarch64/aarch64-sve.md (abd_3): New define_expand. (aarch64_abd_3): Likewise. (*aarch64_abd_3): New define_insn. (sad): New define_expand. * config/aarch64/iterators.md: Added MAX_OPP and max_opp attributes. Added USMAX iterator. * config/aarch64/predicates.md: Added aarch64_smin and aarch64_umin predicates. * tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR. (build_vect_cond_expr): Likewise. gcc/testsuite/Changelog: 2019-02-04 Alejandro Martinez * gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute differences. sad_v1.patch Description: sad_v1.patch
[Aarch64][SVE] Dot product support
Hi, This patch does two things. For the general vectorizer, it adds support to perform fully masked reductions over expressions that don't support masking. This is achieved by using VEC_COND_EXPR where possible. At the moment this is implemented for DOT_PROD_EXPR only, but the framework is there to extend it to other expressions. Related to that, this patch adds support to vectorize dot product using SVE. It also uses the new functionality to ensure that the resulting loop is masked. Given this input code: uint32_t dotprod (uint8_t *restrict x, uint8_t *restrict y, int n) { uint32_t sum = 0; for (int i = 0; i < n; i++) { sum += x[i] * y[i]; } return sum; } The resulting SVE code is: : 0: 715fcmp w2, #0x0 4: 5400024db.le4c 8: d283mov x3, #0x0// #0 c: 93407c42sxtwx2, w2 10: 2538c001mov z1.b, #0 14: 25221fe0whilelo p0.b, xzr, x2 18: 2538c003mov z3.b, #0 1c: d503201fnop 20: a4034002ld1b{z2.b}, p0/z, [x0, x3] 24: a4034020ld1b{z0.b}, p0/z, [x1, x3] 28: 0430e3e3incbx3 2c: 0523c000sel z0.b, p0, z0.b, z3.b 30: 25221c60whilelo p0.b, x3, x2 34: 44820401udotz1.s, z0.b, z2.b 38: 5441b.ne20 // b.any 3c: 2598e3e0ptrue p0.s 40: 04812021uaddv d1, p0, z1.s 44: 1e260020fmovw0, s1 48: d65f03c0ret 4c: 1e2703e1fmovs1, wzr 50: 1e260020fmovw0, s1 54: d65f03c0ret Notice how udot is used inside a fully masked loop. I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. I admit it is too late to merge this into gcc 9, but I'm posting it anyway so it can be considered for gcc 10. Alejandro gcc/Changelog: 2019-01-31 Alejandro Martinez * config/aarch64/aarch64-sve.md (dot_prod): Taken from SVE ACLE branch. * config/aarch64/iterators.md: Copied Vetype_fourth, VSI2QI and vsi2qi from SVE ACLE branch. * tree-vect-loop.c (use_mask_by_cond_expr_p): New function to check if a VEC_COND_EXPR be inserted to emulate a conditional internal function. (build_vect_cond_expr): Emit the VEC_COND_EXPR. (vectorizable_reduction): Use the functions above to vectorize in a fully masked loop codes that don't have a conditional internal function. gcc/testsuite/Changelog: 2019-01-31 Alejandro Martinez * gcc.target/aarch64/sve/dot_1.c: New test for dot product. dot_v1.patch Description: dot_v1.patch
RE: [Vectorizer] Add SLP support for masked loads
> -Original Message- > From: Richard Biener > Sent: 17 January 2019 07:53 > To: Alejandro Martinez Vicente > Cc: GCC Patches ; nd ; Richard > Sandiford > Subject: Re: [Vectorizer] Add SLP support for masked loads > > On Wed, Jan 16, 2019 at 2:37 PM Alejandro Martinez Vicente > wrote: > > > > Hi, > > > > Current vectorizer doesn't support masked loads for SLP. We should add > > that, to allow things like: > > > > void > > f (int *restrict x, int *restrict y, int *restrict z, int n) { > > for (int i = 0; i < n; i += 2) > > { > > x[i] = y[i] ? z[i] : 1; > > x[i + 1] = y[i + 1] ? z[i + 1] : 2; > > } > > } > > > > to be vectorized using contiguous loads rather than LD2 and ST2. > > > > This patch was motivated by SVE, but it is completely generic and > > should apply to any architecture with masked loads. > > > > After the patch is applied, the above code generates this output > > (-march=armv8.2-a+sve -O2 -ftree-vectorize): > > > > : > >0: 717fcmp w3, #0x0 > >4: 540002cdb.le5c > >8: 51000464sub w4, w3, #0x1 > >c: d283mov x3, #0x0// #0 > > 10: 9005adrpx5, 0 > > 14: 25d8e3e0ptrue p0.d > > 18: 53017c84lsr w4, w4, #1 > > 1c: 91a5add x5, x5, #0x0 > > 20: 11000484add w4, w4, #0x1 > > 24: 85c0e0a1ld1rd {z1.d}, p0/z, [x5] > > 28: 2598e3e3ptrue p3.s > > 2c: d37ff884lsl x4, x4, #1 > > 30: 25a41fe2whilelo p2.s, xzr, x4 > > 34: d503201fnop > > 38: a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2] > > 3c: 25808c11cmpne p1.s, p3/z, z0.s, #0 > > 40: 25808810cmpne p0.s, p2/z, z0.s, #0 > > 44: a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2] > > 48: 05a1c400sel z0.s, p1, z0.s, z1.s > > 4c: e5434800st1w{z0.s}, p2, [x0, x3, lsl #2] > > 50: 04b0e3e3incwx3 > > 54: 25a41c62whilelo p2.s, x3, x4 > > 58: 5401b.ne38 // b.any > > 5c: d65f03c0ret > > > > > > I tested this patch in an aarch64 machine bootstrapping the compiler > > and running the checks. > > Thanks for implementing this - note this is stage1 material and I will have a > look when time allows unless Richard beats me to it. > I agree, this is for GCC 10. I'll ping you guys when we're at stage1. > It might be interesting to note that "non-SLP" code paths are likely to go > away in GCC 10 to streamline the vectorizer and make further changes easier > (so you'll see group_size == 1 SLP instances). > Cool, thanks for the heads up. Alejandro > There are quite a few other cases missing SLP handling. > > Richard. > > > Alejandro > > > > gcc/Changelog: > > > > 2019-01-16 Alejandro Martinez > > > > * config/aarch64/aarch64-sve.md (copysign3): New > define_expand. > > (xorsign3): Likewise. > > internal-fn.c: Marked mask_load_direct and mask_store_direct as > > vectorizable. > > tree-data-ref.c (data_ref_compare_tree): Fixed comment typo. > > tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be > > combined even if masks different. > > (slp_vect_only_p): New function to detect masked loads that are only > > vectorizable using SLP. > > (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups. > > tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to > > dissolve SLP-only vectorizable groups when SLP has been discarded. > > (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when > needed. > > tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads > > masks. > > (vect_build_slp_tree_1): Fixed comment typo. > > (vect_build_slp_tree_2): Include masks from masked loads in SLP > > tree. > > tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to > get > > vec_defs for operand with optional SLP and vectype. > > (vectorizable_load): Allow vectorizaion of masked loads for SLP > > only. > > tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only > > vectorizable. > > tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise. > > > > gcc/testsuite/Changelog: > > > > 2019-01-16 Alejandro Martinez > > > > * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP > > vectorized masked loads.
[Vectorizer] Add SLP support for masked loads
Hi, Current vectorizer doesn't support masked loads for SLP. We should add that, to allow things like: void f (int *restrict x, int *restrict y, int *restrict z, int n) { for (int i = 0; i < n; i += 2) { x[i] = y[i] ? z[i] : 1; x[i + 1] = y[i + 1] ? z[i + 1] : 2; } } to be vectorized using contiguous loads rather than LD2 and ST2. This patch was motivated by SVE, but it is completely generic and should apply to any architecture with masked loads. After the patch is applied, the above code generates this output (-march=armv8.2-a+sve -O2 -ftree-vectorize): : 0: 717fcmp w3, #0x0 4: 540002cdb.le5c 8: 51000464sub w4, w3, #0x1 c: d283mov x3, #0x0// #0 10: 9005adrpx5, 0 14: 25d8e3e0ptrue p0.d 18: 53017c84lsr w4, w4, #1 1c: 91a5add x5, x5, #0x0 20: 11000484add w4, w4, #0x1 24: 85c0e0a1ld1rd {z1.d}, p0/z, [x5] 28: 2598e3e3ptrue p3.s 2c: d37ff884lsl x4, x4, #1 30: 25a41fe2whilelo p2.s, xzr, x4 34: d503201fnop 38: a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2] 3c: 25808c11cmpne p1.s, p3/z, z0.s, #0 40: 25808810cmpne p0.s, p2/z, z0.s, #0 44: a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2] 48: 05a1c400sel z0.s, p1, z0.s, z1.s 4c: e5434800st1w{z0.s}, p2, [x0, x3, lsl #2] 50: 04b0e3e3incwx3 54: 25a41c62whilelo p2.s, x3, x4 58: 5401b.ne38 // b.any 5c: d65f03c0ret I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. Alejandro gcc/Changelog: 2019-01-16 Alejandro Martinez * config/aarch64/aarch64-sve.md (copysign3): New define_expand. (xorsign3): Likewise. internal-fn.c: Marked mask_load_direct and mask_store_direct as vectorizable. tree-data-ref.c (data_ref_compare_tree): Fixed comment typo. tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be combined even if masks different. (slp_vect_only_p): New function to detect masked loads that are only vectorizable using SLP. (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups. tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to dissolve SLP-only vectorizable groups when SLP has been discarded. (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when needed. tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads masks. (vect_build_slp_tree_1): Fixed comment typo. (vect_build_slp_tree_2): Include masks from masked loads in SLP tree. tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to get vec_defs for operand with optional SLP and vectype. (vectorizable_load): Allow vectorizaion of masked loads for SLP only. tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only vectorizable. tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise. gcc/testsuite/Changelog: 2019-01-16 Alejandro Martinez * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP vectorized masked loads. mask_load_slp_1.patch Description: mask_load_slp_1.patch
RE: [Aarch64][SVE] Add copysign and xorsign support
Hi, I updated the patch to address Wilco's comment and style issues. Alejandro > -Original Message- > From: Wilco Dijkstra > Sent: 08 January 2019 16:58 > To: GCC Patches ; Alejandro Martinez Vicente > > Cc: nd ; Richard Sandiford > Subject: Re: [Aarch64][SVE] Add copysign and xorsign support > > Hi Alejandro, > > +emit_move_insn (mask, > + aarch64_simd_gen_const_vector_dup > (mode, > +HOST_WIDE_INT_M1U > +<< bits)); > + > +emit_insn (gen_and3 (sign, arg2, mask)); > > Is there a reason to emit separate moves and then requiring the optimizer to > combine them? The result of aarch64_simd_gen_const_vector_dup can be > used directly in the gen_and for all supported floating point types. > > Cheers, > Wilco copysign_2.patch Description: copysign_2.patch
[Aarch64][SVE] Add copysign and xorsign support
Hi, This patch adds support for copysign and xorsign builtins to SVE. With the new expands, they can be vectorized using bitwise logical operations. I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. Alejandro gcc/Changelog: 2019-01-08 Alejandro Martinez * config/aarch64/aarch64-sve.md (copysign3): New define_expand. (xorsign3): Likewise. gcc/testsuite/Changelog: 2019-01-08 Alejandro Martinez * gcc.target/aarch64/sve/copysign_1.c: New test for SVE vectorized copysign. * gcc.target/aarch64/sve/copysign_1_run.c: Likewise. * gcc.target/aarch64/sve/xorsign_1.c: New test for SVE vectorized xorsign. * gcc.target/aarch64/sve/xorsign_1_run.c: Likewise. copysign.patch Description: copysign.patch
RE: [Patch, Vectorizer, SVE] fmin/fmax builtin reduction support
Richard, I'm happy to change the name of the helper to code_helper_for_stmt, the new patch and changelog are included. Regarding the reductions being fold_left, the FMINNM/FMINMV instructions are defined in such a way that this is not necessary (it wouldn't work with FMIN/FMINV). Alejandro gcc/Changelog: 2018-12-18 Alejandro Martinez * gimple-match.h (code_helper_for_stmt): New function to get a code_helper from an statement. * internal-fn.def: New reduc_fmax_scal and reduc_fmin_scal optabs for ieee fp max/min reductions * optabs.def: Likewise. * tree-vect-loop.c (reduction_fn_for_scalar_code): Changed function signature to accept code_helper instead of tree_code. Handle the fmax/fmin builtins. (needs_fold_left_reduction_p): Likewise. (check_reduction_path): Likewise. (vect_is_simple_reduction): Use code_helper instead of tree_code. Check for supported call-based reductions. Extend support for both assignment-based and call-based reductions. (vect_model_reduction_cost): Extend cost-model support to call-based reductions (just use MAX expression). (get_initial_def_for_reduction): Use code_helper instead of tree_code. Extend support for both assignment-based and call-based reductions. (vect_create_epilog_for_reduction): Likewise. (vectorizable_reduction): Likewise. * tree-vectorizer.h: include gimple-match.h for code_helper. Use code_helper in check_reduction_path signature. * config/aarch64/aarch64-sve.md: Added define_expand to capture new reduc_fmax_scal and reduc_fmin_scal optabs. * config/aarch64/iterators.md: New FMAXMINNMV and fmaxmin_uns iterators to support the new define_expand. gcc/testsuite/Changelog: 2018-12-18 Alejandro Martinez * gcc.target/aarch64/sve/reduc_9.c: New test to check SVE-vectorized reductions without -ffast-math. * gcc.target/aarch64/sve/reduc_10.c: New test to check SVE-vectorized builtin reductions without -ffast-math. -Original Message- From: Richard Biener Sent: 19 December 2018 12:35 To: Alejandro Martinez Vicente Cc: GCC Patches ; Richard Sandiford ; nd Subject: Re: [Patch, Vectorizer, SVE] fmin/fmax builtin reduction support On Wed, Dec 19, 2018 at 10:33 AM Alejandro Martinez Vicente wrote: > > Hi all, > > Loops that use the fmin/fmax builtins can be vectorized even without > -ffast-math using SVE's FMINNM/FMAXNM instructions. This is an example: > > double > f (double *x, int n) > { > double res = 100.0; > for (int i = 0; i < n; ++i) > res = __builtin_fmin (res, x[i]); > return res; > } > > Before this patch, the compiler would generate this code > (-march=armv8.2-a+sve > -O2 -ftree-vectorize): > > : >0: 713fcmp w1, #0x0 >4: 5400018db.le34 >8: 51000422sub w2, w1, #0x1 >c: 91002003add x3, x0, #0x8 > 10: d2e80b21mov x1, #0x4059 > 14: 9e670020fmovd0, x1 > 18: 8b224c62add x2, x3, w2, uxtw #3 > 1c: d503201fnop > 20: fc408401ldr d1, [x0],#8 > 24: 1e617800fminnm d0, d0, d1 > 28: eb02001fcmp x0, x2 > 2c: 54a1b.ne20 > 30: d65f03c0ret > 34: d2e80b20mov x0, #0x4059 > 38: 9e67fmovd0, x0 > 3c: d65f03c0ret > > After this patch, this is the code that gets generated: > > : >0: 713fcmp w1, #0x0 >4: 5400020db.le44 >8: d282mov x2, #0x0 >c: 25d8e3e0ptrue p0.d > 10: 93407c21sxtwx1, w1 > 14: 9003adrpx3, 0 > 18: 25804001mov p1.b, p0.b > 1c: 9163add x3, x3, #0x0 > 20: 85c0e060ld1rd {z0.d}, p0/z, [x3] > 24: 25e11fe0whilelo p0.d, xzr, x1 > 28: a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3] > 2c: 04f0e3e2incdx2 > 30: 65c58020fminnm z0.d, p0/m, z0.d, z1.d > 34: 25e11c40whilelo p0.d, x2, x1 > 38: 5481b.ne28 // b.any > 3c: 65c52400fminnmv d0, p1, z0.d > 40: d65f03c0ret > 44: d2e80b20mov x0, #0x4059 > 48: 9e67fmovd0, x0 > 4c: d65f03c0ret > > This patch extends the support for reductions to include calls to > internal functions, in addition to assign statements. For this > purpose, in most places where a tree_code would be used, a code_helper > is used instead. The code_helper a
[Patch, Vectorizer, SVE] fmin/fmax builtin reduction support
Hi all, Loops that use the fmin/fmax builtins can be vectorized even without -ffast-math using SVE's FMINNM/FMAXNM instructions. This is an example: double f (double *x, int n) { double res = 100.0; for (int i = 0; i < n; ++i) res = __builtin_fmin (res, x[i]); return res; } Before this patch, the compiler would generate this code (-march=armv8.2-a+sve -O2 -ftree-vectorize): : 0: 713fcmp w1, #0x0 4: 5400018db.le34 8: 51000422sub w2, w1, #0x1 c: 91002003add x3, x0, #0x8 10: d2e80b21mov x1, #0x4059 14: 9e670020fmovd0, x1 18: 8b224c62add x2, x3, w2, uxtw #3 1c: d503201fnop 20: fc408401ldr d1, [x0],#8 24: 1e617800fminnm d0, d0, d1 28: eb02001fcmp x0, x2 2c: 54a1b.ne20 30: d65f03c0ret 34: d2e80b20mov x0, #0x4059 38: 9e67fmovd0, x0 3c: d65f03c0ret After this patch, this is the code that gets generated: : 0: 713fcmp w1, #0x0 4: 5400020db.le44 8: d282mov x2, #0x0 c: 25d8e3e0ptrue p0.d 10: 93407c21sxtwx1, w1 14: 9003adrpx3, 0 18: 25804001mov p1.b, p0.b 1c: 9163add x3, x3, #0x0 20: 85c0e060ld1rd {z0.d}, p0/z, [x3] 24: 25e11fe0whilelo p0.d, xzr, x1 28: a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3] 2c: 04f0e3e2incdx2 30: 65c58020fminnm z0.d, p0/m, z0.d, z1.d 34: 25e11c40whilelo p0.d, x2, x1 38: 5481b.ne28 // b.any 3c: 65c52400fminnmv d0, p1, z0.d 40: d65f03c0ret 44: d2e80b20mov x0, #0x4059 48: 9e67fmovd0, x0 4c: d65f03c0ret This patch extends the support for reductions to include calls to internal functions, in addition to assign statements. For this purpose, in most places where a tree_code would be used, a code_helper is used instead. The code_helper allows to hold either a tree_code or combined_fn. This patch implements these tasks: - Detect a reduction candidate based on a call to an internal function (currently only fmin or fmax). - Process the reduction using code_helper. This means that at several places we have to check whether this is as assign-based reduction or a call-based reduction. - Add new internal functions for the fmin/fmax reductions and for conditional fmin/fmax. In architectures where ieee fmin/fmax reductions are available, it is still possible to vectorize the loop using unconditional instructions. - Update SVE's md to support these new reductions. - Add new SVE tests to check that the optimal code is being generated. I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. Alejandro gcc/Changelog: 2018-12-18 Alejandro Martinez * gimple-match.h (code_helper_for_stmnt): New function to get a code_helper from an statement. * internal-fn.def: New reduc_fmax_scal and reduc_fmin_scal optabs for ieee fp max/min reductions * optabs.def: Likewise. * tree-vect-loop.c (reduction_fn_for_scalar_code): Changed function signature to accept code_helper instead of tree_code. Handle the fmax/fmin builtins. (needs_fold_left_reduction_p): Likewise. (check_reduction_path): Likewise. (vect_is_simple_reduction): Use code_helper instead of tree_code. Check for supported call-based reductions. Extend support for both assignment-based and call-based reductions. (vect_model_reduction_cost): Extend cost-model support to call-based reductions (just use MAX expression). (get_initial_def_for_reduction): Use code_helper instead of tree_code. Extend support for both assignment-based and call-based reductions. (vect_create_epilog_for_reduction): Likewise. (vectorizable_reduction): Likewise. * tree-vectorizer.h: include gimple-match.h for code_helper. Use code_helper in check_reduction_path signature. * config/aarch64/aarch64-sve.md: Added define_expand to capture new reduc_fmax_scal and reduc_fmin_scal optabs. * config/aarch64/iterators.md: New FMAXMINNMV and fmaxmin_uns iterators to support the new define_expand. gcc/testsuite/Changelog: 2018-12-18 Alejandro Martinez * gcc.target/aarch64/sve/reduc_9.c: New test to check SVE-vectorized reductions without -ffast-math. * gcc.target/aarch64/sve/reduc_10.c: New test to check SVE-vectorized builtin reductions without -ffast-math. final.patch Description: final.patch