[Vectorizer] Support masking fold left reductions

2019-06-12 Thread Alejandro Martinez Vicente
Hi,

This patch adds support in the vectorizer for masking fold left reductions.
This avoids the need to insert a conditional assignment with some identity
value.

For example, this C code:

double
f (double *restrict x, int n)
{
  double res = 0.0;
  for (int i = 0; i < n; i++)
{
  res += x[i];
}
  return res;
}

Produced this for SVE:

 :
   0:   2f00e400movid0, #0x0
   4:   713fcmp w1, #0x0
   8:   5400018db.le38 
   c:   d282mov x2, #0x0// #0
  10:   93407c21sxtwx1, w1
  14:   25f8c002mov z2.d, #0
  18:   25e11fe0whilelo p0.d, xzr, x1
  1c:   25d8e3e1ptrue   p1.d
  20:   a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3]
  24:   04f0e3e2incdx2
  28:   05e2c021sel z1.d, p0, z1.d, z2.d
  2c:   25e11c40whilelo p0.d, x2, x1
  30:   65d82420fadda   d0, p1, d0, z1.d
  34:   5461b.ne20   // b.any
  38:   d65f03c0ret

And now I get this:

 :
   0:   2f00e400movid0, #0x0
   4:   713fcmp w1, #0x0
   8:   5400012db.le2c 
   c:   d282mov x2, #0x0// #0
  10:   93407c21sxtwx1, w1
  14:   25e11fe0whilelo p0.d, xzr, x1
  18:   a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3]
  1c:   04f0e3e2incdx2
  20:   65d82020fadda   d0, p0, d0, z1.d
  24:   25e11c40whilelo p0.d, x2, x1
  28:   5481b.ne18   // b.any
  2c:   d65f03c0ret

I've added a new test and run the regression testing. Ok for trunk?

Alejandro

2019-06-12  Alejandro Martinez  

gcc/
* config/aarch64/aarch64-sve.md (mask_fold_left_plus_): Renamed
from "*fold_left_plus_", updated operands order.
* doc/md.texi (mask_fold_left_plus_@var{m}): Documented new optab.
* internal-fn.c (mask_fold_left_direct): New define.
(expand_mask_fold_left_optab_fn): Likewise.
(direct_mask_fold_left_optab_supported_p): Likewise.
* internal-fn.def (MASK_FOLD_LEFT_PLUS): New internal function.
* optabs.def (mask_fold_left_plus_optab): New optab.
* tree-vect-loop.c (mask_fold_left_plus_optab): New function to get a
masked internal_fn for a reduction ifn.
(vectorize_fold_left_reduction): Add support for masking reductions.

gcc/testsuite/
* gcc.target/aarch64/sve/fadda_1.c: New test.


mask_fold_left_v3.patch
Description: mask_fold_left_v3.patch


[PATCH] PR tree-optimization/90681 Fix ICE in vect_slp_analyze_node_operations_1

2019-05-31 Thread Alejandro Martinez Vicente
Hi,

This patch fixes bug 90681.  It was caused by trying to SLP vectorize a non
grouped load.  We've fixed it by tweaking a bit the implementation: mark
masked loads as not vectorizable, but support them as an special case.  Then
the detect them in the test for normal non-grouped loads that was already
there.

The bug reproducer now works and the performance test we added is still happy.

Alejandro

gcc/ChangeLog:

2019-05-31  Alejandro Martinez  

PR tree-optimization/90681
* internal-fn.c (mask_load_direct): Mark as non-vectorizable again.
* tree-vect-slp.c (vect_build_slp_tree_1): Add masked loads as a
special case for SLP, but fail on non-grouped loads.


2019-05-31  Alejandro Martinez  

gcc/testsuite/

PR tree-optimization/90681
* gfortran.dg/vect/pr90681.f: Bug reproducer.


fix.patch
Description: fix.patch


RE: Implement vector average patterns for SVE2

2019-05-29 Thread Alejandro Martinez Vicente
Turns out I was missing a few bits and pieces. Here is the updated patch and 
changelog.

Alejandro


2019-05-29  Alejandro Martinez  

* config/aarch64/aarch64-c.c: Added TARGET_SVE2.
* config/aarch64/aarch64-sve2.md: New file.
(avg3_floor): New pattern.
(avg3_ceil): Likewise.
(*h): Likewise.
* config/aarch64/aarch64.h: Added AARCH64_ISA_SVE2 and TARGET_SVE2.
* config/aarch64/aarch64.md: Include aarch64-sve2.md.


2019-05-29  Alejandro Martinez  

gcc/testsuite/
* gcc.target/aarch64/sve2/aarch64-sve2.exp: New file, regression driver
for AArch64 SVE2.
* gcc.target/aarch64/sve2/average_1.c: New test.
* lib/target-supports.exp (check_effective_target_aarch64_sve2): New
helper.
(check_effective_target_aarch64_sve1_only): Likewise.
(check_effective_target_aarch64_sve2_hw): Likewise.
(check_effective_target_vect_avg_qi): Check for SVE1 only.

> -Original Message-
> From: Richard Sandiford 
> Sent: 29 May 2019 10:54
> To: Alejandro Martinez Vicente 
> Cc: GCC Patches ; nd 
> Subject: Re: Implement vector average patterns for SVE2
> 
> Alejandro Martinez Vicente  writes:
> > Hi,
> >
> > This patch implements the [u]avgM3_floor and [u]avgM3_ceil optabs for
> SVE2.
> >
> > Alejandro
> >
> > gcc/ChangeLog:
> >
> > 2019-05-28  Alejandro Martinez  
> >
> > * config/aarch64/aarch64-sve2.md: New file.
> > (avg3_floor): New pattern.
> > (avg3_ceil): Likewise.
> > (*h): Likewise.
> > * config/aarch64/aarch64.md: Include aarch64-sve2.md.
> >
> >
> > 2019-05-28  Alejandro Martinez  
> >
> > gcc/testsuite/
> > * gcc.target/aarch64/sve2/average_1.c: New test.
> > * lib/target-supports.exp
> (check_effective_target_aarch64_sve1_only):
> > New helper.
> > (check_effective_target_vect_avg_qi): Check for SVE1 only.
> 
> OK, thanks, but...
> 
> > diff --git gcc/testsuite/lib/target-supports.exp
> > gcc/testsuite/lib/target-supports.exp
> > index f69106d..41431e6 100644
> > --- gcc/testsuite/lib/target-supports.exp
> > +++ gcc/testsuite/lib/target-supports.exp
> > @@ -3308,6 +3308,12 @@ proc check_effective_target_aarch64_sve2 { } {
> >  }]
> >  }
> >
> > +# Return 1 if this is an AArch64 target only supporting SVE (not SVE2).
> > +proc check_effective_target_aarch64_sve1_only { } {
> > +return [expr { [check_effective_target_aarch64_sve]
> > +  && ![check_effective_target_aarch64_sve2] }] }
> 
> ...it needs check_effective_target_aarch64_sve2 to go in first.
> 
> Richard


vavg_sve2_v2.patch
Description: vavg_sve2_v2.patch


Implement vector average patterns for SVE2

2019-05-28 Thread Alejandro Martinez Vicente
Hi,

This patch implements the [u]avgM3_floor and [u]avgM3_ceil optabs for SVE2.

Alejandro

gcc/ChangeLog:

2019-05-28  Alejandro Martinez  

* config/aarch64/aarch64-sve2.md: New file.
(avg3_floor): New pattern.
(avg3_ceil): Likewise.
(*h): Likewise.
* config/aarch64/aarch64.md: Include aarch64-sve2.md.


2019-05-28  Alejandro Martinez  

gcc/testsuite/
* gcc.target/aarch64/sve2/average_1.c: New test.
* lib/target-supports.exp (check_effective_target_aarch64_sve1_only):
New helper.
(check_effective_target_vect_avg_qi): Check for SVE1 only.


vavg_sve2.patch
Description: vavg_sve2.patch


RE: [Vectorizer] Add SLP support for masked loads

2019-05-09 Thread Alejandro Martinez Vicente
Hi Richards,

This is the new version of the patch, addressing your comments.

Alejandro

> -Original Message-
> From: Richard Sandiford 
> Sent: 08 May 2019 14:36
> To: Richard Biener 
> Cc: Alejandro Martinez Vicente ; GCC
> Patches ; nd 
> Subject: Re: [Vectorizer] Add SLP support for masked loads
> 
> Richard Biener  writes:
> > On Fri, Apr 26, 2019 at 3:14 PM Richard Sandiford
> >  wrote:
> >>
> >> Alejandro Martinez Vicente 
> writes:
> >> > Hi,
> >> >
> >> > Current vectorizer doesn't support masked loads for SLP. We should
> >> > add that, to allow things like:
> >> >
> >> > void
> >> > f (int *restrict x, int *restrict y, int *restrict z, int n) {
> >> >   for (int i = 0; i < n; i += 2)
> >> > {
> >> >   x[i] = y[i] ? z[i] : 1;
> >> >   x[i + 1] = y[i + 1] ? z[i + 1] : 2;
> >> > }
> >> > }
> >> >
> >> > to be vectorized using contiguous loads rather than LD2 and ST2.
> >> >
> >> > This patch was motivated by SVE, but it is completely generic and
> >> > should apply to any architecture with masked loads.
> >> >
> >> > After the patch is applied, the above code generates this output
> >> > (-march=armv8.2-a+sve -O2 -ftree-vectorize):
> >> >
> >> >  :
> >> >0: 717fcmp w3, #0x0
> >> >4: 540002cdb.le5c 
> >> >8: 51000464sub w4, w3, #0x1
> >> >c: d283mov x3, #0x0// #0
> >> >   10: 9005adrpx5, 0 
> >> >   14: 25d8e3e0ptrue   p0.d
> >> >   18: 53017c84lsr w4, w4, #1
> >> >   1c: 91a5add x5, x5, #0x0
> >> >   20: 11000484add w4, w4, #0x1
> >> >   24: 85c0e0a1ld1rd   {z1.d}, p0/z, [x5]
> >> >   28: 2598e3e3ptrue   p3.s
> >> >   2c: d37ff884lsl x4, x4, #1
> >> >   30: 25a41fe2whilelo p2.s, xzr, x4
> >> >   34: d503201fnop
> >> >   38: a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2]
> >> >   3c: 25808c11cmpne   p1.s, p3/z, z0.s, #0
> >> >   40: 25808810cmpne   p0.s, p2/z, z0.s, #0
> >> >   44: a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2]
> >> >   48: 05a1c400sel z0.s, p1, z0.s, z1.s
> >> >   4c: e5434800st1w{z0.s}, p2, [x0, x3, lsl #2]
> >> >   50: 04b0e3e3incwx3
> >> >   54: 25a41c62whilelo p2.s, x3, x4
> >> >   58: 5401b.ne38   // b.any
> >> >   5c: d65f03c0ret
> >> >
> >> >
> >> > I tested this patch in an aarch64 machine bootstrapping the
> >> > compiler and running the checks.
> >> >
> >> > Alejandro
> >> >
> >> > gcc/Changelog:
> >> >
> >> > 2019-01-16  Alejandro Martinez  
> >> >
> >> >   * config/aarch64/aarch64-sve.md (copysign3): New
> define_expand.
> >> >   (xorsign3): Likewise.
> >> >   internal-fn.c: Marked mask_load_direct and mask_store_direct as
> >> >   vectorizable.
> >> >   tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
> >> >   tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
> >> >   combined even if masks different.
> >> >   (slp_vect_only_p): New function to detect masked loads that are 
> >> > only
> >> >   vectorizable using SLP.
> >> >   (vect_analyze_data_ref_accesses): Mark SLP only vectorizable 
> >> > groups.
> >> >   tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
> >> >   dissolve SLP-only vectorizable groups when SLP has been discarded.
> >> >   (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when
> needed.
> >> >   tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
> >> >   masks.
> >> >   (vect_build_slp_tree_1): Fixed comment typo.
> >> >   (vect_build_slp_tree_2): Include masks from masked loads in SLP
> tree.
> >> >   tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to
> get
> >> >   vec_defs for operand with optional SLP and vectype.
> >> >   

RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences

2019-05-07 Thread Alejandro Martinez Vicente
Great, committed in rev. 270975

Alejandro

> -Original Message-
> From: Richard Sandiford 
> Sent: 07 May 2019 17:18
> To: Alejandro Martinez Vicente 
> Cc: James Greenhalgh ; GCC Patches  patc...@gcc.gnu.org>; nd ; Richard Biener
> 
> Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> 
> Alejandro Martinez Vicente  writes:
> > Thanks for your comments Richard. I think this patch addresses them.
> 
> Yeah, this is OK to install, thanks.
> 
> Richard
> 
> >
> > Alejandro
> >
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: 07 May 2019 15:46
> >> To: Alejandro Martinez Vicente 
> >> Cc: James Greenhalgh ; GCC Patches  >> patc...@gcc.gnu.org>; nd ; Richard Biener
> >> 
> >> Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> >>
> >> Alejandro Martinez Vicente 
> writes:
> >> > +;; Helper expander for aarch64_abd_3 to save the callers
> >> > +;; the hassle of constructing the other arm of the MINUS.
> >> > +(define_expand "abd_3"
> >> > +  [(use (match_operand:SVE_I 0 "register_operand"))
> >> > +   (USMAX:SVE_I (match_operand:SVE_I 1 "register_operand")
> >> > +(match_operand:SVE_I 2 "register_operand"))]
> >> > +  "TARGET_SVE"
> >> > +  {
> >> > +rtx other_arm
> >> > +  = simplify_gen_binary (, mode, operands[1],
> >> > +operands[2]);
> >>
> >> I realise this is just copied from the Advanced SIMD version, but
> >> simplify_gen_binary is a bit dangerous here, since we explicitly want
> >> an unsimplified  with the two operands given.  Probably
> better as:
> >>
> >>   gen_rtx_ (mode, ...)
> >>
> >> > +emit_insn (gen_aarch64_abd_3 (operands[0],
> operands[1],
> >> > +   operands[2], other_arm));
> >> > +DONE;
> >> > +  }
> >> > +)
> >> > +
> >> > +;; Unpredicated integer absolute difference.
> >> > +(define_expand "aarch64_abd_3"
> >> > +  [(set (match_operand:SVE_I 0 "register_operand")
> >> > +(unspec:SVE_I
> >> > +  [(match_dup 4)
> >> > +   (minus:SVE_I
> >> > + (USMAX:SVE_I
> >> > +   (match_operand:SVE_I 1 "register_operand" "w")
> >> > +   (match_operand:SVE_I 2 "register_operand" "w"))
> >> > + (match_operator 3 "aarch64_"
> >> > +   [(match_dup 1)
> >> > +(match_dup 2)]))]
> >> > +  UNSPEC_MERGE_PTRUE))]
> >> > +  "TARGET_SVE"
> >> > +  {
> >> > +operands[4] = force_reg (mode, CONSTM1_RTX
> >> (mode));
> >> > +  }
> >> > +)
> >>
> >> I think we should go directly from abd_3 to the final
> >> pattern, so that abd_3 does the force_reg too.  This would
> make...
> >>
> >> > +;; Predicated integer absolute difference.
> >> > +(define_insn "*aarch64_abd_3"
> >>
> >> ...this the named pattern, instead of starting with "*".
> >>
> >> > +  [(set (match_operand:SVE_I 0 "register_operand" "=w, ?")
> >> > +(unspec:SVE_I
> >> > +  [(match_operand: 1 "register_operand" "Upl, Upl")
> >> > +   (minus:SVE_I
> >> > + (USMAX:SVE_I
> >> > +   (match_operand:SVE_I 2 "register_operand" "w, w")
> >>
> >> Should be "0, w", so that the first alternative ties the input to the 
> >> output.
> >>
> >> > +   (match_operand:SVE_I 3 "register_operand" "w, w"))
> >> > + (match_operator 4 "aarch64_"
> >> > +   [(match_dup 2)
> >> > +(match_dup 3)]))]
> >> > +  UNSPEC_MERGE_PTRUE))]
> >> > +  "TARGET_SVE"
> >> > +  "@
> >> > +   abd\t%0., %1/m, %0., %3.
> >> > +
> >>
> movprfx\t%0, %2\;abd\t%0., %1/m, %0., %3. >> >"
> >> > +  [(set_attr "movprfx" "*,yes")]
> >> > +)
> >> > +
> >> > +;; Emit a sequence t

RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences

2019-05-07 Thread Alejandro Martinez Vicente
Thanks for your comments Richard. I think this patch addresses them.

Alejandro

> -Original Message-
> From: Richard Sandiford 
> Sent: 07 May 2019 15:46
> To: Alejandro Martinez Vicente 
> Cc: James Greenhalgh ; GCC Patches  patc...@gcc.gnu.org>; nd ; Richard Biener
> 
> Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> 
> Alejandro Martinez Vicente  writes:
> > +;; Helper expander for aarch64_abd_3 to save the callers ;;
> > +the hassle of constructing the other arm of the MINUS.
> > +(define_expand "abd_3"
> > +  [(use (match_operand:SVE_I 0 "register_operand"))
> > +   (USMAX:SVE_I (match_operand:SVE_I 1 "register_operand")
> > +   (match_operand:SVE_I 2 "register_operand"))]
> > +  "TARGET_SVE"
> > +  {
> > +rtx other_arm
> > +  = simplify_gen_binary (, mode, operands[1],
> > +operands[2]);
> 
> I realise this is just copied from the Advanced SIMD version, but
> simplify_gen_binary is a bit dangerous here, since we explicitly want an
> unsimplified  with the two operands given.  Probably better as:
> 
>   gen_rtx_ (mode, ...)
> 
> > +emit_insn (gen_aarch64_abd_3 (operands[0], operands[1],
> > +  operands[2], other_arm));
> > +DONE;
> > +  }
> > +)
> > +
> > +;; Unpredicated integer absolute difference.
> > +(define_expand "aarch64_abd_3"
> > +  [(set (match_operand:SVE_I 0 "register_operand")
> > +   (unspec:SVE_I
> > + [(match_dup 4)
> > +  (minus:SVE_I
> > +(USMAX:SVE_I
> > +  (match_operand:SVE_I 1 "register_operand" "w")
> > +  (match_operand:SVE_I 2 "register_operand" "w"))
> > +(match_operator 3 "aarch64_"
> > +  [(match_dup 1)
> > +   (match_dup 2)]))]
> > + UNSPEC_MERGE_PTRUE))]
> > +  "TARGET_SVE"
> > +  {
> > +operands[4] = force_reg (mode, CONSTM1_RTX
> (mode));
> > +  }
> > +)
> 
> I think we should go directly from abd_3 to the final pattern, so
> that abd_3 does the force_reg too.  This would make...
> 
> > +;; Predicated integer absolute difference.
> > +(define_insn "*aarch64_abd_3"
> 
> ...this the named pattern, instead of starting with "*".
> 
> > +  [(set (match_operand:SVE_I 0 "register_operand" "=w, ?")
> > +   (unspec:SVE_I
> > + [(match_operand: 1 "register_operand" "Upl, Upl")
> > +  (minus:SVE_I
> > +(USMAX:SVE_I
> > +  (match_operand:SVE_I 2 "register_operand" "w, w")
> 
> Should be "0, w", so that the first alternative ties the input to the output.
> 
> > +  (match_operand:SVE_I 3 "register_operand" "w, w"))
> > +(match_operator 4 "aarch64_"
> > +  [(match_dup 2)
> > +   (match_dup 3)]))]
> > + UNSPEC_MERGE_PTRUE))]
> > +  "TARGET_SVE"
> > +  "@
> > +   abd\t%0., %1/m, %0., %3.
> > +
> movprfx\t%0, %2\;abd\t%0., %1/m, %0., %3. >"
> > +  [(set_attr "movprfx" "*,yes")]
> > +)
> > +
> > +;; Emit a sequence to produce a sum-of-absolute-differences of the
> > +inputs in ;; operands 1 and 2.  The sequence also has to perform a
> > +widening reduction of ;; the difference into a vector and accumulate
> > +that into operand 3 before ;; copying that into the result operand 0.
> > +;; Perform that with a sequence of:
> > +;; MOV ones.b, #1
> > +;; UABDdiff.b, p0/m, op1.b, op2.b
> > +;; UDOTop3.s, diff.b, ones.b
> > +;; MOV op0, op3  // should be eliminated in later passes.
> > +;; The signed version just uses the signed variants of the above
> instructions.
> 
> Think it would be clearer if we removed the last line and just used [SU]ABD
> instead of UABD, since that's the only sign-dependent part of the operation.
> Also think we should SVEize it with MOVPRFX, since a separate MOV should
> never be needed:
> 
> ;; MOVones.b, #1
> ;; [SU]ABDdiff.b, ptrue/m, op1.b, op2.b
> ;; MOVPRFXop0, op3// If necessary
> ;; UDOT   op0.s, diff.b, ones.b
> 
> > +(define_expand "sad"
> > +  [(use (match_operand:SVE_SDI 0 "register_operand"))
> > +   (unspec: [(use (match_operand: 1 "register_operand"))
> > +   (use (match_operand: 2 "register_operand"

RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences

2019-05-07 Thread Alejandro Martinez Vicente
Hi,

I updated the patch after the dot product went in. This is the new covet letter:

This patch adds support to vectorize sum of abslolute differences (SAD_EXPR)
using SVE.

Given this input code:

int
sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n)
{
  int sum = 0;

  for (int i = 0; i < n; i++)
{
  sum += __builtin_abs (x[i] - y[i]);
}

  return sum;
}

The resulting SVE code is:

 :
   0:   715fcmp w2, #0x0
   4:   5400026db.le50 
   8:   d283mov x3, #0x0// #0
   c:   93407c42sxtwx2, w2
  10:   2538c002mov z2.b, #0
  14:   25221fe0whilelo p0.b, xzr, x2
  18:   2538c023mov z3.b, #1
  1c:   2518e3e1ptrue   p1.b
  20:   a4034000ld1b{z0.b}, p0/z, [x0, x3]
  24:   a4034021ld1b{z1.b}, p0/z, [x1, x3]
  28:   0430e3e3incbx3
  2c:   0520c021sel z1.b, p0, z1.b, z0.b
  30:   25221c60whilelo p0.b, x3, x2
  34:   040d0420uabdz0.b, p1/m, z0.b, z1.b
  38:   44830402udotz2.s, z0.b, z3.b
  3c:   5421b.ne20   // b.any
  40:   2598e3e0ptrue   p0.s
  44:   04812042uaddv   d2, p0, z2.s
  48:   1e260040fmovw0, s2
  4c:   d65f03c0ret
  50:   1e2703e2fmovs2, wzr
  54:   1e260040fmovw0, s2
  58:   d65f03c0ret

Notice how udot is used inside a fully masked loop.

I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

Alejandro

gcc/Changelog:

2019-05-07  Alejandro Martinez  

* config/aarch64/aarch64-sve.md (abd_3): New define_expand.
(aarch64_abd_3): Likewise.
(*aarch64_abd_3): New define_insn.
(sad): New define_expand.
* config/aarch64/iterators.md: Added MAX_OPP attribute.
* tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR.
(build_vect_cond_expr): Likewise.

gcc/testsuite/Changelog:
 
2019-05-07  Alejandro Martinez  

* gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute
differences.

> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org 
> On Behalf Of Alejandro Martinez Vicente
> Sent: 11 February 2019 15:38
> To: James Greenhalgh 
> Cc: GCC Patches ; nd ; Richard
> Sandiford ; Richard Biener
> 
> Subject: RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> 
> > -Original Message-
> > From: James Greenhalgh 
> > Sent: 06 February 2019 17:42
> > To: Alejandro Martinez Vicente 
> > Cc: GCC Patches ; nd ; Richard
> > Sandiford ; Richard Biener
> > 
> > Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> >
> > On Mon, Feb 04, 2019 at 07:34:05AM -0600, Alejandro Martinez Vicente
> > wrote:
> > > Hi,
> > >
> > > This patch adds support to vectorize sum of absolute differences
> > > (SAD_EXPR) using SVE. It also uses the new functionality to ensure
> > > that the resulting loop is masked. Therefore, it depends on
> > >
> > > https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html
> > >
> > > Given this input code:
> > >
> > > int
> > > sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) {
> > >   int sum = 0;
> > >
> > >   for (int i = 0; i < n; i++)
> > > {
> > >   sum += __builtin_abs (x[i] - y[i]);
> > > }
> > >
> > >   return sum;
> > > }
> > >
> > > The resulting SVE code is:
> > >
> > >  :
> > >0: 715fcmp w2, #0x0
> > >4: 5400026db.le50 
> > >8: d283mov x3, #0x0// #0
> > >c: 93407c42sxtwx2, w2
> > >   10: 2538c002mov z2.b, #0
> > >   14: 25221fe0whilelo p0.b, xzr, x2
> > >   18: 2538c023mov z3.b, #1
> > >   1c: 2518e3e1ptrue   p1.b
> > >   20: a4034000ld1b{z0.b}, p0/z, [x0, x3]
> > >   24: a4034021ld1b{z1.b}, p0/z, [x1, x3]
> > >   28: 0430e3e3incbx3
> > >   2c: 0520c021sel z1.b, p0, z1.b, z0.b
> > >   30: 25221c60whilelo p0.b, x3, x2
> > >   34: 040d0420uabdz0.b, p1/m, z0.b, z1.b
> > >   38: 44830402udotz2.s, z0.b, z3.b
> > >   3c: 5421b.ne20   // b.any
> > >   40: 2598e3e0ptrue   p0.s
> > >   44: 04812042uaddv   d2, p0, z2.s
> > >   48: 1e260040fmovw0, s2
> > >   4c: d

RE: [Aarch64][SVE] Dot product support

2019-04-29 Thread Alejandro Martinez Vicente

> -Original Message-
> From: Richard Sandiford 
> Sent: 29 April 2019 09:42
> To: Alejandro Martinez Vicente 
> Cc: GCC Patches ; nd ; Richard
> Biener 
> Subject: Re: [Aarch64][SVE] Dot product support
> 
> Alejandro Martinez Vicente  writes:
> > @@ -5885,6 +5885,56 @@ is_nonwrapping_integer_induction
> (stmt_vec_info stmt_vinfo, struct loop *loop)
> >   <= TYPE_PRECISION (lhs_type));
> >  }
> >
> > +/* Check if masking can be supported by inserting a condional expression.
> 
> conditional
> 
> > +   CODE is the code for the operation.  COND_FN is the conditional internal
> > +   function, if it exists.  VECTYPE_IN is the type of the vector
> > +input.  */ static bool use_mask_by_cond_expr_p (enum tree_code code,
> > +internal_fn cond_fn,
> > +tree vectype_in)
> > +{
> > +  if (cond_fn != IFN_LAST
> > +  && direct_internal_fn_supported_p (cond_fn, vectype_in,
> > +OPTIMIZE_FOR_SPEED))
> > +return false;
> > +
> > +  switch (code)
> > +{
> > +case DOT_PROD_EXPR:
> > +  return true;
> > +
> > +default:
> > +  return false;
> > +}
> > +}
> > +
> > +/* Insert a condional expression to enable masked vectorization.
> > +CODE is the
> 
> Same here.
> 
> > +   code for the operation.  VOP is the array of operands.  MASK is the loop
> > +   mask.  GSI is a statement iterator used to place the new conditional
> > +   expression.  */
> > +static void
> > +build_vect_cond_expr (enum tree_code code, tree vop[3], tree mask,
> > + gimple_stmt_iterator *gsi)
> > +{
> > +  switch (code)
> > +{
> > +case DOT_PROD_EXPR:
> > +  {
> > +   tree vectype = TREE_TYPE (vop[1]);
> > +   tree zero = build_zero_cst (vectype);
> > +   zero = build_vector_from_val (vectype, zero);
> 
> This last line isn't right -- should just delete it.
> 
> tree zero = build_zero_cst (vectype);
> 
> builds a zero vector in one go.
> 
> OK with those changes, thanks.  (This version didn't include the testcase, but
> I assume that's because it didn't change from last time.)
> 
Done. I forgot to add the testcase in v2.

Alejandro

> Richard


dot_v3.patch
Description: dot_v3.patch


RE: [Aarch64][SVE] Dot product support

2019-04-29 Thread Alejandro Martinez Vicente
Hi Richard,

This is the updated patch with your comments. In addition to that, I removed
vectype_in from the build_vect_cond_expr call, since it wasn't really necessary.

Alejandro

> -Original Message-
> From: Richard Sandiford 
> Sent: 26 April 2019 14:29
> To: Alejandro Martinez Vicente 
> Cc: GCC Patches ; nd ; Richard
> Biener 
> Subject: Re: [Aarch64][SVE] Dot product support
> 
> Alejandro Martinez Vicente  writes:
> > Hi,
> >
> > This patch does two things. For the general vectorizer, it adds
> > support to perform fully masked reductions over expressions that don't
> support masking.
> > This is achieved by using VEC_COND_EXPR where possible.  At the moment
> > this is implemented for DOT_PROD_EXPR only, but the framework is there
> > to extend it to other expressions.
> >
> > Related to that, this patch adds support to vectorize dot product
> > using SVE.  It also uses the new functionality to ensure that the resulting
> loop is masked.
> >
> > Given this input code:
> >
> > uint32_t
> > dotprod (uint8_t *restrict x, uint8_t *restrict y, int n) {
> >   uint32_t sum = 0;
> >
> >   for (int i = 0; i < n; i++)
> > {
> >   sum += x[i] * y[i];
> > }
> >
> >   return sum;
> > }
> >
> > The resulting SVE code is:
> >
> >  :
> >0:   715fcmp w2, #0x0
> >4:   5400024db.le4c 
> >8:   d283mov x3, #0x0// #0
> >c:   93407c42sxtwx2, w2
> >   10:   2538c001mov z1.b, #0
> >   14:   25221fe0whilelo p0.b, xzr, x2
> >   18:   2538c003mov z3.b, #0
> >   1c:   d503201fnop
> >   20:   a4034002ld1b{z2.b}, p0/z, [x0, x3]
> >   24:   a4034020ld1b{z0.b}, p0/z, [x1, x3]
> >   28:   0430e3e3incbx3
> >   2c:   0523c000sel z0.b, p0, z0.b, z3.b
> >   30:   25221c60whilelo p0.b, x3, x2
> >   34:   44820401udotz1.s, z0.b, z2.b
> >   38:   5441b.ne20   // b.any
> >   3c:   2598e3e0ptrue   p0.s
> >   40:   04812021uaddv   d1, p0, z1.s
> >   44:   1e260020fmovw0, s1
> >   48:   d65f03c0ret
> >   4c:   1e2703e1fmovs1, wzr
> >   50:   1e260020fmovw0, s1
> >   54:   d65f03c0ret
> >
> > Notice how udot is used inside a fully masked loop.
> >
> > I tested this patch in an aarch64 machine bootstrapping the compiler
> > and running the checks.
> >
> > I admit it is too late to merge this into gcc 9, but I'm posting it
> > anyway so it can be considered for gcc 10.
> >
> > Alejandro
> >
> >
> > gcc/Changelog:
> >
> > 2019-01-31  Alejandro Martinez  
> >
> > * config/aarch64/aarch64-sve.md (dot_prod): Taken
> from SVE
> > ACLE branch.
> > * config/aarch64/iterators.md: Copied Vetype_fourth, VSI2QI and
> vsi2qi from
> > SVE ACLE branch.
> > * tree-vect-loop.c (use_mask_by_cond_expr_p): New function to
> check if a
> > VEC_COND_EXPR be inserted to emulate a conditional internal
> function.
> > (build_vect_cond_expr): Emit the VEC_COND_EXPR.
> > (vectorizable_reduction): Use the functions above to vectorize in a
> > fully masked loop codes that don't have a conditional internal
> > function.
> >
> > gcc/testsuite/Changelog:
> >
> > 2019-01-31  Alejandro Martinez  
> >
> > * gcc.target/aarch64/sve/dot_1.c: New test for dot product.
> >
> > diff --git a/gcc/config/aarch64/aarch64-sve.md
> > b/gcc/config/aarch64/aarch64-sve.md
> > index 5bb3422..2779a21 100644
> > --- a/gcc/config/aarch64/aarch64-sve.md
> > +++ b/gcc/config/aarch64/aarch64-sve.md
> > @@ -3128,3 +3128,17 @@
> >  DONE;
> >}
> >  )
> > +
> > +;; Unpredicated DOT product.
> > +(define_insn "dot_prod"
> > +  [(set (match_operand:SVE_SDI 0 "register_operand" "=w, ?")
> > +   (plus:SVE_SDI (unspec:SVE_SDI [(match_operand: 1
> "register_operand" "w, w")
> > +  (match_operand: 2
> "register_operand" "w, w")]
> > +  DOTPROD)
> > +   (match_operand:SVE_SDI 3 "register_operand" "0, w")))]
> > +  "TARGET_SVE"
> > +  "@
> >

[MAINTAINERS] Add myself to MAINTAINERS

2019-02-27 Thread Alejandro Martinez Vicente
Add myself to write after approval.

Alejandro

Committed to trunk in r 269246

Index: MAINTAINERS
===
--- MAINTAINERS(revision 269244)
+++ MAINTAINERS(working copy)
@@ -495,6 +495,7 @@
Jose E. Marchesi
Patrick Marlier
Simon Martin
+Alejandro Martinez
Ranjit Mathew
Paulo Matos
Michael Matz


[AArch64, SVE] Fix vectorized FP converts

2019-02-26 Thread Alejandro Martinez Vicente
Hi,

Some of the narrowing/widening FP converts were missing from SVE. I fixed most
of them, so they can be vectorized. The ones missing are int64->fp16 and
fp16->int64.

I extended the tests to cover the cases that were missing.

I validated the patch with self-checking and running the new SVE tests on an
SVE emulator.

Alejandro


gcc/Changelog:

2019-02-25  Alejandro Martinez  

* config/aarch64/aarch64-sve.md
(aarch64_sve__vnx8hf2,
aarch64_sve__vnx4sf2): Renamed FP to int
patterns.
(vec_unpack_fix_trunc__,
vec_pack_float_): New unpack/pack expanders.
* config/aarch64/iterators.md (SVE_HSDI): Fix cut-&-paste of SVE_BHSI.
(VWIDEINT): New iterator.
(VwideInt): Likewise.


gcc/testsuite/Changelog:
 
2019-02-25  Alejandro Martinez  

* gcc.target/aarch64/sve/fcvt_1.c: New test for fp to fp convert.
* gcc.target/aarch64/sve/fcvt_1_run.c: Likewise.
* gcc.target/aarch64/sve/cvtf_signed_1.c Improved test to cover
widening and narrowing cases.
* gcc.target/aarch64/sve/cvtf_signed_1_run.c: Likewise.
* gcc.target/aarch64/sve/cvtf_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/cvtf_unsigned_1_run.c: Likewise.
* gcc.target/aarch64/sve/fcvtz_signed_1.c: Likewise.
* gcc.target/aarch64/sve/fcvtz_signed_1_run.c: Likewise.
* gcc.target/aarch64/sve/fcvtz_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/fcvtz_unsigned_1_run.c: Likewise.



cvt_v4.patch
Description: cvt_v4.patch


RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences

2019-02-11 Thread Alejandro Martinez Vicente
> -Original Message-
> From: James Greenhalgh 
> Sent: 06 February 2019 17:42
> To: Alejandro Martinez Vicente 
> Cc: GCC Patches ; nd ; Richard
> Sandiford ; Richard Biener
> 
> Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> 
> On Mon, Feb 04, 2019 at 07:34:05AM -0600, Alejandro Martinez Vicente
> wrote:
> > Hi,
> >
> > This patch adds support to vectorize sum of absolute differences
> > (SAD_EXPR) using SVE. It also uses the new functionality to ensure
> > that the resulting loop is masked. Therefore, it depends on
> >
> > https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html
> >
> > Given this input code:
> >
> > int
> > sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) {
> >   int sum = 0;
> >
> >   for (int i = 0; i < n; i++)
> > {
> >   sum += __builtin_abs (x[i] - y[i]);
> > }
> >
> >   return sum;
> > }
> >
> > The resulting SVE code is:
> >
> >  :
> >0:   715fcmp w2, #0x0
> >4:   5400026db.le50 
> >8:   d283mov x3, #0x0// #0
> >c:   93407c42sxtwx2, w2
> >   10:   2538c002mov z2.b, #0
> >   14:   25221fe0whilelo p0.b, xzr, x2
> >   18:   2538c023mov z3.b, #1
> >   1c:   2518e3e1ptrue   p1.b
> >   20:   a4034000ld1b{z0.b}, p0/z, [x0, x3]
> >   24:   a4034021ld1b{z1.b}, p0/z, [x1, x3]
> >   28:   0430e3e3incbx3
> >   2c:   0520c021sel z1.b, p0, z1.b, z0.b
> >   30:   25221c60whilelo p0.b, x3, x2
> >   34:   040d0420uabdz0.b, p1/m, z0.b, z1.b
> >   38:   44830402udotz2.s, z0.b, z3.b
> >   3c:   5421b.ne20   // b.any
> >   40:   2598e3e0ptrue   p0.s
> >   44:   04812042uaddv   d2, p0, z2.s
> >   48:   1e260040fmovw0, s2
> >   4c:   d65f03c0ret
> >   50:   1e2703e2fmovs2, wzr
> >   54:   1e260040fmovw0, s2
> >   58:   d65f03c0ret
> >
> > Notice how udot is used inside a fully masked loop.
> >
> > I tested this patch in an aarch64 machine bootstrapping the compiler
> > and running the checks.
> 
> This doesn't give us much confidence in SVE coverage; unless you have been
> running in an environment using SVE by default? Do you have some set of
> workloads you could test the compiler against to ensure correct operation of
> the SVE vectorization?
> 
I tested it using an SVE model and a big set of workloads, including SPEC 2000,
2006 and 2017. On the plus side, nothing got broken. But impact on performance
was very minimal (on average, a tiny gain over the whole set of workloads).

I still want this patch (and the companion dot product patch) to make into the
compiler because they are the first steps towards vectorising workloads using
fully masked loops when the target ISA (like SVE) doesn't support masking in
all the operations.

Alejandro

> >
> > I admit it is too late to merge this into gcc 9, but I'm posting it
> > anyway so it can be considered for gcc 10.
> 
> Richard Sandiford has the call on whether this patch is OK for trunk now or
> GCC 10. With the minimal testing it has had, I'd be uncomfortable with it as a
> GCC 9 patch. That said, it is a fairly self-contained pattern for the compiler
> and it would be good to see this optimization in GCC 9.
> 
> >
> > Alejandro
> >
> >
> > gcc/Changelog:
> >
> > 2019-02-04  Alejandro Martinez  
> >
> > * config/aarch64/aarch64-sve.md (abd_3): New
> define_expand.
> > (aarch64_abd_3): Likewise.
> > (*aarch64_abd_3): New define_insn.
> > (sad): New define_expand.
> > * config/aarch64/iterators.md: Added MAX_OPP and max_opp
> attributes.
> > Added USMAX iterator.
> > * config/aarch64/predicates.md: Added aarch64_smin and
> aarch64_umin
> > predicates.
> > * tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR.
> > (build_vect_cond_expr): Likewise.
> >
> > gcc/testsuite/Changelog:
> >
> > 2019-02-04  Alejandro Martinez  
> >
> > * gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute
> > differences.
> 



[Aarch64][SVE] Vectorise sum-of-absolute-differences

2019-02-04 Thread Alejandro Martinez Vicente
Hi,

This patch adds support to vectorize sum of absolute differences (SAD_EXPR)
using SVE. It also uses the new functionality to ensure that the resulting loop
is masked. Therefore, it depends on

https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html

Given this input code:

int
sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n)
{
  int sum = 0;

  for (int i = 0; i < n; i++)
{
  sum += __builtin_abs (x[i] - y[i]);
}

  return sum;
}

The resulting SVE code is:

 :
   0:   715fcmp w2, #0x0
   4:   5400026db.le50 
   8:   d283mov x3, #0x0// #0
   c:   93407c42sxtwx2, w2
  10:   2538c002mov z2.b, #0
  14:   25221fe0whilelo p0.b, xzr, x2
  18:   2538c023mov z3.b, #1
  1c:   2518e3e1ptrue   p1.b
  20:   a4034000ld1b{z0.b}, p0/z, [x0, x3]
  24:   a4034021ld1b{z1.b}, p0/z, [x1, x3]
  28:   0430e3e3incbx3
  2c:   0520c021sel z1.b, p0, z1.b, z0.b
  30:   25221c60whilelo p0.b, x3, x2
  34:   040d0420uabdz0.b, p1/m, z0.b, z1.b
  38:   44830402udotz2.s, z0.b, z3.b
  3c:   5421b.ne20   // b.any
  40:   2598e3e0ptrue   p0.s
  44:   04812042uaddv   d2, p0, z2.s
  48:   1e260040fmovw0, s2
  4c:   d65f03c0ret
  50:   1e2703e2fmovs2, wzr
  54:   1e260040fmovw0, s2
  58:   d65f03c0ret

Notice how udot is used inside a fully masked loop.

I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

I admit it is too late to merge this into gcc 9, but I'm posting it anyway so
it can be considered for gcc 10.

Alejandro


gcc/Changelog:

2019-02-04  Alejandro Martinez  

* config/aarch64/aarch64-sve.md (abd_3): New define_expand.
(aarch64_abd_3): Likewise.
(*aarch64_abd_3): New define_insn.
(sad): New define_expand.
* config/aarch64/iterators.md: Added MAX_OPP and max_opp attributes.
Added USMAX iterator.
* config/aarch64/predicates.md: Added aarch64_smin and aarch64_umin
predicates.
* tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR.
(build_vect_cond_expr): Likewise.

gcc/testsuite/Changelog:
 
2019-02-04  Alejandro Martinez  

* gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute
differences.


sad_v1.patch
Description: sad_v1.patch


[Aarch64][SVE] Dot product support

2019-02-01 Thread Alejandro Martinez Vicente
Hi,

This patch does two things. For the general vectorizer, it adds support to
perform fully masked reductions over expressions that don't support masking.
This is achieved by using VEC_COND_EXPR where possible.  At the moment this is
implemented for DOT_PROD_EXPR only, but the framework is there to extend it to
other expressions.

Related to that, this patch adds support to vectorize dot product using SVE.  It
also uses the new functionality to ensure that the resulting loop is masked.

Given this input code:

uint32_t
dotprod (uint8_t *restrict x, uint8_t *restrict y, int n)
{
  uint32_t sum = 0;

  for (int i = 0; i < n; i++)
{
  sum += x[i] * y[i];
}

  return sum;
}

The resulting SVE code is:

 :
   0:   715fcmp w2, #0x0
   4:   5400024db.le4c 
   8:   d283mov x3, #0x0// #0
   c:   93407c42sxtwx2, w2
  10:   2538c001mov z1.b, #0
  14:   25221fe0whilelo p0.b, xzr, x2
  18:   2538c003mov z3.b, #0
  1c:   d503201fnop
  20:   a4034002ld1b{z2.b}, p0/z, [x0, x3]
  24:   a4034020ld1b{z0.b}, p0/z, [x1, x3]
  28:   0430e3e3incbx3
  2c:   0523c000sel z0.b, p0, z0.b, z3.b
  30:   25221c60whilelo p0.b, x3, x2
  34:   44820401udotz1.s, z0.b, z2.b
  38:   5441b.ne20   // b.any
  3c:   2598e3e0ptrue   p0.s
  40:   04812021uaddv   d1, p0, z1.s
  44:   1e260020fmovw0, s1
  48:   d65f03c0ret
  4c:   1e2703e1fmovs1, wzr
  50:   1e260020fmovw0, s1
  54:   d65f03c0ret

Notice how udot is used inside a fully masked loop.

I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

I admit it is too late to merge this into gcc 9, but I'm posting it anyway so
it can be considered for gcc 10.

Alejandro


gcc/Changelog:

2019-01-31  Alejandro Martinez  

* config/aarch64/aarch64-sve.md (dot_prod): Taken from SVE
ACLE branch.
* config/aarch64/iterators.md: Copied Vetype_fourth, VSI2QI and vsi2qi 
from
SVE ACLE branch.
* tree-vect-loop.c (use_mask_by_cond_expr_p): New function to check if a
VEC_COND_EXPR be inserted to emulate a conditional internal function.
(build_vect_cond_expr): Emit the VEC_COND_EXPR.
(vectorizable_reduction): Use the functions above to vectorize in a
fully masked loop codes that don't have a conditional internal
function.

gcc/testsuite/Changelog:
 
2019-01-31  Alejandro Martinez  

* gcc.target/aarch64/sve/dot_1.c: New test for dot product.


dot_v1.patch
Description: dot_v1.patch


RE: [Vectorizer] Add SLP support for masked loads

2019-01-17 Thread Alejandro Martinez Vicente
> -Original Message-
> From: Richard Biener 
> Sent: 17 January 2019 07:53
> To: Alejandro Martinez Vicente 
> Cc: GCC Patches ; nd ; Richard
> Sandiford 
> Subject: Re: [Vectorizer] Add SLP support for masked loads
> 
> On Wed, Jan 16, 2019 at 2:37 PM Alejandro Martinez Vicente
>  wrote:
> >
> > Hi,
> >
> > Current vectorizer doesn't support masked loads for SLP. We should add
> > that, to allow things like:
> >
> > void
> > f (int *restrict x, int *restrict y, int *restrict z, int n) {
> >   for (int i = 0; i < n; i += 2)
> > {
> >   x[i] = y[i] ? z[i] : 1;
> >   x[i + 1] = y[i + 1] ? z[i + 1] : 2;
> > }
> > }
> >
> > to be vectorized using contiguous loads rather than LD2 and ST2.
> >
> > This patch was motivated by SVE, but it is completely generic and
> > should apply to any architecture with masked loads.
> >
> > After the patch is applied, the above code generates this output
> > (-march=armv8.2-a+sve -O2 -ftree-vectorize):
> >
> >  :
> >0:   717fcmp w3, #0x0
> >4:   540002cdb.le5c 
> >8:   51000464sub w4, w3, #0x1
> >c:   d283mov x3, #0x0// #0
> >   10:   9005adrpx5, 0 
> >   14:   25d8e3e0ptrue   p0.d
> >   18:   53017c84lsr w4, w4, #1
> >   1c:   91a5add x5, x5, #0x0
> >   20:   11000484add w4, w4, #0x1
> >   24:   85c0e0a1ld1rd   {z1.d}, p0/z, [x5]
> >   28:   2598e3e3ptrue   p3.s
> >   2c:   d37ff884lsl x4, x4, #1
> >   30:   25a41fe2whilelo p2.s, xzr, x4
> >   34:   d503201fnop
> >   38:   a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2]
> >   3c:   25808c11cmpne   p1.s, p3/z, z0.s, #0
> >   40:   25808810cmpne   p0.s, p2/z, z0.s, #0
> >   44:   a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2]
> >   48:   05a1c400sel z0.s, p1, z0.s, z1.s
> >   4c:   e5434800st1w{z0.s}, p2, [x0, x3, lsl #2]
> >   50:   04b0e3e3incwx3
> >   54:   25a41c62whilelo p2.s, x3, x4
> >   58:   5401b.ne38   // b.any
> >   5c:   d65f03c0ret
> >
> >
> > I tested this patch in an aarch64 machine bootstrapping the compiler
> > and running the checks.
> 
> Thanks for implementing this - note this is stage1 material and I will have a
> look when time allows unless Richard beats me to it.
> 


I agree, this is for GCC 10. I'll ping you guys when we're at stage1.

> It might be interesting to note that "non-SLP" code paths are likely to go
> away in GCC 10 to streamline the vectorizer and make further changes easier
> (so you'll see group_size == 1 SLP instances).
> 

Cool, thanks for the heads up.

Alejandro

> There are quite a few other cases missing SLP handling.
> 
> Richard.
> 
> > Alejandro
> >
> > gcc/Changelog:
> >
> > 2019-01-16  Alejandro Martinez  
> >
> > * config/aarch64/aarch64-sve.md (copysign3): New
> define_expand.
> > (xorsign3): Likewise.
> > internal-fn.c: Marked mask_load_direct and mask_store_direct as
> > vectorizable.
> > tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
> > tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
> > combined even if masks different.
> > (slp_vect_only_p): New function to detect masked loads that are only
> > vectorizable using SLP.
> > (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups.
> > tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
> > dissolve SLP-only vectorizable groups when SLP has been discarded.
> > (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when
> needed.
> > tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
> > masks.
> > (vect_build_slp_tree_1): Fixed comment typo.
> > (vect_build_slp_tree_2): Include masks from masked loads in SLP 
> > tree.
> > tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to
> get
> > vec_defs for operand with optional SLP and vectype.
> > (vectorizable_load): Allow vectorizaion of masked loads for SLP 
> > only.
> > tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only
> > vectorizable.
> > tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise.
> >
> > gcc/testsuite/Changelog:
> >
> > 2019-01-16  Alejandro Martinez  
> >
> > * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP
> > vectorized masked loads.


[Vectorizer] Add SLP support for masked loads

2019-01-16 Thread Alejandro Martinez Vicente
Hi,

Current vectorizer doesn't support masked loads for SLP. We should add that, to
allow things like:

void
f (int *restrict x, int *restrict y, int *restrict z, int n)
{
  for (int i = 0; i < n; i += 2)
{
  x[i] = y[i] ? z[i] : 1;
  x[i + 1] = y[i + 1] ? z[i + 1] : 2;
}
}

to be vectorized using contiguous loads rather than LD2 and ST2.

This patch was motivated by SVE, but it is completely generic and should apply
to any architecture with masked loads.

After the patch is applied, the above code generates this output
(-march=armv8.2-a+sve -O2 -ftree-vectorize):

 :
   0:   717fcmp w3, #0x0
   4:   540002cdb.le5c 
   8:   51000464sub w4, w3, #0x1
   c:   d283mov x3, #0x0// #0
  10:   9005adrpx5, 0 
  14:   25d8e3e0ptrue   p0.d
  18:   53017c84lsr w4, w4, #1
  1c:   91a5add x5, x5, #0x0
  20:   11000484add w4, w4, #0x1
  24:   85c0e0a1ld1rd   {z1.d}, p0/z, [x5]
  28:   2598e3e3ptrue   p3.s
  2c:   d37ff884lsl x4, x4, #1
  30:   25a41fe2whilelo p2.s, xzr, x4
  34:   d503201fnop
  38:   a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2]
  3c:   25808c11cmpne   p1.s, p3/z, z0.s, #0
  40:   25808810cmpne   p0.s, p2/z, z0.s, #0
  44:   a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2]
  48:   05a1c400sel z0.s, p1, z0.s, z1.s
  4c:   e5434800st1w{z0.s}, p2, [x0, x3, lsl #2]
  50:   04b0e3e3incwx3
  54:   25a41c62whilelo p2.s, x3, x4
  58:   5401b.ne38   // b.any
  5c:   d65f03c0ret


I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

Alejandro

gcc/Changelog:

2019-01-16  Alejandro Martinez  

* config/aarch64/aarch64-sve.md (copysign3): New define_expand.
(xorsign3): Likewise.
internal-fn.c: Marked mask_load_direct and mask_store_direct as
vectorizable.
tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
combined even if masks different.
(slp_vect_only_p): New function to detect masked loads that are only
vectorizable using SLP.
(vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups.
tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
dissolve SLP-only vectorizable groups when SLP has been discarded.
(vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when needed.
tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
masks.
(vect_build_slp_tree_1): Fixed comment typo.
(vect_build_slp_tree_2): Include masks from masked loads in SLP tree.
tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to get
vec_defs for operand with optional SLP and vectype.
(vectorizable_load): Allow vectorizaion of masked loads for SLP only.
tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only
vectorizable.
tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise.

gcc/testsuite/Changelog:
 
2019-01-16  Alejandro Martinez  

* gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP
vectorized masked loads.


mask_load_slp_1.patch
Description: mask_load_slp_1.patch


RE: [Aarch64][SVE] Add copysign and xorsign support

2019-01-09 Thread Alejandro Martinez Vicente
Hi,

I updated the patch to address Wilco's comment and style issues.

Alejandro


> -Original Message-
> From: Wilco Dijkstra 
> Sent: 08 January 2019 16:58
> To: GCC Patches ; Alejandro Martinez Vicente
> 
> Cc: nd ; Richard Sandiford 
> Subject: Re: [Aarch64][SVE] Add copysign and xorsign support
> 
> Hi Alejandro,
> 
> +emit_move_insn (mask,
> + aarch64_simd_gen_const_vector_dup
> (mode,
> +HOST_WIDE_INT_M1U
> +<< bits));
> +
> +emit_insn (gen_and3 (sign, arg2, mask));
> 
> Is there a reason to emit separate moves and then requiring the optimizer to
> combine them? The result of aarch64_simd_gen_const_vector_dup can be
> used directly in the gen_and for all supported floating point types.
> 
> Cheers,
> Wilco


copysign_2.patch
Description: copysign_2.patch


[Aarch64][SVE] Add copysign and xorsign support

2019-01-08 Thread Alejandro Martinez Vicente
Hi,

This patch adds support for copysign and xorsign builtins to SVE. With the new
expands, they can be vectorized using bitwise logical operations.

I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

Alejandro

gcc/Changelog:

2019-01-08  Alejandro Martinez  

* config/aarch64/aarch64-sve.md (copysign3): New define_expand.
(xorsign3): Likewise.

gcc/testsuite/Changelog:
 
2019-01-08  Alejandro Martinez  

* gcc.target/aarch64/sve/copysign_1.c: New test for SVE vectorized
copysign.
* gcc.target/aarch64/sve/copysign_1_run.c: Likewise.
* gcc.target/aarch64/sve/xorsign_1.c: New test for SVE vectorized
xorsign.
* gcc.target/aarch64/sve/xorsign_1_run.c: Likewise.



copysign.patch
Description: copysign.patch


RE: [Patch, Vectorizer, SVE] fmin/fmax builtin reduction support

2018-12-19 Thread Alejandro Martinez Vicente
Richard,

I'm happy to change the name of the helper to code_helper_for_stmt, the new 
patch and changelog are included. Regarding the reductions being fold_left, the 
FMINNM/FMINMV instructions are defined in such a way that this is not necessary 
(it wouldn't work with FMIN/FMINV).

Alejandro

 
gcc/Changelog:
 
2018-12-18  Alejandro Martinez  

* gimple-match.h (code_helper_for_stmt): New function to get a
code_helper from an statement.
* internal-fn.def: New reduc_fmax_scal and reduc_fmin_scal optabs for
ieee fp max/min reductions
* optabs.def: Likewise.
* tree-vect-loop.c (reduction_fn_for_scalar_code): Changed function
signature to accept code_helper instead of tree_code. Handle the
fmax/fmin builtins.
(needs_fold_left_reduction_p): Likewise.
(check_reduction_path): Likewise.
(vect_is_simple_reduction): Use code_helper instead of tree_code. Check
for supported call-based reductions. Extend support for both
assignment-based and call-based reductions.
(vect_model_reduction_cost): Extend cost-model support to call-based
reductions (just use MAX expression).
(get_initial_def_for_reduction): Use code_helper instead of tree_code.
Extend support for both assignment-based and call-based reductions.
(vect_create_epilog_for_reduction): Likewise.
(vectorizable_reduction): Likewise.
* tree-vectorizer.h: include gimple-match.h for code_helper. Use
code_helper in check_reduction_path signature.
* config/aarch64/aarch64-sve.md: Added define_expand to capture new
reduc_fmax_scal and reduc_fmin_scal optabs.
* config/aarch64/iterators.md: New FMAXMINNMV and fmaxmin_uns iterators
to support the new define_expand.
 
gcc/testsuite/Changelog:
 
2018-12-18  Alejandro Martinez  

* gcc.target/aarch64/sve/reduc_9.c: New test to check
SVE-vectorized reductions without -ffast-math.
* gcc.target/aarch64/sve/reduc_10.c: New test to check
SVE-vectorized builtin reductions without -ffast-math.

-Original Message-
From: Richard Biener  
Sent: 19 December 2018 12:35
To: Alejandro Martinez Vicente 
Cc: GCC Patches ; Richard Sandiford 
; nd 
Subject: Re: [Patch, Vectorizer, SVE] fmin/fmax builtin reduction support

On Wed, Dec 19, 2018 at 10:33 AM Alejandro Martinez Vicente 
 wrote:
>
> Hi all,
>
> Loops that use the fmin/fmax builtins can be vectorized even without 
> -ffast-math using SVE's FMINNM/FMAXNM instructions. This is an example:
>
> double
> f (double *x, int n)
> {
>   double res = 100.0;
>   for (int i = 0; i < n; ++i)
> res = __builtin_fmin (res, x[i]);
>   return res;
> }
>
> Before this patch, the compiler would generate this code 
> (-march=armv8.2-a+sve
> -O2 -ftree-vectorize):
>
>  :
>0:   713fcmp w1, #0x0
>4:   5400018db.le34 
>8:   51000422sub w2, w1, #0x1
>c:   91002003add x3, x0, #0x8
>   10:   d2e80b21mov x1, #0x4059
>   14:   9e670020fmovd0, x1
>   18:   8b224c62add x2, x3, w2, uxtw #3
>   1c:   d503201fnop
>   20:   fc408401ldr d1, [x0],#8
>   24:   1e617800fminnm  d0, d0, d1
>   28:   eb02001fcmp x0, x2
>   2c:   54a1b.ne20 
>   30:   d65f03c0ret
>   34:   d2e80b20mov x0, #0x4059
>   38:   9e67fmovd0, x0
>   3c:   d65f03c0ret
>
> After this patch, this is the code that gets generated:
>
>  :
>0:   713fcmp w1, #0x0
>4:   5400020db.le44 
>8:   d282mov x2, #0x0
>c:   25d8e3e0ptrue   p0.d
>   10:   93407c21sxtwx1, w1
>   14:   9003adrpx3, 0 
>   18:   25804001mov p1.b, p0.b
>   1c:   9163add x3, x3, #0x0
>   20:   85c0e060ld1rd   {z0.d}, p0/z, [x3]
>   24:   25e11fe0whilelo p0.d, xzr, x1
>   28:   a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3]
>   2c:   04f0e3e2incdx2
>   30:   65c58020fminnm  z0.d, p0/m, z0.d, z1.d
>   34:   25e11c40whilelo p0.d, x2, x1
>   38:   5481b.ne28   // b.any
>   3c:   65c52400fminnmv d0, p1, z0.d
>   40:   d65f03c0ret
>   44:   d2e80b20mov x0, #0x4059
>   48:   9e67fmovd0, x0
>   4c:   d65f03c0ret
>
> This patch extends the support for reductions to include calls to 
> internal functions, in addition to assign statements. For this 
> purpose, in most places where a tree_code would be used, a code_helper 
> is used instead. The code_helper a

[Patch, Vectorizer, SVE] fmin/fmax builtin reduction support

2018-12-19 Thread Alejandro Martinez Vicente
Hi all,
 
Loops that use the fmin/fmax builtins can be vectorized even without
-ffast-math using SVE's FMINNM/FMAXNM instructions. This is an example:
 
double
f (double *x, int n)
{
  double res = 100.0;
  for (int i = 0; i < n; ++i)
res = __builtin_fmin (res, x[i]);
  return res;
}

Before this patch, the compiler would generate this code (-march=armv8.2-a+sve
-O2 -ftree-vectorize):

 :
   0:   713fcmp w1, #0x0
   4:   5400018db.le34 
   8:   51000422sub w2, w1, #0x1
   c:   91002003add x3, x0, #0x8
  10:   d2e80b21mov x1, #0x4059
  14:   9e670020fmovd0, x1
  18:   8b224c62add x2, x3, w2, uxtw #3
  1c:   d503201fnop
  20:   fc408401ldr d1, [x0],#8
  24:   1e617800fminnm  d0, d0, d1
  28:   eb02001fcmp x0, x2
  2c:   54a1b.ne20 
  30:   d65f03c0ret
  34:   d2e80b20mov x0, #0x4059
  38:   9e67fmovd0, x0
  3c:   d65f03c0ret

After this patch, this is the code that gets generated:

 :
   0:   713fcmp w1, #0x0
   4:   5400020db.le44 
   8:   d282mov x2, #0x0
   c:   25d8e3e0ptrue   p0.d
  10:   93407c21sxtwx1, w1
  14:   9003adrpx3, 0 
  18:   25804001mov p1.b, p0.b
  1c:   9163add x3, x3, #0x0
  20:   85c0e060ld1rd   {z0.d}, p0/z, [x3]
  24:   25e11fe0whilelo p0.d, xzr, x1
  28:   a5e24001ld1d{z1.d}, p0/z, [x0, x2, lsl #3]
  2c:   04f0e3e2incdx2
  30:   65c58020fminnm  z0.d, p0/m, z0.d, z1.d
  34:   25e11c40whilelo p0.d, x2, x1
  38:   5481b.ne28   // b.any
  3c:   65c52400fminnmv d0, p1, z0.d
  40:   d65f03c0ret
  44:   d2e80b20mov x0, #0x4059
  48:   9e67fmovd0, x0
  4c:   d65f03c0ret

This patch extends the support for reductions to include calls to internal
functions, in addition to assign statements. For this purpose, in most places
where a tree_code would be used, a code_helper is used instead. The code_helper
allows to hold either a tree_code or combined_fn.

This patch implements these tasks:

- Detect a reduction candidate based on a call to an internal function
  (currently only fmin or fmax).
- Process the reduction using code_helper. This means that at several places
  we have to check whether this is as assign-based reduction or a call-based
  reduction.
- Add new internal functions for the fmin/fmax reductions and for conditional
  fmin/fmax. In architectures where ieee fmin/fmax reductions are available, it
  is still possible to vectorize the loop using unconditional instructions.
- Update SVE's md to support these new reductions.
- Add new SVE tests to check that the optimal code is being generated.

I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.
 
Alejandro
 
gcc/Changelog:
 
2018-12-18  Alejandro Martinez  

* gimple-match.h (code_helper_for_stmnt): New function to get a
code_helper from an statement.
* internal-fn.def: New reduc_fmax_scal and reduc_fmin_scal optabs for
ieee fp max/min reductions
* optabs.def: Likewise.
* tree-vect-loop.c (reduction_fn_for_scalar_code): Changed function
signature to accept code_helper instead of tree_code. Handle the
fmax/fmin builtins.
(needs_fold_left_reduction_p): Likewise.
(check_reduction_path): Likewise.
(vect_is_simple_reduction): Use code_helper instead of tree_code. Check
for supported call-based reductions. Extend support for both
assignment-based and call-based reductions.
(vect_model_reduction_cost): Extend cost-model support to call-based
reductions (just use MAX expression).
(get_initial_def_for_reduction): Use code_helper instead of tree_code.
Extend support for both assignment-based and call-based reductions.
(vect_create_epilog_for_reduction): Likewise.
(vectorizable_reduction): Likewise.
* tree-vectorizer.h: include gimple-match.h for code_helper. Use
code_helper in check_reduction_path signature.
* config/aarch64/aarch64-sve.md: Added define_expand to capture new
reduc_fmax_scal and reduc_fmin_scal optabs.
* config/aarch64/iterators.md: New FMAXMINNMV and fmaxmin_uns iterators
to support the new define_expand.
 
gcc/testsuite/Changelog:
 
2018-12-18  Alejandro Martinez  

* gcc.target/aarch64/sve/reduc_9.c: New test to check
SVE-vectorized reductions without -ffast-math.
* gcc.target/aarch64/sve/reduc_10.c: New test to check
SVE-vectorized builtin reductions without -ffast-math.


final.patch
Description: final.patch