Re: [RFC][vect] BB SLP reduction prototype

2020-06-15 Thread Richard Biener
On Tue, 9 Jun 2020, Andre Vieira (lists) wrote:

> Hi,
> 
> So this is my rework of the code you sent me, I have not included the
> 'permute' code you included as I can't figure out what it is meant to be
> doing. Maybe something to look at later.
> 
> I have also included three tests that show it working for some simple cases
> and even a nested one.
> 
> Unfortunately it will not handle other simple cases where reassoc doesn't put
> the reduction in the form of :
> sum0 = a + b;
> sum1 = c + sum0;
> ...
> 
> For instance a testcase I have been looking at is:
> unsigned int u32_single_abs_sum (unsigned int * a, unsigned int * b)
> {
>   unsigned int sub0 = a[0] - b[0];
>   unsigned int sub1 = a[1] - b[1];
>   unsigned int sub2 = a[2] - b[2];
>   unsigned int sub3 = a[3] - b[3];
>   unsigned int sum = sub0 + sub1;
>   sum += sub2;
>   sum += sub3;
>   return sum;
> }
> 
> Unfortunately, the code that reaches slp will look like:
>   _1 = *a_10(D);
>   _2 = *b_11(D);
>   _3 = MEM[(unsigned int *)a_10(D) + 4B];
>   _4 = MEM[(unsigned int *)b_11(D) + 4B];
>   _5 = MEM[(unsigned int *)a_10(D) + 8B];
>   _6 = MEM[(unsigned int *)b_11(D) + 8B];
>   _7 = MEM[(unsigned int *)a_10(D) + 12B];
>   _8 = MEM[(unsigned int *)b_11(D) + 12B];
>   _28 = _1 - _2;
>   _29 = _3 + _28;
>   _30 = _29 - _4;
>   _31 = _5 + _30;
>   _32 = _31 - _6;
>   _33 = _7 + _32;
>   sum_18 = _33 - _8;
>   return sum_18;
> 
> Which doesn't have the format expected as I described above... I am wondering
> how to teach it to support this. Maybe starting with your suggestion of making
> plus_expr and minus_expr have the same hash, so it groups all these statements
> together might be a start, but you'd still need to 'rebalance' the tree
> somehow I need to give this a bit more thought but I wanted to share what
> I have so far.

Yeah, I guess at the point you have the ordered_set of reduction
components you want to do a DFS-like walk (but starting where?)
figuring the "interesting" SSA edges and see which component you
can actually reach (and from which "final" component).  I guess
the final reduction component should be always the latest stmt
(even if you associate as (a[0] + a[1] + a[2]) - (b[0] + b[1] + b[2])).

I've attached the current version of the permutation code, originally
that wasn't the interesting pice but the reorg in SLP discovery was.
But now seeing your vectorize_slp_instance_root_stmt hunk and reflecting
what I intend to do with the permutation node we may want to represent
the reduction as a distinct SLP node reducing from N to 1 lane.
I want to use the permute node in the patch as a way to
do concat / split and lane select as well.  A reduction node would
differ in that it has a reduction operation and that it would reduce
all input lanes (but for _Complex we'd want 4 to 2 lane reductions
as well I guess).  Still the "root stmt" idea wasn't so much to
gobble actual code generation there but to record the output SSA
edges of the SLP graph entry.

In vect_analyze_slp where you set up things for SLP discovery
you shouldn't need to build a REDUC_GROUP but instead run
SLP discovery on the set of components only and if successful
build the reduction SLP node manually.  IIRC the patch I sent
you last time (with the permutation code) did arrange
vect_analyze_slp to do that for stores for example.  That
way SLP discovery should be unchanged and most of the reduction
validation should happen when analyzing the reduction operation.

For initial discovery in vect_slp_check_for_reductions we probably
want to factor out some common meta-data about (SLP) reductions
so we can share validation and maybe parts of the discovery code
with the loop aware variant [I'd like to get rid of the
REDUC_GROUP chaining for example].

> The code is severely lacking in comments for now btw...

Which is why I refrain from commenting on the actual patch details ;)

As a note for the casual reader - the vect_slp_check_for_reductions
was supposed to be a "cheap" O(n log n) way to discover general 
basic-block vectorization opportunities without "obvious" root
stmts to start greedy searching (currently roots include store groups
and vector typed CTORs).  While Andre only looks for reductions
in those candidates I originally invented the multi-level hashing scheme
to discover vectorization opportunities where both the ultimate vector
input and the output get built/decomposed from/to components.
It's actually simpler to leverage just that and not try to get
the reduction part working I guess ;)  For the reduction example
above it would vectorize the subtraction of a[] and b[] and then
extract all lanes from the result, doing the reduction operation
in scalar code.

Richard.From e8421124df9acda96c9c40853ad7e7d0ce97c115 Mon Sep 17 00:00:00 2001
From: Richard Biener 
Date: Wed, 25 Mar 2020 14:42:49 +0100
Subject: [PATCH] remove SLP_TREE_TWO_OPERATORS

This removes the SLP_TREE_TWO_OPERATORS hack in favor of having
explicit SLP nodes for both computations and the 

Re: [RFC][vect] BB SLP reduction prototype

2020-06-09 Thread Andre Vieira (lists)
The 'you' here is Richi, which Richi is probably aware but maybe not the 
rest of the list :')


On 09/06/2020 15:29, Andre Vieira (lists) wrote:

Hi,

So this is my rework of the code you sent me, I have not included the 
'permute' code you included as I can't figure out what it is meant to 
be doing. Maybe something to look at later.


I have also included three tests that show it working for some simple 
cases and even a nested one.


Unfortunately it will not handle other simple cases where reassoc 
doesn't put the reduction in the form of :

sum0 = a + b;
sum1 = c + sum0;
...

For instance a testcase I have been looking at is:
unsigned int u32_single_abs_sum (unsigned int * a, unsigned int * b)
{
  unsigned int sub0 = a[0] - b[0];
  unsigned int sub1 = a[1] - b[1];
  unsigned int sub2 = a[2] - b[2];
  unsigned int sub3 = a[3] - b[3];
  unsigned int sum = sub0 + sub1;
  sum += sub2;
  sum += sub3;
  return sum;
}

Unfortunately, the code that reaches slp will look like:
  _1 = *a_10(D);
  _2 = *b_11(D);
  _3 = MEM[(unsigned int *)a_10(D) + 4B];
  _4 = MEM[(unsigned int *)b_11(D) + 4B];
  _5 = MEM[(unsigned int *)a_10(D) + 8B];
  _6 = MEM[(unsigned int *)b_11(D) + 8B];
  _7 = MEM[(unsigned int *)a_10(D) + 12B];
  _8 = MEM[(unsigned int *)b_11(D) + 12B];
  _28 = _1 - _2;
  _29 = _3 + _28;
  _30 = _29 - _4;
  _31 = _5 + _30;
  _32 = _31 - _6;
  _33 = _7 + _32;
  sum_18 = _33 - _8;
  return sum_18;

Which doesn't have the format expected as I described above... I am 
wondering how to teach it to support this. Maybe starting with your 
suggestion of making plus_expr and minus_expr have the same hash, so 
it groups all these statements together might be a start, but you'd 
still need to 'rebalance' the tree somehow I need to give this a 
bit more thought but I wanted to share what I have so far.


The code is severely lacking in comments for now btw...

Cheers,
Andre



[RFC][vect] BB SLP reduction prototype

2020-06-09 Thread Andre Vieira (lists)

Hi,

So this is my rework of the code you sent me, I have not included the 
'permute' code you included as I can't figure out what it is meant to be 
doing. Maybe something to look at later.


I have also included three tests that show it working for some simple 
cases and even a nested one.


Unfortunately it will not handle other simple cases where reassoc 
doesn't put the reduction in the form of :

sum0 = a + b;
sum1 = c + sum0;
...

For instance a testcase I have been looking at is:
unsigned int u32_single_abs_sum (unsigned int * a, unsigned int * b)
{
  unsigned int sub0 = a[0] - b[0];
  unsigned int sub1 = a[1] - b[1];
  unsigned int sub2 = a[2] - b[2];
  unsigned int sub3 = a[3] - b[3];
  unsigned int sum = sub0 + sub1;
  sum += sub2;
  sum += sub3;
  return sum;
}

Unfortunately, the code that reaches slp will look like:
  _1 = *a_10(D);
  _2 = *b_11(D);
  _3 = MEM[(unsigned int *)a_10(D) + 4B];
  _4 = MEM[(unsigned int *)b_11(D) + 4B];
  _5 = MEM[(unsigned int *)a_10(D) + 8B];
  _6 = MEM[(unsigned int *)b_11(D) + 8B];
  _7 = MEM[(unsigned int *)a_10(D) + 12B];
  _8 = MEM[(unsigned int *)b_11(D) + 12B];
  _28 = _1 - _2;
  _29 = _3 + _28;
  _30 = _29 - _4;
  _31 = _5 + _30;
  _32 = _31 - _6;
  _33 = _7 + _32;
  sum_18 = _33 - _8;
  return sum_18;

Which doesn't have the format expected as I described above... I am 
wondering how to teach it to support this. Maybe starting with your 
suggestion of making plus_expr and minus_expr have the same hash, so it 
groups all these statements together might be a start, but you'd still 
need to 'rebalance' the tree somehow I need to give this a bit more 
thought but I wanted to share what I have so far.


The code is severely lacking in comments for now btw...

Cheers,
Andre

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-1.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-1.c
new file mode 100644
index 
..66b53ff9bb1e77414e7493c07ab87d46f4d33651
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-1.c
@@ -0,0 +1,66 @@
+/* { dg-require-effective-target vect_int } */
+#include 
+#include "tree-vect.h"
+extern int abs (int);
+
+#define ABS4(N)\
+  sum += abs (a[N]);   \
+  sum += abs (a[N+1]); \
+  sum += abs (a[N+2]); \
+  sum += abs (a[N+3]);
+
+#define ABS8(N)  \
+  ABS4(N)\
+  ABS4(N+4)
+
+#define ABS16(N)  \
+  ABS8(N)\
+  ABS8(N+8)
+
+__attribute__ ((noipa)) unsigned char
+u8_single_abs_sum (signed char * a)
+{
+  unsigned char sum = 0;
+  ABS16(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned short
+u16_single_abs_sum (signed short * a)
+{
+  unsigned short sum = 0;
+  ABS8(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned int
+u32_single_abs_sum (signed int * a)
+{
+  unsigned int sum = 0;
+  ABS4(0)
+  return sum;
+}
+
+signed char u8[16] = {0, 1, 2, 3, 4, 5, 6, -7, -8, -9, -10, -11, -12, -13,
+   -14, -15};
+signed short u16[8] = {0, 1, 2, 3, 4, -5, -6, -7};
+signed int u32[4] = {-10, -20, 30, 40};
+
+
+int main (void)
+{
+  check_vect ();
+
+  if (u8_single_abs_sum (&(u8[0])) != 120)
+abort ();
+
+  if (u16_single_abs_sum (&(u16[0])) != 28)
+abort ();
+
+  if (u32_single_abs_sum (&(u32[0])) != 100)
+abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 3 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-2.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-2.c
new file mode 100644
index 
..298a22cfef687f6634d61bf808a41942c3ce4a85
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-reduc-2.c
@@ -0,0 +1,82 @@
+/* { dg-require-effective-target vect_int } */
+#include 
+#include "tree-vect.h"
+extern int abs (int);
+
+#define ABS4(N)\
+  sum += abs (a[N]);   \
+  sum += abs (a[N+1]); \
+  sum += abs (a[N+2]); \
+  sum += abs (a[N+3]);
+
+#define ABS8(N)  \
+  ABS4(N)\
+  ABS4(N+4)
+
+#define ABS16(N)  \
+  ABS8(N)\
+  ABS8(N+8)
+
+__attribute__ ((noipa)) unsigned char
+u8_double_abs_sum (signed char * a)
+{
+  unsigned char sum = 0;
+  ABS16(0)
+  ABS16(16)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned short
+u16_double_abs_sum (signed short * a)
+{
+  unsigned short sum = 0;
+  ABS16(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned int
+u32_double_abs_sum (signed int * a)
+{
+  unsigned int sum = 0;
+  ABS8(0)
+  return sum;
+}
+
+__attribute__ ((noipa)) unsigned int
+u32_triple_abs_sum (signed int * a)
+{
+  unsigned int sum = 0;
+  ABS8(0)
+  ABS4(8)
+  return sum;
+}
+
+signed char u8[32] = {0, 1, 2, 3, 4, 5, 6, -7, -8, -9, -10, -11, -12, -13,
+ -14, -15, 0, 1, 2, 3, 4, 5, 6, -7, -8, -9, -10, -11, -12, 
-13,
+ -14, -30};
+
+signed short u16[16] = {0, 1, 2, 3, 4, -5, -6, -7, 10, 20, 30, 40, -10, -20,
+  -30, -40};
+signed int u32[16] = {-10, -20, 30, 40, 100, 200, -300, -500, -600, -700, 1000,
+2000};
+