On 16/07/15 12:28, Richard Biener wrote:
On Thu, Jul 16, 2015 at 12:23 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
On Thu, Jul 16, 2015 at 12:19 PM, Thomas Schwinge
<tho...@codesourcery.com> wrote:
Hi Tom!

On Thu, 16 Jul 2015 10:46:00 +0200, Richard Biener <richard.guent...@gmail.com> 
wrote:
On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <tom_devr...@mentor.com> wrote:
I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
[...]

So I wondered, why not always use the graphite dependency analysis in
parloops. (Of course you could use -floop-parallelize-all, but that also
changes the heuristic). So I wrote a patch for parloops to use graphite
dependency analysis by default (so without -floop-parallelize-all), but
while testing found out that all the reduction test-cases started failing
because the modifications graphite makes to the code messes up the parloops
reduction analysis.

Then I came up with this patch, which:
- first runs a parloops pass, restricted to reduction loops only,
- then runs graphite dependency analysis
- followed by a normal parloops pass run.

This way, we get to both:
- compile the reduction testcases as before, and
- profit from the better graphite dependency analysis otherwise.

graphite dependence analysis is too slow to be enabled unconditionally.
(read: hours in some simple cases - see bugzilla)

Haha, "cool"!  ;-)

Maybe it is still reasonable to use graphite to analyze the code inside
OpenACC kernels regions -- maybe such code can reasonably be expected to
not have the properties that make its analysis lengthy?  So, Tom, could
you please identify and check such PRs, to get an understanding of what
these properties are?

Like the one in PR62113 or 53852 or 59121.

Btw, it would be nice to handle this case (or at least figure out why we can't)
in GCCs dependence analysis.


I wrote an equivalent test-case in C:
...
$ cat src/gcc/testsuite/gcc.dg/autopar/outer-7.c
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details -fdump-tree-optimized" } */

void abort (void);

#define N 500

int
main (void)
{
  int i, j;
  int x[N][N];
  int *y = &x[0][0];

  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      /* y[i * N + j] == x[i][j].  */
      y[i * N + j] = i + j + 3;

  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      if (x[i][j] != i + j + 3)
        abort ();

  return 0;
}

/* Check that outer loop is parallelized.  */
/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
/* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
...

With -fno-tree-loop-ivcanon to keep original iteration order we get:
...
#(Data Ref:
#  bb: 4
#  stmt: *_15 = _17;
#  ref: *_15;
#  base_object: MEM[(int *)&x];
#  Access function 0: {{0B, +, 2000}_1, +, 4}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: *_15 = _17;
#  ref: *_15;
#  base_object: MEM[(int *)&x];
#  Access function 0: {{0B, +, 2000}_1, +, 4}_4
#)
  access_fn_A: {{0B, +, 2000}_1, +, 4}_4
  access_fn_B: {{0B, +, 2000}_1, +, 4}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (1 4 )
  distance_vector:   0   0
  distance_vector:   1 -500
  direction_vector:     =    =
  direction_vector:     +    -
)
  FAILED: data dependencies exist across iterations
...

If we replace the y[i * N + j] with x[i][j] we get instead:
...
#(Data Ref:
#  bb: 4
#  stmt: x[i_7][j_8] = _12;
#  ref: x[i_7][j_8];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
#(Data Ref:
#  bb: 4
#  stmt: x[i_7][j_8] = _12;
#  ref: x[i_7][j_8];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
  access_fn_A: {0, +, 1}_4
  access_fn_B: {0, +, 1}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  access_fn_A: {0, +, 1}_1
  access_fn_B: {0, +, 1}_1

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (1 4 )
  distance_vector:   0   0
  direction_vector:     =    =
)
  SUCCESS: may be parallelized
parallelizing outer loop 8
...

Thanks,
- Tom

Reply via email to