[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-27 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #10 from Jeff Hammond  ---
Thanks for the feedback.  I agree that it is a huge amount of work to optimize
this.

For what it's worth, GCC and Clang perform about the same.  Unfortunately, I do
not have the means to evaluate IBM XLF, which may have an optimized
implementation of this
(https://rd.springer.com/chapter/10.1007%2F978-3-642-32820-6_23), so I do not
have a good sense of what is achievable here, other than what I hand-optimize.

I have no objection if you want to close this as invalid.

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-27 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #9 from Jakub Jelinek  ---
With schedule(static) or schedule(dynamic) etc. I believe the compiler is not
allowed to do it, at least if it can't prove it won't be observable.
So, if you have
int cnt = 0;
#pragma omp parallel for schedule(static) ordered(2) collapse(2) \
firstprivate(cnt)
for (j = ...)
for (i = ...)
{
  #pragma omp ordered depend(sink:j,i-1) depend(sink:j-1,i)
depend(sink:j-1,i-1)
  arr[i][j] = omp_get_thread_num ();
  arr2[i][j] = cnt++;
  grid[...] = ...;
  #pragma omp ordered source
}
then you really can't schedule arbitrarily, the standard specifies the
requirements, and the above code observes both the thread num each logical
iteration is assigned to, and in what order.
The only way where you have complete freedom is no schedule clause (then it is
implementation defined what happens) or schedule(auto) (likewise).
But it still requires advanced loop optimizations to find optimal schedule for
the particular loop (and other loops would need different scheduling
decisions), and the schedule would also need to take into account some
constraints of the doacross post/wait operations (it has to ensure all threads
don't get stuck waiting for something that hasn't been scheduled).  Plus unless
we want complex scheduling functions in the library the compiler has to map the
logical iterations to the logical iterations of the parallelized loop and
similarly map the indices.  It is doable, but huge amount of work.

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-27 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #8 from Jeff Hammond  ---
I tried with schedule(dynamic) and schedule(static,n) for n=1,8,100.  None of
this made a positive difference.  Is that expected?  I'm happy to make any code
changes that don't break correctness and are still readable.

I don't fully understand the semantics of ordered, but is it illegal for the
compiler to generate blocked code that still orders the iteration space, which
I achieve by explicitly implementing that with tasks.  The following code
performs very well, at least with appropriate tiling (mc,nc).

  _Pragma("omp master")
  {
for (auto i=1; i=1; j--) {
_Pragma("omp for")
for (auto i=1; i<=j; i++) {
  auto x = n+i-j-1;
  auto y = n-i;
  grid[x*n+y] = grid[(x-1)*n+y] + grid[x*n+(y-1)] -
grid[(x-1)*n+(y-1)];
}
  }
  _Pragma("omp master")
  grid[0*n+0] = -grid[(n-1)*n+(n-1)];

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-27 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #7 from Jakub Jelinek  ---
GCC implements what is required if there is schedule(static), which is the
implementation defined schedule right now, which gives the requirement how the
iterations are distributed to different threads and I don't see how could you
get good performance with that distribution (if you have ideas, feel free to
explain them here).  In order to perform well on this testcase (which doesn't
look very suitable for doacross because the computation is inexpensive and so
the needed synchronization dominates the execution time), we'd have to use a
different schedule, specific for this exact loop (proceed diagonally from 2, 2
to n, m or something like that).

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-15 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #6 from Jeff Hammond  ---
Created attachment 41566
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41566&action=edit
tasks C++

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-15 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #5 from Jeff Hammond  ---
Created attachment 41565
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41565&action=edit
header for C++ codes

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-15 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #4 from Jeff Hammond  ---
Created attachment 41564
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41564&action=edit
doacross C++

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-15 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #3 from Jeff Hammond  ---
Created attachment 41563
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41563&action=edit
sequential C++

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-15 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #2 from Jeff Hammond  ---
Created attachment 41562
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41562&action=edit
tasks Fortran

This code runs faster than serial (assuming blocking is used), showing that
this pattern can be parallelized.

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

2017-06-15 Thread jeff.science at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

--- Comment #1 from Jeff Hammond  ---
Created attachment 41561
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41561&action=edit
doacross Fortran