for ordered) performance

jeff.science at gmail dot com Tue, 27 Jun 2017 09:12:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108


--- Comment #8 from Jeff Hammond <jeff.science at gmail dot com> ---
I tried with schedule(dynamic) and schedule(static,n) for n=1,8,100.  None of
this made a positive difference.  Is that expected?  I'm happy to make any code
changes that don't break correctness and are still readable.

I don't fully understand the semantics of ordered, but is it illegal for the
compiler to generate blocked code that still orders the iteration space, which
I achieve by explicitly implementing that with tasks.  The following code
performs very well, at least with appropriate tiling (mc,nc).

      _Pragma("omp master")
      {
        for (auto i=1; i<m; i+=mc) {
          for (auto j=1; j<n; j+=nc) {
            _Pragma("omp task
depend(in:grid[0],grid[(i-mc)*n+j],grid[i*n+(j-nc)],grid[(i-mc)*n+(j-nc)])
depend(out:grid[i*n+j])")
            sweep_tile(i, std::min(m,i+mc),
                       j, std::min(n,j+nc),
                       m, n, grid);
          }
        }
        _Pragma("omp task depend(in:grid[(lic-1)*n+(ljc)])
depend(out:grid[0])")
        grid[0*n+0] = -grid[(m-1)*n+(n-1)];
      }
    }

inline void sweep_tile(size_t startm, size_t endm,
                       size_t startn, size_t endn,
                       size_t m,      size_t n,
                       double grid[])
{
  //_Pragma("omp critical")
  //std::cout << startm << "," << endm << "," << startn << "," << endn << ","
<< m << "," << n << std::endl;
  for (auto i=startm; i<endm; i++) {
    for (auto j=startn; j<endn; j++) {
      grid[i*n+j] = grid[(i-1)*n+j] + grid[i*n+(j-1)] - grid[(i-1)*n+(j-1)];
    }
  }
}

It's not optimal, but parallelizing over the anti-diagonal is a legal schedule
for these loops that runs much faster than the doacross one.  Is the compiler
not allowed to do something like this with ordered loops?

      for (auto j=1; j<n; j++) {
        _Pragma("omp for")
        for (auto i=1; i<=j; i++) {
          auto x = i;
          auto y = j-i+1;
          grid[x*n+y] = grid[(x-1)*n+y] + grid[x*n+(y-1)] -
grid[(x-1)*n+(y-1)];
        }
      }
      for (auto j=n-2; j>=1; j--) {
        _Pragma("omp for")
        for (auto i=1; i<=j; i++) {
          auto x = n+i-j-1;
          auto y = n-i;
          grid[x*n+y] = grid[(x-1)*n+y] + grid[x*n+(y-1)] -
grid[(x-1)*n+(y-1)];
        }
      }
      _Pragma("omp master")
      grid[0*n+0] = -grid[(n-1)*n+(n-1)];

[Bug libgomp/81108] OpenMP doacross (omp do/for ordered) performance

Reply via email to