https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108
--- Comment #8 from Jeff Hammond <jeff.science at gmail dot com> --- I tried with schedule(dynamic) and schedule(static,n) for n=1,8,100. None of this made a positive difference. Is that expected? I'm happy to make any code changes that don't break correctness and are still readable. I don't fully understand the semantics of ordered, but is it illegal for the compiler to generate blocked code that still orders the iteration space, which I achieve by explicitly implementing that with tasks. The following code performs very well, at least with appropriate tiling (mc,nc). _Pragma("omp master") { for (auto i=1; i<m; i+=mc) { for (auto j=1; j<n; j+=nc) { _Pragma("omp task depend(in:grid[0],grid[(i-mc)*n+j],grid[i*n+(j-nc)],grid[(i-mc)*n+(j-nc)]) depend(out:grid[i*n+j])") sweep_tile(i, std::min(m,i+mc), j, std::min(n,j+nc), m, n, grid); } } _Pragma("omp task depend(in:grid[(lic-1)*n+(ljc)]) depend(out:grid[0])") grid[0*n+0] = -grid[(m-1)*n+(n-1)]; } } inline void sweep_tile(size_t startm, size_t endm, size_t startn, size_t endn, size_t m, size_t n, double grid[]) { //_Pragma("omp critical") //std::cout << startm << "," << endm << "," << startn << "," << endn << "," << m << "," << n << std::endl; for (auto i=startm; i<endm; i++) { for (auto j=startn; j<endn; j++) { grid[i*n+j] = grid[(i-1)*n+j] + grid[i*n+(j-1)] - grid[(i-1)*n+(j-1)]; } } } It's not optimal, but parallelizing over the anti-diagonal is a legal schedule for these loops that runs much faster than the doacross one. Is the compiler not allowed to do something like this with ordered loops? for (auto j=1; j<n; j++) { _Pragma("omp for") for (auto i=1; i<=j; i++) { auto x = i; auto y = j-i+1; grid[x*n+y] = grid[(x-1)*n+y] + grid[x*n+(y-1)] - grid[(x-1)*n+(y-1)]; } } for (auto j=n-2; j>=1; j--) { _Pragma("omp for") for (auto i=1; i<=j; i++) { auto x = n+i-j-1; auto y = n-i; grid[x*n+y] = grid[(x-1)*n+y] + grid[x*n+(y-1)] - grid[(x-1)*n+(y-1)]; } } _Pragma("omp master") grid[0*n+0] = -grid[(n-1)*n+(n-1)];