https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108
Bug ID: 81108 Summary: OpenMP doacross (omp do/for ordered) performance Product: gcc Version: 7.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: jeff.science at gmail dot com CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 41560 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41560&action=edit sequential code When I use "doacross" OpenMP parallelism as described in https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/, the performance is ~30x worse than sequential execution. This code can be effectively parallelized by blocking the wavefront, which I have implemented as well. I see the same behavior in Fortran and C++, so I believe it is relatively agnostic to the front-end. Clang has the same issue, which we determined is because the dependency analysis is applied to the collapsed loop, inhibiting all parallelism, although I don't understand why this is so much worse than a serial implementation. If it helps, the project is https://github.com/ParRes/Kernels, although I will attach all of the code here. # Sequential do j=2,n do i=2,m grid(i,j) = grid(i-1,j) + grid(i,j-1) - grid(i-1,j-1) enddo enddo $ ./p2p 40 1000 1000 Parallel Research Kernels Fortran Serial pipeline execution on 2D Number of iterations = 40 Grid sizes = 1000 1000 > traverse in the m dimension Solution validates Rate (MFlop/s): 866.508739 Avg time (s): 0.002303 # OpenMP "doacross" !$omp do ordered(2) collapse(2) do j=2,n do i=2,m !$omp ordered depend(sink:j,i-1) depend(sink:j-1,i) depend(sink:j-1,i-1) grid(i,j) = grid(i-1,j) + grid(i,j-1) - grid(i-1,j-1) !$omp ordered depend(source) enddo enddo !$omp end do $ ./p2p-openmp-doacross 40 1000 1000 Parallel Research Kernels Fortran OpenMP pipeline execution on 2D Number of threads = 4 Number of iterations = 40 Grid sizes = 1000 1000 Solution validates Rate (MFlop/s): 17.855468 Avg time (s): 0.111787