https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108

            Bug ID: 81108
           Summary: OpenMP doacross (omp do/for ordered) performance
           Product: gcc
           Version: 7.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jeff.science at gmail dot com
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 41560
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41560&action=edit
sequential code

When I use "doacross" OpenMP parallelism as described in
https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/, the
performance is ~30x worse than sequential execution.

This code can be effectively parallelized by blocking the wavefront, which I
have implemented as well.  I see the same behavior in Fortran and C++, so I
believe it is relatively agnostic to the front-end.  Clang has the same issue,
which we determined is because the dependency analysis is applied to the
collapsed loop, inhibiting all parallelism, although I don't understand why
this is so much worse than a serial implementation.

If it helps, the project is https://github.com/ParRes/Kernels, although I will
attach all of the code here.

# Sequential

      do j=2,n
        do i=2,m
          grid(i,j) = grid(i-1,j) + grid(i,j-1) - grid(i-1,j-1)
        enddo
      enddo

$ ./p2p 40 1000 1000
               Parallel Research Kernels
Fortran Serial pipeline execution on 2D 
Number of iterations     =       40
Grid sizes               =     1000    1000
> traverse in the m dimension
Solution validates
Rate (MFlop/s):    866.508739 Avg time (s):   0.002303

# OpenMP "doacross"

    !$omp do ordered(2) collapse(2)
    do j=2,n
      do i=2,m
        !$omp ordered depend(sink:j,i-1) depend(sink:j-1,i)
depend(sink:j-1,i-1)
        grid(i,j) = grid(i-1,j) + grid(i,j-1) - grid(i-1,j-1)
        !$omp ordered depend(source)
      enddo
    enddo
    !$omp end do

$ ./p2p-openmp-doacross 40 1000 1000

               Parallel Research Kernels
Fortran OpenMP pipeline execution on 2D 
Number of threads        =        4
Number of iterations     =       40
Grid sizes               =     1000    1000
Solution validates
Rate (MFlop/s):     17.855468 Avg time (s):   0.111787

Reply via email to