https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125717

            Bug ID: 125717
           Summary: [8? Regression] PR83064 impact on SPEC2026 pot3d_s
                    benchmark (r8-7827-gbc436e10e0b892)
           Product: gcc
           Version: 17.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dhruvc at gcc dot gnu.org
  Target Milestone: ---

In gcc/fortran/trans-stmt.cc, there is a comment in gfc_trans_forall_loop () at
line 4388 which says that do-concurrent loops cannot be annotated as
annot_expr_parallel_kind as the auto-parallelizer is not capable of dealing
with it. On AArch64, we are seeing a 4x slowdown in the SPEC2026 pot3d_s
benchmark because of this.

A hot loop in the application is of the form:

do concurrent (k:<>, j:<>, i:<>)
  ...
enddo

Due to the loop getting annotated with annot_expr_ivdep_kind, the compiler
eventually reaches a point where it isn't able to prove that the outermost loop
is parallelizable and gives up. It ends up parallelizing the j:<> loop instead,
and thus ends up dispatching to the OpenMP runtime for each outer iteration.

If gfc_trans_forall_loop is changed to emit annot_expr_parallel_kind instead,
everything works fine and the performance is recovered. The main question at
this point is, is this a valid change now? Has the auto-parallelizer or the
frontend improved to the point where the annotation is correct now?

Here's a snippet that demonstrates the issue:

  subroutine ax(n, x, y)
    implicit none
    integer, intent(in) :: n
    real, intent(in)    :: x(n,n)
    real, intent(out)   :: y(n,n)
    integer :: i, j
    do concurrent (j=2:n-1, i=2:n-1)
       y(i,j) = x(i-1,j) + x(i+1,j) + x(i,j-1) + x(i,j+1)
    end do
  end subroutine

Compiling this with annot_expr_ivdep_kind ends up parallelizing the inner loop,
whereas annot_expr_parallel_kind ends up parallelizing the outer loop.

Flags: -O2 -ftree-parallelize-loops=8 -fopt-info-optimized

PS: There's also an issue where this fix doesn't survive LTO, because the
`can_be_parallel` member of `class loop` isn't streamed out.

Reply via email to