https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125717
Bug ID: 125717
Summary: [8? Regression] PR83064 impact on SPEC2026 pot3d_s
benchmark (r8-7827-gbc436e10e0b892)
Product: gcc
Version: 17.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: dhruvc at gcc dot gnu.org
Target Milestone: ---
In gcc/fortran/trans-stmt.cc, there is a comment in gfc_trans_forall_loop () at
line 4388 which says that do-concurrent loops cannot be annotated as
annot_expr_parallel_kind as the auto-parallelizer is not capable of dealing
with it. On AArch64, we are seeing a 4x slowdown in the SPEC2026 pot3d_s
benchmark because of this.
A hot loop in the application is of the form:
do concurrent (k:<>, j:<>, i:<>)
...
enddo
Due to the loop getting annotated with annot_expr_ivdep_kind, the compiler
eventually reaches a point where it isn't able to prove that the outermost loop
is parallelizable and gives up. It ends up parallelizing the j:<> loop instead,
and thus ends up dispatching to the OpenMP runtime for each outer iteration.
If gfc_trans_forall_loop is changed to emit annot_expr_parallel_kind instead,
everything works fine and the performance is recovered. The main question at
this point is, is this a valid change now? Has the auto-parallelizer or the
frontend improved to the point where the annotation is correct now?
Here's a snippet that demonstrates the issue:
subroutine ax(n, x, y)
implicit none
integer, intent(in) :: n
real, intent(in) :: x(n,n)
real, intent(out) :: y(n,n)
integer :: i, j
do concurrent (j=2:n-1, i=2:n-1)
y(i,j) = x(i-1,j) + x(i+1,j) + x(i,j-1) + x(i,j+1)
end do
end subroutine
Compiling this with annot_expr_ivdep_kind ends up parallelizing the inner loop,
whereas annot_expr_parallel_kind ends up parallelizing the outer loop.
Flags: -O2 -ftree-parallelize-loops=8 -fopt-info-optimized
PS: There's also an issue where this fix doesn't survive LTO, because the
`can_be_parallel` member of `class loop` isn't streamed out.