[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-10-05 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

Tom de Vries  changed:

   What|Removed |Added

   Target Milestone|--- |11.0

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-10-05 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

Tom de Vries  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #17 from Tom de Vries  ---
Patch with test-case committed, marking resolved-fixed.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-10-04 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #16 from CVS Commits  ---
The master branch has been updated by Tom de Vries :

https://gcc.gnu.org/g:ab3f4b27abe8abc947e84ef84bfc9a18797c5868

commit r11-3648-gab3f4b27abe8abc947e84ef84bfc9a18797c5868
Author: Tom de Vries 
Date:   Tue Sep 22 16:38:07 2020 +0200

[omp, ftracer] Don't duplicate blocks in SIMT region

When running the libgomp testsuite on x86_64-linux with nvptx accelerator
on
the test-case included in this patch, we run into:
...
FAIL: libgomp.fortran/pr95654.f90 -O3 -fomit-frame-pointer -funroll-loops \
  -fpeel-loops -ftracer -finline-functions  execution test
...

The test-case is a minimal version of this FAIL:
...
FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops
\
  -fpeel-loops -ftracer -finline-functions  execution test
...
but that one has stopped failing at commit c2ebf4f10de "openmp: Add support
for non-rect simd and improve collapsed simd support".

The problem is that ftracer duplicates a block containing
GOMP_SIMT_VOTE_ANY.

That is, before ftracer we have (dropping the GOMP_SIMT_ prefix):
...
bb4(ENTER_ALLOC)
*--+
|   \
|\
| v
| *
v bb8
*<*
bb5(VOTE_ANY)
*-+
| |
| |
| |
| |
| v
| *
v bb7(XCHG_IDX)
*<*
bb6(EXIT)
...

The XCHG_IDX internal-fn does inter-SIMT-lane communication, which for
nvptx
maps onto shfl, an operator which has the requirement that the warp
executing
the operator is convergent.  The warp diverges at bb4, and
reconverges at bb5, and does not diverge by going to bb7, so the shfl is
indeed executed by a convergent warp.

After ftracer, we have:
...
bb4(ENTER_ALLOC)
*--+
|   \
|\
| \
|  \
v   v
*   *
bb5(VOTE_ANY)   bb8(VOTE_ANY)
*   *
|\ /|
| \  ++ |
|  \/   |
|  /\   |
| /  +--v
|/  *
v   bb7(XCHG_IDX)
*<--*
bb6(EXIT)
...

The warp diverges again at bb5, but does not reconverge again before bb6,
so
the shfl is executed by a divergent warp, which causes the FAIL.

Fix this by making ftracer ignore blocks containing ENTER_ALLOC, VOTE_ANY
and
EXIT, effectively treating the SIMT region conservatively.

An argument can be made that the test needs to be added in a more
generic place, like gimple_can_duplicate_bb_p or some such, and that
ftracer
then needs to use the generic test.  But that's a discussion with a much
broader scope, so I'm leaving that for another patch.

Bootstrapped and reg-tested on x86_64-linux.

Build on x86_64-linux with nvptx accelerator, tested with libgomp.

gcc/ChangeLog:

PR fortran/95654
* tracer.c (ignore_bb_p): Ignore GOMP_SIMT_ENTER_ALLOC,
GOMP_SIMT_VOTE_ANY and GOMP_SIMT_EXIT.

libgomp/ChangeLog:

2020-10-05  Tom de Vries  

PR fortran/95654
* testsuite/libgomp.fortran/pr95654.f90: New test.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-25 Thread burnus at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #15 from Tobias Burnus  ---
See also PR97203 + PR97203, and PR80053.

And the thread:
https://gcc.gnu.org/pipermail/gcc-patches/2020-September/thread.html#554054

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-22 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #14 from CVS Commits  ---
The master branch has been updated by Tobias Burnus :

https://gcc.gnu.org/g:f74c87f85f510248874cb90ad0b0527b015034b9

commit r11-3363-gf74c87f85f510248874cb90ad0b0527b015034b9
Author: Tobias Burnus 
Date:   Tue Sep 22 19:15:44 2020 +0200

libgomp.fortran/pr66199-5.f90: Make stop codes unique

libgomp/ChangeLog:

PR fortran/95654
* testsuite/libgomp.fortran/pr66199-5.f90: Make stop codes unique.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-17 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #13 from Tom de Vries  ---
(In reply to Tom de Vries from comment #11)
> My guess at this point, is that duplicating the block with VOTE_ANY has the
> effect that the JIT compiler doesn't recognize control flow divergence
> before XCHG_IDX, and fails to insert the proper barrier.

Turns out, it's not that complicated.

Before ftracer we have:
...
   [local count: 268435456]:
  _30 = _18 + _27;
  _31 = _18 + _28;
  _46 = .GOMP_SIMT_ENTER_ALLOC (0, 1);
  _47 = .GOMP_SIMT_LANE ();
  _48 = (int) _47;
  _49 = _30 + _48;
  if (_31 > _49)
goto ; [87.50%]
  else
goto ; [12.50%]

   [local count: 117440512]:
  ...
  goto ; [100.00%]

   [local count: 134217728]:
  # _54 = PHI <_50(D)(4), _67(8)>
  # _34 = PHI <_49(4), _71(8)>
  _55 = _34 == 63;
  _56 = (int) _55;
  _57 = .GOMP_SIMT_VOTE_ANY (_56);
  if (_57 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 67108864]:
  _58 = .GOMP_SIMT_LAST_LANE (_56);
  _60 = .GOMP_SIMT_XCHG_IDX (_54, _58);
  _61 = _60 + 1;
  goto ; [100.00%]

   [local count: 268435456]:
  # d1_6 = PHI <_61(7), d1_29(D)(5)>
  *_46 ={v} {CLOBBER};
  .GOMP_SIMT_EXIT (_46);
  if (_31 == 32)
goto ; [34.00%]
  else
goto ; [66.00%]
...

At bb4 entry, we have unified control flow (that is, all threads in the warp
execute the same code in lockstep).

That's no longer the case at bb5/bb8.  In team 0, threads 0..15 execute the
loop body (bb8), and threads 16..31 don't.  In team 1, it's the opposite.

However, at bb5 the control flow from bb4 and bb8 joins, so control flow is
once again unified.

Then VOTE_ANY is executed in bb5, with team 1 subsequently going to the block
with XCHG_IDX (bb 7), and team 0, skipping straight to bb6.

After ftracer, we have:
...
   [local count: 16777216]:
  # _54 = PHI <_50(D)(4)>
  # _34 = PHI <_49(4)>
  _55 = _34 == 63;
  _56 = (int) _55;
  _57 = .GOMP_SIMT_VOTE_ANY (_56);
  if (_57 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 117440512]:
  ...
  _80 = _71 == 63;
  _81 = (int) _80;
  _82 = .GOMP_SIMT_VOTE_ANY (_81);
  if (_82 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]
...

Now control flow no longer is unified at bb 5, and consequently it's not in bb7
when executing XCHG_IDX.  And that's the root cause for the failure we're
seeing.

So, one way to handle this it to consider VOTE_ANY as a "join" to the "fork" of
ENTER_ALLOC (which means: don't duplicate, unless you duplicate the pair).

But, after reading this:
...
/* Allocate per-lane storage and begin non-uniform execution region.  */

static void
expand_GOMP_SIMT_ENTER_ALLOC (internal_fn, gcall *stmt)
...
and this:
...
/* Deallocate per-lane storage and leave non-uniform execution region.  */

static void
expand_GOMP_SIMT_EXIT (internal_fn, gcall *stmt)
...
it seems that spot is already taken.

So I wonder, isn't the problem that we do the lastprivate stuff before
SIMT_EXIT. [ Of course after fixing that we might run into SIMT_EXIT being
duplicated by ftracer. But there at least the description of the internal-fn
would make it clear why we don't want to duplicate it. ]

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-17 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #12 from Tom de Vries  ---
(In reply to Tom de Vries from comment #7)
> Minimal example after commit 91347c3bbf7 "Fortran: OpenMP - fix simd with
> (last)private (PR97061)":
> ...
> ! { dg-do run } 
> 
> program main
>   implicit none
>   integer :: d1
>   !$omp target map(from: d1)
> 
>   !$omp teams distribute parallel do simd default(none) lastprivate(d1) 
> 
>   do d1 = 0, 31
>   end do
>   !$omp end target  
> 
>   if (d1 /= 32) stop 3
> end program
> ...

To further reduce: set num_teams to 2, and num_threads to 1:
...
  !$omp teams distribute parallel do simd default(none) lastprivate(d1)
num_teams (2) num_threads (1)
...
Makes it easier to reason about the code.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-17 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

Tom de Vries  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #11 from Tom de Vries  ---
So, at this point we know that duplicating the BB containing VOTE_ANY causes
problems in executing.  But AFAIU, we do not know why.

Is VOTE_ANY not supposed to be duplicated by design? If so, is there any
documentation of that design, that explains that?

At the nvptx level, VOTE_ANY translates to vote.ballot.b32, which does
cross-lane communication, but has defined behaviour in divergent mode AFAICT.
>From that perspective at least, there's no problem with duplicating VOTE_ANY.

My guess at this point, is that duplicating the block with VOTE_ANY has the
effect that the JIT compiler doesn't recognize control flow divergence before
XCHG_IDX, and fails to insert the proper barrier.

And XCHG_IDX translates to shfl.idx.b32, which has undefined behaviour in
divergent mode.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-17 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #10 from Tobias Burnus  ---
(In reply to Tobias Burnus from comment #9)
> See also thread at:
>   https://gcc.gnu.org/pipermail/gcc-patches/2020-September/thread.html#554054

Regarding the patch there, the proper way is to adapt can_duplicate_block_p,
which calls the associated cfg_hooks->can_duplicate_block_p.

Currently, there is the stub (in gimple_cfg_hooks)
  tree-cfg.c:  gimple_can_duplicate_bb_p
and (for rtl_cfg_hooks and cfg_layout_rtl_cfg_hooks)
  cfgrtl.c:cfg_layout_can_duplicate_bb_p
the latter avoids duplicate tablejumps and calls
targetm.cannot_copy_insn_p.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-16 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

Tobias Burnus  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2020-09-16

--- Comment #9 from Tobias Burnus  ---
See also thread at:
  https://gcc.gnu.org/pipermail/gcc-patches/2020-September/thread.html#554054

Mentioned there RELATED PR:
  PR 80053 – similar issue related to label with address taken

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-16 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #8 from Tom de Vries  ---
Created attachment 49228
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49228&action=edit
Dumps for failing test-case (no collapse case)

(In reply to Tom de Vries from comment #7)
> Minimal example after commit 91347c3bbf7 "Fortran: OpenMP - fix simd with
> (last)private (PR97061)":
> ...
> ! { dg-do run } 
> 
> program main
>   implicit none
>   integer :: d1
>   !$omp target map(from: d1)
> 
>   !$omp teams distribute parallel do simd default(none) lastprivate(d1) 
> 
>   do d1 = 0, 31
>   end do
>   !$omp end target  
> 
>   if (d1 /= 32) stop 3
> end program
> ...

Corresponding dumps.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-16 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #7 from Tom de Vries  ---
Minimal example after commit 91347c3bbf7 "Fortran: OpenMP - fix simd with
(last)private (PR97061)":
...
! { dg-do run } 
program main
  implicit none
  integer :: d1
  !$omp target map(from: d1)
  !$omp teams distribute parallel do simd default(none) lastprivate(d1) 
  do d1 = 0, 31
  end do
  !$omp end target  
  if (d1 /= 32) stop 3
end program
...

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-16 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #6 from Tom de Vries  ---
Created attachment 49227
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49227&action=edit
Dumps for failing test-case

(In reply to Tom de Vries from comment #5)
> Minimal example:
> ...
> ! { dg-do run } 
> 
> program main
>   implicit none
>   integer :: d1, d2
>   !$omp target map(from: d1)
> 
>   !$omp teams distribute parallel do simd default(none) lastprivate(d1) &   
> 
>   !$omp&  collapse(2)   
> 
>   do d1 = 0, 31
> do d2 = 0, 31
> end do
>   end do
>   !$omp end target  
> 
>   if (d1 /= 32) stop 3
> end program
> ...

Corresponding dumps.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-16 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

Tom de Vries  changed:

   What|Removed |Added

 CC||vries at gcc dot gnu.org

--- Comment #5 from Tom de Vries  ---
Minimal example:
...
! { dg-do run } 
program main
  implicit none
  integer :: d1, d2
  !$omp target map(from: d1)
  !$omp teams distribute parallel do simd default(none) lastprivate(d1) &   
  !$omp&  collapse(2)   
  do d1 = 0, 31
do d2 = 0, 31
end do
  end do
  !$omp end target  
  if (d1 /= 32) stop 3
end program
...

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-16 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #4 from Tobias Burnus  ---
(In reply to Tobias Burnus from comment #3)
> Created attachment 49222 [details]
> Slightly reduced example, compile with gfortran -fopenmp -O1 -ftracer

On the host side, a single BB gets inserted – but more interesting is the
device side:

For the nvptx lto1, all dumped trees are identical until profile_estimate.

The result (the non-64 value) slightly varies, but one can disable disable app
three passes until tree-tracer to get still a fail:

gfortran -fopenmp -ftracer -O1 test.f90
  -foffload='-fdisable-tree-dom3 -fdisable-tree-profile_estimate
-fdisable-tree-fixup_cfg3 -fdisable-tree-strlen1 -fdisable-tree-copyprop5
-fdisable-tree-wrestr^Ct -fdisable-tree-dse3 -fdump-tree-all
-fdisable-tree-cddce3 -fdisable-tree-forwprop4 -fdisable-tree-phiopt4
-fdisable-tree-fab1 -fdisable-tree-dce7 -fdisable-tree-crited1
-fdisable-tree-uncprop1 -fdisable-tree-local-pure-const2 -fdisable-tree-nrv
-fdisable-tree-isel -fdisable-tree-optimized'

Adding -foffload=-fdisable-tree-tracer unsurprisingly makes all tree dumps
identical and yields the expected 64.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-15 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #3 from Tobias Burnus  ---
Created attachment 49222
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49222&action=edit
Slightly reduced example, compile with gfortran -fopenmp -O1 -ftracer

Some testing; with gfortran -fopenmp -O1 -ftracer it shows something like:
 -31 =?= 64
The expected output is the last value of the loop iteration ("lastprivate"),
i.e. (b-1+1)*2 = 32*2 = 64. – When running the target section on the host
(-foffload=disable) it works, which indicates either a shared-memory vs.
non-shared issue or an nvptx problem.

At a glance, the original dump looks fine; it seems as if either the
  .omp_data_i_28(D)->d1 = d1_9;
  .omp_data_i_28(D)->d2 = d2_11;
is not executed or the PHI handling goes wrong.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-09-09 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654
Bug 95654 depends on bug 95109, which changed state.

Bug 95109 Summary: [11 regression] ICE in gfortran.dg/gomp/target1.f90 after 
r11-349
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95109

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-06-12 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #2 from Tobias Burnus  ---
(In reply to Thomas Schwinge from comment #1)
> Have you verified that it's the same underlying issue
It's not but would otherwise be a duplicate.

> or do you just want to wait for PR95109 being resolved before analyzing this 
> one
The latter. However, it is probably a separate issue.

> And, if you're not going to work on these items now, is there any time scale?
No – there seems to be currently an exponential growth of work items.

[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test

2020-06-12 Thread tschwinge at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #1 from Thomas Schwinge  ---
Tobias, I see you've unassigned yourself here, and set Depends on: PR95109. 
Have you verified that it's the same underlying issue, or do you just want to
wait for PR95109 being resolved before analyzing this one here, as it might be
the same underlying issue?  And, if you're not going to work on these items
now, is there any time scale?