[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 Tom de Vries changed: What|Removed |Added Target Milestone|--- |11.0
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 Tom de Vries changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #17 from Tom de Vries --- Patch with test-case committed, marking resolved-fixed.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #16 from CVS Commits --- The master branch has been updated by Tom de Vries : https://gcc.gnu.org/g:ab3f4b27abe8abc947e84ef84bfc9a18797c5868 commit r11-3648-gab3f4b27abe8abc947e84ef84bfc9a18797c5868 Author: Tom de Vries Date: Tue Sep 22 16:38:07 2020 +0200 [omp, ftracer] Don't duplicate blocks in SIMT region When running the libgomp testsuite on x86_64-linux with nvptx accelerator on the test-case included in this patch, we run into: ... FAIL: libgomp.fortran/pr95654.f90 -O3 -fomit-frame-pointer -funroll-loops \ -fpeel-loops -ftracer -finline-functions execution test ... The test-case is a minimal version of this FAIL: ... FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops \ -fpeel-loops -ftracer -finline-functions execution test ... but that one has stopped failing at commit c2ebf4f10de "openmp: Add support for non-rect simd and improve collapsed simd support". The problem is that ftracer duplicates a block containing GOMP_SIMT_VOTE_ANY. That is, before ftracer we have (dropping the GOMP_SIMT_ prefix): ... bb4(ENTER_ALLOC) *--+ | \ |\ | v | * v bb8 *<* bb5(VOTE_ANY) *-+ | | | | | | | | | v | * v bb7(XCHG_IDX) *<* bb6(EXIT) ... The XCHG_IDX internal-fn does inter-SIMT-lane communication, which for nvptx maps onto shfl, an operator which has the requirement that the warp executing the operator is convergent. The warp diverges at bb4, and reconverges at bb5, and does not diverge by going to bb7, so the shfl is indeed executed by a convergent warp. After ftracer, we have: ... bb4(ENTER_ALLOC) *--+ | \ |\ | \ | \ v v * * bb5(VOTE_ANY) bb8(VOTE_ANY) * * |\ /| | \ ++ | | \/ | | /\ | | / +--v |/ * v bb7(XCHG_IDX) *<--* bb6(EXIT) ... The warp diverges again at bb5, but does not reconverge again before bb6, so the shfl is executed by a divergent warp, which causes the FAIL. Fix this by making ftracer ignore blocks containing ENTER_ALLOC, VOTE_ANY and EXIT, effectively treating the SIMT region conservatively. An argument can be made that the test needs to be added in a more generic place, like gimple_can_duplicate_bb_p or some such, and that ftracer then needs to use the generic test. But that's a discussion with a much broader scope, so I'm leaving that for another patch. Bootstrapped and reg-tested on x86_64-linux. Build on x86_64-linux with nvptx accelerator, tested with libgomp. gcc/ChangeLog: PR fortran/95654 * tracer.c (ignore_bb_p): Ignore GOMP_SIMT_ENTER_ALLOC, GOMP_SIMT_VOTE_ANY and GOMP_SIMT_EXIT. libgomp/ChangeLog: 2020-10-05 Tom de Vries PR fortran/95654 * testsuite/libgomp.fortran/pr95654.f90: New test.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #15 from Tobias Burnus --- See also PR97203 + PR97203, and PR80053. And the thread: https://gcc.gnu.org/pipermail/gcc-patches/2020-September/thread.html#554054
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #14 from CVS Commits --- The master branch has been updated by Tobias Burnus : https://gcc.gnu.org/g:f74c87f85f510248874cb90ad0b0527b015034b9 commit r11-3363-gf74c87f85f510248874cb90ad0b0527b015034b9 Author: Tobias Burnus Date: Tue Sep 22 19:15:44 2020 +0200 libgomp.fortran/pr66199-5.f90: Make stop codes unique libgomp/ChangeLog: PR fortran/95654 * testsuite/libgomp.fortran/pr66199-5.f90: Make stop codes unique.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #13 from Tom de Vries --- (In reply to Tom de Vries from comment #11) > My guess at this point, is that duplicating the block with VOTE_ANY has the > effect that the JIT compiler doesn't recognize control flow divergence > before XCHG_IDX, and fails to insert the proper barrier. Turns out, it's not that complicated. Before ftracer we have: ... [local count: 268435456]: _30 = _18 + _27; _31 = _18 + _28; _46 = .GOMP_SIMT_ENTER_ALLOC (0, 1); _47 = .GOMP_SIMT_LANE (); _48 = (int) _47; _49 = _30 + _48; if (_31 > _49) goto ; [87.50%] else goto ; [12.50%] [local count: 117440512]: ... goto ; [100.00%] [local count: 134217728]: # _54 = PHI <_50(D)(4), _67(8)> # _34 = PHI <_49(4), _71(8)> _55 = _34 == 63; _56 = (int) _55; _57 = .GOMP_SIMT_VOTE_ANY (_56); if (_57 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 67108864]: _58 = .GOMP_SIMT_LAST_LANE (_56); _60 = .GOMP_SIMT_XCHG_IDX (_54, _58); _61 = _60 + 1; goto ; [100.00%] [local count: 268435456]: # d1_6 = PHI <_61(7), d1_29(D)(5)> *_46 ={v} {CLOBBER}; .GOMP_SIMT_EXIT (_46); if (_31 == 32) goto ; [34.00%] else goto ; [66.00%] ... At bb4 entry, we have unified control flow (that is, all threads in the warp execute the same code in lockstep). That's no longer the case at bb5/bb8. In team 0, threads 0..15 execute the loop body (bb8), and threads 16..31 don't. In team 1, it's the opposite. However, at bb5 the control flow from bb4 and bb8 joins, so control flow is once again unified. Then VOTE_ANY is executed in bb5, with team 1 subsequently going to the block with XCHG_IDX (bb 7), and team 0, skipping straight to bb6. After ftracer, we have: ... [local count: 16777216]: # _54 = PHI <_50(D)(4)> # _34 = PHI <_49(4)> _55 = _34 == 63; _56 = (int) _55; _57 = .GOMP_SIMT_VOTE_ANY (_56); if (_57 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 117440512]: ... _80 = _71 == 63; _81 = (int) _80; _82 = .GOMP_SIMT_VOTE_ANY (_81); if (_82 != 0) goto ; [50.00%] else goto ; [50.00%] ... Now control flow no longer is unified at bb 5, and consequently it's not in bb7 when executing XCHG_IDX. And that's the root cause for the failure we're seeing. So, one way to handle this it to consider VOTE_ANY as a "join" to the "fork" of ENTER_ALLOC (which means: don't duplicate, unless you duplicate the pair). But, after reading this: ... /* Allocate per-lane storage and begin non-uniform execution region. */ static void expand_GOMP_SIMT_ENTER_ALLOC (internal_fn, gcall *stmt) ... and this: ... /* Deallocate per-lane storage and leave non-uniform execution region. */ static void expand_GOMP_SIMT_EXIT (internal_fn, gcall *stmt) ... it seems that spot is already taken. So I wonder, isn't the problem that we do the lastprivate stuff before SIMT_EXIT. [ Of course after fixing that we might run into SIMT_EXIT being duplicated by ftracer. But there at least the description of the internal-fn would make it clear why we don't want to duplicate it. ]
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #12 from Tom de Vries --- (In reply to Tom de Vries from comment #7) > Minimal example after commit 91347c3bbf7 "Fortran: OpenMP - fix simd with > (last)private (PR97061)": > ... > ! { dg-do run } > > program main > implicit none > integer :: d1 > !$omp target map(from: d1) > > !$omp teams distribute parallel do simd default(none) lastprivate(d1) > > do d1 = 0, 31 > end do > !$omp end target > > if (d1 /= 32) stop 3 > end program > ... To further reduce: set num_teams to 2, and num_threads to 1: ... !$omp teams distribute parallel do simd default(none) lastprivate(d1) num_teams (2) num_threads (1) ... Makes it easier to reason about the code.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 Tom de Vries changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #11 from Tom de Vries --- So, at this point we know that duplicating the BB containing VOTE_ANY causes problems in executing. But AFAIU, we do not know why. Is VOTE_ANY not supposed to be duplicated by design? If so, is there any documentation of that design, that explains that? At the nvptx level, VOTE_ANY translates to vote.ballot.b32, which does cross-lane communication, but has defined behaviour in divergent mode AFAICT. >From that perspective at least, there's no problem with duplicating VOTE_ANY. My guess at this point, is that duplicating the block with VOTE_ANY has the effect that the JIT compiler doesn't recognize control flow divergence before XCHG_IDX, and fails to insert the proper barrier. And XCHG_IDX translates to shfl.idx.b32, which has undefined behaviour in divergent mode.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #10 from Tobias Burnus --- (In reply to Tobias Burnus from comment #9) > See also thread at: > https://gcc.gnu.org/pipermail/gcc-patches/2020-September/thread.html#554054 Regarding the patch there, the proper way is to adapt can_duplicate_block_p, which calls the associated cfg_hooks->can_duplicate_block_p. Currently, there is the stub (in gimple_cfg_hooks) tree-cfg.c: gimple_can_duplicate_bb_p and (for rtl_cfg_hooks and cfg_layout_rtl_cfg_hooks) cfgrtl.c:cfg_layout_can_duplicate_bb_p the latter avoids duplicate tablejumps and calls targetm.cannot_copy_insn_p.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 Tobias Burnus changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2020-09-16 --- Comment #9 from Tobias Burnus --- See also thread at: https://gcc.gnu.org/pipermail/gcc-patches/2020-September/thread.html#554054 Mentioned there RELATED PR: PR 80053 – similar issue related to label with address taken
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #8 from Tom de Vries --- Created attachment 49228 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49228&action=edit Dumps for failing test-case (no collapse case) (In reply to Tom de Vries from comment #7) > Minimal example after commit 91347c3bbf7 "Fortran: OpenMP - fix simd with > (last)private (PR97061)": > ... > ! { dg-do run } > > program main > implicit none > integer :: d1 > !$omp target map(from: d1) > > !$omp teams distribute parallel do simd default(none) lastprivate(d1) > > do d1 = 0, 31 > end do > !$omp end target > > if (d1 /= 32) stop 3 > end program > ... Corresponding dumps.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #7 from Tom de Vries --- Minimal example after commit 91347c3bbf7 "Fortran: OpenMP - fix simd with (last)private (PR97061)": ... ! { dg-do run } program main implicit none integer :: d1 !$omp target map(from: d1) !$omp teams distribute parallel do simd default(none) lastprivate(d1) do d1 = 0, 31 end do !$omp end target if (d1 /= 32) stop 3 end program ...
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #6 from Tom de Vries --- Created attachment 49227 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49227&action=edit Dumps for failing test-case (In reply to Tom de Vries from comment #5) > Minimal example: > ... > ! { dg-do run } > > program main > implicit none > integer :: d1, d2 > !$omp target map(from: d1) > > !$omp teams distribute parallel do simd default(none) lastprivate(d1) & > > !$omp& collapse(2) > > do d1 = 0, 31 > do d2 = 0, 31 > end do > end do > !$omp end target > > if (d1 /= 32) stop 3 > end program > ... Corresponding dumps.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 Tom de Vries changed: What|Removed |Added CC||vries at gcc dot gnu.org --- Comment #5 from Tom de Vries --- Minimal example: ... ! { dg-do run } program main implicit none integer :: d1, d2 !$omp target map(from: d1) !$omp teams distribute parallel do simd default(none) lastprivate(d1) & !$omp& collapse(2) do d1 = 0, 31 do d2 = 0, 31 end do end do !$omp end target if (d1 /= 32) stop 3 end program ...
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #4 from Tobias Burnus --- (In reply to Tobias Burnus from comment #3) > Created attachment 49222 [details] > Slightly reduced example, compile with gfortran -fopenmp -O1 -ftracer On the host side, a single BB gets inserted – but more interesting is the device side: For the nvptx lto1, all dumped trees are identical until profile_estimate. The result (the non-64 value) slightly varies, but one can disable disable app three passes until tree-tracer to get still a fail: gfortran -fopenmp -ftracer -O1 test.f90 -foffload='-fdisable-tree-dom3 -fdisable-tree-profile_estimate -fdisable-tree-fixup_cfg3 -fdisable-tree-strlen1 -fdisable-tree-copyprop5 -fdisable-tree-wrestr^Ct -fdisable-tree-dse3 -fdump-tree-all -fdisable-tree-cddce3 -fdisable-tree-forwprop4 -fdisable-tree-phiopt4 -fdisable-tree-fab1 -fdisable-tree-dce7 -fdisable-tree-crited1 -fdisable-tree-uncprop1 -fdisable-tree-local-pure-const2 -fdisable-tree-nrv -fdisable-tree-isel -fdisable-tree-optimized' Adding -foffload=-fdisable-tree-tracer unsurprisingly makes all tree dumps identical and yields the expected 64.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #3 from Tobias Burnus --- Created attachment 49222 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49222&action=edit Slightly reduced example, compile with gfortran -fopenmp -O1 -ftracer Some testing; with gfortran -fopenmp -O1 -ftracer it shows something like: -31 =?= 64 The expected output is the last value of the loop iteration ("lastprivate"), i.e. (b-1+1)*2 = 32*2 = 64. – When running the target section on the host (-foffload=disable) it works, which indicates either a shared-memory vs. non-shared issue or an nvptx problem. At a glance, the original dump looks fine; it seems as if either the .omp_data_i_28(D)->d1 = d1_9; .omp_data_i_28(D)->d2 = d2_11; is not executed or the PHI handling goes wrong.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 Bug 95654 depends on bug 95109, which changed state. Bug 95109 Summary: [11 regression] ICE in gfortran.dg/gomp/target1.f90 after r11-349 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95109 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #2 from Tobias Burnus --- (In reply to Thomas Schwinge from comment #1) > Have you verified that it's the same underlying issue It's not but would otherwise be a duplicate. > or do you just want to wait for PR95109 being resolved before analyzing this > one The latter. However, it is probably a separate issue. > And, if you're not going to work on these items now, is there any time scale? No – there seems to be currently an exponential growth of work items.
[Bug fortran/95654] nvptx offloading: FAIL: libgomp.fortran/pr66199-5.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654 --- Comment #1 from Thomas Schwinge --- Tobias, I see you've unassigned yourself here, and set Depends on: PR95109. Have you verified that it's the same underlying issue, or do you just want to wait for PR95109 being resolved before analyzing this one here, as it might be the same underlying issue? And, if you're not going to work on these items now, is there any time scale?