[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #14 from Andrew Pinski --- Created attachment 62398 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62398&action=edit Unincluded file which is able to compile with newer compilers Compiles on the trunk with `-march=skylake -D__assert_rtn=__assert_fail` Note __assert_rtn part is not needed if you are compiling on darwin only with glibc (linux).
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 Martin Jambor changed: What|Removed |Added CC||19373742 at buaa dot edu.cn --- Comment #13 from Martin Jambor --- *** Bug 110282 has been marked as a duplicate of this bug. ***
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #12 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:b58dc0b803057c0e6032e0d9bd92cd834f72c75c commit r12-248-gb58dc0b803057c0e6032e0d9bd92cd834f72c75c Author: Richard Biener Date: Tue Apr 27 14:32:27 2021 +0200 tree-optimization/99912 - delete trivially dead stmts during DSE DSE performs a backwards walk over stmts removing stores but it leaves removing resulting dead SSA defs to later passes. This eats into its own alias walking budget if the removed stores kept loads live. The following patch adds removal of trivially dead SSA defs which helps in this situation and reduces the amount of garbage followup passes need to deal with. 2021-04-28 Richard Biener PR tree-optimization/99912 * tree-ssa-dse.c (dse_dom_walker::m_need_cfg_cleanup): New. (dse_dom_walker::todo): Likewise. (dse_dom_walker::dse_optimize_stmt): Move VDEF check to the caller. (dse_dom_walker::before_dom_children): Remove trivially dead SSA defs and schedule CFG cleanup if we removed all PHIs in a block. (pass_dse::execute): Get TODO as computed by the DOM walker and return it. Wipe dominator info earlier. * gcc.dg/pr95580.c: Disable DSE. * gcc.dg/Wrestrict-8.c: Place a use after each memcpy. * c-c++-common/ubsan/overflow-negate-3.c: Make asms volatile to prevent them from being removed. * c-c++-common/ubsan/overflow-sub-4.c: Likewise.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #11 from Erik Schnetter --- The number of active local variables is likely much larger than the number of registers, and I expect there to be a lot of spilling. I hope that the compiler is clever about changing the order in which expressions are evaluated to reduce spilling as much as possible. Because the loop is so large, I split it into two, each calculating about half of the output variables. The code here looks at one of the loops. To simplify the code, each loop still loads all variables (via masked loads), but may not use all of them. The unused masked loads do not surprise me per se, but I expect the compiler to remove them.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #10 from Richard Biener --- So with the latest patches I now see real spilling dominating (oops). I also see, on the GIMPLE level _64425 = (unsigned long) SR.3210_122492; _64416 = _64425 + ivtmp.5307_121062; _62971 = (double &) _64416; __builtin_ia32_maskloadpd256 (_62971, _61513); that is, dead masked loads (that's odd). There's also still some dead / redundant code from the abstraction to remove.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #9 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:8d4c374c4419a8751cfae18d6b58169c62dea49f commit r12-156-g8d4c374c4419a8751cfae18d6b58169c62dea49f Author: Richard Biener Date: Tue Apr 27 14:27:40 2021 +0200 tree-optimization/99912 - schedule another TODO_remove_unused_locals This makes sure to remove unused locals and prune CLOBBERs after the first scalar cleanup phase after IPA optimizations. On the testcase in the PR this results in 8000 CLOBBERs removed which in turn unleashes more DSE which otherwise hits its walking limit of 256 too early on this testcase. 2021-04-27 Richard Biener PR tree-optimization/99912 * passes.def: Add comment about new TODO_remove_unused_locals. * tree-stdarg.c (pass_data_stdarg): Run TODO_remove_unused_locals at start.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #8 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:d8e1f1d24179690fd9c0f63c27b12e030010d9ea commit r12-155-gd8e1f1d24179690fd9c0f63c27b12e030010d9ea Author: Richard Biener Date: Wed Apr 7 12:09:44 2021 +0200 tree-optimization/99912 - schedule DSE before SRA For the testcase in the PR the main SRA pass is unable to do some important scalarizations because dead stores of addresses make the candiate variables disqualified. The following patch adds another DSE pass before SRA forming a DCE/DSE pair and moves the DSE pass that is currently closely after SRA up to after the next DCE pass, forming another DCE/DSE pair now residing after PRE. 2021-04-07 Richard Biener PR tree-optimization/99912 * passes.def (pass_all_optimizations): Add pass_dse before the first pass_dce, move the first pass_dse before the pass_dce following pass_pre. * gcc.dg/tree-ssa/ldist-33.c: Disable PRE and LIM. * gcc.dg/tree-ssa/pr96789.c: Adjust dump file scanned. * gcc.dg/tree-ssa/ssa-dse-28.c: Likewise. * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #7 from Richard Biener --- I've posted a series of two patches that will improve things for GCC 12. https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567743.html https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567731.html https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567744.html the last will eventually see adjustments and/or a different implementation approach.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Last reconfirmed||2021-04-07 Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #6 from Richard Biener --- (In reply to Erik Schnetter from comment #5) > As you suggested, the problem is probably not caused by register spills, but > by stores into a struct that are not optimized away. In this case, the > respective struct elements are unused in the code. > > I traced the results of the first __builtin_ia32_maskloadpd256: > > _63940 = __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203); > MEM [(struct mat3 *)&vars + 992B] = _63940; > _178613 = .FMA (_63940, _64752, _178609); > MEM [(struct mat3 *)&vars + 1312B] = _63940; > > The respective struct locations (+ 992B, + 1312B) are indeed not used > anywhere else. > > The struct is of type z4c_vars. It (and its parent) are defined in lines > 279837 to 280818. It is large. > > Is there e.g. a parameter I could set to make GCC try harder avoid > unnecessary stores? Yes, there's --param dse-max-alias-queries-per-store=N where setting N to 1 seems to help quite a bit (it's default is 256 and it's used to limit the quadratic complexity of DSE to constant-time). There are still some other temporaries left, notably we have aggregate copies like MEM[(struct vect *)&D.662088 + 96B clique 23 base 1].elts._M_elems[0] = MEM[(const struct simd &)&D.662080 clique 23 base 0]; and later used as SR.3537_174107 = MEM [(const struct vec3 &)&D.662088]; the aggregate copying keeps the stores into D.662080 live (instead of directly storing into D.662088). At SRA time we still take the address of both so they are not considered for decomposition. That's from dead stores to some std::initializer_list. Thus there's a pass ordering issue and SRA would benefit from another DSE pass before it. For the fun I did diff --git a/gcc/passes.def b/gcc/passes.def index e9ed3c7bc57..0c8a50e7a07 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -221,6 +221,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_tail_recursion); NEXT_PASS (pass_ch); NEXT_PASS (pass_lower_complex); + NEXT_PASS (pass_dse); NEXT_PASS (pass_sra); /* The dom pass will also resolve all __builtin_constant_p calls that are still there to 0. This has to be done after some @@ -236,7 +237,6 @@ along with GCC; see the file COPYING3. If not see /* Identify paths that should never be executed in a conforming program and isolate those paths. */ NEXT_PASS (pass_isolate_erroneous_paths); - NEXT_PASS (pass_dse); NEXT_PASS (pass_reassoc, true /* insert_powi_p */); NEXT_PASS (pass_dce); NEXT_PASS (pass_forwprop); which helps SRA in this case. It does need quite some magic to remove all the C++ abstraction in this code. There's also lots of stray CLOBBERS of otherwise unused variables in the code which skews the DSE alias query parameter limit, we should look at doing TODO_remove_unused_locals more often (for example after DSE) - that alone is enough to get almost all the improvements without increasing any of the DSE walking limits.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #5 from Erik Schnetter --- As you suggested, the problem is probably not caused by register spills, but by stores into a struct that are not optimized away. In this case, the respective struct elements are unused in the code. I traced the results of the first __builtin_ia32_maskloadpd256: _63940 = __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203); MEM [(struct mat3 *)&vars + 992B] = _63940; _178613 = .FMA (_63940, _64752, _178609); MEM [(struct mat3 *)&vars + 1312B] = _63940; The respective struct locations (+ 992B, + 1312B) are indeed not used anywhere else. The struct is of type z4c_vars. It (and its parent) are defined in lines 279837 to 280818. It is large. Is there e.g. a parameter I could set to make GCC try harder avoid unnecessary stores?
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #4 from Erik Schnetter --- I build with the compiler options /Users/eschnett/src/CarpetX/Cactus/view-compilers/bin/g++ -fopenmp -Wall -pipe -g -march=skylake -std=gnu++17 -O3 -fcx-limited-range -fexcess-precision=fast -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -c -o configs/sim/build/Z4c/rhs.cxx.o configs/sim/build/Z4c/rhs.cxx.ii One of the kernels in question (the one I describe above) is the C++ lambda in lines 281013 to 281119. The call to the "noinline" function ensures that the kernel (and surrounding for loops) is compiled as a separate function, which produces more efficient code. The function "grid.loop_int_device" contains essentially three nested for loops, and the actual kernel is the C++ lambda in lines 281015 to 281118. I'll have a look at -fdump-tree-optimized.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org Keywords||missed-optimization Target||x86_64-*-* --- Comment #3 from Richard Biener --- Which function does the loop kernel reside in? I see you have some lambdas in Z4c_RHS, done fancy as out-of-line functions, that do look like they could comprise the actual kernels. In apply_upwind_diss I see cases without stack usage. I'm looking at -O2 -march=skylake compiles Note that with C++ it's easy to retain some abstraction and thus misinterpret stack accesses as spilling where they are aggregates not eliminated. For example in one of the lambdas I see _61489 = __builtin_ia32_maskloadpd256 (_104487, _61513); D.545024[1].elts.car = _61489; ... MEM[(struct vect *)&D.544982].elts._M_elems[1] = MEM[(const struct simd &)&D.545024 + 32]; ... MEM[(struct mat3 *)&vars + 992B] = MEM[(const struct mat3 &)&D.544982]; and D.544982 is later variable indexed in some MIN/MAX, FMA using code (instead of using 'vars' there). Looking at what -fdump-tree-optimized produces is sometimes pointing at problems. That said, the code is large so please point at some source lines within the important kernel(s) (of the preprocessed source, that is) and the compile options used.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #2 from Erik Schnetter --- I did not describe the scale of the issue. There are more than just a few inefficient or unnecessary operations: The loop kernel (a single basic block) extends from address 0x1240 to 0xbf27 in the attached disassembled object file. Out of about 6000 instructions in the loop, 1000 are inefficient (and likely superfluous) moves that copy one 32-byte stack slot into another, using 16-byte wide copies. For example, the stack slot 9376(%rsp) is written 9 times in the loop kernel, but is read only once.
[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912 --- Comment #1 from Erik Schnetter --- Created attachment 50508 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50508&action=edit Compressed disassembled object file
