On Tue, Nov 3, 2020 at 4:31 PM Frederik Harwath <frede...@codesourcery.com> wrote: > > > Hi, > > as a first step towards enabling the use of Graphite for optimizing > OpenACC loops this patch moves the OpenACC device lowering after the > Graphite pass. This means that the device lowering now takes place > after some crucial optimization passes. Thus new instances of those > passes are added inside of a new pass pass_oacc_functions which ensures > that they run on OpenACC functions only. The choice of the new position > for pass_oacc_device_lower is further constrainted by the need to > execute it before pass_vectorize. This means that > pass_oacc_device_lower now runs inside of pass_tree_loop. A further > instance of the pass that handles functions without loops is added > inside of pass_tree_no_loop. Yet another pass instance that executes if > optimizations are disabled is included inside of a new > pass_no_optimizations. > > The patch has been bootstrapped on x86_64-linux-gnu and tested with the > GCC testsuite and with the libgomp testsuite with nvptx and gcn > offloading. > > The patch should have no impact on non-OpenACC user code. However the > new pass instances have changed the pass instance numbering and hence > the dump scanning commands in several tests had to be adjusted. I hope
What's on my TODO list (or on the list of things to explore) is to make the dump file names/suffixes explicit in passes.def like via NEXT_PASS (pass_ccp, true /* nonzero_p */, "oacc") and we'd get a dump named .ccp_oacc or so. Or stick with explicit numbers by specifying , 5. If just the number is fixed this could eventually be done with just tweaks to gen-pass-instances.awk Now, what does oacc_device_lower actually do that you need to re-run complex lowering? What does cunrolli do at this point that the complete_unroll pass later does not do? What's special about oacc_device lower that doesn't also apply to omp_device_lower? Is all this targeted at code compiled exclusively for the offload target? Thus we're in lto1 here? Does it make eventually more sense to have a completely custom pass pipeline for the offload compilation? Maybe even per offload target? See how we have a custom pipeline for -Og (pass_all_optimizations_g). > that I found all that needed adjustment, but it is well possible that I > missed some tests that execute for particular targets or non-default > languages only. The resulting UNRESOLVED tests are usually easily fixed > by appending a pass number to the name of a pass that previously had no > number (e.g. "cunrolli" becomes "cunrolli1") or by incrementing the pass > number (e.g. "dce6" becomes "dce7") in a dump scanning command. > > The patch leads to several new unresolved tests in the libgomp testsuite > which are caused by the combination of torture testing, missing cleanup > of the offload dump files, and the new pass numbering. If a test that > uses, for instance, "-foffload=fdump-tree-oaccdevlow" gets compiled with > "-O0" and afterwards with "-O2", each run of the test executes different > instances of pass_oacc_device_lower and produces dumps whose names > differ only in the pass instance number. The dump scanning command in > the second run fails, because the dump files do not get removed after > the first run and the command consequently matches two different dump > files. This seems to be a known issue. I am going to submit a patch > that implements the cleanup of the offload dumps soon. > > I have tried to rule out performance regressions by running different > benchmark suites with nvptx and gcn offloading. Nevertheless, I think > that it makes sense to keep an eye on OpenACC performance in the close > future and revisit the optimizations that run on the device lowered > function if necessary. > > Ok to include the patch in master? > > Best regards, > Frederik > > > ----------------- > Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany > Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, > Alexander Walter