Hi Richard, Richard Biener <richard.guent...@gmail.com> writes:
> On Tue, Nov 3, 2020 at 4:31 PM Frederik Harwath > What's on my TODO list (or on the list of things to explore) is to make > the dump file names/suffixes explicit in passes.def like via > > NEXT_PASS (pass_ccp, true /* nonzero_p */, "oacc") > > and we'd get a dump named .ccp_oacc or so. That would be very helpful for avoiding the drudgery of adapting those pass numbers! > Now, what does oacc_device_lower actually do that you need to > re-run complex lowering? What does cunrolli do at this point that > the complete_unroll pass later does not do? > Good spot, "cunrolli" seems to be unnecessary. The complex lowering is necessary to handle the code that gets created by the OpenACC reduction lowering during oaccdevlow. I have attached a test case (a reduced version of libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-flt.c) which shows that the complex instructions are created by pass_oacc_device_lower and which leads to an ICE if compiled without the new complex lowering instance ("-foffload=-fdisable-tree-cplxlower2"). The problem is an unlowered addition. This is from a diff of the dump of the pass following oaccdevlow1 (ccp4) with disabled and with enabled tree-cplxlower2: < _91 = VIEW_CONVERT_EXPR<complex float>(_1); < _92 = reduction_var_2 + _91; --- > _104 = REALPART_EXPR <VIEW_CONVERT_EXPR<complex float>(_1)>; > _105 = IMAGPART_EXPR <VIEW_CONVERT_EXPR<complex float>(_1)>; > _91 = COMPLEX_EXPR <_104, _105>; > _106 = reduction_var$real_100 + _104; > _107 = reduction_var$imag_101 + _105; > _92 = COMPLEX_EXPR <_106, _107>; > What's special about oacc_device lower that doesn't also apply > to omp_device_lower? The passes do different things. The goal is to optimize OpenACC loops using Graphite. The relevant lowering of the internal OpenACC function calls happens in pass_oacc_device_lower. > Is all this targeted at code compiled exclusively for the offload > target? Thus we're in lto1 here? The OpenACC outlined functions also get compiled for the host. > Does it make eventually more sense to have a completely custom pass > pipeline for the offload compilation? Maybe even per offload target? > See how we have a custom pipeline for -Og (pass_all_optimizations_g). What would be the main benefits of a separate pipeline? Avoiding (re-)running passes unneccessarily, less unwanted interactions in the test suite (but your suggestion above regarding the fixed pass names would also solve this)? >> Ok to include the patch in master? Best regards, Frederik ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c new file mode 100644 index 00000000000..6879e5aaf25 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c @@ -0,0 +1,50 @@ +/* { dg-additional-options "-foffload=-fdump-tree-cplxlower2" } */ +/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow1" } */ +/* { dg-do link } */ +/* { dg-skip-if "" { *-*-* } { "-O0" } {""} } */ + +#include <stdio.h> +#if !defined(__hppa__) || !defined(__hpux__) +#include <complex.h> +#endif + +#define N 100 + +static float _Complex __attribute__ ((noinline)) +sum (float _Complex ary[N]) +{ + float _Complex reduction_var = 0; +#pragma acc parallel loop gang reduction(+:reduction_var) + for (int ix = 0; ix < N; ix++) + reduction_var += ary[ix]; + + return reduction_var; +} + +int main (void) +{ + float _Complex ary[N]; + float _Complex result; + + for (int ix = 0; ix < N; ix++) + { + float frac = ix * (1.0f / 1024) + 1.0f; + ary[ix] = frac + frac * 2.0j - 1.0j; + } + + result = sum (ary); + printf("%.1f%+.1fi\n", creal(result), cimag(result)); + return 0; +} + +/* { dg-final { scan-offload-tree-dump-times "COMPLEX_EXPR" 1 "oaccdevlow1" } } + + There is just one COMPLEX_EXPR right before oaccdevlow1 ...*/ + +/* { dg-final { scan-offload-tree-dump-times "GOACC_REDUCTION .*?reduction_var.*?;" 4 "oaccdevlow1" } } + + ... but several IFN_GOACC_REDUCTION calls for the reduction variable which are subsequently lowered ... */ + +/* { dg-final { scan-offload-tree-dump-times "COMPLEX_EXPR <reduction_var.real_\\d+, reduction_var.imag_\\d+>" 4 "cplxlower2" } } + + ... which introduces new COMPLEX_EXPRs. */