Hi Richard,

Richard Biener <richard.guent...@gmail.com> writes:

> On Tue, Nov 3, 2020 at 4:31 PM Frederik Harwath

> What's on my TODO list (or on the list of things to explore) is to make
> the dump file names/suffixes explicit in passes.def like via
>
>   NEXT_PASS (pass_ccp, true /* nonzero_p */, "oacc")
>
> and we'd get a dump named .ccp_oacc or so.

That would be very helpful for avoiding the drudgery of adapting those
pass numbers!

> Now, what does oacc_device_lower actually do that you need to
> re-run complex lowering?  What does cunrolli do at this point that
> the complete_unroll pass later does not do?
>

Good spot, "cunrolli" seems to be unnecessary.  The complex lowering is
necessary to handle the code that gets created by the OpenACC reduction
lowering during oaccdevlow.  I have attached a test case (a reduced
version of
libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-flt.c) which
shows that the complex instructions are created by
pass_oacc_device_lower and which leads to an ICE if compiled without the
new complex lowering instance ("-foffload=-fdisable-tree-cplxlower2").
The problem is an unlowered addition. This is from a diff of the dump of
the pass following oaccdevlow1 (ccp4) with disabled and with enabled
tree-cplxlower2:

<   _91 = VIEW_CONVERT_EXPR<complex float>(_1);
<   _92 = reduction_var_2 + _91;
---
>   _104 = REALPART_EXPR <VIEW_CONVERT_EXPR<complex float>(_1)>;
>   _105 = IMAGPART_EXPR <VIEW_CONVERT_EXPR<complex float>(_1)>;
>   _91 = COMPLEX_EXPR <_104, _105>;
>   _106 = reduction_var$real_100 + _104;
>   _107 = reduction_var$imag_101 + _105;
>   _92 = COMPLEX_EXPR <_106, _107>;

> What's special about oacc_device lower that doesn't also apply
> to omp_device_lower?

The passes do different things. The goal is to optimize OpenACC
loops using Graphite. The relevant lowering of the internal OpenACC
function calls happens in pass_oacc_device_lower.

> Is all this targeted at code compiled exclusively for the offload
> target?  Thus we're in lto1 here?

The OpenACC outlined functions also get compiled for the host.

> Does it make eventually more sense to have a completely custom pass
> pipeline for the  offload compilation?  Maybe even per offload target?
> See how we have a custom pipeline for -Og (pass_all_optimizations_g).

What would be the main benefits of a separate pipeline? Avoiding
(re-)running passes unneccessarily, less unwanted interactions
in the test suite (but your suggestion above regarding the fixed
pass names would also solve this)?

>> Ok to include the patch in master?

Best regards,
Frederik

-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander 
Walter
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c
new file mode 100644
index 00000000000..6879e5aaf25
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c
@@ -0,0 +1,50 @@
+/* { dg-additional-options "-foffload=-fdump-tree-cplxlower2" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow1" } */
+/* { dg-do link } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } {""} } */
+
+#include <stdio.h>
+#if !defined(__hppa__) || !defined(__hpux__)
+#include <complex.h>
+#endif
+
+#define N 100
+
+static float _Complex __attribute__ ((noinline))
+sum (float _Complex ary[N])
+{
+  float _Complex reduction_var = 0;
+#pragma acc parallel loop gang reduction(+:reduction_var)
+  for (int ix = 0; ix < N; ix++)
+    reduction_var += ary[ix];
+
+ return reduction_var;
+}
+
+int main (void)
+{
+  float _Complex ary[N];
+  float _Complex result;
+
+  for (int ix = 0; ix < N;  ix++)
+    {
+      float frac = ix * (1.0f / 1024) + 1.0f;
+      ary[ix] = frac + frac * 2.0j - 1.0j;
+    }
+
+  result = sum (ary);
+  printf("%.1f%+.1fi\n", creal(result), cimag(result));
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump-times "COMPLEX_EXPR" 1 "oaccdevlow1" } }
+
+ There is just one COMPLEX_EXPR right before oaccdevlow1 ...*/
+
+/* { dg-final { scan-offload-tree-dump-times "GOACC_REDUCTION .*?reduction_var.*?;" 4 "oaccdevlow1" } }
+
+  ... but several IFN_GOACC_REDUCTION calls for the reduction variable which are subsequently lowered ... */
+
+/* { dg-final { scan-offload-tree-dump-times "COMPLEX_EXPR <reduction_var.real_\\d+, reduction_var.imag_\\d+>" 4  "cplxlower2" } }
+
+ ... which introduces new COMPLEX_EXPRs. */

Reply via email to