Re: Move pass_oacc_device_lower after pass_graphite

2020-11-06 Thread Frederik Harwath

Hi Richard,

Richard Biener  writes:

> On Tue, Nov 3, 2020 at 4:31 PM Frederik Harwath

> What's on my TODO list (or on the list of things to explore) is to make
> the dump file names/suffixes explicit in passes.def like via
>
>   NEXT_PASS (pass_ccp, true /* nonzero_p */, "oacc")
>
> and we'd get a dump named .ccp_oacc or so.

That would be very helpful for avoiding the drudgery of adapting those
pass numbers!

> Now, what does oacc_device_lower actually do that you need to
> re-run complex lowering?  What does cunrolli do at this point that
> the complete_unroll pass later does not do?
>

Good spot, "cunrolli" seems to be unnecessary.  The complex lowering is
necessary to handle the code that gets created by the OpenACC reduction
lowering during oaccdevlow.  I have attached a test case (a reduced
version of
libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-flt.c) which
shows that the complex instructions are created by
pass_oacc_device_lower and which leads to an ICE if compiled without the
new complex lowering instance ("-foffload=-fdisable-tree-cplxlower2").
The problem is an unlowered addition. This is from a diff of the dump of
the pass following oaccdevlow1 (ccp4) with disabled and with enabled
tree-cplxlower2:

<   _91 = VIEW_CONVERT_EXPR(_1);
<   _92 = reduction_var_2 + _91;
---
>   _104 = REALPART_EXPR (_1)>;
>   _105 = IMAGPART_EXPR (_1)>;
>   _91 = COMPLEX_EXPR <_104, _105>;
>   _106 = reduction_var$real_100 + _104;
>   _107 = reduction_var$imag_101 + _105;
>   _92 = COMPLEX_EXPR <_106, _107>;

> What's special about oacc_device lower that doesn't also apply
> to omp_device_lower?

The passes do different things. The goal is to optimize OpenACC
loops using Graphite. The relevant lowering of the internal OpenACC
function calls happens in pass_oacc_device_lower.

> Is all this targeted at code compiled exclusively for the offload
> target?  Thus we're in lto1 here?

The OpenACC outlined functions also get compiled for the host.

> Does it make eventually more sense to have a completely custom pass
> pipeline for the  offload compilation?  Maybe even per offload target?
> See how we have a custom pipeline for -Og (pass_all_optimizations_g).

What would be the main benefits of a separate pipeline? Avoiding
(re-)running passes unneccessarily, less unwanted interactions
in the test suite (but your suggestion above regarding the fixed
pass names would also solve this)?

>> Ok to include the patch in master?

Best regards,
Frederik

-
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander 
Walter
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c
new file mode 100644
index 000..6879e5aaf25
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-lowering.c
@@ -0,0 +1,50 @@
+/* { dg-additional-options "-foffload=-fdump-tree-cplxlower2" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow1" } */
+/* { dg-do link } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } {""} } */
+
+#include 
+#if !defined(__hppa__) || !defined(__hpux__)
+#include 
+#endif
+
+#define N 100
+
+static float _Complex __attribute__ ((noinline))
+sum (float _Complex ary[N])
+{
+  float _Complex reduction_var = 0;
+#pragma acc parallel loop gang reduction(+:reduction_var)
+  for (int ix = 0; ix < N; ix++)
+reduction_var += ary[ix];
+
+ return reduction_var;
+}
+
+int main (void)
+{
+  float _Complex ary[N];
+  float _Complex result;
+
+  for (int ix = 0; ix < N;  ix++)
+{
+  float frac = ix * (1.0f / 1024) + 1.0f;
+  ary[ix] = frac + frac * 2.0j - 1.0j;
+}
+
+  result = sum (ary);
+  printf("%.1f%+.1fi\n", creal(result), cimag(result));
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump-times "COMPLEX_EXPR" 1 "oaccdevlow1" } }
+
+ There is just one COMPLEX_EXPR right before oaccdevlow1 ...*/
+
+/* { dg-final { scan-offload-tree-dump-times "GOACC_REDUCTION .*?reduction_var.*?;" 4 "oaccdevlow1" } }
+
+  ... but several IFN_GOACC_REDUCTION calls for the reduction variable which are subsequently lowered ... */
+
+/* { dg-final { scan-offload-tree-dump-times "COMPLEX_EXPR " 4  "cplxlower2" } }
+
+ ... which introduces new COMPLEX_EXPRs. */


Re: Move pass_oacc_device_lower after pass_graphite

2020-11-06 Thread Richard Biener via Gcc-patches
On Fri, Nov 6, 2020 at 12:18 PM Frederik Harwath
 wrote:
>
>
> Hi Richard,
>
> Richard Biener  writes:
>
> > On Tue, Nov 3, 2020 at 4:31 PM Frederik Harwath
>
> > What's on my TODO list (or on the list of things to explore) is to make
> > the dump file names/suffixes explicit in passes.def like via
> >
> >   NEXT_PASS (pass_ccp, true /* nonzero_p */, "oacc")
> >
> > and we'd get a dump named .ccp_oacc or so.
>
> That would be very helpful for avoiding the drudgery of adapting those
> pass numbers!
>
> > Now, what does oacc_device_lower actually do that you need to
> > re-run complex lowering?  What does cunrolli do at this point that
> > the complete_unroll pass later does not do?
> >
>
> Good spot, "cunrolli" seems to be unnecessary.  The complex lowering is
> necessary to handle the code that gets created by the OpenACC reduction
> lowering during oaccdevlow.  I have attached a test case (a reduced
> version of
> libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-cplx-flt.c) which
> shows that the complex instructions are created by
> pass_oacc_device_lower and which leads to an ICE if compiled without the
> new complex lowering instance ("-foffload=-fdisable-tree-cplxlower2").
> The problem is an unlowered addition. This is from a diff of the dump of
> the pass following oaccdevlow1 (ccp4) with disabled and with enabled
> tree-cplxlower2:
>
> <   _91 = VIEW_CONVERT_EXPR(_1);
> <   _92 = reduction_var_2 + _91;
> ---
> >   _104 = REALPART_EXPR (_1)>;
> >   _105 = IMAGPART_EXPR (_1)>;
> >   _91 = COMPLEX_EXPR <_104, _105>;
> >   _106 = reduction_var$real_100 + _104;
> >   _107 = reduction_var$imag_101 + _105;
> >   _92 = COMPLEX_EXPR <_106, _107>;

I wonder if oacc device lowering could handle this itself rather than
requiring another cplxlower pass for presumably just complex add?

> > What's special about oacc_device lower that doesn't also apply
> > to omp_device_lower?
>
> The passes do different things. The goal is to optimize OpenACC
> loops using Graphite. The relevant lowering of the internal OpenACC
> function calls happens in pass_oacc_device_lower.
>
> > Is all this targeted at code compiled exclusively for the offload
> > target?  Thus we're in lto1 here?
>
> The OpenACC outlined functions also get compiled for the host.
>
> > Does it make eventually more sense to have a completely custom pass
> > pipeline for the  offload compilation?  Maybe even per offload target?
> > See how we have a custom pipeline for -Og (pass_all_optimizations_g).
>
> What would be the main benefits of a separate pipeline? Avoiding
> (re-)running passes unneccessarily, less unwanted interactions
> in the test suite (but your suggestion above regarding the fixed
> pass names would also solve this)?

Mainly to avoid (re-)running passes unneccessarily and more
easily tuning towards offload targets without affecting non-offload
code too much.

Can I somehow make you work on that dump-file idea? ;)

Richard.

> >> Ok to include the patch in master?
>
> Best regards,
> Frederik
>
> -
> Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
> Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, 
> Alexander Walter


Re: Move pass_oacc_device_lower after pass_graphite

2020-11-04 Thread Richard Biener via Gcc-patches
On Tue, Nov 3, 2020 at 4:31 PM Frederik Harwath
 wrote:
>
>
> Hi,
>
> as a first step towards enabling the use of Graphite for optimizing
> OpenACC loops this patch moves the OpenACC device lowering after the
> Graphite pass.  This means that the device lowering now takes place
> after some crucial optimization passes. Thus new instances of those
> passes are added inside of a new pass pass_oacc_functions which ensures
> that they run on OpenACC functions only. The choice of the new position
> for pass_oacc_device_lower is further constrainted by the need to
> execute it before pass_vectorize.  This means that
> pass_oacc_device_lower now runs inside of pass_tree_loop. A further
> instance of the pass that handles functions without loops is added
> inside of pass_tree_no_loop. Yet another pass instance that executes if
> optimizations are disabled is included inside of a new
> pass_no_optimizations.
>
> The patch has been bootstrapped on x86_64-linux-gnu and tested with the
> GCC testsuite and with the libgomp testsuite with nvptx and gcn
> offloading.
>
> The patch should have no impact on non-OpenACC user code. However the
> new pass instances have changed the pass instance numbering and hence
> the dump scanning commands in several tests had to be adjusted. I hope

What's on my TODO list (or on the list of things to explore) is to make
the dump file names/suffixes explicit in passes.def like via

  NEXT_PASS (pass_ccp, true /* nonzero_p */, "oacc")

and we'd get a dump named .ccp_oacc or so.  Or stick with explicit
numbers by specifying , 5.  If just the number is fixed this could
eventually be done with just tweaks to gen-pass-instances.awk

Now, what does oacc_device_lower actually do that you need to
re-run complex lowering?  What does cunrolli do at this point that
the complete_unroll pass later does not do?

What's special about oacc_device lower that doesn't also apply
to omp_device_lower?

Is all this targeted at code compiled exclusively for the offload
target?  Thus we're in lto1 here?  Does it make eventually more
sense to have a completely custom pass pipeline for the
offload compilation?  Maybe even per offload target?  See how
we have a custom pipeline for -Og (pass_all_optimizations_g).

> that I found all that needed adjustment, but it is well possible that I
> missed some tests that execute for particular targets or non-default
> languages only. The resulting UNRESOLVED tests are usually easily fixed
> by appending a pass number to the name of a pass that previously had no
> number (e.g. "cunrolli" becomes "cunrolli1") or by incrementing the pass
> number (e.g. "dce6" becomes "dce7") in a dump scanning command.
>
> The patch leads to several new unresolved tests in the libgomp testsuite
> which are caused by the combination of torture testing, missing cleanup
> of the offload dump files, and the new pass numbering.  If a test that
> uses, for instance, "-foffload=fdump-tree-oaccdevlow" gets compiled with
> "-O0" and afterwards with "-O2", each run of the test executes different
> instances of pass_oacc_device_lower and produces dumps whose names
> differ only in the pass instance number.  The dump scanning command in
> the second run fails, because the dump files do not get removed after
> the first run and the command consequently matches two different dump
> files.  This seems to be a known issue.  I am going to submit a patch
> that implements the cleanup of the offload dumps soon.
>
> I have tried to rule out performance regressions by running different
> benchmark suites with nvptx and gcn offloading. Nevertheless, I think
> that it makes sense to keep an eye on OpenACC performance in the close
> future and revisit the optimizations that run on the device lowered
> function if necessary.
>
> Ok to include the patch in master?
>
> Best regards,
> Frederik
>
>
> -
> Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
> Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, 
> Alexander Walter