Re: [PATCH v2] Re: OpenMP: Generate SIMD clones for functions with "declare target"

2022-09-30 Thread Jakub Jelinek via Gcc-patches
On Wed, Sep 21, 2022 at 09:17:18PM -0600, Sandra Loosemore wrote:
> On 9/14/22 12:12, Jakub Jelinek wrote:
> 
> > If it is pure optimization thing and purely keyed on the definition,
> > all the simd clones should be local to the TU, never exported from it.
> 
> OK, here is a revised patch that addresses that.  x86_64 target also
> generates a different set of clones for functions with internal linkage vs
> external so I hacked that to treat these implicit clones in the same way as
> other internal clones.
> 
> There is an existing problem with internal "declare simd" clones in that
> nothing ever DCEs clones that end up not being useful, or does a scan of the
> code in the compilation unit before clone generation to avoid generating
> useless clones in the first place.  I haven't tried to solve that problem,
> but I did attempt to mitigate it for these implicit "declare target" clones
> by tagging the option OPT_LEVELS_2_PLUS_SPEED_ONLY (instead of enabling it
> by default all the time) so the clones are not generated by default at -Os
> and -Og.  I added a couple new test cases to check this.

We've discussed this at Cauldron.  Especially for this patch, but less
urgently for explicit declare simd on non-exported functions (less urgently
just because people don't mark everything declare simd usually) solving the
above is essential.  I don't say it can't be done incrementally, but if the
patch is added to trunk, it needs to be solved before 13 branches.
We need to arrange cgraph to process the declare simd clones after the
callers of the corresponding main function, so that by the time we try to
post-IPA optimize the clones we can see if they were actually used or not
and if not, throw them away.

On the other side, for the implicit declare simd (in explicit case it is
user's choice), maybe it might be useful to actually see if the function clone
is vectorizable before deciding whether to actually make use of it.
Because I doubt it will be a good optimization if we clone it, push
arguments into vectors, then because vectorization failed take it appart,
do a serial loop, create return vector from the scalar results and return.
Though, thinking more about it, for the amdgcn case maybe it is worth even
in that case if we manage to vectorize the caller.  Because if failed
vectorization on admgcn means we perform significantly slower, it can be
helpful to have even partial vectorization, vectorize statements that can
be vectorized and for others use a scalar loop.  Our vectorizer is not
prepared to do that right now I believe (which is why e.g. for
#pragma omp ordered simd we just make the whole loop non-vectorizable,
rather than using a scalar loop for stuff in there and vectorize the rest),
but with this optimization we'd effectively achieve that at least at
function call boundaries (though, only in one direction, if the caller can
be vectorized and callee can't; no optimization if caller can't and callee
could be).

> +/* Helper function for mark_auto_simd_clone; return false if the statement
> +   violates restrictions for an "omp declare simd" function.  Specifically,
> +   the function must not
> +   - throw or call setjmp/longjmp
> +   - write memory that could alias parallel calls
> +   - include openmp directives or calls
> +   - call functions that might do those things */
> +
> +static bool
> +auto_simd_check_stmt (gimple *stmt, tree outer)
> +{
> +  tree decl;
> +
> +  switch (gimple_code (stmt))
> +{
> +case GIMPLE_CALL:
> +  decl = gimple_call_fndecl (stmt);
> +
> +  /* We can't know whether indirect calls are safe.  */
> +  if (decl == NULL_TREE)
> + return false;

What about internal function calls?  Are all of them undesirable, or
some of them?  We do have const / pure ifns, ...
> +
> +  /* Calls to functions that are CONST or PURE are ok.  */
> +  if (gimple_call_flags (stmt) & (ECF_CONST | ECF_PURE))
> + break;
> +
> +  /* Calls to functions that are already marked "omp declare simd" are
> +  OK.  */
> +  if (lookup_attribute ("omp declare simd", DECL_ATTRIBUTES (decl)))
> + break;

You could instead look up the cgraph simd clone info for the function...

> +  /* OpenMP directives are not permitted.  */
> +CASE_GIMPLE_OMP:
> +  return false;

This makes no sense.  The function is called on low GIMPLE during IPA,
there are no GOMP_* statements at this point in the IL, everything has
been expanded.  Most of OpenMP directives though end up calling
libgomp APIs which aren't pure/const and don't have declare simd
attribute...
Exception can be say master construct, or static scheduling nowait
worksharing loop.

> +  /* Conservatively reject all EH-related constructs.  */
> +case GIMPLE_CATCH:
> +case GIMPLE_EH_FILTER:
> +case GIMPLE_EH_MUST_NOT_THROW:
> +case GIMPLE_EH_ELSE:
> +case GIMPLE_EH_DISPATCH:
> +case GIMPLE_RESX:
> +case GIMPLE_TRY:

Most of these won't appear in low gimple either, I think 

[PATCH v2] Re: OpenMP: Generate SIMD clones for functions with "declare target"

2022-09-21 Thread Sandra Loosemore

On 9/14/22 12:12, Jakub Jelinek wrote:


If it is pure optimization thing and purely keyed on the definition,
all the simd clones should be local to the TU, never exported from it.


OK, here is a revised patch that addresses that.  x86_64 target also 
generates a different set of clones for functions with internal linkage 
vs external so I hacked that to treat these implicit clones in the same 
way as other internal clones.


There is an existing problem with internal "declare simd" clones in that 
nothing ever DCEs clones that end up not being useful, or does a scan of 
the code in the compilation unit before clone generation to avoid 
generating useless clones in the first place.  I haven't tried to solve 
that problem, but I did attempt to mitigate it for these implicit 
"declare target" clones by tagging the option 
OPT_LEVELS_2_PLUS_SPEED_ONLY (instead of enabling it by default all the 
time) so the clones are not generated by default at -Os and -Og.  I 
added a couple new test cases to check this.


On 9/14/22 15:45, Thomas Schwinge wrote:

However, OpenACC and OpenMP support may be active at the same time...


+  if (attr == NULL_TREE
+  && flag_openmp_target_simd_clone && !flag_openacc)


..., so '!flag_openacc' is not the right check here.  Instead you'd do
'!oacc_get_fn_attrib (DECL_ATTRIBUTES (node->decl))' (untested) or
similar.


This is fixed now too.

OK to check in?

-SandraFrom dfdb9a2162978b964863f351c814211dca8e9a3f Mon Sep 17 00:00:00 2001
From: Sandra Loosemore 
Date: Thu, 22 Sep 2022 02:16:42 +
Subject: [PATCH] OpenMP: Generate SIMD clones for functions with "declare
 target"

This patch causes the IPA simdclone pass to generate clones for
functions with the "omp declare target" attribute as if they had
"omp declare simd", provided the function appears to be suitable for
SIMD execution.  The filter is conservative, rejecting functions
that write memory or that call other functions not known to be safe.
A new option -fopenmp-target-simd-clone is added to control this
transformation; it's enabled at -O2 and higher.

gcc/ChangeLog:

	* common.opt (fopenmp-target-simd-clone): New option.
	* opts.cc (default_options_table): Add -fopenmp-target-simd-clone.
	* doc/invoke.texi (-fopenmp-target-simd-clone): Document.
	* omp-simd-clone.cc (auto_simd_check_stmt): New function.
	(mark_auto_simd_clone): New function.
	(simd_clone_create): Add force_local argument, make the symbol
	have internal linkage if it is true.
	(expand_simd_clones): Also check for cloneable functions with
	"omp declare target".  Pass explicit_p argument to
	simd_clone.compute_vecsize_and_simdlen target hook.
	* target.def (TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN):
	Add bool explicit_p argument.
	* doc/tm.texi: Regenerated.
	* config/aarch64/aarch64.cc
	(aarch64_simd_clone_compute_vecsize_and_simdlen): Update.
	* config/gcn/gcn.cc
	(gcn_simd_clone_compute_vecsize_and_simdlen): Update.
	* config/i386/i386.cc
	(ix86_simd_clone_compute_vecsize_and_simdlen): Update.

gcc/testsuite/ChangeLog:

	* gcc.dg/gomp/target-simd-clone-1.c: New.
	* gcc.dg/gomp/target-simd-clone-2.c: New.
	* gcc.dg/gomp/target-simd-clone-3.c: New.
	* gcc.dg/gomp/target-simd-clone-4.c: New.
	* gcc.dg/gomp/target-simd-clone-5.c: New.
	* gcc.dg/gomp/target-simd-clone-6.c: New.
---
 gcc/common.opt|   4 +
 gcc/config/aarch64/aarch64.cc |  24 +-
 gcc/config/gcn/gcn.cc |  10 +-
 gcc/config/i386/i386.cc   |  27 +-
 gcc/doc/invoke.texi   |  12 +-
 gcc/doc/tm.texi   |   2 +-
 gcc/omp-simd-clone.cc | 237 --
 gcc/opts.cc   |   1 +
 gcc/target.def|   2 +-
 .../gcc.dg/gomp/target-simd-clone-1.c |  18 ++
 .../gcc.dg/gomp/target-simd-clone-2.c |  18 ++
 .../gcc.dg/gomp/target-simd-clone-3.c |  17 ++
 .../gcc.dg/gomp/target-simd-clone-4.c |  16 ++
 .../gcc.dg/gomp/target-simd-clone-5.c |  13 +
 .../gcc.dg/gomp/target-simd-clone-6.c |  13 +
 15 files changed, 362 insertions(+), 52 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-2.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-3.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-4.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-5.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-6.c

diff --git a/gcc/common.opt b/gcc/common.opt
index fba90ff6dcb..c735c62a8d4 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2217,6 +2217,10 @@ fomit-frame-pointer
 Common Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
 
+fopenmp-target-simd-clone
+Common Var(flag_openmp_target_simd_clone) Optimization
+Generate SIMD clones for