Hi!

On 2021-03-02T04:20:13-0800, Julian Brown <jul...@codesourcery.com> wrote:
> This patch enables worker-partitioning support via gimple rewriting for
> AMD GCN.

Thanks!

> Older (and currently unused) parts of this support are already
> present in the AMD GCN backend: those vestigial parts are enabled or
> updated, as appropriate.

..., and some of that moved into the "openacc: Middle-end
worker-partitioning support" commit.

A few of the test suite changes have already been resolved via other
commits.  And, on the other hand, a few more were necessary now.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
> @@ -79,13 +79,18 @@ int check (const int *ary, int size, int gp, int wp, int 
> vp)
>       exit = 1;
>        }
>
> +#ifndef ACC_DEVICE_TYPE_radeon
> +  /* AMD GCN uses the autovectorizer for the vector dimension: the use
> +     of a function call in vector-partitioned code in this test is not
> +     currently supported.  */
>    for (ix = 0; ix < vp; ix++)
>      if (vectors[ix] != vectors[0])
>        {
>       printf ("vector %d not used %d times\n", ix, vectors[0]);
>       exit = 1;
>        }
> -
> +#endif
> +
>    return exit;
>  }

I removed this change (disabling 'vectors' checking for
'ACC_DEVICE_TYPE_radeon'), because:

> @@ -132,9 +137,7 @@ int main ()
>    /* AMD GCN uses the autovectorizer for the vector dimension: the use
>       of a function call in vector-partitioned code in this test is not
>       currently supported.  */
> -  /* AMD GCN does not currently support multiple workers.  This should be
> -     set to 16 when that changes.  */
> -  return test_1 (16, 1, 1);
> +  return test_1 (16, 16, 64);
>  #else
>    return test_1 (16, 16, 32);
>  #endif

... if continuing to specify 'vp = 1' for 'ACC_DEVICE_TYPE_radeon', the
above 'vectors' checking can stay as is.  (Similarly done in other test case
files.)

ACK for the 'wp = 1' to 'wp = 16' change, of course.

I found that 'libgomp.oacc-fortran/optional-reduction.f90' and
'libgomp.oacc-fortran/reduction-7.f90' need to be XFAILed for '-O0'.
I suppose that'll get resolved via the forthcoming "openacc:
Reference-typed reduction and private variable rewriting" changes.

Pushed "amdgcn: Enable OpenACC worker partitioning for AMD GCN" to master
branch in commit c408512e1f7ca07e07794dc13fd6dfd9d2d7e998, see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From c408512e1f7ca07e07794dc13fd6dfd9d2d7e998 Mon Sep 17 00:00:00 2001
From: Julian Brown <jul...@codesourcery.com>
Date: Tue, 2 Mar 2021 04:20:13 -0800
Subject: [PATCH] amdgcn: Enable OpenACC worker partitioning for AMD GCN

	gcc/
	* config/gcn/gcn.c (gcn_init_builtins): Override decls for
	BUILT_IN_GOACC_SINGLE_START, BUILT_IN_GOACC_SINGLE_COPY_START,
	BUILT_IN_GOACC_SINGLE_COPY_END and BUILT_IN_GOACC_BARRIER.
	(gcn_goacc_validate_dims): Turn on worker partitioning unconditionally.
	(gcn_fork_join): Update comment.
	* config/gcn/gcn.opt (flag_worker_partitioning): Remove.
	(macc_experimental_workers): Remove unused option.
	libgomp/
	* plugin/plugin-gcn.c (gcn_exec): Change default number of workers to
	16.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
	[acc_device_radeon]: Update.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
	[ACC_DEVICE_TYPE_radeon]: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
	[acc_device_radeon]: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
	[ACC_DEVICE_TYPE_radeon]: Likewise.
	* testsuite/libgomp.oacc-fortran/optional-reduction.f90: XFAIL for
	'openacc_radeon_accel_selected' and '-O0'.
	* testsuite/libgomp.oacc-fortran/reduction-7.f90: Likewise.

Co-Authored-By: Kwok Cheung Yeung <k...@codesourcery.com>
Co-Authored-By: Thomas Schwinge <tho...@codesourcery.com>
---
 gcc/config/gcn/gcn.c                              | 15 +++------------
 gcc/config/gcn/gcn.opt                            |  5 -----
 libgomp/plugin/plugin-gcn.c                       |  3 +--
 .../acc_prof-kernels-1.c                          |  3 ---
 .../libgomp.oacc-c-c++-common/loop-dim-default.c  |  4 +---
 .../libgomp.oacc-c-c++-common/parallel-dims.c     | 12 ++++--------
 .../libgomp.oacc-c-c++-common/routine-wv-2.c      |  7 ++++---
 .../libgomp.oacc-fortran/optional-reduction.f90   |  3 +++
 .../libgomp.oacc-fortran/reduction-7.f90          |  3 +++
 9 files changed, 19 insertions(+), 36 deletions(-)

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 87af5d18f42..9df28277498 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -3712,8 +3712,6 @@ gcn_init_builtins (void)
       TREE_NOTHROW (gcn_builtin_decls[i]) = 1;
     }
 
-/* FIXME: remove the ifdef once OpenACC support is merged upstream.  */
-#ifdef BUILT_IN_GOACC_SINGLE_START
   /* These builtins need to take/return an LDS pointer: override the generic
      versions here.  */
 
@@ -3730,7 +3728,6 @@ gcn_init_builtins (void)
 
   set_builtin_decl (BUILT_IN_GOACC_BARRIER,
 		    gcn_builtin_decls[GCN_BUILTIN_ACC_BARRIER], false);
-#endif
 }
 
 /* Implement TARGET_INIT_LIBFUNCS.  */
@@ -5019,11 +5016,7 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
 			 unsigned /*used*/)
 {
   bool changed = false;
-
-  /* FIXME: remove -facc-experimental-workers when they're ready.  */
-  int max_workers = flag_worker_partitioning ? 16 : 1;
-
-  gcc_assert (!flag_worker_partitioning);
+  const int max_workers = 16;
 
   /* The vector size must appear to be 64, to the user, unless this is a
      SEQ routine.  The real, internal value is always 1, which means use
@@ -5060,8 +5053,7 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
     {
       dims[GOMP_DIM_VECTOR] = GCN_DEFAULT_VECTORS;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = (flag_worker_partitioning
-				 ? GCN_DEFAULT_WORKERS : 1);
+	dims[GOMP_DIM_WORKER] = GCN_DEFAULT_WORKERS;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = GCN_DEFAULT_GANGS;
       changed = true;
@@ -5126,8 +5118,7 @@ static bool
 gcn_fork_join (gcall *ARG_UNUSED (call), const int *ARG_UNUSED (dims),
 	       bool ARG_UNUSED (is_fork))
 {
-  /* GCN does not use the fork/join concept invented for NVPTX.
-     Instead we use standard autovectorization.  */
+  /* GCN does not need to expand fork/join markers at the RTL level.  */
   return false;
 }
 
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
index b2b10b0794c..6faacca42bb 100644
--- a/gcc/config/gcn/gcn.opt
+++ b/gcc/config/gcn/gcn.opt
@@ -62,11 +62,6 @@ bool flag_bypass_init_error = false
 mbypass-init-error
 Target RejectNegative Var(flag_bypass_init_error)
 
-bool flag_worker_partitioning = false
-
-macc-experimental-workers
-Target Var(flag_worker_partitioning) Init(0)
-
 int stack_size_opt = -1
 
 mstack-size=
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index f26d7361106..9e7377c91f9 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3038,8 +3038,7 @@ gcn_exec (struct kernel_info *kernel, size_t mapnum, void **hostaddrs,
      64 gangs matches a typical Fiji device.  */
 
   if (dims[0] == 0) dims[0] = get_cu_count (kernel->agent); /* Gangs.  */
-  /* NOTE: Until support for middle-end worker partitioning is merged, force 'num_workers (1)'.  */
-  if (/*TODO dims[1] == 0*/ true) dims[1] = 1;  /* Workers.  */
+  if (dims[1] == 0) dims[1] = 16; /* Workers.  */
 
   /* The incoming dimensions are expressed in terms of gangs, workers, and
      vectors.  The HSA dimensions are expressed in terms of "work-items",
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
index 6c136c26c93..ad33f72e2fb 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
@@ -93,9 +93,6 @@ static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *e
     }
   if (num_workers < 1)
     assert (event_info->launch_event.num_workers >= 1);
-  /* GCN currently enforces 'num_workers (1)'.  */
-  else if (acc_device_type == acc_device_radeon)
-    assert (event_info->launch_event.num_workers == 1);
   else
     {
 #ifdef __OPTIMIZE__
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
index ca771646655..419bc33ad53 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
@@ -132,9 +132,7 @@ int main ()
   /* AMD GCN uses the autovectorizer for the vector dimension: the use
      of a function call in vector-partitioned code in this test is not
      currently supported.  */
-  /* AMD GCN does not currently support multiple workers.  This should be
-     set to 16 when that changes.  */
-  return test_1 (16, 1, 1);
+  return test_1 (16, 16, 1);
 #else
   return test_1 (16, 16, 32);
 #endif
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
index fe0dacd5aac..9392e1d88c5 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
@@ -261,9 +261,8 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* The GCC GCN back end is limited to num_workers (16).
-	     Temporarily set this to 1 until multiple workers are permitted. */
-	  workers_actual = 1; // 16;
+	  /* The GCC GCN back end is limited to num_workers (16).  */
+	  workers_actual = 16;
 	}
       else
 	__builtin_abort ();
@@ -313,9 +312,8 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* The GCC GCN back end is limited to num_workers (16).
-	     Temporarily set this to 1 until multiple workers are permitted. */
-	  workers_actual = 1; // 16;
+	  /* The GCC GCN back end is limited to num_workers (16).  */
+	  workers_actual = 16;
 	}
       else
 	__builtin_abort ();
@@ -465,8 +463,6 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* Temporary setting, until multiple workers are permitted.  */
-	  workers_actual = 1;
 	  /* See above comments about GCN vectors_actual.  */
 	  vectors_actual = 1;
 	}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
index 624ec24e437..4f88b1c0779 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
@@ -5,12 +5,13 @@
 #include <openacc.h>
 #include <gomp-constants.h>
 
+#define NUM_WORKERS 16
 #ifdef ACC_DEVICE_TYPE_radeon
-/* Temporarily set this to 1 until multiple workers are permitted.  */
-#define NUM_WORKERS 1
+/* AMD GCN uses the autovectorizer for the vector dimension: the use
+   of a function call in vector-partitioned code in this test is not
+   currently supported.  */
 #define NUM_VECTORS 1
 #else
-#define NUM_WORKERS 16
 #define NUM_VECTORS 32
 #endif
 #define WIDTH 64
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90 b/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90
index 29f92c0d4c3..69b69b66c71 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90
@@ -4,6 +4,9 @@
 
 ! { dg-do run }
 
+!TODO
+! { dg-xfail-run-if TODO { openacc_radeon_accel_selected && { ! __OPTIMIZE__ } } }
+
 program optional_reduction
   implicit none
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
index 8cffac93a22..a8b0c60e420 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
@@ -1,5 +1,8 @@
 ! { dg-do run }
 
+!TODO
+! { dg-xfail-run-if TODO { openacc_radeon_accel_selected && { ! __OPTIMIZE__ } } }
+
 ! subroutine reduction with private and firstprivate variables
 
 program reduction
-- 
2.30.2

Reply via email to