Re: RE: [PATCH v7] libgfortran: Replace mutex with rwlock

2023-12-21 Thread Lipeng Zhu

Hi Thomas,

On 2023/12/21 19:42, Thomas Schwinge wrote:

Hi!

On 2023-12-13T21:52:29+0100, I wrote:

On 2023-12-12T02:05:26+, "Zhu, Lipeng"  wrote:

On 2023/12/12 1:45, H.J. Lu wrote:

On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng  wrote:

On 2023/12/9 23:23, Jakub Jelinek wrote:

On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:

This patch try to introduce the rwlock and split the read/write to
unit_root tree and unit_cache with rwlock instead of the mutex to
increase CPU efficiency. In the get_gfc_unit function, the
percentage to step into the insert_unit function is around 30%, in
most instances, we can get the unit in the phase of reading the
unit_cache or unit_root tree. So split the read/write phase by
rwlock would be an approach to make it more parallel.

BTW, the IPC metrics can gain around 9x in our test server with
220 cores. The benchmark we used is
https://github.com/rwesson/NEAT



Ok for trunk, thanks.



Thanks! Looking forward to landing to trunk.



Pushed for you.



I've just filed 
"'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution test 
timeouts".
Would you be able to look into that?


See my update in there.


Grüße
  Thomas
-- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 
201, 80634 München; Gesellschaft mit beschränkter Haftung; 
Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: 
München; Registergericht München, HRB 106955




Since I don't have gcc bugzilla account. Reply in this thread:
Limit themselves to some lower 'OMP_NUM_THREADS' should be an option or 
increase the execution timeout?


But I can't reproduce the execution timeout failure on both powerpc9 and 
powerpc10 arch machine. And I also tried to decrease the CPU frequency 
from 2.6G to 800M, these test cases still can run successfully.


> so only a little bit of an improvement of the new "rwlock" 
libgfortran vs. old "mutex" GCC 10 one, curiously.  (But supposedly that 
depends on the hardware or other factors?)


The rwlock can increase the IPC a lot, maybe the wall time you listed is 
not obvious.


$ grep ^cpu < /proc/cpuinfo | uniq -c

192 cpu : POWER10 (architected), altivec supported

Native configuration is powerpc64le-unknown-linux-gnu

Schedule of variations:
unix

PASS: libgomp.fortran/rwlock_1.f90   -O0  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O0  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O1  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O1  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O2  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O2  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test

PASS: libgomp.fortran/rwlock_1.f90   -O3 -g  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O3 -g  execution test
PASS: libgomp.fortran/rwlock_1.f90   -Os  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -Os  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O0  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O0  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O1  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O1  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O2  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O2  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test

PASS: libgomp.fortran/rwlock_2.f90   -O3 -g  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O3 -g  execution test
PASS: libgomp.fortran/rwlock_2.f90   -Os  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -Os  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O0  (test for excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O0  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O1  (test for excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O1  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O2  (test for excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O2  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test


Lipeng Zhu



[PATCH] libgfortran: Bugfix if not define HAVE_ATOMIC_FETCH_ADD

2023-12-21 Thread Lipeng Zhu
This patch try to fix the bug when HAVE_ATOMIC_FETCH_ADD is
not defined in dec_waiting_unlocked function.

libgfortran/ChangeLog:

* io/io.h (dec_waiting_unlocked): Use
__gthread_rwlock_wrlock/__gthread_rwlock_unlock or
__gthread_mutex_lock/__gthread_mutex_unlock functions
to replace WRLOCK and RWUNLOCK macros.

Signed-off-by: Lipeng Zhu 
---
 libgfortran/io/io.h | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 15daa0995b1..c7f0f7d7d9e 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -1020,9 +1020,15 @@ dec_waiting_unlocked (gfc_unit *u)
 #ifdef HAVE_ATOMIC_FETCH_ADD
   (void) __atomic_fetch_add (>waiting, -1, __ATOMIC_RELAXED);
 #else
-  WRLOCK (_rwlock);
+#ifdef __GTHREAD_RWLOCK_INIT
+  __gthread_rwlock_wrlock (_rwlock);
+  u->waiting--;
+  __gthread_rwlock_unlock (_rwlock);
+#else
+  __gthread_mutex_lock (_rwlock);
   u->waiting--;
-  RWUNLOCK (_rwlock);
+  __gthread_mutex_unlock (_rwlock);
+#endif
 #endif
 }
 
-- 
2.39.3



RE: [PATCH v7] libgfortran: Replace mutex with rwlock

2023-12-21 Thread Thomas Schwinge
Hi!

On 2023-12-13T21:52:29+0100, I wrote:
> On 2023-12-12T02:05:26+, "Zhu, Lipeng"  wrote:
>> On 2023/12/12 1:45, H.J. Lu wrote:
>>> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng  wrote:
>>> > On 2023/12/9 23:23, Jakub Jelinek wrote:
>>> > > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>>> > > > This patch try to introduce the rwlock and split the read/write to
>>> > > > unit_root tree and unit_cache with rwlock instead of the mutex to
>>> > > > increase CPU efficiency. In the get_gfc_unit function, the
>>> > > > percentage to step into the insert_unit function is around 30%, in
>>> > > > most instances, we can get the unit in the phase of reading the
>>> > > > unit_cache or unit_root tree. So split the read/write phase by
>>> > > > rwlock would be an approach to make it more parallel.
>>> > > >
>>> > > > BTW, the IPC metrics can gain around 9x in our test server with
>>> > > > 220 cores. The benchmark we used is
>>> > > > https://github.com/rwesson/NEAT
>
>>> > > Ok for trunk, thanks.
>
>>> > Thanks! Looking forward to landing to trunk.
>
>>> Pushed for you.

> I've just filed 
> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution 
> test timeouts".
> Would you be able to look into that?

See my update in there.


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH v7 5/5] OpenMP/OpenACC: Reorganise OMP map clause handling in gimplify.cc

2023-12-21 Thread Tobias Burnus

Hi Julian,

On 20.12.23 22:29, Julian Brown wrote:

Thanks for review! Here's a new version of the patch which hopefully
addresses this round of comments.


Thanks for the patch. LGTM now.

Tobias


On Tue, 19 Dec 2023 16:41:54 +0100
Tobias Burnus  wrote:


On 16.12.23 14:25, Julian Brown wrote:

--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -10107,6 +10114,20 @@ omp_segregate_mapping_groups
(omp_mapping_group *inlist) ard_tail = >next;
 break;

+ case GOMP_MAP_PRESENT_ALLOC:
+   *pa_tail = w;
+   w->next = NULL;
+   pa_tail = >next;
+   break;
+
+ case GOMP_MAP_PRESENT_FROM:
+ case GOMP_MAP_PRESENT_TO:
+ case GOMP_MAP_PRESENT_TOFROM:
+   *ptf_tail = w;
+   w->next = NULL;
+   ptf_tail = >next;
+   break;
+

First, I note that GOMP_MAP_PRESENT_ALLOC and
GOMP_MAP_PRESENT_{FROM,TO,TOFROM} are semantically identical: If the
variable is not present, error termination will happen - otherwise, if
present, no data movement will happen. Hence, they will be changed to
GOMP_MAP_FORCE_PRESENT in gimplify_adjust_omp_clauses.

That's also the reason that the old code handled all of them
identical.

However, besides a plain 'present', there is also 'always' +
'present'. Those are different as after a normal 'present' check
(abort if not present), the data copying will happen:
GOMP_MAP_ALWAYS_PRESENT_TO, GOMP_MAP_ALWAYS_PRESENT_FROM,
GOMP_MAP_ALWAYS_PRESENT_TOFROM.

(Note that: always + present + alloc = GOMP_MAP_PRESENT_ALLOC (w/o
'always') as already done in the FE.)

Thus, all 'case' from your patch should go to a single group (possibly
adding a comment about it). The question is what to do with the
'present,always' case. I think leaving them under 'default:' is fine,
but I might have missed something.

I've made this change (i.e.: grouping all "GOMP_MAP_PRESENT_*" nodes
together), and in fact that restores the dump output for the
gfortran.dg/gomp/map-12.f90 that needed to be adjusted for the previous
version of the patch (so that hunk has now disappeared).


   default:
 *tf_tail = w;
 w->next = NULL;
@@ -10118,8 +10139,10 @@ omp_segregate_mapping_groups
(omp_mapping_group *inlist)

* * *

@@ -11922,119 +11945,30 @@ gimplify_scan_omp_clauses (tree *list_p,
gimple_seq *pre_p, break;
 }

-  if (code == OMP_TARGET
-  || code == OMP_TARGET_DATA
-  || code == OMP_TARGET_ENTER_DATA
-  || code == OMP_TARGET_EXIT_DATA)
-{
-  vec *groups;
-  groups = omp_gather_mapping_groups (list_p);
-  if (groups)
- {
-   hash_map
*grpmap;
-   grpmap = omp_index_mapping_groups (groups);
+  vec *groups = omp_gather_mapping_groups
(list_p);
+  hash_map *grpmap =
NULL;
+  unsigned grpnum = 0;
+  tree *grp_start_p = NULL, grp_end = NULL_TREE;

...


-  else if (region_type & ORT_ACC)
-{

I wonder whether you should not better call
'omp_gather_mapping_groups' only for the 'code == OMP_TARGET...' and
for ORT_ACC (or some subset of OACC *), given that this function is
also called bygimplify_omp_parallel, gimplify_omp_task,
gimplify_omp_for, ...

This avoids some memory allocation and list_p walking, i.e. it is not
too bad - but also not really needed for task, parallel, for, ...

I've made that change -- OpenACC uses OMP_CLAUSE_MAP in quite a wide
range of directives, but the new version of the patch lists them
individually anyway, rather than using a catch-all for ORT_ACC regions.
That seems OK, I think.


@@ -14008,26 +13926,73 @@ gimplify_adjust_omp_clauses (gimple_seq
*pre_p, gimple_seq body, tree *list_p, default:
 break;
   }
-   if (code == OMP_TARGET_EXIT_DATA
-   && OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_ALWAYS_POINTER)
+   switch (code)
   {
+ case OMP_TARGET:
+   break;
+ case OACC_DATA:
+   if (TREE_CODE (TREE_TYPE (decl)) != ARRAY_TYPE)
+ break;
+   goto check_firstprivate;
+ case OACC_ENTER_DATA:
+ case OACC_EXIT_DATA:
+ case OMP_TARGET_DATA:
+ case OMP_TARGET_ENTER_DATA:
+ case OMP_TARGET_EXIT_DATA:
+ case OACC_HOST_DATA:
+ check_firstprivate:
+   if (OMP_CLAUSE_MAP_KIND (c) ==
GOMP_MAP_FIRSTPRIVATE_POINTER

I think it looks nicer if the OACC_HOST is before OMP_* such that all
OACC_* are together. (In the old code, oacc_enter/exit was treated
differently than OMP_* and OACC_HOST_DATA; your order is a leftover
from that code movement/change.)

I've fixed this bit -- which actually doesn't need the goto any more
either, so that's now a fallthrough instead.


+ flags = GOVD_MAP | GOVD_EXPLICIT;
+ if (OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_ALWAYS_TO
+ || OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_ALWAYS_TOFROM)
+   flags |= GOVD_MAP_ALWAYS_TO;

I know that the code has only been moved, but I wonder whether that
should also include GOMP_MAP_ALWAYS_PRESENT_{TO,TOFROM} as condition.

I've added it (caveat: without any tests).