RE: [PATCH v4] libgfortran: Replace mutex with rwlock
> Hi Thomas, > > > > > Hi Lipeng, > > > > > May I know any comment or concern on this patch, thanks for your > > > time > > > > > > > Thanks for your patience in getting this reviewed. > > > > A few remarks / questions. > > > > Which strategy is used in this implementation, read-preferring or > > write- preferring? And if read-preferring is used, is there a danger > > of deadlock if people do unreasonable things? > > Maybe you could explain that, also in a comment in the code. > > > > Yes, the implementation use the read-preferring strategy, and comments in > code. > When adding the test cases, I didn’t meet the situation which may cause the > deadlock. > Maybe you can give more guidance about that. > > > Can you add some sort of torture test case(s) which does a lot of > > opening/closing/reading/writing, possibly with asynchronous I/O and/or > > pthreads, to catch possible problems? If there is a system dependency > > or some race condition, chances are that regression testers will catch this. > > > > Sure, as your comments, in the patch V6, I added 3 test cases with OpenMP to > test different cases in concurrency respectively: > 1. find and create unit very frequently to stress read lock and write lock. > 2. only access the unit which exist in cache to stress read lock. > 3. access the same unit in concurrency. > For the third test case, it also help to find a bug: When unit can't be > found in > cache nor unit list in read phase, then threads will try to acquire write > lock to > insert the same unit, this will cause duplicate key error. > To fix this bug, I get the unit from unit list once again before insert in > write lock. > More details you can refer the patch v6. > Could you help to review this update? I really appreciate your assistance. > > With this, the libgfortran parts are OK, unless somebody else has more > > comments, so give this a couple of days. I cannot approve the libgcc > > parts, that would be somebody else (Jakub?) > > > > Best regards > > > > Thomas > > > > Best Regards, > Lipeng Zhu Best Regards, Lipeng Zhu
[PATCH v6] libgfortran: Replace mutex with rwlock
From: Lipeng Zhu This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT libgcc/ChangeLog: * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro (__gthrw): New function (__gthread_rwlock_rdlock): New function (__gthread_rwlock_tryrdlock): New function (__gthread_rwlock_wrlock): New function (__gthread_rwlock_trywrlock): New function (__gthread_rwlock_unlock): New function libgfortran/ChangeLog: * io/async.c (DEBUG_LINE): New * io/async.h (RWLOCK_DEBUG_ADD): New macro (CHECK_RDLOCK): New macro (CHECK_WRLOCK): New macro (TAIL_RWLOCK_DEBUG_QUEUE): New macro (IN_RWLOCK_DEBUG_QUEUE): New macro (RDLOCK): New macro (WRLOCK): New macro (RWUNLOCK): New macro (RD_TO_WRLOCK): New macro (INTERN_RDLOCK): New macro (INTERN_WRLOCK): New macro (INTERN_RWUNLOCK): New macro * io/io.h (internal_proto): Define unit_rwlock * io/transfer.c (st_read_done_worker): Relace unit_lock with unit_rwlock (st_write_done_worker): Relace unit_lock with unit_rwlock * io/unit.c (get_gfc_unit): Relace unit_lock with unit_rwlock (if): Relace unit_lock with unit_rwlock (close_unit_1): Relace unit_lock with unit_rwlock (close_units): Relace unit_lock with unit_rwlock (newunit_alloc): Relace unit_lock with unit_rwlock * io/unix.c (flush_all_units): Relace unit_lock with unit_rwlock --- v1 -> v2: Limit the pthread_rwlock usage in libgcc only when __cplusplus isn't defined. v2 -> v3: Rebase the patch with trunk branch. v3 -> v4: Update the comments. v4 -> v5: Fix typos and code formatter. v5 -> v6: Add unit tests. Reviewed-by: Hongjiu Lu Reviewed-by: Bernhard Reutner-Fischer Reviewed-by: Thomas Koenig Signed-off-by: Zhu, Lipeng --- libgcc/gthr-posix.h | 60 +++ libgfortran/io/async.c| 4 + libgfortran/io/async.h| 154 +- libgfortran/io/io.h | 15 +- libgfortran/io/transfer.c | 8 +- libgfortran/io/unit.c | 116 - libgfortran/io/unix.c | 16 +- .../testsuite/libgomp.fortran/rwlock_1.f90| 33 .../testsuite/libgomp.fortran/rwlock_2.f90| 22 +++ .../testsuite/libgomp.fortran/rwlock_3.f90| 18 ++ 10 files changed, 387 insertions(+), 59 deletions(-) create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_1.f90 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_2.f90 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_3.f90 diff --git a/libgcc/gthr-posix.h b/libgcc/gthr-posix.h index aebcfdd9f4c..73283082997 100644 --- a/libgcc/gthr-posix.h +++ b/libgcc/gthr-posix.h @@ -48,6 +48,9 @@ typedef pthread_t __gthread_t; typedef pthread_key_t __gthread_key_t; typedef pthread_once_t __gthread_once_t; typedef pthread_mutex_t __gthread_mutex_t; +#ifndef __cplusplus +typedef pthread_rwlock_t __gthread_rwlock_t; +#endif typedef pthread_mutex_t __gthread_recursive_mutex_t; typedef pthread_cond_t __gthread_cond_t; typedef struct timespec __gthread_time_t; @@ -58,6 +61,9 @@ typedef struct timespec __gthread_time_t; #define __GTHREAD_MUTEX_INIT PTHREAD_MUTEX_INITIALIZER #define __GTHREAD_MUTEX_INIT_FUNCTION __gthread_mutex_init_function +#ifndef __cplusplus +#define __GTHREAD_RWLOCK_INIT PTHREAD_RWLOCK_INITIALIZER +#endif #define __GTHREAD_ONCE_INIT PTHREAD_ONCE_INIT #if defined(PTHREAD_RECURSIVE_MUTEX_INITIALIZER) #define __GTHREAD_RECURSIVE_MUTEX_INIT PTHREAD_RECURSIVE_MUTEX_INITIALIZER @@ -135,6 +141,13 @@ __gthrw(pthread_mutexattr_init) __gthrw(pthread_mutexattr_settype) __gthrw(pthread_mutexattr_destroy) +#ifndef __cplusplus +__gthrw(pthread_rwlock_rdlock) +__gthrw(pthread_rwlock_tryrdlock) +__gthrw(pthread_rwlock_wrlock) +__gthrw(pthread_rwlock_trywrlock) +__gthrw(pthread_rwlock_unlock) +#endif #if defined(_LIBOBJC) || defined(_LIBOBJC_WEAK) /* Objective-C. */ @@ -885,6 +898,53 @@ __gthread_cond_destroy (__gthread_cond_t* __cond) return __gthrw_(pthread_cond_destroy) (__cond); } +#ifndef __cplusplus +static inline int +__gthread_rwlock_rdlock (__gthread_rwlock_t *__rwlock) +{ + if (__gthread_active_p ()) +return __gthrw_(pthread_rwlock_rdlock) (__rwlock); + else +return 0; +} + +static inline int +__gthread_rwlock_tryrdlock (__gthread_rwlock_t *__rwlock) +{ + if (__gthread_active_p ()) +return __gthrw_(pthread_rwlock_tryrdlock) (__rwlock); +
RE: [PATCH v4] libgfortran: Replace mutex with rwlock
Hi Thomas, > > Hi Lipeng, > > > May I know any comment or concern on this patch, thanks for your time > > > > Thanks for your patience in getting this reviewed. > > A few remarks / questions. > > Which strategy is used in this implementation, read-preferring or write- > preferring? And if read-preferring is used, is there a danger of deadlock if > people do unreasonable things? > Maybe you could explain that, also in a comment in the code. > Yes, the implementation use the read-preferring strategy, and comments in code. When adding the test cases, I didn’t meet the situation which may cause the deadlock. Maybe you can give more guidance about that. > Can you add some sort of torture test case(s) which does a lot of > opening/closing/reading/writing, possibly with asynchronous I/O and/or > pthreads, to catch possible problems? If there is a system dependency or > some race condition, chances are that regression testers will catch this. > Sure, as your comments, in the patch V6, I added 3 test cases with OpenMP to test different cases in concurrency respectively: 1. find and create unit very frequently to stress read lock and write lock. 2. only access the unit which exist in cache to stress read lock. 3. access the same unit in concurrency. For the third test case, it also help to find a bug: When unit can't be found in cache nor unit list in read phase, then threads will try to acquire write lock to insert the same unit, this will cause duplicate key error. To fix this bug, I get the unit from unit list once again before insert in write lock. More details you can refer the patch v6. > With this, the libgfortran parts are OK, unless somebody else has more > comments, so give this a couple of days. I cannot approve the libgcc parts, > that would be somebody else (Jakub?) > > Best regards > > Thomas > Best Regards, Lipeng Zhu
Re: [PATCH v4] libgfortran: Replace mutex with rwlock
On 1/1/1970 8:00 AM, Thomas Koenig wrote: Hi Lipeng, May I know any comment or concern on this patch, thanks for your time :) Thanks for your patience in getting this reviewed. A few remarks / questions. Which strategy is used in this implementation, read-preferring or write-preferring? And if read- preferring is used, is there a danger of deadlock if people do unreasonable things? Maybe you could explain that, also in a comment in the code > Can you add some sort of torture test case(s) which does a lot of opening/closing/reading/writing, possibly with asynchronous I/O and/or pthreads, to catch possible problems? If there is a system dependency or some race condition, chances are that regression testers will catch this. Hi Thomas, Thanks for your time for the review. Sure, I will add test case according to your suggestions and update the comment based on the implementation of "read-preferring" strategy. Thanks, Lipeng Zhu With this, the libgfortran parts are OK, unless somebody else has more comments, so give this a couple of days. I cannot approve the libgcc parts, that would be somebody else (Jakub?) Best regards Thomas
Re: [PATCH v4] libgfortran: Replace mutex with rwlock
On 5/16/2023 3:08 PM, Zhu, Lipeng wrote: On 5/9/2023 10:32 AM, Zhu, Lipeng wrote: On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: On Mon, 8 May 2023 17:44:43 +0800 Lipeng Zhu wrote: This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT See commentary typos below. You did not state if you regression tested the patch? I use valgrind --tool=helgrind or --tool=drd to test 'make check-fortran'. Is it necessary to add an additional unit test for this patch? Other than that it LGTM but i cannot approve it. Thank you for your kind help for this patch, is there anything that I can do or can you help to push this patch forward? Hi Bernhard, Is there any other refinement that need I to do for this patch? Thanks. May I know any comment or concern on this patch, thanks for your time :) diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index ad226c8e856..0033cc74252 100644 --- a/libgfortran/io/async.h +++ b/libgfortran/io/async.h @@ -210,6 +210,128 @@ DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \ Thanks, corrected in Patch v5. } while (0) +#ifdef __GTHREAD_RWLOCK_INIT +#define RWLOCK_DEBUG_ADD(rwlock) do { \ + aio_rwlock_debug *n; \ + n = xmalloc (sizeof(aio_rwlock_debug)); \ Missing space before the open brace: sizeof ( Thanks, corrected in Patch v5. diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index 82664dc5f98..62f1db21d34 100644 --- a/libgfortran/io/unit.c +++ b/libgfortran/io/unit.c @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see /* IO locking rules: - UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + And use the rwlock to spilt read and write phase to UNIT_ROOT tree + and UNIT_CACHE to increase CPU efficiency. s/spilt/split. Maybe: Using an rwlock improves efficiency by allowing us to separate readers and writers of both UNIT_ROOT and UNIT_CACHE. Thanks, corrected in Patch v5. @@ -350,6 +356,17 @@ retry: if (c == 0) break; } + /* We did not find a unit in the cache nor in the unit list, create a new + (locked) unit and insert into the unit list and cache. + Manipulating either or both the unit list and the unit cache requires to + hold a write-lock [for obvious reasons]: + 1. By separating the read/write lock, it will greatly reduce the contention + at the read part, while write part is not always necessary or most + unlikely once the unit hit in cache. + By separating the read/write lock, we will greatly reduce the contention + on the read part, while the write part is unlikely once the unit hits + the cache. + 2. We try to balance the implementation complexity and the performance + gains that fit into current cases we observed by just using a + pthread_rwlock. */ Let's drop 2. Got it, thanks! thanks,
Re: [PATCH v4] libgfortran: Replace mutex with rwlock
On 5/9/2023 10:32 AM, Zhu, Lipeng wrote: On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: On Mon, 8 May 2023 17:44:43 +0800 Lipeng Zhu wrote: This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT See commentary typos below. You did not state if you regression tested the patch? I use valgrind --tool=helgrind or --tool=drd to test 'make check-fortran'. Is it necessary to add an additional unit test for this patch? Other than that it LGTM but i cannot approve it. Thank you for your kind help for this patch, is there anything that I can do or can you help to push this patch forward? Hi Bernhard, Is there any other refinement that need I to do for this patch? Thanks. diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index ad226c8e856..0033cc74252 100644 --- a/libgfortran/io/async.h +++ b/libgfortran/io/async.h @@ -210,6 +210,128 @@ DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \ Thanks, corrected in Patch v5. } while (0) +#ifdef __GTHREAD_RWLOCK_INIT +#define RWLOCK_DEBUG_ADD(rwlock) do { \ + aio_rwlock_debug *n; \ + n = xmalloc (sizeof(aio_rwlock_debug)); \ Missing space before the open brace: sizeof ( Thanks, corrected in Patch v5. diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index 82664dc5f98..62f1db21d34 100644 --- a/libgfortran/io/unit.c +++ b/libgfortran/io/unit.c @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see /* IO locking rules: - UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + And use the rwlock to spilt read and write phase to UNIT_ROOT tree + and UNIT_CACHE to increase CPU efficiency. s/spilt/split. Maybe: Using an rwlock improves efficiency by allowing us to separate readers and writers of both UNIT_ROOT and UNIT_CACHE. Thanks, corrected in Patch v5. @@ -350,6 +356,17 @@ retry: if (c == 0) break; } + /* We did not find a unit in the cache nor in the unit list, create a new + (locked) unit and insert into the unit list and cache. + Manipulating either or both the unit list and the unit cache requires to + hold a write-lock [for obvious reasons]: + 1. By separating the read/write lock, it will greatly reduce the contention + at the read part, while write part is not always necessary or most + unlikely once the unit hit in cache. + By separating the read/write lock, we will greatly reduce the contention + on the read part, while the write part is unlikely once the unit hits + the cache. + 2. We try to balance the implementation complexity and the performance + gains that fit into current cases we observed by just using a + pthread_rwlock. */ Let's drop 2. Got it, thanks! thanks,
Re: [PATCH v4] libgfortran: Replace mutex with rwlock
On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: On Mon, 8 May 2023 17:44:43 +0800 Lipeng Zhu wrote: This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT See commentary typos below. You did not state if you regression tested the patch? I use valgrind --tool=helgrind or --tool=drd to test 'make check-fortran'. Is it necessary to add an additional unit test for this patch? Other than that it LGTM but i cannot approve it. Thank you for your kind help for this patch, is there anything that I can do or can you help to push this patch forward? diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index ad226c8e856..0033cc74252 100644 --- a/libgfortran/io/async.h +++ b/libgfortran/io/async.h @@ -210,6 +210,128 @@ DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \ Thanks, corrected in Patch v5. } while (0) +#ifdef __GTHREAD_RWLOCK_INIT +#define RWLOCK_DEBUG_ADD(rwlock) do { \ +aio_rwlock_debug *n; \ +n = xmalloc (sizeof(aio_rwlock_debug));\ Missing space before the open brace: sizeof ( Thanks, corrected in Patch v5. diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index 82664dc5f98..62f1db21d34 100644 --- a/libgfortran/io/unit.c +++ b/libgfortran/io/unit.c @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see /* IO locking rules: - UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + And use the rwlock to spilt read and write phase to UNIT_ROOT tree + and UNIT_CACHE to increase CPU efficiency. s/spilt/split. Maybe: Using an rwlock improves efficiency by allowing us to separate readers and writers of both UNIT_ROOT and UNIT_CACHE. Thanks, corrected in Patch v5. @@ -350,6 +356,17 @@ retry: if (c == 0) break; } + /* We did not find a unit in the cache nor in the unit list, create a new +(locked) unit and insert into the unit list and cache. +Manipulating either or both the unit list and the unit cache requires to +hold a write-lock [for obvious reasons]: +1. By separating the read/write lock, it will greatly reduce the contention + at the read part, while write part is not always necessary or most + unlikely once the unit hit in cache. +By separating the read/write lock, we will greatly reduce the contention +on the read part, while the write part is unlikely once the unit hits +the cache. +2. We try to balance the implementation complexity and the performance + gains that fit into current cases we observed by just using a + pthread_rwlock. */ Let's drop 2. Got it, thanks! thanks,
Re: [PATCH v3] libgfortran: Replace mutex with rwlock
On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: Hi! [please do not top-post] Sure, thanks for the reminder. On Thu, 20 Apr 2023 21:13:08 +0800 "Zhu, Lipeng" wrote: Hi Bernhard, Thanks for your questions and suggestions. The rwlock could allow multiple threads to have concurrent read-only access to the cache/unit list, only a single writer is allowed. right. Write lock will not be acquired until all read lock are released. So i must have confused rwlock with something else, something that allows self to hold a read-lock and upgrade that to a write-lock, purposely starving all successive incoming readers. I.e. just toggle your RD_TO_WRLOCK impl, here, atomically. This proved to be benefical in some situations in the past. Doesn't seem to work with your rwlock, does it Pthread API doesn't support the upgrade rdlock to wrlock directly, the calling thread may deadlock if at the time the call is made it holds the read-write lock (whether a read or write lock). https://linux.die.net/man/3/pthread_rwlock_wrlock. That is why we implement RD_TO_WRLOCK like this: release read lock firstly and then acquire write lock. And I didn't change the mutex scope when refactor the code, only make a more fine-grained distinction for the read/write cache/unit list. Yes of course, i can see you did that. I complete the comment according to your template, I will insert the comment in the source code in next version patch with other refinement by your suggestions. " Now we did not get a unit in cache and unit list, so we need to create a new unit, and update it to cache and unit list. s/Now/By now/ or s/Now w/W/ and s/get/find/ " We did not find a unit in the cache nor in the unit list, create a new (locked) unit and insert into the unit list and cache. Manipulating either or both the unit list and the unit cache requires to hold a write-lock [for obvious reasons]" Superfluous when talking about pthread_rwlock_wrlock since that implies that even the process acquiring the wrlock has to first release it's very own rdlock. Thanks for the correction, I will update the comments in next v4 patch. Prior to update the cache and list, we need to release all read locks, and then immediately to acquire write lock, thus ensure the exclusive update to the cache and unit list. Either way, we will manipulate the cache and/or the unit list so we must take a write lock now. We don't take the write bit in *addition* to the read lock because: 1. It will needlessly complicate releasing the respective lock; Under pthread_rwlock_wrlock it will deadlock, so that's wrong? Drop that point then? If not, what's your reasoning / observation? Under my lock, you hold the R, additionally take the W and then immediately release the R because you yourself won't read, just write. But mine's not the pthread_rwlock you talk about, admittedly. Yes, pthread_rwlock doesn't support upgrade rdlock to wrlock directly, so we can’t hold rdlock while trying to acquire wrlock. I will drop the first point and update the comment accordingly. 2. By separate the read/write lock, it will greatly reduce the contention at the read part, while write part is not always necessary or most unlikely once the unit hit in cache; We know that. 3. We try to balance the implementation complexity and the performance gains that fit into current cases we observed. .. by just using a pthread_rwlock. And that's the top point iff you keep it at that. That's a fair step, sure. BTW, did you look at the RELEASE semantics, respectively the note that one day (and now is that very day), we might improve on the release semantics? Can of course be incremental AFAIC Not quite understand your question about looking at the RELEASE semantics. Can you be more specific? " If folks agree on this first step then you have my OK with a catchy malloc and the discussion recorded here on the list. A second step would be RELEASE. And, depending on the underlying capabilities of available locks, further tweaks, obviously. PS: and, please, don't top-post thanks, Best Regards, Zhu, Lipeng On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: On 19 April 2023 09:06:28 CEST, Lipeng Zhu via Fortran wrote: This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT +#define RD_TO_WRLOCK(rwlock) \ + RWUNLOCK (rwlock);\ + WRLOCK (rwlock); +#endif + diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
Re: [PATCH v3] libgfortran: Replace mutex with rwlock
Hi Bernhard, Thanks for your questions and suggestions. The rwlock could allow multiple threads to have concurrent read-only access to the cache/unit list, only a single writer is allowed. Write lock will not be acquired until all read lock are released. And I didn't change the mutex scope when refactor the code, only make a more fine-grained distinction for the read/write cache/unit list. I complete the comment according to your template, I will insert the comment in the source code in next version patch with other refinement by your suggestions. " Now we did not get a unit in cache and unit list, so we need to create a new unit, and update it to cache and unit list. Prior to update the cache and list, we need to release all read locks, and then immediately to acquire write lock, thus ensure the exclusive update to the cache and unit list. Either way, we will manipulate the cache and/or the unit list so we must take a write lock now. We don't take the write bit in *addition* to the read lock because: 1. It will needlessly complicate releasing the respective lock; 2. By separate the read/write lock, it will greatly reduce the contention at the read part, while write part is not always necessary or most unlikely once the unit hit in cache; 3. We try to balance the implementation complexity and the performance gains that fit into current cases we observed. " Best Regards, Zhu, Lipeng On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: On 19 April 2023 09:06:28 CEST, Lipeng Zhu via Fortran wrote: This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT +#define RD_TO_WRLOCK(rwlock) \ + RWUNLOCK (rwlock);\ + WRLOCK (rwlock); +#endif + diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index 82664dc5f98..4312c5f36de 100644 --- a/libgfortran/io/unit.c +++ b/libgfortran/io/unit.c @@ -329,7 +335,7 @@ get_gfc_unit (int n, int do_create) int c, created = 0; NOTE ("Unit n=%d, do_create = %d", n, do_create); - LOCK (_lock); + RDLOCK (_rwlock); retry: for (c = 0; c < CACHE_SIZE; c++) @@ -350,6 +356,7 @@ retry: if (c == 0) break; } + RD_TO_WRLOCK (_rwlock); So I'm trying to convince myself why it's safe to unlock and only then take the write lock. Can you please elaborate/confirm why that's ok? I wouldn't mind a comment like We can release the unit and cache read lock now. We might have to allocate a (locked) unit, below in do_create. Either way, we will manipulate the cache and/or the unit list so we have to take a write lock now. We don't take the write bit in *addition* to the read lock because.. (that needlessly complicates releasing the respective locks / it triggers too much contention when we.. / ...?) thanks, if (p == NULL && do_create) { @@ -368,8 +375,8 @@ retry: if (created) { /* Newly created units have their lock held already -from insert_unit. Just unlock UNIT_LOCK and return. */ - UNLOCK (_lock); +from insert_unit. Just unlock UNIT_RWLOCK and return. */ + RWUNLOCK (_rwlock); return p; } @@ -380,7 +387,7 @@ found: if (! TRYLOCK (>lock)) { /* assert (p->closed == 0); */ - UNLOCK (_lock); + RWUNLOCK (_rwlock); return p; } @@ -388,14 +395,14 @@ found: } - UNLOCK (_lock); + RWUNLOCK (_rwlock); if (p != NULL && (p->child_dtio == 0)) { LOCK (>lock); if (p->closed) { - LOCK (_lock); + WRLOCK (_rwlock); UNLOCK (>lock); if (predec_waiting_locked (p) == 0) destroy_unit_mutex (p); @@ -593,8 +600,8 @@ init_units (void) #endif #endif -#ifndef __GTHREAD_MUTEX_INIT - __GTHREAD_MUTEX_INIT_FUNCTION (_lock); +#if (!defined(__GTHREAD_RWLOCK_INIT) && +!defined(__GTHREAD_MUTEX_INIT)) + __GTHREAD_MUTEX_INIT_FUNCTION (_rwlock); #endif if (sizeof (max_offset) == 8)
RE: [PATCH v2] libgfortran: Replace mutex with rwlock
> > Hi Lipeng, > > > > > This patch try to introduce the rwlock and split the read/write to > > > unit_root tree and unit_cache with rwlock instead of the mutex to > > > increase CPU efficiency. In the get_gfc_unit function, the > > > percentage to step into the insert_unit function is around 30%, in > > > most instances, we can get the unit in the phase of reading the > > > unit_cache or unit_root tree. So split the read/write phase by > > > rwlock would be an approach to make it more parallel. > > > > No comment on the code itself, as yet... but I'd like to know how > > throroughly you tested it, using which tools, and on which programs. > > Did you use valgrind --tool=helgrind or --tool=drd? Since it is prone to > > race conditions, did you also test Fortran's asynchronous I/O? > > > > Best regards > > > > Thomas > > Hi Thomas, > > I didn’t use valgrind and make check-fortran succeed in local. > And the tools/programs I used is pts/neatbench > https://openbenchmarking.org/test/pts/neatbench > > Best Regards, > Lipeng Zhu Hi Thomas, Thanks for your comments, I just use valgrind --tool=helgrind or --tool=drd to test make check-fortran, no errors occurred. ==2609780== Helgrind, a thread error detector ==2609780== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al. ==2609780== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info ==2609780== Command: make check-fortran -j150 ==2609780== ==2609780== ==2609780== Use --history-level=approx or =none to gain increased speed, at ==2609780== the cost of reduced accuracy of conflicting-access information ==2609780== For lists of detected and suppressed errors, rerun with: -s ==2609780== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==2976599== drd, a thread error detector ==2976599== Copyright (C) 2006-2020, and GNU GPL'd, by Bart Van Assche. ==2976599== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info ==2976599== Command: make check-fortran -j150 ==2976599== ==2976599== ==2976599== For lists of detected and suppressed errors, rerun with: -s ==2976599== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) BTW, can you give me some guidance of how to test Fortran's asynchronous I/O? Best Regards, Lipeng Zhu
RE: [PATCH v2] libgfortran: Replace mutex with rwlock
> Hi Lipeng, > > This patch try to introduce the rwlock and split the read/write to > > unit_root tree and unit_cache with rwlock instead of the mutex to > > increase CPU efficiency. In the get_gfc_unit function, the percentage > > to step into the insert_unit function is around 30%, in most > > instances, we can get the unit in the phase of reading the unit_cache > > or unit_root tree. So split the read/write phase by rwlock would be an > > approach to make it more parallel. > > No comment on the code itself, as yet... but I'd like to know how throroughly > you tested it, using which tools, and on which programs. > Did you use valgrind --tool=helgrind or --tool=drd? Since it is prone to > race conditions, did you also test Fortran's asynchronous I/O? > > Best regards > > Thomas Hi Thomas, I didn’t use valgrind and make check-fortran succeed in local. And the tools/programs I used is pts/neatbench https://openbenchmarking.org/test/pts/neatbench Best Regards, Lipeng Zhu
[PATCH] libgfortran: Replace mutex with rwlock
> libstdc++ implements shared mutex with pthread_rwlock, which can > libstdc++ conflict > with the pthread_rwlock usage in libgcc. Lipeng, please limit the > pthread_rwlock usage in libgcc only when __cplusplus isn't defined. > > > -- > H.J. Thanks for suggestion, send a V2 patch. -- Lipeng Zhu