I encountered a deadlock in sync_wait_mt().
After investigations, it appears that a first thread executing
wait_sync_update() decrements sync->count just after a second thread in
sync_wait_mt() made the test :
if(sync->count <= 0)
return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
After that, there is a narrow window in which the first thread may call
pthread_cond_signal() before the second thread calls pthread_cond_wait().
If I protect this test by the sync->lock, this window is closed and the problem
does not reproduce.
To easy reproduce the problem, just add a call to usleep(100) before the call
to pthread_mutex(&sync->lock);
So my proposed patch is:
diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c
index c9b9137..2f90965 100644
--- a/opal/threads/wait_sync.c
+++ b/opal/threads/wait_sync.c
@@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL;
int sync_wait_mt(ompi_wait_sync_t *sync)
{
- if(sync->count <= 0)
- return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
-
/* lock so nobody can signal us during the list updating */
pthread_mutex_lock(&sync->lock);
+ if(sync->count <= 0) {
+ pthread_mutex_unlock(&sync->lock);
+ return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
+ }
+
/* Insert sync on the list of pending synchronization constructs */
OPAL_THREAD_LOCK(&wait_sync_lock);
if( NULL == wait_sync_list ) {
For performance reasons, it is also possible to leave the first test call. So
if the request is terminated, we do not spend time to take and free the lock.
_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel