I encountered a deadlock in sync_wait_mt().

After investigations, it appears that a first thread executing 
wait_sync_update() decrements sync->count just after a second thread in 
sync_wait_mt() made the test :

    if(sync->count <= 0)
        return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;

After that, there is a narrow window in which the first thread may call 
pthread_cond_signal() before the second thread calls pthread_cond_wait().

If I protect this test by the sync->lock, this window is closed and the problem 
does not reproduce.

To easy reproduce the problem, just add a call to usleep(100) before the call 
to pthread_mutex(&sync->lock);

So my proposed patch is:

diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c
index c9b9137..2f90965 100644
--- a/opal/threads/wait_sync.c
+++ b/opal/threads/wait_sync.c
@@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL;

 int sync_wait_mt(ompi_wait_sync_t *sync)
 {
-    if(sync->count <= 0)
-        return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
-
     /* lock so nobody can signal us during the list updating */
     pthread_mutex_lock(&sync->lock);

+    if(sync->count <= 0) {
+        pthread_mutex_unlock(&sync->lock);
+        return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
+    }
+
     /* Insert sync on the list of pending synchronization constructs */
     OPAL_THREAD_LOCK(&wait_sync_lock);
     if( NULL == wait_sync_list ) {

For performance reasons, it is also possible to leave the first test call. So 
if the request is terminated, we do not spend time to take and free the lock.



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to