Gleb,

This patch is not correct. The code preventing the registration of the same communicator twice is later in the code (same file in the function ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid is called, we know that each communicator only handle one "communicator creation" function at the same time. Therefore, we want to give priority to the smallest com_id, which is what happens in the code you removed.

Without the condition in the ompi_comm_register_cid (each communicator only get registered once) your comment make sense. However, with the condition your patch allow a dead end situation, while 2 processes try to create communicators in multiple threads, and they will never succeed, simply because they will not order the creation based on the com_id.

  george.



On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote:

Author: gleb
Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007)
New Revision: 16088
URL: https://svn.open-mpi.org/trac/ompi/changeset/16088

Log:
The code tries to prevent itself from running for more then one communicator simultaneously, but is doing it incorrectly. If the function is running already for one communicator and it is called from another thread for other communicator
with lower cid the check comm->c_contextid != ompi_comm_lowest_cid()
will fail and the function will be executed for two different communicators by two threads simultaneously. There is nothing in the algorithm that prevent it from been running simultaneously for different communicators as far as I can see, but ompi_comm_unregister_cid() assumes that it is always called for a communicator with the lowest cid and this is not always the case. This patch removes bogus lowest cid check and fix ompi_comm_register_cid() to properly remove cid from
the list.

Text files modified:
   trunk/ompi/communicator/comm_cid.c |    24 ++++++++++++------------
   1 files changed, 12 insertions(+), 12 deletions(-)

Modified: trunk/ompi/communicator/comm_cid.c
====================================================================== ========
--- trunk/ompi/communicator/comm_cid.c  (original)
+++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007)
@@ -11,6 +11,7 @@
  *                         All rights reserved.
* Copyright (c) 2006-2007 University of Houston. All rights reserved.
  * Copyright (c) 2007      Cisco, Inc.  All rights reserved.
+ * Copyright (c) 2007      Voltaire All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -170,15 +171,6 @@
              * This is the real algorithm described in the doc
              */

-            OPAL_THREAD_LOCK(&ompi_cid_lock);
-            if (comm->c_contextid != ompi_comm_lowest_cid() ) {
- /* if not lowest cid, we do not continue, but sleep and try again */
-                OPAL_THREAD_UNLOCK(&ompi_cid_lock);
-                continue;
-            }
-            OPAL_THREAD_UNLOCK(&ompi_cid_lock);
-
-
             for (i=start; i < mca_pml.pml_max_contextid ; i++) {
flag=ompi_pointer_array_test_and_set_item (&ompi_mpi_communicators,
                                                           i, comm);
@@ -365,10 +357,18 @@

 static int ompi_comm_unregister_cid (uint32_t cid)
 {
-    ompi_comm_reg_t *regcom=NULL;
- opal_list_item_t *item=opal_list_remove_first (&ompi_registered_comms);
+    ompi_comm_reg_t *regcom;
+    opal_list_item_t *item;

-    regcom = (ompi_comm_reg_t *) item;
+    for (item = opal_list_get_first(&ompi_registered_comms);
+         item != opal_list_get_end(&ompi_registered_comms);
+         item = opal_list_get_next(item)) {
+        regcom = (ompi_comm_reg_t *)item;
+        if(regcom->cid == cid) {
+            opal_list_remove_item(&ompi_registered_comms, item);
+            break;
+        }
+    }
     OBJ_RELEASE(regcom);
     return OMPI_SUCCESS;
 }
_______________________________________________
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to