Folks,

currently, the dynamic/intercomm_create fails if ran on one host with an
IB port :
mpirun -np 1 ./intercomm_create
/* misleading error message is
opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1899:udcm_process_messages]
could not find associated endpoint */

this program spawns one task and a second one, create a single
communicator and performs a barrier.

what happens here is :
- tasks 0 and 1 do not use IB as a loopback interface because 
OPAL_MODEX_RECV fails in mca_btl_openib_proc_create()
/* this is ok since the openib modex was sent with PMIX_REMOTE */

but later, task 1 will try to communicate with task 2 via the openib btl.
the reason is task 1 got the openib modex from task 2 via
ompi_comm_get_rprocs invoked by MPI_Intercomm_create

and this will cause an error with a misleading error message reported by
task 2

i wrote the attached hack to "fix" the issue.
i had to strcmp the host names since at that time, proc->proc_flags is
OPAL_PROC_NON_LOCAL

i guess several things are not being handled correctly here, could you
please advise a correct way to fix this ?

Cheers,

Gilles



diff --git a/opal/mca/btl/openib/btl_openib_proc.c 
b/opal/mca/btl/openib/btl_openib_proc.c
index 2d622fe..6e17320 100644
--- a/opal/mca/btl/openib/btl_openib_proc.c
+++ b/opal/mca/btl/openib/btl_openib_proc.c
@@ -12,6 +12,8 @@
  * Copyright (c) 2007-2008 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2006-2007 Voltaire All rights reserved.
  * Copyright (c) 2014      Intel, Inc. All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -159,6 +161,13 @@ mca_btl_openib_proc_t* 
mca_btl_openib_proc_create(opal_proc_t* proc)
     if (0 == msg_size) {
         return NULL;
     }
+    /* do NOT use ib as a loopback interface */
+    if (NULL != proc->proc_hostname) {
+        char * h = opal_proc_local_get()->proc_hostname;
+        if (strcmp(h, proc->proc_hostname) == 0) {
+            return NULL;
+        }
+    }

     /* Message was packed in btl_openib_component.c; the format is
        listed in a comment in that file */

Reply via email to