Folks,
currently, the dynamic/intercomm_create fails if ran on one host with an
IB port :
mpirun -np 1 ./intercomm_create
/* misleading error message is
opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1899:udcm_process_messages]
could not find associated endpoint */
this program spawns one task and a second one, create a single
communicator and performs a barrier.
what happens here is :
- tasks 0 and 1 do not use IB as a loopback interface because
OPAL_MODEX_RECV fails in mca_btl_openib_proc_create()
/* this is ok since the openib modex was sent with PMIX_REMOTE */
but later, task 1 will try to communicate with task 2 via the openib btl.
the reason is task 1 got the openib modex from task 2 via
ompi_comm_get_rprocs invoked by MPI_Intercomm_create
and this will cause an error with a misleading error message reported by
task 2
i wrote the attached hack to "fix" the issue.
i had to strcmp the host names since at that time, proc->proc_flags is
OPAL_PROC_NON_LOCAL
i guess several things are not being handled correctly here, could you
please advise a correct way to fix this ?
Cheers,
Gilles
diff --git a/opal/mca/btl/openib/btl_openib_proc.c
b/opal/mca/btl/openib/btl_openib_proc.c
index 2d622fe..6e17320 100644
--- a/opal/mca/btl/openib/btl_openib_proc.c
+++ b/opal/mca/btl/openib/btl_openib_proc.c
@@ -12,6 +12,8 @@
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -159,6 +161,13 @@ mca_btl_openib_proc_t*
mca_btl_openib_proc_create(opal_proc_t* proc)
if (0 == msg_size) {
return NULL;
}
+ /* do NOT use ib as a loopback interface */
+ if (NULL != proc->proc_hostname) {
+ char * h = opal_proc_local_get()->proc_hostname;
+ if (strcmp(h, proc->proc_hostname) == 0) {
+ return NULL;
+ }
+ }
/* Message was packed in btl_openib_component.c; the format is
listed in a comment in that file */