Folks, currently, the dynamic/intercomm_create fails if ran on one host with an IB port : mpirun -np 1 ./intercomm_create /* misleading error message is opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1899:udcm_process_messages] could not find associated endpoint */
this program spawns one task and a second one, create a single communicator and performs a barrier. what happens here is : - tasks 0 and 1 do not use IB as a loopback interface because OPAL_MODEX_RECV fails in mca_btl_openib_proc_create() /* this is ok since the openib modex was sent with PMIX_REMOTE */ but later, task 1 will try to communicate with task 2 via the openib btl. the reason is task 1 got the openib modex from task 2 via ompi_comm_get_rprocs invoked by MPI_Intercomm_create and this will cause an error with a misleading error message reported by task 2 i wrote the attached hack to "fix" the issue. i had to strcmp the host names since at that time, proc->proc_flags is OPAL_PROC_NON_LOCAL i guess several things are not being handled correctly here, could you please advise a correct way to fix this ? Cheers, Gilles
diff --git a/opal/mca/btl/openib/btl_openib_proc.c b/opal/mca/btl/openib/btl_openib_proc.c index 2d622fe..6e17320 100644 --- a/opal/mca/btl/openib/btl_openib_proc.c +++ b/opal/mca/btl/openib/btl_openib_proc.c @@ -12,6 +12,8 @@ * Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2006-2007 Voltaire All rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -159,6 +161,13 @@ mca_btl_openib_proc_t* mca_btl_openib_proc_create(opal_proc_t* proc) if (0 == msg_size) { return NULL; } + /* do NOT use ib as a loopback interface */ + if (NULL != proc->proc_hostname) { + char * h = opal_proc_local_get()->proc_hostname; + if (strcmp(h, proc->proc_hostname) == 0) { + return NULL; + } + } /* Message was packed in btl_openib_component.c; the format is listed in a comment in that file */