I notice that BTLs are not checking the return value from ompi_modex_recv() for OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that modex key). In the BTL context, NOT_FOUND means that that peer process doesn't have this BTL, so this local peer process should probably mark it as unreachable in add_procs().
This is on both trunk and the v1.8 branch. The BTLs listed above are not checking/handling ompi_modex_recv() returning OPAL_ERR_DATA_VALUE_NOT_FOUND properly. Most of these BTLs do something like this: ----- module_add_procs() { loop over the peers { proc = proc_create(...) if (NULL == proc) error! .... } } proc_create(...) { if (ompi_modex_recv() != OMPI_SUCCESS) return NULL; ... } ----- The fix is to make proc_create() return something a bit more expressive so that add_procs() can tell the difference between "error!" and "you can't reach this peer". I fixed this in the usnic BTL back in late March, but forgot to bring this to everyone's attention -- oops. See https://svn.open-mpi.org/trac/ompi/ticket/4442 -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/