I notice that BTLs are not checking the return value from ompi_modex_recv() for 
OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that 
modex key).  In the BTL context, NOT_FOUND means that that peer process doesn't 
have this BTL, so this local peer process should probably mark it as 
unreachable in add_procs().

This is on both trunk and the v1.8 branch.

The BTLs listed above are not checking/handling ompi_modex_recv() returning 
OPAL_ERR_DATA_VALUE_NOT_FOUND properly.  Most of these BTLs do something like 
this:

-----
module_add_procs() {
  loop over the peers {
    proc = proc_create(...)
    if (NULL == proc)
      error!
    ....
  }
}

proc_create(...) {
  if (ompi_modex_recv() != OMPI_SUCCESS)
     return NULL;
  ...
}
-----

The fix is to make proc_create() return something a bit more expressive so that 
add_procs() can tell the difference between "error!" and "you can't reach this 
peer".

I fixed this in the usnic BTL back in late March, but forgot to bring this to 
everyone's attention -- oops.  See 
https://svn.open-mpi.org/trac/ompi/ticket/4442

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to