Good catch. I fixed the TCP BTL (r31753). It is the only BTL I can
test so that's the most I can do here.

However, I never get OPAL_ERR_DATA_VALUE_NOT_FOUND out of the modex
call when the key doesn't exists. I looked in dstore and the correct
value one should look for is OPAL_ERR_NOT_FOUND. I guess you might
want to revise the check in the USNIC.

  George.

PS: There is a easy way to test this particular case by using the MPMD
capabilities of mpiexec. As an example for a quick NetPIPE run between
two processes one supporting SM and TCP and one supporting only SM (I
ignored self here), you can do:

mpirun -np 1 --mca btl tcp,sm,self ./NPmpi -l 5 -u 5 : -np 1 --mca btl
sm,self ./NPmpi -l 5 -u 5


On Tue, May 13, 2014 at 2:09 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> I notice that BTLs are not checking the return value from ompi_modex_recv() 
> for OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't 
> put that modex key).  In the BTL context, NOT_FOUND means that that peer 
> process doesn't have this BTL, so this local peer process should probably 
> mark it as unreachable in add_procs().
>
> This is on both trunk and the v1.8 branch.
>
> The BTLs listed above are not checking/handling ompi_modex_recv() returning 
> OPAL_ERR_DATA_VALUE_NOT_FOUND properly.  Most of these BTLs do something like 
> this:
>
> -----
> module_add_procs() {
>   loop over the peers {
>     proc = proc_create(...)
>     if (NULL == proc)
>       error!
>     ....
>   }
> }
>
> proc_create(...) {
>   if (ompi_modex_recv() != OMPI_SUCCESS)
>      return NULL;
>   ...
> }
> -----
>
> The fix is to make proc_create() return something a bit more expressive so 
> that add_procs() can tell the difference between "error!" and "you can't 
> reach this peer".
>
> I fixed this in the usnic BTL back in late March, but forgot to bring this to 
> everyone's attention -- oops.  See 
> https://svn.open-mpi.org/trac/ompi/ticket/4442
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14783.php

Reply via email to