> On Nov 5, 2014, at 6:11 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > > Elena, > > the first case (-mca btl tcp,self) crashing is a bug and i will have a look > at it. > > the second case (-mca sm,self) is a feature : the sm btl cannot be used with > tasks > having different jobids (this is the case after a spawn), and obviously, self > cannot be used also, > so the behaviour and error message is correct. > /* i am not aware of any plans to make the sm btl work with tasks from > different jobids */\
That is correct - I’m also unaware of any plans to extend it at this point, though IIRC Nathan at one time mentioned perhaps extending vader for that purpose > > the third case (-mca openib,self) is more controversial ... > i previously posted > http://www.open-mpi.org/community/lists/devel/2014/10/16136.php > <http://www.open-mpi.org/community/lists/devel/2014/10/16136.php> > what happens in your case (simple_spawn) is the openib modex is sent with > PMIX_REMOTE, > that means openib btl cannot be used between tasks on the same node. > i am still waiting for some feedback since i cannot figure out whether this > is a feature or an > undesired side effect / bug I believe it is a bug - I provided some initial values for the modex scope with the expectation (and request when we committed it) that people would review and modify them as appropriate. I recall setting the openib scope as “remote” only because I wasn’t aware of anyone using it for local comm. Since Mellanox obviously is testing for that case, a scope of PMIX_GLOBAL would be more appropriate > > the last cast (-mca ^sm,openib) does make sense to me : > the tcp and self btls are used and they work just like they should. > > bottom line, i will investigate the first crash, wait for feedback about the > openib btl. > > Cheers, > > Gilles > > On 2014/11/06 1:08, Elena Elkina wrote: >> Hi, >> >> It looks like there is a problem in trunk which reproduces with >> simple_spawn test (orte/test/mpi/simple_spawn.c). It seems to be a n issue >> with pmix. It doesn't reproduce with default set of btls. But it reproduces >> with several btls specified. For example, >> >> salloc -N5 $OMPI_HOME/install/bin/mpirun -np 33 --map-by node -mca coll ^ml >> -display-map -mca orte_debug_daemons true --leave-session-attached >> --debug-daemons -mca pml ob1 -mca btl *tcp,self* >> ./orte/test/mpi/simple_spawn >> >> gets >> >> simple_spawn: ../../ompi/group/group_init.c:215: >> ompi_group_increment_proc_count: Assertion `((0xdeafbeedULL << 32) + >> 0xdeafbeedULL) == ((opal_object_t *) (proc_pointer))->obj_magic_id' failed. >> [sputnik3.vbench.com:28888] [[41877,0],3] orted_cmd: exit cmd, but proc >> [[41877,1],2] is alive >> [sputnik5][[41877,1],29][../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:675:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.1.42 failed: Connection refused (111) >> >> salloc -N1 $OMPI_HOME/install/bin/mpirun -np 3 --map-by node -mca coll ^ml >> -display-map -mca orte_debug_daemons true --leave-session-attached >> --debug-daemons -mca pml ob1 -mca btl *sm,self* ./orte/test/mpi/simple_spawn >> >> fails with >> >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[59481,2],0]) is on host: sputnik1 >> Process 2 ([[59481,1],0]) is on host: sputnik1 >> BTLs attempted: self sm >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> [sputnik1.vbench.com:22156] [[59481,1],2] ORTE_ERROR_LOG: Unreachable in >> file ../../../../../ompi/mca/dpm/orte/dpm_orte.c at line 485 >> >> >> salloc -N1 $OMPI_HOME/install/bin/mpirun -np 3 --map-by node -mca coll ^ml >> -display-map -mca orte_debug_daemons true --leave-session-attached >> --debug-daemons -mca pml ob1 -mca btl *openib,self* >> ./orte/test/mpi/simple_spawn >> >> also doesn't work: >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[60046,1],13]) is on host: sputnik4 >> Process 2 ([[60046,2],1]) is on host: sputnik4 >> BTLs attempted: openib self >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> [sputnik4.vbench.com:25476] [[60046,1],3] ORTE_ERROR_LOG: Unreachable in >> file ../../../../../ompi/mca/dpm/orte/dpm_orte.c at line 485 >> >> >> *But* combination ^sm,openib seems to work. >> >> I tried different revisions from the beginning of October. It reproduces on >> them. >> >> Best regards, >> Elena >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16202.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16202.php> > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16223.php