Re: [OMPI devel] intercomm_create from the ibm test suite hangs
Thanks Ralph ! Cheers, Gilles On 2014/08/28 4:52, Ralph Castain wrote: > Took me awhile to track this down, but it is now fixed - combination of > several minor errors > > Thanks > Ralph > > On Aug 27, 2014, at 4:07 AM, Gilles Gouaillardet > wrote: > >> Folks, >> >> the intercomm_create test case from the ibm test suite can hang under >> some configuration. >> >> basically, it will spawn n tasks in a first communicator, and then n >> tasks in a second communicator. >> >> when i run from node0 : >> mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2 >> ./intercomm_create >> >> the second spawn will hang. >> a simple workaround is to use 3 hosts : >> mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2,node3 >> ./intercomm_create >> >> the second spawn creates the task on node2. >> for some reasons i cannot fully understand, pmix believe orted of nodes >> node1 and node2 are involved in allgather. >> since node1 in not involved whatsoever, the program hangs >> /* in create_dmns, orte_get_job_data_object(sig->signature[0].jobid) >> returns jdata with jdata->map->num_nodes = 2 */ >> >> Cheers, >> >> Gilles >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15732.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15743.php
Re: [OMPI devel] intercomm_create from the ibm test suite hangs
Took me awhile to track this down, but it is now fixed - combination of several minor errors Thanks Ralph On Aug 27, 2014, at 4:07 AM, Gilles Gouaillardet wrote: > Folks, > > the intercomm_create test case from the ibm test suite can hang under > some configuration. > > basically, it will spawn n tasks in a first communicator, and then n > tasks in a second communicator. > > when i run from node0 : > mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2 > ./intercomm_create > > the second spawn will hang. > a simple workaround is to use 3 hosts : > mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2,node3 > ./intercomm_create > > the second spawn creates the task on node2. > for some reasons i cannot fully understand, pmix believe orted of nodes > node1 and node2 are involved in allgather. > since node1 in not involved whatsoever, the program hangs > /* in create_dmns, orte_get_job_data_object(sig->signature[0].jobid) > returns jdata with jdata->map->num_nodes = 2 */ > > Cheers, > > Gilles > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15732.php
[OMPI devel] intercomm_create from the ibm test suite hangs
Folks, the intercomm_create test case from the ibm test suite can hang under some configuration. basically, it will spawn n tasks in a first communicator, and then n tasks in a second communicator. when i run from node0 : mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2 ./intercomm_create the second spawn will hang. a simple workaround is to use 3 hosts : mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2,node3 ./intercomm_create the second spawn creates the task on node2. for some reasons i cannot fully understand, pmix believe orted of nodes node1 and node2 are involved in allgather. since node1 in not involved whatsoever, the program hangs /* in create_dmns, orte_get_job_data_object(sig->signature[0].jobid) returns jdata with jdata->map->num_nodes = 2 */ Cheers, Gilles