Hmmm...I did some digging, and the best I can tell is that root cause is that the second job ("b" in the test program) is never actually calling connect_accept! This looks like a change may have occurred in Intercomm_create that is causing it to not recognize the need to do so.
Anyone confirm that diagnosis? FWIW: job 1 clearly receives and has all the required info in the correct places - it is ready to provide it to job 2, if/when job 2 actually calls connect_accept. On May 27, 2014, at 10:13 AM, Ralph Castain <r...@open-mpi.org> wrote: > Hi Gilles > > I concur on the typo and fixed it - thanks for catching it. I'll have to look > into the problem you reported as it has been fixed in the past, and was > working last I checked it. The info required for this 3-way connect/accept is > supposed to be in the modex provided by the common communicator. > > On May 27, 2014, at 3:51 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> Folks, >> >> while debugging the dynamic/intercomm_create from the ibm test suite, i >> found something odd. >> >> i ran *without* any batch manager on a VM (one socket and four cpus) >> mpirun -np 1 ./dynamic/intercomm_create >> >> it hangs by default >> it works with --mca coll ^ml >> >> basically : >> - task 0 spawns task 1 >> - task 0 spawns task 2 >> - a communicator is created for the 3 tasks via MPI_Intercomm_create() >> >> MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls >> ompi_proc_set_locality() >> >> then, on task 1, ompi_proc_set_locality() calls >> opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which >> fails and this is OK >> then >> opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which fails >> and this is *not* OK >> >> /* on task 2, the first fetch for "task 1" fails but the second success */ >> >> my analysis is that when task 2 was created, it updated its >> opal_dstore_nonpeer with info from "task 1" which was previously spawned by >> task 0. >> when task 1 was spawned, task 2 did not exist yet and hence >> opal_dstore_nonpeer contains no reference to task 2. >> but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been >> updated, hence the failure >> >> (on task 1, proc_flags of task 2 has incorrect locality, this likely >> confuses coll ml and hang the test) >> >> should task1 have received new information when task 2 was spawned ? >> shoud task2 have sent information to task1 when it was spawned ? >> should task1 have (tried to) get fresh information before invoking >> MPI_Intercomm_create() ? >> >> incidentally, i found ompi_proc_set_locality calls opal_dstore.store with >> identifier &proc (the argument is &proc->proc_name everywhere else, so this >> is likely a bug/typo. the attached patch fixes this. >> >> Thanks in advance for your feedback, >> >> Gilles >> <proc.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14848.php >