Hi Gilles I concur on the typo and fixed it - thanks for catching it. I'll have to look into the problem you reported as it has been fixed in the past, and was working last I checked it. The info required for this 3-way connect/accept is supposed to be in the modex provided by the common communicator.
On May 27, 2014, at 3:51 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote: > Folks, > > while debugging the dynamic/intercomm_create from the ibm test suite, i found > something odd. > > i ran *without* any batch manager on a VM (one socket and four cpus) > mpirun -np 1 ./dynamic/intercomm_create > > it hangs by default > it works with --mca coll ^ml > > basically : > - task 0 spawns task 1 > - task 0 spawns task 2 > - a communicator is created for the 3 tasks via MPI_Intercomm_create() > > MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls > ompi_proc_set_locality() > > then, on task 1, ompi_proc_set_locality() calls > opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which fails > and this is OK > then > opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which fails > and this is *not* OK > > /* on task 2, the first fetch for "task 1" fails but the second success */ > > my analysis is that when task 2 was created, it updated its > opal_dstore_nonpeer with info from "task 1" which was previously spawned by > task 0. > when task 1 was spawned, task 2 did not exist yet and hence > opal_dstore_nonpeer contains no reference to task 2. > but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been > updated, hence the failure > > (on task 1, proc_flags of task 2 has incorrect locality, this likely confuses > coll ml and hang the test) > > should task1 have received new information when task 2 was spawned ? > shoud task2 have sent information to task1 when it was spawned ? > should task1 have (tried to) get fresh information before invoking > MPI_Intercomm_create() ? > > incidentally, i found ompi_proc_set_locality calls opal_dstore.store with > identifier &proc (the argument is &proc->proc_name everywhere else, so this > is likely a bug/typo. the attached patch fixes this. > > Thanks in advance for your feedback, > > Gilles > <proc.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14848.php