Hi Gilles

I concur on the typo and fixed it - thanks for catching it. I'll have to look 
into the problem you reported as it has been fixed in the past, and was working 
last I checked it. The info required for this 3-way connect/accept is supposed 
to be in the modex provided by the common communicator.

On May 27, 2014, at 3:51 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com> wrote:

> Folks,
> 
> while debugging the dynamic/intercomm_create from the ibm test suite, i found 
> something odd.
> 
> i ran *without* any batch manager on a VM (one socket and four cpus)
> mpirun -np 1 ./dynamic/intercomm_create
> 
> it hangs by default
> it works with --mca coll ^ml
> 
> basically :
> - task 0 spawns task 1
> - task 0 spawns task 2
> - a communicator is created for the 3 tasks via MPI_Intercomm_create()
> 
> MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls 
> ompi_proc_set_locality()
> 
> then, on task 1, ompi_proc_set_locality() calls
> opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which fails 
> and this is OK
> then 
> opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which fails 
> and this is *not* OK
> 
> /* on task 2, the first fetch for "task 1" fails but the second success */
> 
> my analysis is that when task 2 was created, it updated its 
> opal_dstore_nonpeer with info from "task 1" which was previously spawned by 
> task 0.
> when task 1 was spawned, task 2 did not exist yet and hence 
> opal_dstore_nonpeer contains no reference to task 2.
> but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been 
> updated, hence the failure
> 
> (on task 1, proc_flags of task 2 has incorrect locality, this likely confuses 
> coll ml and hang the test)
> 
> should task1 have received new information when task 2 was spawned ?
> shoud task2 have sent information to task1 when it was spawned ?
> should task1 have (tried to) get fresh information before invoking 
> MPI_Intercomm_create() ?
> 
> incidentally, i found ompi_proc_set_locality calls opal_dstore.store with 
> identifier &proc (the argument is &proc->proc_name everywhere else, so this
> is likely a bug/typo. the attached patch fixes this.
> 
> Thanks in advance for your feedback,
> 
> Gilles
> <proc.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14848.php

Reply via email to