Hmmm...I did some digging, and the best I can tell is that root cause is that 
the second job ("b" in the test program) is never actually calling 
connect_accept!  This looks like a change may have occurred in Intercomm_create 
that is causing it to not recognize the need to do so.

Anyone confirm that diagnosis?

FWIW: job 1 clearly receives and has all the required info in the correct 
places - it is ready to provide it to job 2, if/when job 2 actually calls 
connect_accept.

On May 27, 2014, at 10:13 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Hi Gilles
> 
> I concur on the typo and fixed it - thanks for catching it. I'll have to look 
> into the problem you reported as it has been fixed in the past, and was 
> working last I checked it. The info required for this 3-way connect/accept is 
> supposed to be in the modex provided by the common communicator.
> 
> On May 27, 2014, at 3:51 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> Folks,
>> 
>> while debugging the dynamic/intercomm_create from the ibm test suite, i 
>> found something odd.
>> 
>> i ran *without* any batch manager on a VM (one socket and four cpus)
>> mpirun -np 1 ./dynamic/intercomm_create
>> 
>> it hangs by default
>> it works with --mca coll ^ml
>> 
>> basically :
>> - task 0 spawns task 1
>> - task 0 spawns task 2
>> - a communicator is created for the 3 tasks via MPI_Intercomm_create()
>> 
>> MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls 
>> ompi_proc_set_locality()
>> 
>> then, on task 1, ompi_proc_set_locality() calls
>> opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which 
>> fails and this is OK
>> then 
>> opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which fails 
>> and this is *not* OK
>> 
>> /* on task 2, the first fetch for "task 1" fails but the second success */
>> 
>> my analysis is that when task 2 was created, it updated its 
>> opal_dstore_nonpeer with info from "task 1" which was previously spawned by 
>> task 0.
>> when task 1 was spawned, task 2 did not exist yet and hence 
>> opal_dstore_nonpeer contains no reference to task 2.
>> but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been 
>> updated, hence the failure
>> 
>> (on task 1, proc_flags of task 2 has incorrect locality, this likely 
>> confuses coll ml and hang the test)
>> 
>> should task1 have received new information when task 2 was spawned ?
>> shoud task2 have sent information to task1 when it was spawned ?
>> should task1 have (tried to) get fresh information before invoking 
>> MPI_Intercomm_create() ?
>> 
>> incidentally, i found ompi_proc_set_locality calls opal_dstore.store with 
>> identifier &proc (the argument is &proc->proc_name everywhere else, so this
>> is likely a bug/typo. the attached patch fixes this.
>> 
>> Thanks in advance for your feedback,
>> 
>> Gilles
>> <proc.patch>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14848.php
> 

Reply via email to