And that was indeed the problem - fixed, and now the trunk runs clean thru my 
MTT.

Thanks again!
Ralph

On Aug 25, 2014, at 7:38 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Yeah, that was going to be my first place to look once I finished breakfast 
> :-)
> 
> Thanks!
> Ralph
> 
> On Aug 25, 2014, at 7:32 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> Thanks for the explanation
>> 
>> In orte_dt_compare_sig(...) memcmp did not multiply value1->sz by 
>> sizeof(opal_identifier_t).
>> 
>> Being afk, I could not test but that looks like a good suspect
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ralph Castain <r...@open-mpi.org> wrote:
>>> Each collective is given a "signature" that is just the array of names for 
>>> all procs involved in the collective. Thus, even though task 0 is involved 
>>> in both of the disconnect barriers, the two collectives should be running 
>>> in isolation from each other.
>>> 
>>> The "tags" are just receive callbacks and have no meaning other than to 
>>> associate a particular callback to a given send/recv pair. It is the 
>>> signature that counts as the daemons are using that to keep the various 
>>> collectives separated.
>>> 
>>> I'll have to take a look at why task 2 is leaving early. The key will be to 
>>> look at that signature to ensure we aren't getting it confused.
>>> 
>>> On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org> wrote:
>>> 
>>>> Folks,
>>>> 
>>>> when i run
>>>> mpirun -np 1 ./intercomm_create
>>>> from the ibm test suite, it either :
>>>> - success
>>>> - hangs
>>>> - mpirun crashes (SIGSEGV) soon after writing the following message
>>>> ORTE_ERROR_LOG: Not found in file
>>>> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
>>>> 
>>>> here is what happens :
>>>> 
>>>> first, the test program itself :
>>>> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
>>>> parent on task 1
>>>> then
>>>> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
>>>> parent on task 2
>>>> then
>>>> several operations (merge, barrier, ...)
>>>> and then without any synchronization :
>>>> - task 0 MPI_Comm_disconnect(ab_inter) and then
>>>> MPI_Comm_disconnect(ac_inter)
>>>> - task 1 and task 2 MPI_Comm_disconnect(parent)
>>>> 
>>>> i applied the attached pmix_debug.patch and ran
>>>> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create
>>>> 
>>>> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
>>>> and 2 execute a native fence.
>>>> they both use the *same* tags on different though overlapping tasks
>>>> bottom line, task 2 leave the fences *before* task 0 enterred the fence
>>>> (it seems task 1 told task 2 it is ok to leave the fence)
>>>> 
>>>> a simple work around is to call MPI_Barrier before calling
>>>> MPI_Comm_disconnect
>>>> 
>>>> at this stage, i doubt it is even possible to get this working at the
>>>> pmix level, so the fix
>>>> might be to have MPI_Comm_disconnect invoke MPI_Barrier
>>>> the attached comm_disconnect.patch always call the barrier before
>>>> (indirectly) invoking pmix
>>>> 
>>>> could you please comment on this issue ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> here are the relevant logs :
>>>> 
>>>> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
>>>> [[8110,1],0] and [[8110,3],0]
>>>> [soleil:00650] [[8110,3],0]
>>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
>>>> post
>>>> send to server
>>>> [soleil:00650] [[8110,3],0] posting recv on tag 5
>>>> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
>>>> queueing for send
>>>> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server
>>>> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
>>>> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
>>>> [[8110,1],0] and [[8110,2],0]
>>>> [soleil:00647] [[8110,2],0]
>>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] 
>>>> post
>>>> send to server
>>>> [soleil:00647] [[8110,2],0] posting recv on tag 5
>>>> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
>>>> queueing for send
>>>> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server
>>>> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
>>>> [soleil:00650] [[8110,3],0] usock:recv:handler called
>>>> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
>>>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
>>>> [soleil:00650] usock:recv:handler read hdr
>>>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
>>>> size 14
>>>> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
>>>> BYTES FOR TAG 5
>>>> [soleil:00650] [[8110,3],0]
>>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
>>>> post msg
>>>> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
>>>> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
>>>> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14
>>>> bytes
>>>> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs
>>>> [[8110,1],0] and [[8110,3],0]
>>>> 
>>>> 
>>>> <pmix_debug.patch><comm_disconnect.patch>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15701.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15702.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15703.php
> 

Reply via email to