And that was indeed the problem - fixed, and now the trunk runs clean thru my MTT.
Thanks again! Ralph On Aug 25, 2014, at 7:38 AM, Ralph Castain <r...@open-mpi.org> wrote: > Yeah, that was going to be my first place to look once I finished breakfast > :-) > > Thanks! > Ralph > > On Aug 25, 2014, at 7:32 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> Thanks for the explanation >> >> In orte_dt_compare_sig(...) memcmp did not multiply value1->sz by >> sizeof(opal_identifier_t). >> >> Being afk, I could not test but that looks like a good suspect >> >> Cheers, >> >> Gilles >> >> Ralph Castain <r...@open-mpi.org> wrote: >>> Each collective is given a "signature" that is just the array of names for >>> all procs involved in the collective. Thus, even though task 0 is involved >>> in both of the disconnect barriers, the two collectives should be running >>> in isolation from each other. >>> >>> The "tags" are just receive callbacks and have no meaning other than to >>> associate a particular callback to a given send/recv pair. It is the >>> signature that counts as the daemons are using that to keep the various >>> collectives separated. >>> >>> I'll have to take a look at why task 2 is leaving early. The key will be to >>> look at that signature to ensure we aren't getting it confused. >>> >>> On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> wrote: >>> >>>> Folks, >>>> >>>> when i run >>>> mpirun -np 1 ./intercomm_create >>>> from the ibm test suite, it either : >>>> - success >>>> - hangs >>>> - mpirun crashes (SIGSEGV) soon after writing the following message >>>> ORTE_ERROR_LOG: Not found in file >>>> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566 >>>> >>>> here is what happens : >>>> >>>> first, the test program itself : >>>> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and >>>> parent on task 1 >>>> then >>>> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and >>>> parent on task 2 >>>> then >>>> several operations (merge, barrier, ...) >>>> and then without any synchronization : >>>> - task 0 MPI_Comm_disconnect(ab_inter) and then >>>> MPI_Comm_disconnect(ac_inter) >>>> - task 1 and task 2 MPI_Comm_disconnect(parent) >>>> >>>> i applied the attached pmix_debug.patch and ran >>>> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create >>>> >>>> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0 >>>> and 2 execute a native fence. >>>> they both use the *same* tags on different though overlapping tasks >>>> bottom line, task 2 leave the fences *before* task 0 enterred the fence >>>> (it seems task 1 told task 2 it is ok to leave the fence) >>>> >>>> a simple work around is to call MPI_Barrier before calling >>>> MPI_Comm_disconnect >>>> >>>> at this stage, i doubt it is even possible to get this working at the >>>> pmix level, so the fix >>>> might be to have MPI_Comm_disconnect invoke MPI_Barrier >>>> the attached comm_disconnect.patch always call the barrier before >>>> (indirectly) invoking pmix >>>> >>>> could you please comment on this issue ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> here are the relevant logs : >>>> >>>> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs >>>> [[8110,1],0] and [[8110,3],0] >>>> [soleil:00650] [[8110,3],0] >>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] >>>> post >>>> send to server >>>> [soleil:00650] [[8110,3],0] posting recv on tag 5 >>>> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server - >>>> queueing for send >>>> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server >>>> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER >>>> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs >>>> [[8110,1],0] and [[8110,2],0] >>>> [soleil:00647] [[8110,2],0] >>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] >>>> post >>>> send to server >>>> [soleil:00647] [[8110,2],0] posting recv on tag 5 >>>> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server - >>>> queueing for send >>>> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server >>>> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER >>>> [soleil:00650] [[8110,3],0] usock:recv:handler called >>>> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED >>>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg >>>> [soleil:00650] usock:recv:handler read hdr >>>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of >>>> size 14 >>>> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14 >>>> BYTES FOR TAG 5 >>>> [soleil:00650] [[8110,3],0] >>>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415] >>>> post msg >>>> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5 >>>> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5 >>>> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14 >>>> bytes >>>> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs >>>> [[8110,1],0] and [[8110,3],0] >>>> >>>> >>>> <pmix_debug.patch><comm_disconnect.patch>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15701.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15702.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15703.php >