Yeah, that was going to be my first place to look once I finished breakfast :-)
Thanks! Ralph On Aug 25, 2014, at 7:32 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote: > Thanks for the explanation > > In orte_dt_compare_sig(...) memcmp did not multiply value1->sz by > sizeof(opal_identifier_t). > > Being afk, I could not test but that looks like a good suspect > > Cheers, > > Gilles > > Ralph Castain <r...@open-mpi.org> wrote: >> Each collective is given a "signature" that is just the array of names for >> all procs involved in the collective. Thus, even though task 0 is involved >> in both of the disconnect barriers, the two collectives should be running in >> isolation from each other. >> >> The "tags" are just receive callbacks and have no meaning other than to >> associate a particular callback to a given send/recv pair. It is the >> signature that counts as the daemons are using that to keep the various >> collectives separated. >> >> I'll have to take a look at why task 2 is leaving early. The key will be to >> look at that signature to ensure we aren't getting it confused. >> >> On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Folks, >>> >>> when i run >>> mpirun -np 1 ./intercomm_create >>> from the ibm test suite, it either : >>> - success >>> - hangs >>> - mpirun crashes (SIGSEGV) soon after writing the following message >>> ORTE_ERROR_LOG: Not found in file >>> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566 >>> >>> here is what happens : >>> >>> first, the test program itself : >>> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and >>> parent on task 1 >>> then >>> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and >>> parent on task 2 >>> then >>> several operations (merge, barrier, ...) >>> and then without any synchronization : >>> - task 0 MPI_Comm_disconnect(ab_inter) and then >>> MPI_Comm_disconnect(ac_inter) >>> - task 1 and task 2 MPI_Comm_disconnect(parent) >>> >>> i applied the attached pmix_debug.patch and ran >>> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create >>> >>> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0 >>> and 2 execute a native fence. >>> they both use the *same* tags on different though overlapping tasks >>> bottom line, task 2 leave the fences *before* task 0 enterred the fence >>> (it seems task 1 told task 2 it is ok to leave the fence) >>> >>> a simple work around is to call MPI_Barrier before calling >>> MPI_Comm_disconnect >>> >>> at this stage, i doubt it is even possible to get this working at the >>> pmix level, so the fix >>> might be to have MPI_Comm_disconnect invoke MPI_Barrier >>> the attached comm_disconnect.patch always call the barrier before >>> (indirectly) invoking pmix >>> >>> could you please comment on this issue ? >>> >>> Cheers, >>> >>> Gilles >>> >>> here are the relevant logs : >>> >>> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs >>> [[8110,1],0] and [[8110,3],0] >>> [soleil:00650] [[8110,3],0] >>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] >>> post >>> send to server >>> [soleil:00650] [[8110,3],0] posting recv on tag 5 >>> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server - >>> queueing for send >>> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server >>> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER >>> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs >>> [[8110,1],0] and [[8110,2],0] >>> [soleil:00647] [[8110,2],0] >>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] >>> post >>> send to server >>> [soleil:00647] [[8110,2],0] posting recv on tag 5 >>> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server - >>> queueing for send >>> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server >>> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER >>> [soleil:00650] [[8110,3],0] usock:recv:handler called >>> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED >>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg >>> [soleil:00650] usock:recv:handler read hdr >>> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of >>> size 14 >>> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14 >>> BYTES FOR TAG 5 >>> [soleil:00650] [[8110,3],0] >>> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415] >>> post msg >>> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5 >>> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5 >>> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14 >>> bytes >>> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs >>> [[8110,1],0] and [[8110,3],0] >>> >>> >>> <pmix_debug.patch><comm_disconnect.patch>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15701.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15702.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15703.php