Good catch! Glad it is now fixed - we can move r32498 across to 1.8.2 as well
On Aug 10, 2014, at 10:56 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Thanks Ralph ! > > this was necessary but not sufficient : > > orte_errmgr_base_abort calls orte_session_dir_finalize at > errmgr_base_fns.c:219 > that will remove the proc session dir > then, orte_errmgr_base_abort (indirectly) calls orte_ess_base_app_abort > at line 227 > > first, the proc session dir is removed > then the "aborted" empty file is created in the previously removed directory > (and there is no error check, so the failure gets un-noticed) > as a consequence, the code you added in r32460 do not get executed. > > i commited r32498 to fix this. > it simply does not call orte_session_dir_finalize in the first place > (which is sufficient but might not be necessary ...) > > Cheers, > > Gilles > > On 2014/08/09 1:27, Ralph Castain wrote: >> Committed a fix for this in r32460 - see if I got it! >> >> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Folks, >>> >>> here is the description of a hang i briefly mentionned a few days ago. >>> >>> with the trunk (i did not check 1.8 ...) simply run on one node : >>> mpirun -np 2 --mca btl sm,self ./abort >>> >>> (the abort test is taken from the ibm test suite : process 0 call >>> MPI_Abort while process 1 enters an infinite loop) >>> >>> there is a race condition : sometimes it hangs, sometimes it aborts >>> nicely as expected. >>> when the hang occurs, both abort processes have exited and mpirun waits >>> forever >>> >>> i made some investigations and i have now a better idea of what happens >>> (but i am still clueless on how to fix this) >>> >>> when process 0 abort, it : >>> - closes the tcp socket connected to mpirun >>> - closes the pipe connected to mpirun >>> - send SIGCHLD to mpirun >>> >>> then on mpirun : >>> when SIGCHLD is received, the handler basically writes 17 (the signal >>> number) to a socketpair. >>> then libevent will return from a poll and here is the race condition, >>> basically : >>> if revents is non zero for the three fds (socket, pipe and socketpair) >>> then the program will abort nicely >>> if revents is non zero for both socket and pipe but is zero for the >>> socketpair, then the mpirun will hang >>> >>> i digged a bit deeper and found that when the event on the socketpair is >>> processed, it will end up calling >>> odls_base_default_wait_local_proc. >>> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program >>> will abort nicely >>> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the >>> program will hang >>> >>> an other way to put this is that >>> when the program aborts nicely, the call sequence is >>> odls_base_default_wait_local_proc >>> proc_errors(vpid=0) >>> proc_errors(vpid=0) >>> proc_errors(vpid=1) >>> proc_errors(vpid=1) >>> >>> when the program hangs, the call sequence is >>> proc_errors(vpid=0) >>> odls_base_default_wait_local_proc >>> proc_errors(vpid=0) >>> proc_errors(vpid=1) >>> proc_errors(vpid=1) >>> >>> i will resume this on Monday unless someone can fix this in the mean >>> time :-) >>> >>> Cheers, >>> >>> Gilles >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15560.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15601.php