Committed a fix for this in r32460 - see if I got it! On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote:
> Folks, > > here is the description of a hang i briefly mentionned a few days ago. > > with the trunk (i did not check 1.8 ...) simply run on one node : > mpirun -np 2 --mca btl sm,self ./abort > > (the abort test is taken from the ibm test suite : process 0 call > MPI_Abort while process 1 enters an infinite loop) > > there is a race condition : sometimes it hangs, sometimes it aborts > nicely as expected. > when the hang occurs, both abort processes have exited and mpirun waits > forever > > i made some investigations and i have now a better idea of what happens > (but i am still clueless on how to fix this) > > when process 0 abort, it : > - closes the tcp socket connected to mpirun > - closes the pipe connected to mpirun > - send SIGCHLD to mpirun > > then on mpirun : > when SIGCHLD is received, the handler basically writes 17 (the signal > number) to a socketpair. > then libevent will return from a poll and here is the race condition, > basically : > if revents is non zero for the three fds (socket, pipe and socketpair) > then the program will abort nicely > if revents is non zero for both socket and pipe but is zero for the > socketpair, then the mpirun will hang > > i digged a bit deeper and found that when the event on the socketpair is > processed, it will end up calling > odls_base_default_wait_local_proc. > if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program > will abort nicely > *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the > program will hang > > an other way to put this is that > when the program aborts nicely, the call sequence is > odls_base_default_wait_local_proc > proc_errors(vpid=0) > proc_errors(vpid=0) > proc_errors(vpid=1) > proc_errors(vpid=1) > > when the program hangs, the call sequence is > proc_errors(vpid=0) > odls_base_default_wait_local_proc > proc_errors(vpid=0) > proc_errors(vpid=1) > proc_errors(vpid=1) > > i will resume this on Monday unless someone can fix this in the mean > time :-) > > Cheers, > > Gilles > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15552.php