Committed a fix for this in r32460 - see if I got it!

On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> 
wrote:

> Folks,
> 
> here is the description of a hang i briefly mentionned a few days ago.
> 
> with the trunk (i did not check 1.8 ...) simply run on one node :
> mpirun -np 2 --mca btl sm,self ./abort
> 
> (the abort test is taken from the ibm test suite : process 0 call
> MPI_Abort while process 1 enters an infinite loop)
> 
> there is a race condition : sometimes it hangs, sometimes it aborts
> nicely as expected.
> when the hang occurs, both abort processes have exited and mpirun waits
> forever
> 
> i made some investigations and i have now a better idea of what happens
> (but i am still clueless on how to fix this)
> 
> when process 0 abort, it :
> - closes the tcp socket connected to mpirun
> - closes the pipe connected to mpirun
> - send SIGCHLD to mpirun
> 
> then on mpirun :
> when SIGCHLD is received, the handler basically writes 17 (the signal
> number) to a socketpair.
> then libevent will return from a poll and here is the race condition,
> basically :
> if revents is non zero for the three fds (socket, pipe and socketpair)
> then the program will abort nicely
> if revents is non zero for both socket and pipe but is zero for the
> socketpair, then the mpirun will hang
> 
> i digged a bit deeper and found that when the event on the socketpair is
> processed, it will end up calling
> odls_base_default_wait_local_proc.
> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
> will abort nicely
> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
> program will hang
> 
> an other way to put this is that
> when the program aborts nicely, the call sequence is
> odls_base_default_wait_local_proc
> proc_errors(vpid=0)
> proc_errors(vpid=0)
> proc_errors(vpid=1)
> proc_errors(vpid=1)
> 
> when the program hangs, the call sequence is
> proc_errors(vpid=0)
> odls_base_default_wait_local_proc
> proc_errors(vpid=0)
> proc_errors(vpid=1)
> proc_errors(vpid=1)
> 
> i will resume this on Monday unless someone can fix this in the mean
> time :-)
> 
> Cheers,
> 
> Gilles
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php

Reply via email to