Good catch! Glad it is now fixed - we can move r32498 across to 1.8.2 as well


On Aug 10, 2014, at 10:56 PM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Thanks Ralph !
> 
> this was necessary but not sufficient :
> 
> orte_errmgr_base_abort calls orte_session_dir_finalize at
> errmgr_base_fns.c:219
> that will remove the proc session dir
> then, orte_errmgr_base_abort (indirectly) calls orte_ess_base_app_abort
> at line 227
> 
> first, the proc session dir is removed
> then the "aborted" empty file is created in the previously removed directory
> (and there is no error check, so the failure gets un-noticed)
> as a consequence, the code you added in r32460 do not get executed.
> 
> i commited r32498 to fix this.
> it simply does not call orte_session_dir_finalize in the first place
> (which is sufficient but might not be necessary ...)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/09 1:27, Ralph Castain wrote:
>> Committed a fix for this in r32460 - see if I got it!
>> 
>> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>> 
>>> Folks,
>>> 
>>> here is the description of a hang i briefly mentionned a few days ago.
>>> 
>>> with the trunk (i did not check 1.8 ...) simply run on one node :
>>> mpirun -np 2 --mca btl sm,self ./abort
>>> 
>>> (the abort test is taken from the ibm test suite : process 0 call
>>> MPI_Abort while process 1 enters an infinite loop)
>>> 
>>> there is a race condition : sometimes it hangs, sometimes it aborts
>>> nicely as expected.
>>> when the hang occurs, both abort processes have exited and mpirun waits
>>> forever
>>> 
>>> i made some investigations and i have now a better idea of what happens
>>> (but i am still clueless on how to fix this)
>>> 
>>> when process 0 abort, it :
>>> - closes the tcp socket connected to mpirun
>>> - closes the pipe connected to mpirun
>>> - send SIGCHLD to mpirun
>>> 
>>> then on mpirun :
>>> when SIGCHLD is received, the handler basically writes 17 (the signal
>>> number) to a socketpair.
>>> then libevent will return from a poll and here is the race condition,
>>> basically :
>>> if revents is non zero for the three fds (socket, pipe and socketpair)
>>> then the program will abort nicely
>>> if revents is non zero for both socket and pipe but is zero for the
>>> socketpair, then the mpirun will hang
>>> 
>>> i digged a bit deeper and found that when the event on the socketpair is
>>> processed, it will end up calling
>>> odls_base_default_wait_local_proc.
>>> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
>>> will abort nicely
>>> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
>>> program will hang
>>> 
>>> an other way to put this is that
>>> when the program aborts nicely, the call sequence is
>>> odls_base_default_wait_local_proc
>>> proc_errors(vpid=0)
>>> proc_errors(vpid=0)
>>> proc_errors(vpid=1)
>>> proc_errors(vpid=1)
>>> 
>>> when the program hangs, the call sequence is
>>> proc_errors(vpid=0)
>>> odls_base_default_wait_local_proc
>>> proc_errors(vpid=0)
>>> proc_errors(vpid=1)
>>> proc_errors(vpid=1)
>>> 
>>> i will resume this on Monday unless someone can fix this in the mean
>>> time :-)
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15560.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15601.php

Reply via email to