[OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Gilles Gouaillardet
Folks,

here is the description of a hang i briefly mentionned a few days ago.

with the trunk (i did not check 1.8 ...) simply run on one node :
mpirun -np 2 --mca btl sm,self ./abort

(the abort test is taken from the ibm test suite : process 0 call
MPI_Abort while process 1 enters an infinite loop)

there is a race condition : sometimes it hangs, sometimes it aborts
nicely as expected.
when the hang occurs, both abort processes have exited and mpirun waits
forever

i made some investigations and i have now a better idea of what happens
(but i am still clueless on how to fix this)

when process 0 abort, it :
- closes the tcp socket connected to mpirun
- closes the pipe connected to mpirun
- send SIGCHLD to mpirun

then on mpirun :
when SIGCHLD is received, the handler basically writes 17 (the signal
number) to a socketpair.
then libevent will return from a poll and here is the race condition,
basically :
if revents is non zero for the three fds (socket, pipe and socketpair)
then the program will abort nicely
if revents is non zero for both socket and pipe but is zero for the
socketpair, then the mpirun will hang

i digged a bit deeper and found that when the event on the socketpair is
processed, it will end up calling
odls_base_default_wait_local_proc.
if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
will abort nicely
*but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
program will hang

an other way to put this is that
when the program aborts nicely, the call sequence is
odls_base_default_wait_local_proc
proc_errors(vpid=0)
proc_errors(vpid=0)
proc_errors(vpid=1)
proc_errors(vpid=1)

when the program hangs, the call sequence is
proc_errors(vpid=0)
odls_base_default_wait_local_proc
proc_errors(vpid=0)
proc_errors(vpid=1)
proc_errors(vpid=1)

i will resume this on Monday unless someone can fix this in the mean
time :-)

Cheers,

Gilles


Re: [OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Ralph Castain
Committed a fix for this in r32460 - see if I got it!

On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet  
wrote:

> Folks,
> 
> here is the description of a hang i briefly mentionned a few days ago.
> 
> with the trunk (i did not check 1.8 ...) simply run on one node :
> mpirun -np 2 --mca btl sm,self ./abort
> 
> (the abort test is taken from the ibm test suite : process 0 call
> MPI_Abort while process 1 enters an infinite loop)
> 
> there is a race condition : sometimes it hangs, sometimes it aborts
> nicely as expected.
> when the hang occurs, both abort processes have exited and mpirun waits
> forever
> 
> i made some investigations and i have now a better idea of what happens
> (but i am still clueless on how to fix this)
> 
> when process 0 abort, it :
> - closes the tcp socket connected to mpirun
> - closes the pipe connected to mpirun
> - send SIGCHLD to mpirun
> 
> then on mpirun :
> when SIGCHLD is received, the handler basically writes 17 (the signal
> number) to a socketpair.
> then libevent will return from a poll and here is the race condition,
> basically :
> if revents is non zero for the three fds (socket, pipe and socketpair)
> then the program will abort nicely
> if revents is non zero for both socket and pipe but is zero for the
> socketpair, then the mpirun will hang
> 
> i digged a bit deeper and found that when the event on the socketpair is
> processed, it will end up calling
> odls_base_default_wait_local_proc.
> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
> will abort nicely
> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
> program will hang
> 
> an other way to put this is that
> when the program aborts nicely, the call sequence is
> odls_base_default_wait_local_proc
> proc_errors(vpid=0)
> proc_errors(vpid=0)
> proc_errors(vpid=1)
> proc_errors(vpid=1)
> 
> when the program hangs, the call sequence is
> proc_errors(vpid=0)
> odls_base_default_wait_local_proc
> proc_errors(vpid=0)
> proc_errors(vpid=1)
> proc_errors(vpid=1)
> 
> i will resume this on Monday unless someone can fix this in the mean
> time :-)
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php



Re: [OMPI devel] ibm abort test hangs on one node

2014-08-11 Thread Gilles Gouaillardet
Thanks Ralph !

this was necessary but not sufficient :

orte_errmgr_base_abort calls orte_session_dir_finalize at
errmgr_base_fns.c:219
that will remove the proc session dir
then, orte_errmgr_base_abort (indirectly) calls orte_ess_base_app_abort
at line 227

first, the proc session dir is removed
then the "aborted" empty file is created in the previously removed directory
(and there is no error check, so the failure gets un-noticed)
as a consequence, the code you added in r32460 do not get executed.

i commited r32498 to fix this.
it simply does not call orte_session_dir_finalize in the first place
(which is sufficient but might not be necessary ...)

Cheers,

Gilles

On 2014/08/09 1:27, Ralph Castain wrote:
> Committed a fix for this in r32460 - see if I got it!
>
> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet 
>  wrote:
>
>> Folks,
>>
>> here is the description of a hang i briefly mentionned a few days ago.
>>
>> with the trunk (i did not check 1.8 ...) simply run on one node :
>> mpirun -np 2 --mca btl sm,self ./abort
>>
>> (the abort test is taken from the ibm test suite : process 0 call
>> MPI_Abort while process 1 enters an infinite loop)
>>
>> there is a race condition : sometimes it hangs, sometimes it aborts
>> nicely as expected.
>> when the hang occurs, both abort processes have exited and mpirun waits
>> forever
>>
>> i made some investigations and i have now a better idea of what happens
>> (but i am still clueless on how to fix this)
>>
>> when process 0 abort, it :
>> - closes the tcp socket connected to mpirun
>> - closes the pipe connected to mpirun
>> - send SIGCHLD to mpirun
>>
>> then on mpirun :
>> when SIGCHLD is received, the handler basically writes 17 (the signal
>> number) to a socketpair.
>> then libevent will return from a poll and here is the race condition,
>> basically :
>> if revents is non zero for the three fds (socket, pipe and socketpair)
>> then the program will abort nicely
>> if revents is non zero for both socket and pipe but is zero for the
>> socketpair, then the mpirun will hang
>>
>> i digged a bit deeper and found that when the event on the socketpair is
>> processed, it will end up calling
>> odls_base_default_wait_local_proc.
>> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
>> will abort nicely
>> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
>> program will hang
>>
>> an other way to put this is that
>> when the program aborts nicely, the call sequence is
>> odls_base_default_wait_local_proc
>> proc_errors(vpid=0)
>> proc_errors(vpid=0)
>> proc_errors(vpid=1)
>> proc_errors(vpid=1)
>>
>> when the program hangs, the call sequence is
>> proc_errors(vpid=0)
>> odls_base_default_wait_local_proc
>> proc_errors(vpid=0)
>> proc_errors(vpid=1)
>> proc_errors(vpid=1)
>>
>> i will resume this on Monday unless someone can fix this in the mean
>> time :-)
>>
>> Cheers,
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15560.php



Re: [OMPI devel] ibm abort test hangs on one node

2014-08-11 Thread Ralph Castain
Good catch! Glad it is now fixed - we can move r32498 across to 1.8.2 as well


On Aug 10, 2014, at 10:56 PM, Gilles Gouaillardet 
 wrote:

> Thanks Ralph !
> 
> this was necessary but not sufficient :
> 
> orte_errmgr_base_abort calls orte_session_dir_finalize at
> errmgr_base_fns.c:219
> that will remove the proc session dir
> then, orte_errmgr_base_abort (indirectly) calls orte_ess_base_app_abort
> at line 227
> 
> first, the proc session dir is removed
> then the "aborted" empty file is created in the previously removed directory
> (and there is no error check, so the failure gets un-noticed)
> as a consequence, the code you added in r32460 do not get executed.
> 
> i commited r32498 to fix this.
> it simply does not call orte_session_dir_finalize in the first place
> (which is sufficient but might not be necessary ...)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/09 1:27, Ralph Castain wrote:
>> Committed a fix for this in r32460 - see if I got it!
>> 
>> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Folks,
>>> 
>>> here is the description of a hang i briefly mentionned a few days ago.
>>> 
>>> with the trunk (i did not check 1.8 ...) simply run on one node :
>>> mpirun -np 2 --mca btl sm,self ./abort
>>> 
>>> (the abort test is taken from the ibm test suite : process 0 call
>>> MPI_Abort while process 1 enters an infinite loop)
>>> 
>>> there is a race condition : sometimes it hangs, sometimes it aborts
>>> nicely as expected.
>>> when the hang occurs, both abort processes have exited and mpirun waits
>>> forever
>>> 
>>> i made some investigations and i have now a better idea of what happens
>>> (but i am still clueless on how to fix this)
>>> 
>>> when process 0 abort, it :
>>> - closes the tcp socket connected to mpirun
>>> - closes the pipe connected to mpirun
>>> - send SIGCHLD to mpirun
>>> 
>>> then on mpirun :
>>> when SIGCHLD is received, the handler basically writes 17 (the signal
>>> number) to a socketpair.
>>> then libevent will return from a poll and here is the race condition,
>>> basically :
>>> if revents is non zero for the three fds (socket, pipe and socketpair)
>>> then the program will abort nicely
>>> if revents is non zero for both socket and pipe but is zero for the
>>> socketpair, then the mpirun will hang
>>> 
>>> i digged a bit deeper and found that when the event on the socketpair is
>>> processed, it will end up calling
>>> odls_base_default_wait_local_proc.
>>> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
>>> will abort nicely
>>> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
>>> program will hang
>>> 
>>> an other way to put this is that
>>> when the program aborts nicely, the call sequence is
>>> odls_base_default_wait_local_proc
>>> proc_errors(vpid=0)
>>> proc_errors(vpid=0)
>>> proc_errors(vpid=1)
>>> proc_errors(vpid=1)
>>> 
>>> when the program hangs, the call sequence is
>>> proc_errors(vpid=0)
>>> odls_base_default_wait_local_proc
>>> proc_errors(vpid=0)
>>> proc_errors(vpid=1)
>>> proc_errors(vpid=1)
>>> 
>>> i will resume this on Monday unless someone can fix this in the mean
>>> time :-)
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15560.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15601.php