Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Committed in r24815.

On Thursday, June 23, 2011 at 4:19 PM, Ralph Castain wrote:

> 
> On Jun 23, 2011, at 2:14 PM, Wesley Bland wrote:
> > Maybe before the ORTED saw the signal, it detected a communication failure 
> > and reacted to that. 
> 
> Quite possible. However, remember that procs local to mpirun (in most 
> environments) directly receive the ctrl-c instead of the orted getting a cmd 
> from mpirun to kill them. Thus, they "abort_by_signal" rather than "terminate 
> by cmd".
> 
> I've had this problem a lot on my Mac, in particular. The ctrl-c is seen 
> directly by the procs, so the abort code path is totally different.
> 
> 
> > Either way, I haven't had any trouble being able to ctrl-c out of my 
> > applications. I'll go ahead and comment the code out of the HNP and if we 
> > want to put it back later, it will be there.
> > 
> > On Thursday, June 23, 2011 at 4:05 PM, Ralph Castain wrote:
> > 
> > > 
> > > On Jun 23, 2011, at 1:59 PM, Wesley Bland wrote:
> > > > I don't see any code in the orted errmgr that deals with the state 
> > > > ORTE_PROC_STATE_ABORTED_BY_SIG however the HNP does deal with that 
> > > > state.
> > > 
> > > Like I said, the orted just passes it along - as it does with all failure 
> > > states.
> > > 
> > > > 
> > > > The discussion Josh and I were having was whether or not to remove the 
> > > > code dealing with ORTE_PROC_STATE_ABORTED_BY_SIG from the HNP so that 
> > > > the processes running on that node can also be aborted by a kill signal 
> > > > and allow the rest of the job to run.
> > > 
> > > I don't see any reason to treat that state any differently than all the 
> > > other failure states. However, be careful - if someone -wants- to kill 
> > > the job, then we need to ensure they can do so - i.e., if mpirun 
> > > sigterms/sigkills a proc, we don't want it auto-recovering or we'll never 
> > > ctrl-c out of mpirun.
> > > 
> > > In my branch, I have a special code for procs terminated deliberately by 
> > > mpirun - pretty sure I put that code back into the trunk, but I don't 
> > > believe the trunk errmgr modules know what to do with it 
> > > (TERMINATED_BY_CMD).
> > > 
> > > You might need to add some code for that case.
> > > > 
> > > > On Thursday, June 23, 2011 at 3:54 PM, Ralph Castain wrote:
> > > > 
> > > > > I'm not entirely sure what that means. The orteds certainly detect 
> > > > > and mark that a local proc aborted by signal - the orted errmgr just 
> > > > > sends a note back to the HNP notifying it of the situation rather 
> > > > > than responding to it directly.
> > > > > 
> > > > > I don't believe the HNP does anything different when responding to a 
> > > > > local proc's abort-by-signal vs getting a message from an orted, does 
> > > > > it?
> > > > > 
> > > > > What is it you want the HNP/orted to do? I haven't dug that deeply 
> > > > > into your branch
> > > > > 
> > > > > On Jun 23, 2011, at 1:47 PM, Josh Hursey wrote:
> > > > > 
> > > > > > I would mention this to Ralph to be sure (CC'ed). I bet that you can
> > > > > > push this change in with the rest so that mpirun hosting a failed
> > > > > > process works.
> > > > > > 
> > > > > > Ralph, what do you think?
> > > > > > 
> > > > > > -- Josh
> > > > > > 
> > > > > > On Thu, Jun 23, 2011 at 3:29 PM, Wesley Bland  > > > > > (mailto:wbl...@eecs.utk.edu)> wrote:
> > > > > > > There is still one problem that you'll notice when you run your 
> > > > > > > tests. The
> > > > > > > HNP errmgr catches "aborted by signal" while the orteds don't. I 
> > > > > > > wasn't sure
> > > > > > > if this had a purpose that I wasn't aware of so I left that in 
> > > > > > > there. It's a
> > > > > > > simple matter of removing the code to make the behavior the same 
> > > > > > > on the HNP
> > > > > > > as the orteds, but I don't want to remove something like that if 
> > > > > > > it's going
> > > > > > > to cause problems for someone else.
> > > > > > > 
> > > > > > > On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> > > > > > > 
> > > > > > > So I finally got a chance to test the branch this morning. I 
> > > > > > > cannot
> > > > > > > get it to work. Maybe I'm doing some wrong, missing some MCA
> > > > > > > parameter?
> > > > > > > 
> > > > > > > -
> > > > > > > [jjhursey@smoky-login1 resilient-orte] hg summary
> > > > > > > parent: 2:c550cf6ed6a2 tip
> > > > > > > Newest version. Synced with trunk r24785.
> > > > > > > branch: default
> > > > > > > commit: 1 modified, 8097 unknown
> > > > > > > update: (current)
> > > > > > > -
> > > > > > > (the 1 modified was the test program attached)
> > > > > > > 
> > > > > > > Attached is a modified version of the orte_abort.c program found 
> > > > > > > in
> > > > > > > ${top}/orte/test/system. This program is ORTE only, and registers 
> > > > > > > the
> > > > > > > errmgr callback to trigger correct termination. You will need to
> > > > > > > configure Open MPI 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Josh Hursey
Ga - what a rookie mistake :)

I tested the patched test and it works as advertised for the small
scale tests I used before. So I'm good with this going in today.

Thanks,
Josh

On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland  wrote:
> Right. Sorry I misspoke.
>
> On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:
>
> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem
> of "not giving up the thread". The problem was that Josh's test never called
> progress. It would have been equally okay to simply call
> "opal_event_dispatch" while waiting for the callback.
> All applications have to cycle the progress engine.
>
> On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:
>
> Josh,
> There were a couple of bugs that I cleared up in my most recent checkin, but
> I also needed to modify your test. The callback for the application layer
> errmgr actually occurs in the application layer. Your test was never giving
> up the thread to the ORTE application event loop to receive its message from
> the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that
> fixed the problem.
> Try running the attached code with the modifications and see if that clears
> up the problem. It did for me.
> Thanks,
> Wesley
>
> On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
>
> So I finally got a chance to test the branch this morning. I cannot
> get it to work. Maybe I'm doing some wrong, missing some MCA
> parameter?
>
> -
> [jjhursey@smoky-login1 resilient-orte] hg summary
> parent: 2:c550cf6ed6a2 tip
> Newest version. Synced with trunk r24785.
> branch: default
> commit: 1 modified, 8097 unknown
> update: (current)
> -
> (the 1 modified was the test program attached)
>
> Attached is a modified version of the orte_abort.c program found in
> ${top}/orte/test/system. This program is ORTE only, and registers the
> errmgr callback to trigger correct termination. You will need to
> configure Open MPI with '--with-devel-headers' to build this. But then
> you can compile with:
> ortecc -g orte_abort.c -o orte_abort
>
> These are the configure options that I used:
> --with-devel-headers --enable-binaries --disable-io-romio
> --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> F77=gfortran FC=gfortran
>
>
> If the HNP has no processes on it - I get a hang:
> ---
> mpirun -np 4 --nolocal orte_abort
> orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> mpirun: killing job...
>
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> [jjhursey@smoky14 system] echo $?
> 1
> ---
>
> If the HNP has processes on it, but not the one that aborted - I get a hang:
> ---
> [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> mpirun: killing job...
>
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> [jjhursey@smoky14 system] echo $?
> 1
> 
>
> If the HNP has processes on it, and it is the one that aborted - I get
> immediate return, but no callback:
> 
> [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Right. Sorry I misspoke.

On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:

> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem 
> of "not giving up the thread". The problem was that Josh's test never called 
> progress. It would have been equally okay to simply call 
> "opal_event_dispatch" while waiting for the callback.
> 
> All applications have to cycle the progress engine.
> 
> 
> On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:
> > Josh,
> > 
> > There were a couple of bugs that I cleared up in my most recent checkin, 
> > but I also needed to modify your test. The callback for the application 
> > layer errmgr actually occurs in the application layer. Your test was never 
> > giving up the thread to the ORTE application event loop to receive its 
> > message from the ORTED. I changed your while loop to an 
> > ORTE_PROGRESSED_WAIT and that fixed the problem.
> > 
> > Try running the attached code with the modifications and see if that clears 
> > up the problem. It did for me.
> > 
> > Thanks,
> > Wesley
> > 
> > On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> > 
> > > So I finally got a chance to test the branch this morning. I cannot
> > > get it to work. Maybe I'm doing some wrong, missing some MCA
> > > parameter?
> > > 
> > > -
> > > [jjhursey@smoky-login1 resilient-orte] hg summary
> > > parent: 2:c550cf6ed6a2 tip
> > >  Newest version. Synced with trunk r24785.
> > > branch: default
> > > commit: 1 modified, 8097 unknown
> > > update: (current)
> > > -
> > > (the 1 modified was the test program attached)
> > > 
> > > Attached is a modified version of the orte_abort.c program found in
> > > ${top}/orte/test/system. This program is ORTE only, and registers the
> > > errmgr callback to trigger correct termination. You will need to
> > > configure Open MPI with '--with-devel-headers' to build this. But then
> > > you can compile with:
> > >  ortecc -g orte_abort.c -o orte_abort
> > > 
> > > These are the configure options that I used:
> > >  --with-devel-headers --enable-binaries --disable-io-romio
> > > --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> > > F77=gfortran FC=gfortran
> > > 
> > > 
> > > If the HNP has no processes on it - I get a hang:
> > > ---
> > > mpirun -np 4 --nolocal orte_abort
> > > orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> > > orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> > > orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> > > mpirun: killing job...
> > > 
> > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file errmgr_hnp.c at line 824
> > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file orted/orted_comm.c at line 1341
> > > mpirun: abort is already in progress...hit ctrl-c again to forcibly 
> > > terminate
> > > 
> > > [jjhursey@smoky14 system] echo $?
> > > 1
> > > ---
> > > 
> > > If the HNP has processes on it, but not the one that aborted - I get a 
> > > hang:
> > > ---
> > > [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> > > orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> > > orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> > > orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> > > mpirun: killing job...
> > > 
> > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> > > readv failed: Connection reset by peer (104)
> > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> > > readv failed: Connection reset by peer (104)
> > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file errmgr_hnp.c at line 824
> > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file orted/orted_comm.c at line 1341
> > > mpirun: abort is already in progress...hit ctrl-c again to forcibly 
> > > terminate
> > > 
> > > [jjhursey@smoky14 system] echo $?
> > > 1
> > > 
> > > 
> > > If the HNP has processes on it, and it is the one that aborted - I get
> > > immediate return, but no callback:
> > > 
> > > [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> > > orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> > > orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> > > 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Ralph Castain
Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of 
"not giving up the thread". The problem was that Josh's test never called 
progress. It would have been equally okay to simply call "opal_event_dispatch" 
while waiting for the callback.

All applications have to cycle the progress engine.


On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:

> Josh,
> 
> There were a couple of bugs that I cleared up in my most recent checkin, but 
> I also needed to modify your test. The callback for the application layer 
> errmgr actually occurs in the application layer. Your test was never giving 
> up the thread to the ORTE application event loop to receive its message from 
> the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that 
> fixed the problem.
> 
> Try running the attached code with the modifications and see if that clears 
> up the problem. It did for me.
> 
> Thanks,
> Wesley
> On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> 
>> So I finally got a chance to test the branch this morning. I cannot
>> get it to work. Maybe I'm doing some wrong, missing some MCA
>> parameter?
>> 
>> -
>> [jjhursey@smoky-login1 resilient-orte] hg summary
>> parent: 2:c550cf6ed6a2 tip
>> Newest version. Synced with trunk r24785.
>> branch: default
>> commit: 1 modified, 8097 unknown
>> update: (current)
>> -
>> (the 1 modified was the test program attached)
>> 
>> Attached is a modified version of the orte_abort.c program found in
>> ${top}/orte/test/system. This program is ORTE only, and registers the
>> errmgr callback to trigger correct termination. You will need to
>> configure Open MPI with '--with-devel-headers' to build this. But then
>> you can compile with:
>> ortecc -g orte_abort.c -o orte_abort
>> 
>> These are the configure options that I used:
>> --with-devel-headers --enable-binaries --disable-io-romio
>> --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
>> F77=gfortran FC=gfortran
>> 
>> 
>> If the HNP has no processes on it - I get a hang:
>> ---
>> mpirun -np 4 --nolocal orte_abort
>> orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
>> orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
>> orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
>> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
>> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
>> mpirun: killing job...
>> 
>> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file errmgr_hnp.c at line 824
>> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file orted/orted_comm.c at line 1341
>> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>> 
>> [jjhursey@smoky14 system] echo $?
>> 1
>> ---
>> 
>> If the HNP has processes on it, but not the one that aborted - I get a hang:
>> ---
>> [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
>> orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
>> orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
>> orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
>> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
>> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
>> mpirun: killing job...
>> 
>> [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
>> readv failed: Connection reset by peer (104)
>> [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
>> readv failed: Connection reset by peer (104)
>> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file errmgr_hnp.c at line 824
>> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file orted/orted_comm.c at line 1341
>> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>> 
>> [jjhursey@smoky14 system] echo $?
>> 1
>> 
>> 
>> If the HNP has processes on it, and it is the one that aborted - I get
>> immediate return, but no callback:
>> 
>> [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
>> orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
>> orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
>> orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
>> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
>> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
>> [jjhursey@smoky14 system] echo $?
>> 3
>> 
>> 
>> Any ideas on what I might be doing wrong?
>> 
>> I tried with both calling 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Josh,

There were a couple of bugs that I cleared up in my most recent checkin, but I 
also needed to modify your test. The callback for the application layer errmgr 
actually occurs in the application layer. Your test was never giving up the 
thread to the ORTE application event loop to receive its message from the 
ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that fixed the 
problem.

Try running the attached code with the modifications and see if that clears up 
the problem. It did for me.

Thanks,
Wesley

On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:

> So I finally got a chance to test the branch this morning. I cannot
> get it to work. Maybe I'm doing some wrong, missing some MCA
> parameter?
> 
> -
> [jjhursey@smoky-login1 resilient-orte] hg summary
> parent: 2:c550cf6ed6a2 tip
>  Newest version. Synced with trunk r24785.
> branch: default
> commit: 1 modified, 8097 unknown
> update: (current)
> -
> (the 1 modified was the test program attached)
> 
> Attached is a modified version of the orte_abort.c program found in
> ${top}/orte/test/system. This program is ORTE only, and registers the
> errmgr callback to trigger correct termination. You will need to
> configure Open MPI with '--with-devel-headers' to build this. But then
> you can compile with:
>  ortecc -g orte_abort.c -o orte_abort
> 
> These are the configure options that I used:
>  --with-devel-headers --enable-binaries --disable-io-romio
> --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> F77=gfortran FC=gfortran
> 
> 
> If the HNP has no processes on it - I get a hang:
> ---
> mpirun -np 4 --nolocal orte_abort
> orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> mpirun: killing job...
> 
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> 
> [jjhursey@smoky14 system] echo $?
> 1
> ---
> 
> If the HNP has processes on it, but not the one that aborted - I get a hang:
> ---
> [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> mpirun: killing job...
> 
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> 
> [jjhursey@smoky14 system] echo $?
> 1
> 
> 
> If the HNP has processes on it, and it is the one that aborted - I get
> immediate return, but no callback:
> 
> [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
> [jjhursey@smoky14 system] echo $?
> 3
> 
> 
> Any ideas on what I might be doing wrong?
> 
> I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
> NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
> 
> -- Josh
> 
> 
> 
> On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland  (mailto:wbl...@eecs.utk.edu)> wrote:
> > Last reminder (I hope). RFC goes in a COB today.
> > Wesley
> > ___
> > devel mailing list
> > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Josh Hursey
So I finally got a chance to test the branch this morning. I cannot
get it to work. Maybe I'm doing some wrong, missing some MCA
parameter?

-
[jjhursey@smoky-login1 resilient-orte] hg summary
parent: 2:c550cf6ed6a2 tip
 Newest version. Synced with trunk r24785.
branch: default
commit: 1 modified, 8097 unknown
update: (current)
-
(the 1 modified was the test program attached)

Attached is a modified version of the orte_abort.c program found in
${top}/orte/test/system. This program is ORTE only, and registers the
errmgr callback to trigger correct termination. You will need to
configure Open MPI with '--with-devel-headers' to build this. But then
you can compile with:
  ortecc -gorte_abort.c   -o orte_abort

These are the configure options that I used:
 --with-devel-headers --enable-binaries --disable-io-romio
--enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
F77=gfortran FC=gfortran


If the HNP has no processes on it - I get a hang:
---
mpirun -np 4 --nolocal orte_abort
orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
mpirun: killing job...

[smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file errmgr_hnp.c at line 824
[smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file orted/orted_comm.c at line 1341
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[jjhursey@smoky14 system] echo $?
1
---

If the HNP has processes on it, but not the one that aborted - I get a hang:
---
[jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
mpirun: killing job...

[smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file errmgr_hnp.c at line 824
[smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file orted/orted_comm.c at line 1341
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[jjhursey@smoky14 system] echo $?
1


If the HNP has processes on it, and it is the one that aborted - I get
immediate return, but no callback:

[jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
[jjhursey@smoky14 system] echo $?
3


Any ideas on what I might be doing wrong?

I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.

-- Josh



On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland  wrote:
> Last reminder (I hope). RFC goes in a COB today.
> Wesley
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
/* -*- C -*-
 *
 * $HEADER$
 *
 * A program that just spins, with vpid 3 aborting - provides mechanism for testing
 * abnormal program termination
 */

#include 
#include 

#include "orte/runtime/runtime.h"
#include "orte/util/proc_info.h"
#include "orte/util/name_fns.h"
#include "orte/runtime/orte_globals.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "opal/class/opal_pointer_array.h"

static pid_t pid;
static char hostname[500];
static finished = 0;

void my_errhandler_runtime_callback(opal_pointer_array_t *procs);

void my_errhandler_runtime_callback(opal_pointer_array_t *procs) {
   printf("orte_abort: Name %s Host: %s Pid %ld "
   "-- In callback\n",
   

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Last reminder (I hope). RFC goes in a COB today. 

Wesley 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Josh Hursey
Sounds good. Thanks.

On Sat, Jun 18, 2011 at 9:31 AM, Wesley Bland  wrote:
> That's fine. Let's say Thursday COB is now the timeout.
>
> On Jun 18, 2011 9:10 AM, "Joshua Hursey"  wrote:
>> Cool. Then can we hold off pushing this into the trunk for a couple days
>> until I get a chance to test it? Monday COB does not give me much time since
>> we just got the new patch on Friday COB (the RFC gave us 2 weeks to review
>> the original patch). Would waiting until next Thursday/Friday COB be too
>> disruptive? That should give me and maybe Ralph enough time to test and send
>> any further feedback.
>>
>> Thanks,
>> Josh
>>
>> On Jun 17, 2011, at 5:59 PM, Wesley Bland wrote:
>>
>>> I believe that it does. I made quite a few changes in the last checkin
>>> though I didn't run your specific test this afternoon. I'll be able to try
>>> it later this evening but it should be easy to test now that it's synced
>>> with the trunk again.
>>>
>>> On Jun 17, 2011 5:32 PM, "Josh Hursey"  wrote:
>>> > Does this include a fix for the problem I reported with mpirun-hosted
>>> > processes?
>>> >
>>> > If not I would ask that we holding off on putting it into the trunk
>>> > until that particular bug is addressed. From my experience tackling
>>> > this particular issues requires some code refactoring, which should
>>> > probably be done once in the trunk instead of two possibly disruptive
>>> > commits.
>>> >
>>> > -- Josh
>>> >
>>> > On Fri, Jun 17, 2011 at 5:18 PM, Wesley Bland 
>>> > wrote:
>>> >> This is a reminder that the Resilient ORTE RFC is set to go into the
>>> >> trunk
>>> >> on Monday at COB.
>>> >> I've updated the code with a few of the changes that were mentioned on
>>> >> and
>>> >> off the list (moved code out of orted_comm.c, errmgr_set_callback
>>> >> returns
>>> >> previous callback, post_startup function, corrected normal termination
>>> >> issues). Please take another look at it if you have any interest. The
>>> >> code
>>> >> can be found here:
>>> >> https://bitbucket.org/wesbland/resilient-orte/
>>> >> Thanks,
>>> >> Wesley Bland
>>> >
>>> >
>>> >
>>> > --
>>> > Joshua Hursey
>>> > Postdoctoral Research Associate
>>> > Oak Ridge National Laboratory
>>> > http://users.nccs.gov/~jjhursey
>>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Wesley Bland
That's fine. Let's say Thursday COB is now the timeout.
On Jun 18, 2011 9:10 AM, "Joshua Hursey"  wrote:
> Cool. Then can we hold off pushing this into the trunk for a couple days
until I get a chance to test it? Monday COB does not give me much time since
we just got the new patch on Friday COB (the RFC gave us 2 weeks to review
the original patch). Would waiting until next Thursday/Friday COB be too
disruptive? That should give me and maybe Ralph enough time to test and send
any further feedback.
>
> Thanks,
> Josh
>
> On Jun 17, 2011, at 5:59 PM, Wesley Bland wrote:
>
>> I believe that it does. I made quite a few changes in the last checkin
though I didn't run your specific test this afternoon. I'll be able to try
it later this evening but it should be easy to test now that it's synced
with the trunk again.
>>
>> On Jun 17, 2011 5:32 PM, "Josh Hursey"  wrote:
>> > Does this include a fix for the problem I reported with mpirun-hosted
processes?
>> >
>> > If not I would ask that we holding off on putting it into the trunk
>> > until that particular bug is addressed. From my experience tackling
>> > this particular issues requires some code refactoring, which should
>> > probably be done once in the trunk instead of two possibly disruptive
>> > commits.
>> >
>> > -- Josh
>> >
>> > On Fri, Jun 17, 2011 at 5:18 PM, Wesley Bland 
wrote:
>> >> This is a reminder that the Resilient ORTE RFC is set to go into the
trunk
>> >> on Monday at COB.
>> >> I've updated the code with a few of the changes that were mentioned on
and
>> >> off the list (moved code out of orted_comm.c, errmgr_set_callback
returns
>> >> previous callback, post_startup function, corrected normal termination
>> >> issues). Please take another look at it if you have any interest. The
code
>> >> can be found here:
>> >> https://bitbucket.org/wesbland/resilient-orte/
>> >> Thanks,
>> >> Wesley Bland
>> >
>> >
>> >
>> > --
>> > Joshua Hursey
>> > Postdoctoral Research Associate
>> > Oak Ridge National Laboratory
>> > http://users.nccs.gov/~jjhursey
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Joshua Hursey
Cool. Then can we hold off pushing this into the trunk for a couple days until 
I get a chance to test it? Monday COB does not give me much time since we just 
got the new patch on Friday COB (the RFC gave us 2 weeks to review the original 
patch). Would waiting until next Thursday/Friday COB be too disruptive? That 
should give me and maybe Ralph enough time to test and send any further 
feedback.

Thanks,
Josh

On Jun 17, 2011, at 5:59 PM, Wesley Bland wrote:

> I believe that it does. I made quite a few changes in the last checkin though 
> I didn't run your specific test this afternoon. I'll be able to try it later 
> this evening but it should be easy to test now that it's synced with the 
> trunk again.
> 
> On Jun 17, 2011 5:32 PM, "Josh Hursey"  wrote:
> > Does this include a fix for the problem I reported with mpirun-hosted 
> > processes?
> > 
> > If not I would ask that we holding off on putting it into the trunk
> > until that particular bug is addressed. From my experience tackling
> > this particular issues requires some code refactoring, which should
> > probably be done once in the trunk instead of two possibly disruptive
> > commits.
> > 
> > -- Josh
> > 
> > On Fri, Jun 17, 2011 at 5:18 PM, Wesley Bland  wrote:
> >> This is a reminder that the Resilient ORTE RFC is set to go into the trunk
> >> on Monday at COB.
> >> I've updated the code with a few of the changes that were mentioned on and
> >> off the list (moved code out of orted_comm.c, errmgr_set_callback returns
> >> previous callback, post_startup function, corrected normal termination
> >> issues). Please take another look at it if you have any interest. The code
> >> can be found here:
> >> https://bitbucket.org/wesbland/resilient-orte/
> >> Thanks,
> >> Wesley Bland
> > 
> > 
> > 
> > -- 
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Wesley Bland
I believe that it does. I made quite a few changes in the last checkin
though I didn't run your specific test this afternoon. I'll be able to try
it later this evening but it should be easy to test now that it's synced
with the trunk again.
On Jun 17, 2011 5:32 PM, "Josh Hursey"  wrote:
> Does this include a fix for the problem I reported with mpirun-hosted
processes?
>
> If not I would ask that we holding off on putting it into the trunk
> until that particular bug is addressed. From my experience tackling
> this particular issues requires some code refactoring, which should
> probably be done once in the trunk instead of two possibly disruptive
> commits.
>
> -- Josh
>
> On Fri, Jun 17, 2011 at 5:18 PM, Wesley Bland  wrote:
>> This is a reminder that the Resilient ORTE RFC is set to go into the
trunk
>> on Monday at COB.
>> I've updated the code with a few of the changes that were mentioned on
and
>> off the list (moved code out of orted_comm.c, errmgr_set_callback returns
>> previous callback, post_startup function, corrected normal termination
>> issues). Please take another look at it if you have any interest. The
code
>> can be found here:
>> https://bitbucket.org/wesbland/resilient-orte/
>> Thanks,
>> Wesley Bland
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Josh Hursey
Does this include a fix for the problem I reported with mpirun-hosted processes?

If not I would ask that we holding off on putting it into the trunk
until that particular bug is addressed. From my experience tackling
this particular issues requires some code refactoring, which should
probably be done once in the trunk instead of two possibly disruptive
commits.

-- Josh

On Fri, Jun 17, 2011 at 5:18 PM, Wesley Bland  wrote:
> This is a reminder that the Resilient ORTE RFC is set to go into the trunk
> on Monday at COB.
> I've updated the code with a few of the changes that were mentioned on and
> off the list (moved code out of orted_comm.c, errmgr_set_callback returns
> previous callback, post_startup function, corrected normal termination
> issues). Please take another look at it if you have any interest. The code
> can be found here:
> https://bitbucket.org/wesbland/resilient-orte/
> Thanks,
> Wesley Bland



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
No issue - just trying to get ahead of the game instead of running into an 
issue later.

We can leave it for now.

On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote:

> We could, but we could also just replace the callback. I will never
> what to use it in my scenario, and if I did then I could just call it
> directly instead of relying on the errmgr to do the right thing. So
> why complicate the errmgr with additional complexity for something
> that we don't need at the moment?
> 
> On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain  wrote:
>> So why not have the callback return an int, and your callback returns "go no 
>> further"?
>> 
>> 
>> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
>> 
>>> Yeah I do not want the default fatal callback in OMPI. I want to
>>> replace it with something that allows OMPI to continue running when
>>> there are process failures (if the error handlers associated with the
>>> communicators permit such an action). So having the default fatal
>>> callback called after mine would not be useful, since I do not want
>>> the fatal action.
>>> 
>>> As long as I can replace that callback, or selectively get rid of it
>>> then I'm ok.
>>> 
>>> 
>>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
 
 On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
 
> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>> 
>> Still need to shake that corn out of your head... :-)
>> 
>>> 
 
 This only works if all component play the game, and even then there it 
 is difficult if you want to allow components to deregister themselves 
 in the middle of the execution. The problem is that a callback will be 
 previous for some component, and that when you want to remove a 
 callback you have to inform the "next"  component on the callback 
 chain to change its previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with 
>>> the previous callback is probably too optimistic, esp. for layers that 
>>> are not designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to 
>>> replace the default OMPI callback when I am able to in MPI_Init, and, 
>>> if I need to, swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>> 
>> It doesn't solve the problem that George alluded to - just because you 
>> overwrite the callback, it doesn't mean that someone else won't 
>> overwrite you when their component initializes. Only the last one wins - 
>> the rest of you lose.
>> 
>> I'm not sure how you guarantee that you win, which is why I'm unclear 
>> how this callback can really work unless everyone agrees that only one 
>> place gets it. Put that callback in a base function of a new error 
>> handling framework, and then let everyone create components within that 
>> for handling desired error responses?
> 
> Yep, that is a problem, but one that we can deal with in the immediate
> case. Since OMPI is the only layer registering the callback, when I
> replace it in OMPI I will have to make sure that no other place in
> OMPI replaces the callback.
> 
> If at some point we need more than one callback above ORTE then we may
> want to revisit this point. But since we only have one layer on top of
> ORTE, it is the responsibility of that layer to be internally
> consistent with regard to which callback it wants to be triggered.
> 
> If the layers above ORTE want more than one callback I would suggest
> that that layer design some mechanism for coordinating these multiple
> - possibly conflicting - callbacks (by the way this is policy
> management, which can get complex fast as you add more interested
> parties). Meaning that if OMPI wanted multiple callbacks to be active
> at the same time, then OMPI would create a mechanism for managing
> these callbacks, not ORTE. ORTE should just have one callback provided
> to the upper layer, and keep it -simple-. If the upper layer wants to
> toy around with something more complex it must manage the complexity
> instead of artificially pushing it down to the ORTE layer.
 
 I was thinking some more about this, and wonder if we aren't 
 over-complicating the 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
We could, but we could also just replace the callback. I will never
what to use it in my scenario, and if I did then I could just call it
directly instead of relying on the errmgr to do the right thing. So
why complicate the errmgr with additional complexity for something
that we don't need at the moment?

On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain  wrote:
> So why not have the callback return an int, and your callback returns "go no 
> further"?
>
>
> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
>
>> Yeah I do not want the default fatal callback in OMPI. I want to
>> replace it with something that allows OMPI to continue running when
>> there are process failures (if the error handlers associated with the
>> communicators permit such an action). So having the default fatal
>> callback called after mine would not be useful, since I do not want
>> the fatal action.
>>
>> As long as I can replace that callback, or selectively get rid of it
>> then I'm ok.
>>
>>
>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
>>>
>>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>>>
 On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>
> Still need to shake that corn out of your head... :-)
>
>>
>>>
>>> This only works if all component play the game, and even then there it 
>>> is difficult if you want to allow components to deregister themselves 
>>> in the middle of the execution. The problem is that a callback will be 
>>> previous for some component, and that when you want to remove a 
>>> callback you have to inform the "next"  component on the callback chain 
>>> to change its previous.
>>
>> This is a fair point. I think hiding the ordering of callbacks in the 
>> errmgr could be dangerous since it takes control from the upper layers, 
>> but, conversely, trusting the upper layers to 'do the right thing' with 
>> the previous callback is probably too optimistic, esp. for layers that 
>> are not designed together.
>>
>> To that I would suggest that you leave the code as is - registering a 
>> callback overwrites the existing callback. That will allow me to replace 
>> the default OMPI callback when I am able to in MPI_Init, and, if I need 
>> to, swap back in the default version at MPI_Finalize.
>>
>> Does that sound like a reasonable way forward on this design point?
>
> It doesn't solve the problem that George alluded to - just because you 
> overwrite the callback, it doesn't mean that someone else won't overwrite 
> you when their component initializes. Only the last one wins - the rest 
> of you lose.
>
> I'm not sure how you guarantee that you win, which is why I'm unclear how 
> this callback can really work unless everyone agrees that only one place 
> gets it. Put that callback in a base function of a new error handling 
> framework, and then let everyone create components within that for 
> handling desired error responses?

 Yep, that is a problem, but one that we can deal with in the immediate
 case. Since OMPI is the only layer registering the callback, when I
 replace it in OMPI I will have to make sure that no other place in
 OMPI replaces the callback.

 If at some point we need more than one callback above ORTE then we may
 want to revisit this point. But since we only have one layer on top of
 ORTE, it is the responsibility of that layer to be internally
 consistent with regard to which callback it wants to be triggered.

 If the layers above ORTE want more than one callback I would suggest
 that that layer design some mechanism for coordinating these multiple
 - possibly conflicting - callbacks (by the way this is policy
 management, which can get complex fast as you add more interested
 parties). Meaning that if OMPI wanted multiple callbacks to be active
 at the same time, then OMPI would create a mechanism for managing
 these callbacks, not ORTE. ORTE should just have one callback provided
 to the upper layer, and keep it -simple-. If the upper layer wants to
 toy around with something more complex it must manage the complexity
 instead of artificially pushing it down to the ORTE layer.
>>>
>>> I was thinking some more about this, and wonder if we aren't 
>>> over-complicating the question.
>>>
>>> Do you need to actually control the sequence of callbacks, or just ensure 
>>> that your callback gets called prior to the default one that calls abort?
>>>
>>> Meeting the latter requirement is trivial - subsequent calls to 
>>> register_callback get pushed onto the top of the 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
So why not have the callback return an int, and your callback returns "go no 
further"?


On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:

> Yeah I do not want the default fatal callback in OMPI. I want to
> replace it with something that allows OMPI to continue running when
> there are process failures (if the error handlers associated with the
> communicators permit such an action). So having the default fatal
> callback called after mine would not be useful, since I do not want
> the fatal action.
> 
> As long as I can replace that callback, or selectively get rid of it
> then I'm ok.
> 
> 
> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
>> 
>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>> 
>>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
 
 On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
 
> 
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> 
>> Well, you're way to trusty. ;)
> 
> It's the midwestern boy in me :)
 
 Still need to shake that corn out of your head... :-)
 
> 
>> 
>> This only works if all component play the game, and even then there it 
>> is difficult if you want to allow components to deregister themselves in 
>> the middle of the execution. The problem is that a callback will be 
>> previous for some component, and that when you want to remove a callback 
>> you have to inform the "next"  component on the callback chain to change 
>> its previous.
> 
> This is a fair point. I think hiding the ordering of callbacks in the 
> errmgr could be dangerous since it takes control from the upper layers, 
> but, conversely, trusting the upper layers to 'do the right thing' with 
> the previous callback is probably too optimistic, esp. for layers that 
> are not designed together.
> 
> To that I would suggest that you leave the code as is - registering a 
> callback overwrites the existing callback. That will allow me to replace 
> the default OMPI callback when I am able to in MPI_Init, and, if I need 
> to, swap back in the default version at MPI_Finalize.
> 
> Does that sound like a reasonable way forward on this design point?
 
 It doesn't solve the problem that George alluded to - just because you 
 overwrite the callback, it doesn't mean that someone else won't overwrite 
 you when their component initializes. Only the last one wins - the rest of 
 you lose.
 
 I'm not sure how you guarantee that you win, which is why I'm unclear how 
 this callback can really work unless everyone agrees that only one place 
 gets it. Put that callback in a base function of a new error handling 
 framework, and then let everyone create components within that for 
 handling desired error responses?
>>> 
>>> Yep, that is a problem, but one that we can deal with in the immediate
>>> case. Since OMPI is the only layer registering the callback, when I
>>> replace it in OMPI I will have to make sure that no other place in
>>> OMPI replaces the callback.
>>> 
>>> If at some point we need more than one callback above ORTE then we may
>>> want to revisit this point. But since we only have one layer on top of
>>> ORTE, it is the responsibility of that layer to be internally
>>> consistent with regard to which callback it wants to be triggered.
>>> 
>>> If the layers above ORTE want more than one callback I would suggest
>>> that that layer design some mechanism for coordinating these multiple
>>> - possibly conflicting - callbacks (by the way this is policy
>>> management, which can get complex fast as you add more interested
>>> parties). Meaning that if OMPI wanted multiple callbacks to be active
>>> at the same time, then OMPI would create a mechanism for managing
>>> these callbacks, not ORTE. ORTE should just have one callback provided
>>> to the upper layer, and keep it -simple-. If the upper layer wants to
>>> toy around with something more complex it must manage the complexity
>>> instead of artificially pushing it down to the ORTE layer.
>> 
>> I was thinking some more about this, and wonder if we aren't 
>> over-complicating the question.
>> 
>> Do you need to actually control the sequence of callbacks, or just ensure 
>> that your callback gets called prior to the default one that calls abort?
>> 
>> Meeting the latter requirement is trivial - subsequent calls to 
>> register_callback get pushed onto the top of the callback list. Since the 
>> default one always gets registered first (which we can ensure since it 
>> occurs in MPI_Init), it will always be at the bottom of the callback list 
>> and hence called last.
>> 
>> Keeping that list in ORTE is simple and probably the right place to do it.
>> 
>> However, if you truly want to control the callback order in detail - then 
>> yeah, that should go up in  OMPI. I sure don't want to write all that code 
>> :-)
>> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Yeah I do not want the default fatal callback in OMPI. I want to
replace it with something that allows OMPI to continue running when
there are process failures (if the error handlers associated with the
communicators permit such an action). So having the default fatal
callback called after mine would not be useful, since I do not want
the fatal action.

As long as I can replace that callback, or selectively get rid of it
then I'm ok.


On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
>
> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>
>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>>>
>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>

 On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:

> Well, you're way to trusty. ;)

 It's the midwestern boy in me :)
>>>
>>> Still need to shake that corn out of your head... :-)
>>>

>
> This only works if all component play the game, and even then there it is 
> difficult if you want to allow components to deregister themselves in the 
> middle of the execution. The problem is that a callback will be previous 
> for some component, and that when you want to remove a callback you have 
> to inform the "next"  component on the callback chain to change its 
> previous.

 This is a fair point. I think hiding the ordering of callbacks in the 
 errmgr could be dangerous since it takes control from the upper layers, 
 but, conversely, trusting the upper layers to 'do the right thing' with 
 the previous callback is probably too optimistic, esp. for layers that are 
 not designed together.

 To that I would suggest that you leave the code as is - registering a 
 callback overwrites the existing callback. That will allow me to replace 
 the default OMPI callback when I am able to in MPI_Init, and, if I need 
 to, swap back in the default version at MPI_Finalize.

 Does that sound like a reasonable way forward on this design point?
>>>
>>> It doesn't solve the problem that George alluded to - just because you 
>>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>>> you when their component initializes. Only the last one wins - the rest of 
>>> you lose.
>>>
>>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>>> this callback can really work unless everyone agrees that only one place 
>>> gets it. Put that callback in a base function of a new error handling 
>>> framework, and then let everyone create components within that for handling 
>>> desired error responses?
>>
>> Yep, that is a problem, but one that we can deal with in the immediate
>> case. Since OMPI is the only layer registering the callback, when I
>> replace it in OMPI I will have to make sure that no other place in
>> OMPI replaces the callback.
>>
>> If at some point we need more than one callback above ORTE then we may
>> want to revisit this point. But since we only have one layer on top of
>> ORTE, it is the responsibility of that layer to be internally
>> consistent with regard to which callback it wants to be triggered.
>>
>> If the layers above ORTE want more than one callback I would suggest
>> that that layer design some mechanism for coordinating these multiple
>> - possibly conflicting - callbacks (by the way this is policy
>> management, which can get complex fast as you add more interested
>> parties). Meaning that if OMPI wanted multiple callbacks to be active
>> at the same time, then OMPI would create a mechanism for managing
>> these callbacks, not ORTE. ORTE should just have one callback provided
>> to the upper layer, and keep it -simple-. If the upper layer wants to
>> toy around with something more complex it must manage the complexity
>> instead of artificially pushing it down to the ORTE layer.
>
> I was thinking some more about this, and wonder if we aren't 
> over-complicating the question.
>
> Do you need to actually control the sequence of callbacks, or just ensure 
> that your callback gets called prior to the default one that calls abort?
>
> Meeting the latter requirement is trivial - subsequent calls to 
> register_callback get pushed onto the top of the callback list. Since the 
> default one always gets registered first (which we can ensure since it occurs 
> in MPI_Init), it will always be at the bottom of the callback list and hence 
> called last.
>
> Keeping that list in ORTE is simple and probably the right place to do it.
>
> However, if you truly want to control the callback order in detail - then 
> yeah, that should go up in  OMPI. I sure don't want to write all that code :-)
>
>
>>
>> -- Josh
>>

 -- Josh

>
> george.
>
> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>
>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>> -
>> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>> 
>> Still need to shake that corn out of your head... :-)
>> 
>>> 
 
 This only works if all component play the game, and even then there it is 
 difficult if you want to allow components to deregister themselves in the 
 middle of the execution. The problem is that a callback will be previous 
 for some component, and that when you want to remove a callback you have 
 to inform the "next"  component on the callback chain to change its 
 previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with the 
>>> previous callback is probably too optimistic, esp. for layers that are not 
>>> designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to replace 
>>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, 
>>> swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>> 
>> It doesn't solve the problem that George alluded to - just because you 
>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>> you when their component initializes. Only the last one wins - the rest of 
>> you lose.
>> 
>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>> this callback can really work unless everyone agrees that only one place 
>> gets it. Put that callback in a base function of a new error handling 
>> framework, and then let everyone create components within that for handling 
>> desired error responses?
> 
> Yep, that is a problem, but one that we can deal with in the immediate
> case. Since OMPI is the only layer registering the callback, when I
> replace it in OMPI I will have to make sure that no other place in
> OMPI replaces the callback.
> 
> If at some point we need more than one callback above ORTE then we may
> want to revisit this point. But since we only have one layer on top of
> ORTE, it is the responsibility of that layer to be internally
> consistent with regard to which callback it wants to be triggered.
> 
> If the layers above ORTE want more than one callback I would suggest
> that that layer design some mechanism for coordinating these multiple
> - possibly conflicting - callbacks (by the way this is policy
> management, which can get complex fast as you add more interested
> parties). Meaning that if OMPI wanted multiple callbacks to be active
> at the same time, then OMPI would create a mechanism for managing
> these callbacks, not ORTE. ORTE should just have one callback provided
> to the upper layer, and keep it -simple-. If the upper layer wants to
> toy around with something more complex it must manage the complexity
> instead of artificially pushing it down to the ORTE layer.

I was thinking some more about this, and wonder if we aren't over-complicating 
the question.

Do you need to actually control the sequence of callbacks, or just ensure that 
your callback gets called prior to the default one that calls abort?

Meeting the latter requirement is trivial - subsequent calls to 
register_callback get pushed onto the top of the callback list. Since the 
default one always gets registered first (which we can ensure since it occurs 
in MPI_Init), it will always be at the bottom of the callback list and hence 
called last.

Keeping that list in ORTE is simple and probably the right place to do it.

However, if you truly want to control the callback order in detail - then yeah, 
that should go up in  OMPI. I sure don't want to write all that code :-)


> 
> -- Josh
> 
>>> 
>>> -- Josh
>>> 
 
 george.
 
 On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
 
> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
> -
> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
> -
> 
> Which is a callback that just calls abort (which is what we want to do
> by default):
> -
> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
> }
> -
> 
> This is what I want to replace. I do -not- want ompi to abort just
> because a process failed. So I need a way to replace or remove this
> callback, and put in my own callback that 'does the right thing'.
> 
> The current patch allows me to overwrite the callback when I 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain  wrote:
>> 
>> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>> 
>>> Another problem with this patch, that I mentioned to Wesley and George
>>> off list, is that it does not handle the case when mpirun/HNP is also
>>> hosting processes that might fail. In my testing of the patch it
>>> worked fine if mpirun/HNP was -not- hosting any processes, but once it
>>> had to host processes then unexpected behavior occurred when a process
>>> failed. So for those just listening to this thread, Wesley is working
>>> on a revised patch to address this problem that he will post when it
>>> is ready.
>> 
>> See my other response to the patch - I think we need to understand why we 
>> are storing state in multiple places as it can create unexpected behavior 
>> when things are out-of-sync.
>> 
>> 
>>> 
>>> 
>>> As far as the RML issue, doesn't the ORTE state machine branch handle
>>> that case? If it does, then let's push the solution to that problem
>>> until that branch comes around instead of solving it twice.
>> 
>> No, it doesn't - in fact, it's what breaks the current method. Because we no 
>> longer allow event recursion, the RML message never gets out of the app. 
>> Hence my question.
>> 
>> I honestly don't think we need to have orte be aware of the distinction 
>> between "aborted by cmd" and "aborted by signal" as the only diff is in the 
>> error message. There ought to be some other way of resolving this?
> 
> MPI_Abort will need to tell ORTE which processes should be 'aborted by
> signal' along with the calling process. So there needs to be a
> mechanism for that was well. Not sure if I have a good solution to
> this in mind just yet.

Ah yes - that would require a communication anyway.

> 
> A thought though, in the state machine version, the process calling
> MPI_Abort could post a message to the processing thread and return
> from the callback. The callback would have a check at the bottom to
> determine if MPI_Abort was triggered within the callback, and just
> sleep. The processing thread would progress the RML message and once
> finished call exit(). This implies that the application process has a
> separate processing thread. But I think we might be able to post the
> RML message in the callback, then wait for it to complete outside of
> the callback before returning control to the user. :/ Interesting.

Could work, though it does require a thread. You would have to be tricky about 
it, though, as it is possible the call to "abort" could occur in an event 
handler. If you block in that handler waiting for the message to have been 
sent, it never will leave as the RML uses the event lib to trigger the actual 
send.

I may have a solution to the latter problem. For similar reasons, I've had to 
change the errmgr so it doesn't immediately process errors - otherwise, it's 
actions become constrained by the question of "am I in an event handler or 
not". To remove the uncertainty, I'm rigging it so that all errmgr processing 
is done in an event - basically, reporting an error causes the errmgr to push 
the error into a pipe, that triggers an event which actually processes it.

Only way I could deal with the uncertainty. So if that mechanism is in place, 
the only thing you would have to do is (a) call abort, and then (b) cycle 
opal_progress until the errmgr.abort function callback occurred. Of course, we 
would then have to modify the errmgr so that abort took a callback function 
that it called when the app is free to exit.

 no perfect solution, I fear.



> 
> -- Josh
> 
>> 
>> 
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
 Something else you might want to address in here: the current code sends 
 an RML message from the proc calling abort to its local daemon telling the 
 daemon that we are exiting due to the app calling "abort". We needed to do 
 this because we wanted to flag the proc termination as one induced by the 
 app itself as opposed to something like a segfault or termination by 
 signal.
 
 However, the problem is that the app may be calling abort from within an 
 event handler. Hence, the RML send (which is currently blocking) will 
 never complete once we no longer allow event lib recursion (coming soon). 
 If we use a non-blocking send, then we can't know for sure that the 
 message has been sent before we terminate.
 
 What we need is a non-messaging way of communicating that this was an 
 ordered abort as opposed to a segfault or other failure. Prior to the 
 current method, we had the app drop a file that the daemon looked for as 
 an "abort  marker", but that was ugly as it sometimes caused us to not 
 properly cleanup the session directory tree.
 
 I'm open to suggestion - perhaps it isn't actually all that critical for 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain  wrote:
>
> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>
>> Another problem with this patch, that I mentioned to Wesley and George
>> off list, is that it does not handle the case when mpirun/HNP is also
>> hosting processes that might fail. In my testing of the patch it
>> worked fine if mpirun/HNP was -not- hosting any processes, but once it
>> had to host processes then unexpected behavior occurred when a process
>> failed. So for those just listening to this thread, Wesley is working
>> on a revised patch to address this problem that he will post when it
>> is ready.
>
> See my other response to the patch - I think we need to understand why we are 
> storing state in multiple places as it can create unexpected behavior when 
> things are out-of-sync.
>
>
>>
>>
>> As far as the RML issue, doesn't the ORTE state machine branch handle
>> that case? If it does, then let's push the solution to that problem
>> until that branch comes around instead of solving it twice.
>
> No, it doesn't - in fact, it's what breaks the current method. Because we no 
> longer allow event recursion, the RML message never gets out of the app. 
> Hence my question.
>
> I honestly don't think we need to have orte be aware of the distinction 
> between "aborted by cmd" and "aborted by signal" as the only diff is in the 
> error message. There ought to be some other way of resolving this?

MPI_Abort will need to tell ORTE which processes should be 'aborted by
signal' along with the calling process. So there needs to be a
mechanism for that was well. Not sure if I have a good solution to
this in mind just yet.

A thought though, in the state machine version, the process calling
MPI_Abort could post a message to the processing thread and return
from the callback. The callback would have a check at the bottom to
determine if MPI_Abort was triggered within the callback, and just
sleep. The processing thread would progress the RML message and once
finished call exit(). This implies that the application process has a
separate processing thread. But I think we might be able to post the
RML message in the callback, then wait for it to complete outside of
the callback before returning control to the user. :/ Interesting.

-- Josh

>
>
>>
>> -- Josh
>>
>>
>> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
>>> Something else you might want to address in here: the current code sends an 
>>> RML message from the proc calling abort to its local daemon telling the 
>>> daemon that we are exiting due to the app calling "abort". We needed to do 
>>> this because we wanted to flag the proc termination as one induced by the 
>>> app itself as opposed to something like a segfault or termination by signal.
>>>
>>> However, the problem is that the app may be calling abort from within an 
>>> event handler. Hence, the RML send (which is currently blocking) will never 
>>> complete once we no longer allow event lib recursion (coming soon). If we 
>>> use a non-blocking send, then we can't know for sure that the message has 
>>> been sent before we terminate.
>>>
>>> What we need is a non-messaging way of communicating that this was an 
>>> ordered abort as opposed to a segfault or other failure. Prior to the 
>>> current method, we had the app drop a file that the daemon looked for as an 
>>> "abort  marker", but that was ugly as it sometimes caused us to not 
>>> properly cleanup the session directory tree.
>>>
>>> I'm open to suggestion - perhaps it isn't actually all that critical for us 
>>> to distinguish "aborted by call to abort" from "aborted by signal", and we 
>>> can just have the app commit suicide via self-imposed SIGKILL? It is only 
>>> the message output  to the user at the end of the job that differs - and 
>>> since MPI_Abort already provides a message indicating "we called abort", is 
>>> it really necessary that we have orte aware of that distinction?
>>>
>>>
>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>

 On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:

> Well, you're way to trusty. ;)

 It's the midwestern boy in me :)

>
> This only works if all component play the game, and even then there it is 
> difficult if you want to allow components to deregister themselves in the 
> middle of the execution. The problem is that a callback will be previous 
> for some component, and that when you want to remove a callback you have 
> to inform the "next"  component on the callback chain to change its 
> previous.

 This is a fair point. I think hiding the ordering of callbacks in the 
 errmgr could be dangerous since it takes control from the upper layers, 
 but, conversely, trusting the upper layers to 'do the right thing' with 
 the previous callback is probably too optimistic, esp. for layers that are 
 not designed together.

 To that I would 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>> 
>> Still need to shake that corn out of your head... :-)
>> 
>>> 
 
 This only works if all component play the game, and even then there it is 
 difficult if you want to allow components to deregister themselves in the 
 middle of the execution. The problem is that a callback will be previous 
 for some component, and that when you want to remove a callback you have 
 to inform the "next"  component on the callback chain to change its 
 previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with the 
>>> previous callback is probably too optimistic, esp. for layers that are not 
>>> designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to replace 
>>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, 
>>> swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>> 
>> It doesn't solve the problem that George alluded to - just because you 
>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>> you when their component initializes. Only the last one wins - the rest of 
>> you lose.
>> 
>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>> this callback can really work unless everyone agrees that only one place 
>> gets it. Put that callback in a base function of a new error handling 
>> framework, and then let everyone create components within that for handling 
>> desired error responses?
> 
> Yep, that is a problem, but one that we can deal with in the immediate
> case. Since OMPI is the only layer registering the callback, when I
> replace it in OMPI I will have to make sure that no other place in
> OMPI replaces the callback.
> 
> If at some point we need more than one callback above ORTE then we may
> want to revisit this point. But since we only have one layer on top of
> ORTE, it is the responsibility of that layer to be internally
> consistent with regard to which callback it wants to be triggered.
> 
> If the layers above ORTE want more than one callback I would suggest
> that that layer design some mechanism for coordinating these multiple
> - possibly conflicting - callbacks (by the way this is policy
> management, which can get complex fast as you add more interested
> parties). Meaning that if OMPI wanted multiple callbacks to be active
> at the same time, then OMPI would create a mechanism for managing
> these callbacks, not ORTE. ORTE should just have one callback provided
> to the upper layer, and keep it -simple-. If the upper layer wants to
> toy around with something more complex it must manage the complexity
> instead of artificially pushing it down to the ORTE layer.

I agree - I was just proposing one way of doing that in the MPI layer so you 
wouldn't have to play policeman on the rest of the code base to ensure nobody 
else inserts a callback without realizing they overwrote yours. I can envision, 
for example, UTK wanting to do something different from you, and perhaps 
committing a callback that unintentionally overrode you.

Up to you...just making a suggestion.


> 
> -- Josh
> 
>>> 
>>> -- Josh
>>> 
 
 george.
 
 On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
 
> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
> -
> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
> -
> 
> Which is a callback that just calls abort (which is what we want to do
> by default):
> -
> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
> }
> -
> 
> This is what I want to replace. I do -not- want ompi to abort just
> because a process failed. So I need a way to replace or remove this
> callback, and put in my own callback that 'does the right thing'.
> 
> The current patch allows me to overwrite the callback when I call:
> -
> orte_errmgr.set_fault_callback(_callback);
> -
> Which is fine with me.
> 
> At the point I do not want my_callback to be active any more (say in
> MPI_Finalize) I would like to replace it with the old callback. To do
> so, with the patch's interface, I would have to know what the previous
> callback was 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:

> Another problem with this patch, that I mentioned to Wesley and George
> off list, is that it does not handle the case when mpirun/HNP is also
> hosting processes that might fail. In my testing of the patch it
> worked fine if mpirun/HNP was -not- hosting any processes, but once it
> had to host processes then unexpected behavior occurred when a process
> failed. So for those just listening to this thread, Wesley is working
> on a revised patch to address this problem that he will post when it
> is ready.

See my other response to the patch - I think we need to understand why we are 
storing state in multiple places as it can create unexpected behavior when 
things are out-of-sync.


> 
> 
> As far as the RML issue, doesn't the ORTE state machine branch handle
> that case? If it does, then let's push the solution to that problem
> until that branch comes around instead of solving it twice.

No, it doesn't - in fact, it's what breaks the current method. Because we no 
longer allow event recursion, the RML message never gets out of the app. Hence 
my question.

I honestly don't think we need to have orte be aware of the distinction between 
"aborted by cmd" and "aborted by signal" as the only diff is in the error 
message. There ought to be some other way of resolving this?


> 
> -- Josh
> 
> 
> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
>> Something else you might want to address in here: the current code sends an 
>> RML message from the proc calling abort to its local daemon telling the 
>> daemon that we are exiting due to the app calling "abort". We needed to do 
>> this because we wanted to flag the proc termination as one induced by the 
>> app itself as opposed to something like a segfault or termination by signal.
>> 
>> However, the problem is that the app may be calling abort from within an 
>> event handler. Hence, the RML send (which is currently blocking) will never 
>> complete once we no longer allow event lib recursion (coming soon). If we 
>> use a non-blocking send, then we can't know for sure that the message has 
>> been sent before we terminate.
>> 
>> What we need is a non-messaging way of communicating that this was an 
>> ordered abort as opposed to a segfault or other failure. Prior to the 
>> current method, we had the app drop a file that the daemon looked for as an 
>> "abort  marker", but that was ugly as it sometimes caused us to not properly 
>> cleanup the session directory tree.
>> 
>> I'm open to suggestion - perhaps it isn't actually all that critical for us 
>> to distinguish "aborted by call to abort" from "aborted by signal", and we 
>> can just have the app commit suicide via self-imposed SIGKILL? It is only 
>> the message output  to the user at the end of the job that differs - and 
>> since MPI_Abort already provides a message indicating "we called abort", is 
>> it really necessary that we have orte aware of that distinction?
>> 
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>>> 
 
 This only works if all component play the game, and even then there it is 
 difficult if you want to allow components to deregister themselves in the 
 middle of the execution. The problem is that a callback will be previous 
 for some component, and that when you want to remove a callback you have 
 to inform the "next"  component on the callback chain to change its 
 previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with the 
>>> previous callback is probably too optimistic, esp. for layers that are not 
>>> designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to replace 
>>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, 
>>> swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>>> 
>>> -- Josh
>>> 
 
 george.
 
 On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
 
> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
> -
> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
> -
> 
> Which is a callback that just calls abort (which is what we want to do
> by default):
> -
> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
> }
> -
> 
> This is what I want to replace. I do -not- want ompi to abort just
> because a 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Another problem with this patch, that I mentioned to Wesley and George
off list, is that it does not handle the case when mpirun/HNP is also
hosting processes that might fail. In my testing of the patch it
worked fine if mpirun/HNP was -not- hosting any processes, but once it
had to host processes then unexpected behavior occurred when a process
failed. So for those just listening to this thread, Wesley is working
on a revised patch to address this problem that he will post when it
is ready.


As far as the RML issue, doesn't the ORTE state machine branch handle
that case? If it does, then let's push the solution to that problem
until that branch comes around instead of solving it twice.

-- Josh


On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
> Something else you might want to address in here: the current code sends an 
> RML message from the proc calling abort to its local daemon telling the 
> daemon that we are exiting due to the app calling "abort". We needed to do 
> this because we wanted to flag the proc termination as one induced by the app 
> itself as opposed to something like a segfault or termination by signal.
>
> However, the problem is that the app may be calling abort from within an 
> event handler. Hence, the RML send (which is currently blocking) will never 
> complete once we no longer allow event lib recursion (coming soon). If we use 
> a non-blocking send, then we can't know for sure that the message has been 
> sent before we terminate.
>
> What we need is a non-messaging way of communicating that this was an ordered 
> abort as opposed to a segfault or other failure. Prior to the current method, 
> we had the app drop a file that the daemon looked for as an "abort  marker", 
> but that was ugly as it sometimes caused us to not properly cleanup the 
> session directory tree.
>
> I'm open to suggestion - perhaps it isn't actually all that critical for us 
> to distinguish "aborted by call to abort" from "aborted by signal", and we 
> can just have the app commit suicide via self-imposed SIGKILL? It is only the 
> message output  to the user at the end of the job that differs - and since 
> MPI_Abort already provides a message indicating "we called abort", is it 
> really necessary that we have orte aware of that distinction?
>
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>>
>>>
>>> This only works if all component play the game, and even then there it is 
>>> difficult if you want to allow components to deregister themselves in the 
>>> middle of the execution. The problem is that a callback will be previous 
>>> for some component, and that when you want to remove a callback you have to 
>>> inform the "next"  component on the callback chain to change its previous.
>>
>> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
>> could be dangerous since it takes control from the upper layers, but, 
>> conversely, trusting the upper layers to 'do the right thing' with the 
>> previous callback is probably too optimistic, esp. for layers that are not 
>> designed together.
>>
>> To that I would suggest that you leave the code as is - registering a 
>> callback overwrites the existing callback. That will allow me to replace the 
>> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
>> back in the default version at MPI_Finalize.
>>
>> Does that sound like a reasonable way forward on this design point?
>>
>> -- Josh
>>
>>>
>>> george.
>>>
>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>
 So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
 -
 orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
 -

 Which is a callback that just calls abort (which is what we want to do
 by default):
 -
 void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
 }
 -

 This is what I want to replace. I do -not- want ompi to abort just
 because a process failed. So I need a way to replace or remove this
 callback, and put in my own callback that 'does the right thing'.

 The current patch allows me to overwrite the callback when I call:
 -
 orte_errmgr.set_fault_callback(_callback);
 -
 Which is fine with me.

 At the point I do not want my_callback to be active any more (say in
 MPI_Finalize) I would like to replace it with the old callback. To do
 so, with the patch's interface, I would have to know what the previous
 callback was and do:
 -
 orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
 -

 This comes at a slight maintenance burden since now there will be two
 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Okay, finally have time to sit down and review this. It looks pretty much 
identical to what was done in ORCM - we just kept "epoch" separate from the 
process name, and use multicast to notify all procs that someone failed. I do 
have a few questions/comments about your proposed patch:

1. I note that in some places you just set peer_name.epoch = proc_name.epoch, 
and in others you make the assignment by calling a new API 
orte_ess.proc_get_epoch(_name). Ditto for proc_set_epoch. What are the 
rules for when each method should be used? Which leads to...

2. I'm puzzled as to why you are storing process state and epoch number in the 
modex as well as in the process name and orte_proc_t struct. This creates a bit 
of a race condition as the two will be out-of-sync for some (probably small) 
period of time, and looks like unnecessary duplication. Is there some reason 
for doing this? We are trying to eliminate duplicate storage because of the 
data confusion and memory issues, hence my question.

3. as a follow on to #2, I am bothered that we now have the ESS storing proc 
state. That isn't the functional purpose of the ESS - that's a PLM function. Is 
there some reason for doing this in the ESS? Why aren't we just looking at the 
orte_proc_t for that proc and using its state field? I guess I can understand 
if you want to get that via an API (instead of having code to lookup the proc_t 
in multiple places), but then let's put it in the PLM please. I note that it is 
only used in the binomial routing code, so why not just put a static function 
in there to get the state of a proc rather than creating another API?

4. ess_base_open.c: the default orte_ess module appears to be missing an entry 
for proc_set_epoch.

5. I really don't think that notification of proc failure belongs in the 
orted_comm - messages notifying of proc failure should be received in the 
errmgr. This allows people who want to handle things differently (e.g., orcm) 
the ability to create their own errmgr component(s) for daemons and HNP that 
send the messages over their desired messaging system, decide how they want to 
respond, etc. Putting it in orted_comm forces everyone to use only this one 
method, which conflicts with allowing freedom for others to explore alternative 
methods, and frankly, I don't see any strong reason that outweighs that 
limitation.

6. I don't think this errmgr_fault_callback registration is going to work, per 
my response to Josh's RFC. I'll leave the discussion in that thread.


On Jun 6, 2011, at 1:00 PM, George Bosilca wrote:

> WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> (daemons) or application level processes. This patch extends the 
> orte_process_name_t structure with a field to store the process epoch (the 
> number of times it died so far), and add an application failure notification 
> callback function to be registered in the runtime. 
> 
> WHY: Necessary to correctly implement the error handling in the MPI 2.2 
> standard. In addition, such a resilient runtime is a cornerstone for any 
> level of fault tolerance support we want to provide in the future (such as 
> the MPI-3 Run-Through Stabilization or FT-MPI).
> 
> WHEN:
> 
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
> 
> --
> 
> MORE DETAILS:
> 
> Currently the infrastructure required to enable any kind of fault tolerance 
> development in Open MPI (with the exception of the checkpoint/restart) is 
> missing. However, before developing any fault tolerant support at the 
> application (MPI) level, we need to have a resilient runtime. The changes in 
> this patch address this lack of support and would allow anyone to implement a 
> fault tolerance protocol at the MPI layer without having to worry about the 
> ORTE stabilization.
> 
> This patch will allow the runtime to drop any dead daemons, and re-route all 
> communications around the holes in order to __ALWAYS__ deliver a message as 
> long as the destination process is alive. The application is informed (via a 
> callback) about the loss of the processes with the same jobid. In this patch 
> we do not address the MPI_ERROR_RETURN type of failures, we focused on the 
> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the 
> decision, instead of taking it down in the runtime.
> 
> NEW STUFF:
> 
> Epoch - A counter that tracks the number of times a process has been detected 
> to have terminated, either from a failure or an expected termination. After 
> the termination is detected, the HNP coordinates all other process’s 
> knowledge of the new epoch. Each ORTED will know the epoch of the other 
> processes in the job, but it will not actually store anything until the 
> epochs change. 
> 
> Run-Through Stabilization - When an ORTED (or HNP) detects that another 
> process has terminated, it repairs the routing layer and informs the HNP. The 
> HNP tells all other 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>
> Still need to shake that corn out of your head... :-)
>
>>
>>>
>>> This only works if all component play the game, and even then there it is 
>>> difficult if you want to allow components to deregister themselves in the 
>>> middle of the execution. The problem is that a callback will be previous 
>>> for some component, and that when you want to remove a callback you have to 
>>> inform the "next"  component on the callback chain to change its previous.
>>
>> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
>> could be dangerous since it takes control from the upper layers, but, 
>> conversely, trusting the upper layers to 'do the right thing' with the 
>> previous callback is probably too optimistic, esp. for layers that are not 
>> designed together.
>>
>> To that I would suggest that you leave the code as is - registering a 
>> callback overwrites the existing callback. That will allow me to replace the 
>> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
>> back in the default version at MPI_Finalize.
>>
>> Does that sound like a reasonable way forward on this design point?
>
> It doesn't solve the problem that George alluded to - just because you 
> overwrite the callback, it doesn't mean that someone else won't overwrite you 
> when their component initializes. Only the last one wins - the rest of you 
> lose.
>
> I'm not sure how you guarantee that you win, which is why I'm unclear how 
> this callback can really work unless everyone agrees that only one place gets 
> it. Put that callback in a base function of a new error handling framework, 
> and then let everyone create components within that for handling desired 
> error responses?

Yep, that is a problem, but one that we can deal with in the immediate
case. Since OMPI is the only layer registering the callback, when I
replace it in OMPI I will have to make sure that no other place in
OMPI replaces the callback.

If at some point we need more than one callback above ORTE then we may
want to revisit this point. But since we only have one layer on top of
ORTE, it is the responsibility of that layer to be internally
consistent with regard to which callback it wants to be triggered.

If the layers above ORTE want more than one callback I would suggest
that that layer design some mechanism for coordinating these multiple
- possibly conflicting - callbacks (by the way this is policy
management, which can get complex fast as you add more interested
parties). Meaning that if OMPI wanted multiple callbacks to be active
at the same time, then OMPI would create a mechanism for managing
these callbacks, not ORTE. ORTE should just have one callback provided
to the upper layer, and keep it -simple-. If the upper layer wants to
toy around with something more complex it must manage the complexity
instead of artificially pushing it down to the ORTE layer.

-- Josh

>>
>> -- Josh
>>
>>>
>>> george.
>>>
>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>
 So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
 -
 orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
 -

 Which is a callback that just calls abort (which is what we want to do
 by default):
 -
 void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
 }
 -

 This is what I want to replace. I do -not- want ompi to abort just
 because a process failed. So I need a way to replace or remove this
 callback, and put in my own callback that 'does the right thing'.

 The current patch allows me to overwrite the callback when I call:
 -
 orte_errmgr.set_fault_callback(_callback);
 -
 Which is fine with me.

 At the point I do not want my_callback to be active any more (say in
 MPI_Finalize) I would like to replace it with the old callback. To do
 so, with the patch's interface, I would have to know what the previous
 callback was and do:
 -
 orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
 -

 This comes at a slight maintenance burden since now there will be two
 places in the code that must explicitly reference
 'ompi_errhandler_runtime_callback' - if it ever changed then both
 sites would have to be updated.


 If you use the 'sigaction-like' interface then upon registration I
 would get the previous handler back (which would point to
 'ompi_errhandler_runtime_callback), and I can store it for later:
 -
 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Something else you might want to address in here: the current code sends an RML 
message from the proc calling abort to its local daemon telling the daemon that 
we are exiting due to the app calling "abort". We needed to do this because we 
wanted to flag the proc termination as one induced by the app itself as opposed 
to something like a segfault or termination by signal.

However, the problem is that the app may be calling abort from within an event 
handler. Hence, the RML send (which is currently blocking) will never complete 
once we no longer allow event lib recursion (coming soon). If we use a 
non-blocking send, then we can't know for sure that the message has been sent 
before we terminate.

What we need is a non-messaging way of communicating that this was an ordered 
abort as opposed to a segfault or other failure. Prior to the current method, 
we had the app drop a file that the daemon looked for as an "abort  marker", 
but that was ugly as it sometimes caused us to not properly cleanup the session 
directory tree.

I'm open to suggestion - perhaps it isn't actually all that critical for us to 
distinguish "aborted by call to abort" from "aborted by signal", and we can 
just have the app commit suicide via self-imposed SIGKILL? It is only the 
message output  to the user at the end of the job that differs - and since 
MPI_Abort already provides a message indicating "we called abort", is it really 
necessary that we have orte aware of that distinction?


On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:

> 
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> 
>> Well, you're way to trusty. ;)
> 
> It's the midwestern boy in me :)
> 
>> 
>> This only works if all component play the game, and even then there it is 
>> difficult if you want to allow components to deregister themselves in the 
>> middle of the execution. The problem is that a callback will be previous for 
>> some component, and that when you want to remove a callback you have to 
>> inform the "next"  component on the callback chain to change its previous.
> 
> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
> could be dangerous since it takes control from the upper layers, but, 
> conversely, trusting the upper layers to 'do the right thing' with the 
> previous callback is probably too optimistic, esp. for layers that are not 
> designed together.
> 
> To that I would suggest that you leave the code as is - registering a 
> callback overwrites the existing callback. That will allow me to replace the 
> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
> back in the default version at MPI_Finalize.
> 
> Does that sound like a reasonable way forward on this design point?
> 
> -- Josh
> 
>> 
>> george.
>> 
>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>> 
>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>> -
>>> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
>>> -
>>> 
>>> Which is a callback that just calls abort (which is what we want to do
>>> by default):
>>> -
>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>> }
>>> -
>>> 
>>> This is what I want to replace. I do -not- want ompi to abort just
>>> because a process failed. So I need a way to replace or remove this
>>> callback, and put in my own callback that 'does the right thing'.
>>> 
>>> The current patch allows me to overwrite the callback when I call:
>>> -
>>> orte_errmgr.set_fault_callback(_callback);
>>> -
>>> Which is fine with me.
>>> 
>>> At the point I do not want my_callback to be active any more (say in
>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>> so, with the patch's interface, I would have to know what the previous
>>> callback was and do:
>>> -
>>> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
>>> -
>>> 
>>> This comes at a slight maintenance burden since now there will be two
>>> places in the code that must explicitly reference
>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>> sites would have to be updated.
>>> 
>>> 
>>> If you use the 'sigaction-like' interface then upon registration I
>>> would get the previous handler back (which would point to
>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>> -
>>> orte_errmgr.set_fault_callback(_callback, prev_callback);
>>> -
>>> 
>>> And when it comes time to deregister my callback all I need to do is
>>> replace it with the previous callback - which I have a reference to,
>>> but do not need the explicit name of (passing NULL as the second
>>> argument tells the registration function that I don't care about the
>>> current callback):
>>> -
>>> orte_errmgr.set_fault_callback(_callback, NULL);
>>> -
>>> 
>>> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:

> 
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> 
>> Well, you're way to trusty. ;)
> 
> It's the midwestern boy in me :)

Still need to shake that corn out of your head... :-)

> 
>> 
>> This only works if all component play the game, and even then there it is 
>> difficult if you want to allow components to deregister themselves in the 
>> middle of the execution. The problem is that a callback will be previous for 
>> some component, and that when you want to remove a callback you have to 
>> inform the "next"  component on the callback chain to change its previous.
> 
> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
> could be dangerous since it takes control from the upper layers, but, 
> conversely, trusting the upper layers to 'do the right thing' with the 
> previous callback is probably too optimistic, esp. for layers that are not 
> designed together.
> 
> To that I would suggest that you leave the code as is - registering a 
> callback overwrites the existing callback. That will allow me to replace the 
> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
> back in the default version at MPI_Finalize.
> 
> Does that sound like a reasonable way forward on this design point?

It doesn't solve the problem that George alluded to - just because you 
overwrite the callback, it doesn't mean that someone else won't overwrite you 
when their component initializes. Only the last one wins - the rest of you lose.

I'm not sure how you guarantee that you win, which is why I'm unclear how this 
callback can really work unless everyone agrees that only one place gets it. 
Put that callback in a base function of a new error handling framework, and 
then let everyone create components within that for handling desired error 
responses?


> 
> -- Josh
> 
>> 
>> george.
>> 
>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>> 
>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>> -
>>> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
>>> -
>>> 
>>> Which is a callback that just calls abort (which is what we want to do
>>> by default):
>>> -
>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>> }
>>> -
>>> 
>>> This is what I want to replace. I do -not- want ompi to abort just
>>> because a process failed. So I need a way to replace or remove this
>>> callback, and put in my own callback that 'does the right thing'.
>>> 
>>> The current patch allows me to overwrite the callback when I call:
>>> -
>>> orte_errmgr.set_fault_callback(_callback);
>>> -
>>> Which is fine with me.
>>> 
>>> At the point I do not want my_callback to be active any more (say in
>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>> so, with the patch's interface, I would have to know what the previous
>>> callback was and do:
>>> -
>>> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
>>> -
>>> 
>>> This comes at a slight maintenance burden since now there will be two
>>> places in the code that must explicitly reference
>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>> sites would have to be updated.
>>> 
>>> 
>>> If you use the 'sigaction-like' interface then upon registration I
>>> would get the previous handler back (which would point to
>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>> -
>>> orte_errmgr.set_fault_callback(_callback, prev_callback);
>>> -
>>> 
>>> And when it comes time to deregister my callback all I need to do is
>>> replace it with the previous callback - which I have a reference to,
>>> but do not need the explicit name of (passing NULL as the second
>>> argument tells the registration function that I don't care about the
>>> current callback):
>>> -
>>> orte_errmgr.set_fault_callback(_callback, NULL);
>>> -
>>> 
>>> 
>>> So the API in the patch is fine, and I can work with it. I just
>>> suggested that it might be slightly better to return the previous
>>> callback (as is done in other standard interfaces - e.g., sigaction)
>>> in case we wanted to do something with it later.
>>> 
>>> 
>>> What seems to be proposed now is making the errmgr keep a list of all
>>> registered callbacks and call them in some order. This seems odd, and
>>> definitely more complex. Maybe it was just not well explained.
>>> 
>>> Maybe that is just the "computer scientist" in me :)
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain  wrote:
 You mean you want the abort API to point somewhere else, without using a 
 new
 component?
 Perhaps a telecon would help resolve this quicker? I'm available tomorrow 
 or
 anytime next week, if 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Joshua Hursey

On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:

> Well, you're way to trusty. ;)

It's the midwestern boy in me :)

> 
> This only works if all component play the game, and even then there it is 
> difficult if you want to allow components to deregister themselves in the 
> middle of the execution. The problem is that a callback will be previous for 
> some component, and that when you want to remove a callback you have to 
> inform the "next"  component on the callback chain to change its previous.

This is a fair point. I think hiding the ordering of callbacks in the errmgr 
could be dangerous since it takes control from the upper layers, but, 
conversely, trusting the upper layers to 'do the right thing' with the previous 
callback is probably too optimistic, esp. for layers that are not designed 
together.

To that I would suggest that you leave the code as is - registering a callback 
overwrites the existing callback. That will allow me to replace the default 
OMPI callback when I am able to in MPI_Init, and, if I need to, swap back in 
the default version at MPI_Finalize.

Does that sound like a reasonable way forward on this design point?

-- Josh

> 
>  george.
> 
> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
> 
>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>> -
>> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
>> -
>> 
>> Which is a callback that just calls abort (which is what we want to do
>> by default):
>> -
>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>   ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>> }
>> -
>> 
>> This is what I want to replace. I do -not- want ompi to abort just
>> because a process failed. So I need a way to replace or remove this
>> callback, and put in my own callback that 'does the right thing'.
>> 
>> The current patch allows me to overwrite the callback when I call:
>> -
>> orte_errmgr.set_fault_callback(_callback);
>> -
>> Which is fine with me.
>> 
>> At the point I do not want my_callback to be active any more (say in
>> MPI_Finalize) I would like to replace it with the old callback. To do
>> so, with the patch's interface, I would have to know what the previous
>> callback was and do:
>> -
>> orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
>> -
>> 
>> This comes at a slight maintenance burden since now there will be two
>> places in the code that must explicitly reference
>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>> sites would have to be updated.
>> 
>> 
>> If you use the 'sigaction-like' interface then upon registration I
>> would get the previous handler back (which would point to
>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>> -
>> orte_errmgr.set_fault_callback(_callback, prev_callback);
>> -
>> 
>> And when it comes time to deregister my callback all I need to do is
>> replace it with the previous callback - which I have a reference to,
>> but do not need the explicit name of (passing NULL as the second
>> argument tells the registration function that I don't care about the
>> current callback):
>> -
>> orte_errmgr.set_fault_callback(_callback, NULL);
>> -
>> 
>> 
>> So the API in the patch is fine, and I can work with it. I just
>> suggested that it might be slightly better to return the previous
>> callback (as is done in other standard interfaces - e.g., sigaction)
>> in case we wanted to do something with it later.
>> 
>> 
>> What seems to be proposed now is making the errmgr keep a list of all
>> registered callbacks and call them in some order. This seems odd, and
>> definitely more complex. Maybe it was just not well explained.
>> 
>> Maybe that is just the "computer scientist" in me :)
>> 
>> -- Josh
>> 
>> 
>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain  wrote:
>>> You mean you want the abort API to point somewhere else, without using a new
>>> component?
>>> Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
>>> anytime next week, if that helps.
>>> 
>>> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey  wrote:
 
 As long as there is the ability to remove and replace a callback I'm
 fine. I personally think that forcing the errmgr to track ordering of
 callback registration makes it a more complex solution, but as long as
 it works.
 
 In particular I need to replace the default 'abort' errmgr call in
 OMPI with something else. If both are called, then this does not help
 me at all - since the abort behavior will be activated either before
 or after my callback. So can you explain how I would do that with the
 current or the proposed interface?
 
 -- Josh
 
 On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain  wrote:
> I agree - 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
-
orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
-

Which is a callback that just calls abort (which is what we want to do
by default):
-
void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
}
-

This is what I want to replace. I do -not- want ompi to abort just
because a process failed. So I need a way to replace or remove this
callback, and put in my own callback that 'does the right thing'.

The current patch allows me to overwrite the callback when I call:
-
orte_errmgr.set_fault_callback(_callback);
-
Which is fine with me.

At the point I do not want my_callback to be active any more (say in
MPI_Finalize) I would like to replace it with the old callback. To do
so, with the patch's interface, I would have to know what the previous
callback was and do:
-
orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
-

This comes at a slight maintenance burden since now there will be two
places in the code that must explicitly reference
'ompi_errhandler_runtime_callback' - if it ever changed then both
sites would have to be updated.


If you use the 'sigaction-like' interface then upon registration I
would get the previous handler back (which would point to
'ompi_errhandler_runtime_callback), and I can store it for later:
-
orte_errmgr.set_fault_callback(_callback, prev_callback);
-

And when it comes time to deregister my callback all I need to do is
replace it with the previous callback - which I have a reference to,
but do not need the explicit name of (passing NULL as the second
argument tells the registration function that I don't care about the
current callback):
-
orte_errmgr.set_fault_callback(_callback, NULL);
-


So the API in the patch is fine, and I can work with it. I just
suggested that it might be slightly better to return the previous
callback (as is done in other standard interfaces - e.g., sigaction)
in case we wanted to do something with it later.


What seems to be proposed now is making the errmgr keep a list of all
registered callbacks and call them in some order. This seems odd, and
definitely more complex. Maybe it was just not well explained.


Maybe that is just the "computer scientist" in me :)

-- Josh


On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain  wrote:
> You mean you want the abort API to point somewhere else, without using a new
> component?
> Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
> anytime next week, if that helps.
>
> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey  wrote:
>>
>> As long as there is the ability to remove and replace a callback I'm
>> fine. I personally think that forcing the errmgr to track ordering of
>> callback registration makes it a more complex solution, but as long as
>> it works.
>>
>> In particular I need to replace the default 'abort' errmgr call in
>> OMPI with something else. If both are called, then this does not help
>> me at all - since the abort behavior will be activated either before
>> or after my callback. So can you explain how I would do that with the
>> current or the proposed interface?
>>
>> -- Josh
>>
>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain  wrote:
>> > I agree - let's not get overly complex unless we can clearly articulate
>> > a
>> > requirement to do so.
>> >
>> > On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca 
>> > wrote:
>> >>
>> >> This will require exactly opposite registration and de-registration
>> >> order,
>> >> or no de-registration at all (aka no way to unload a component). Or
>> >> some
>> >> even more complex code to deal with internally.
>> >>
>> >> If the error manager handle the callbacks it can use the registration
>> >> ordering (which will be what the the approach can do), and can enforce
>> >> that
>> >> all callbacks will be called. I would rather prefer this approach.
>> >>
>> >>  george.
>> >>
>> >> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>> >>
>> >> > I would prefer returning the previous callback instead of relying on
>> >> > the errmgr to get the ordering right. Additionally, when I want to
>> >> > unregister (or replace) a call back it is easy to do that with a
>> >> > single interface, than introducing a new one to remove a particular
>> >> > callback.
>> >> > Register:
>> >> >  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
>> >> > Deregister:
>> >> >  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
>> >> > or to eliminate all callbacks (if you needed that for somme reason):
>> >> >  ompi_errmgr.set_fault_callback(NULL, old_callback);
>> >>
>> >>
>> >> ___
>> >> devel mailing list
>> >> de...@open-mpi.org
>> >> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain
You mean you want the abort API to point somewhere else, without using a new
component?

Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
anytime next week, if that helps.

On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey  wrote:

> As long as there is the ability to remove and replace a callback I'm
> fine. I personally think that forcing the errmgr to track ordering of
> callback registration makes it a more complex solution, but as long as
> it works.
>
> In particular I need to replace the default 'abort' errmgr call in
> OMPI with something else. If both are called, then this does not help
> me at all - since the abort behavior will be activated either before
> or after my callback. So can you explain how I would do that with the
> current or the proposed interface?
>
> -- Josh
>
> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain  wrote:
> > I agree - let's not get overly complex unless we can clearly articulate a
> > requirement to do so.
> >
> > On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca 
> > wrote:
> >>
> >> This will require exactly opposite registration and de-registration
> order,
> >> or no de-registration at all (aka no way to unload a component). Or some
> >> even more complex code to deal with internally.
> >>
> >> If the error manager handle the callbacks it can use the registration
> >> ordering (which will be what the the approach can do), and can enforce
> that
> >> all callbacks will be called. I would rather prefer this approach.
> >>
> >>  george.
> >>
> >> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
> >>
> >> > I would prefer returning the previous callback instead of relying on
> >> > the errmgr to get the ordering right. Additionally, when I want to
> >> > unregister (or replace) a call back it is easy to do that with a
> >> > single interface, than introducing a new one to remove a particular
> >> > callback.
> >> > Register:
> >> >  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
> >> > Deregister:
> >> >  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
> >> > or to eliminate all callbacks (if you needed that for somme reason):
> >> >  ompi_errmgr.set_fault_callback(NULL, old_callback);
> >>
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
As long as there is the ability to remove and replace a callback I'm
fine. I personally think that forcing the errmgr to track ordering of
callback registration makes it a more complex solution, but as long as
it works.

In particular I need to replace the default 'abort' errmgr call in
OMPI with something else. If both are called, then this does not help
me at all - since the abort behavior will be activated either before
or after my callback. So can you explain how I would do that with the
current or the proposed interface?

-- Josh

On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain  wrote:
> I agree - let's not get overly complex unless we can clearly articulate a
> requirement to do so.
>
> On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca 
> wrote:
>>
>> This will require exactly opposite registration and de-registration order,
>> or no de-registration at all (aka no way to unload a component). Or some
>> even more complex code to deal with internally.
>>
>> If the error manager handle the callbacks it can use the registration
>> ordering (which will be what the the approach can do), and can enforce that
>> all callbacks will be called. I would rather prefer this approach.
>>
>>  george.
>>
>> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>>
>> > I would prefer returning the previous callback instead of relying on
>> > the errmgr to get the ordering right. Additionally, when I want to
>> > unregister (or replace) a call back it is easy to do that with a
>> > single interface, than introducing a new one to remove a particular
>> > callback.
>> > Register:
>> >  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
>> > Deregister:
>> >  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
>> > or to eliminate all callbacks (if you needed that for somme reason):
>> >  ompi_errmgr.set_fault_callback(NULL, old_callback);
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain
I agree - let's not get overly complex unless we can clearly articulate a
requirement to do so.

On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote:

> This will require exactly opposite registration and de-registration order,
> or no de-registration at all (aka no way to unload a component). Or some
> even more complex code to deal with internally.
>
> If the error manager handle the callbacks it can use the registration
> ordering (which will be what the the approach can do), and can enforce that
> all callbacks will be called. I would rather prefer this approach.
>
>  george.
>
> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>
> > I would prefer returning the previous callback instead of relying on
> > the errmgr to get the ordering right. Additionally, when I want to
> > unregister (or replace) a call back it is easy to do that with a
> > single interface, than introducing a new one to remove a particular
> > callback.
> > Register:
> >  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
> > Deregister:
> >  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
> > or to eliminate all callbacks (if you needed that for somme reason):
> >  ompi_errmgr.set_fault_callback(NULL, old_callback);
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread George Bosilca
This will require exactly opposite registration and de-registration order, or 
no de-registration at all (aka no way to unload a component). Or some even more 
complex code to deal with internally.

If the error manager handle the callbacks it can use the registration ordering 
(which will be what the the approach can do), and can enforce that all 
callbacks will be called. I would rather prefer this approach.

  george.

On Jun 9, 2011, at 08:36 , Josh Hursey wrote:

> I would prefer returning the previous callback instead of relying on
> the errmgr to get the ordering right. Additionally, when I want to
> unregister (or replace) a call back it is easy to do that with a
> single interface, than introducing a new one to remove a particular
> callback.
> Register:
>  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
> Deregister:
>  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
> or to eliminate all callbacks (if you needed that for somme reason):
>  ompi_errmgr.set_fault_callback(NULL, old_callback);




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland  wrote:
> On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:
>
> - orte_errmgr.post_startup() start the persistent RML message. There
> does not seem to be a shutdown version of this (to deregister the RML
> message at orte_finalize time). Was this intentional, or just missed?
>
>  I just missed that one. I've added that into the code now.

Cool.

>
> - in the orte_errmgr.set_fault_callback: it would be nice if it
> returned the previous callback, so you could layer more than one
> 'thing' on top of ORTE and have them chain in a sigaction-like manner.
>
>  Again, you are correct. Rather than just returning the previous callback
> (if any) I think it makes more sense to maintain a list of callbacks and
> have the errmgr call them directly. That way applications/ompi layers don't
> have to worry about calling another callback function.

I would prefer returning the previous callback instead of relying on
the errmgr to get the ordering right. Additionally, when I want to
unregister (or replace) a call back it is easy to do that with a
single interface, than introducing a new one to remove a particular
callback.
Register:
  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
Deregister:
  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
or to eliminate all callbacks (if you needed that for somme reason):
  ompi_errmgr.set_fault_callback(NULL, old_callback);



>
> - orte_process_info.max_procs: this seems to be only used in the
> binomial routed, but I was a bit unclear about its purpose. Can you
> describe what it does, and how it is used?
>
> I use this to determine how many processes were in the job before we started
> having failures. This helps me preserve the structure of the tree as much as
> possible rather than completely reorganizing the routing layer every time a
> process fails.

Sounds fine, I was just curious.

Reorganizing the routing layer after every process failure has some
race issues with multiple rolling failures, so preserving the original
routing tree and rerouting is probably best for this situation. We can
revisit this later for more performance perserving techniques, but not
really something that needs to be addressed now.

>
> - in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION
> message here. Why not push all of that logic into the errmgr
> components? It is not a big deal, just curious.
>
> Most of the actual logic that handles the processing of the error messages
> is pushed into the errmgr component. The code you see in orted_comm.c is
> almost all parsing and resending the list of dead processes to the
> appropriate modules. That code will have to be in there no matter what.
> I've updated the code and checked it into a bitbucket repository which can
> be found here:
> https://bitbucket.org/wesbland/resilient-orte/

Awesome. Thanks,
Josh

> Please let me know of any more comments,
> Wesley
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread George Bosilca
Well well well, that wasn't supposed to go on the mailing list ;)

 george


On Jun 8, 2011, at 17:43 , George Bosilca wrote:

> Hey if you want to push to the extreme the logic of the "computer scientist" 
> you were talking about in my office, then return the previous callback and 
> let the upper layer do the right thing. Suppose they don't screw up for once 
> ...
> 
>  george.
> 
> On Jun 8, 2011, at 17:37 , Wesley Bland wrote:
> 
>> Again, you are correct. Rather than just returning the previous callback (if 
>> any) I think it makes more sense to maintain a list of callbacks and have 
>> the errmgr call them directly. That way applications/ompi layers don't have 
>> to worry about calling another callback function.
>>> 
> 




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread Wesley Bland
On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:

- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?

 I just missed that one. I've added that into the code now.

- in the orte_errmgr.set_fault_callback: it would be nice if it
returned the previous callback, so you could layer more than one
'thing' on top of ORTE and have them chain in a sigaction-like manner.

 Again, you are correct. Rather than just returning the previous callback
(if any) I think it makes more sense to maintain a list of callbacks and
have the errmgr call them directly. That way applications/ompi layers don't
have to worry about calling another callback function.

- orte_process_info.max_procs: this seems to be only used in the
binomial routed, but I was a bit unclear about its purpose. Can you
describe what it does, and how it is used?

I use this to determine how many processes were in the job before we started
having failures. This helps me preserve the structure of the tree as much as
possible rather than completely reorganizing the routing layer every time a
process fails.

- in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION
message here. Why not push all of that logic into the errmgr
components? It is not a big deal, just curious.

Most of the actual logic that handles the processing of the error messages
is pushed into the errmgr component. The code you see in orted_comm.c is
almost all parsing and resending the list of dead processes to the
appropriate modules. That code will have to be in there no matter what.

I've updated the code and checked it into a bitbucket repository which can
be found here:

https://bitbucket.org/wesbland/resilient-orte/

Please let me know of any more comments,
Wesley


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks - that helps!


On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland  wrote:

>  Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the ORTED on that node will be
> the only one to detect it.
>
> Also, if an ORTED is lost, all of the applications running underneath it
> are also lost because we have no way to communicate with them anymore.
>
> On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote:
>
> Quick question: could you please clarify this statement:
>
> ...because more than one ORTED could (and often will) detect the failure.
>
>
> I don't understand how this can be true, except for detecting an ORTED
> failure. Only one orted can detect an MPI process failure, unless you have
> now involved orted's in MPI communications (and I don't believe you did). If
> the HNP directs another orted to restart that proc, and then that
> incarnation fails, then the epoch number -should- increment again, shouldn't
> it?
>
> So are you concerned (re having the HNP mark a proc down multiple times)
> about orted failure detection? In that case, I agree that you can have
> multiple failure detections - we dealt with it differently in orcm, but I
> have no issue with doing it another way. Just helps to know what problem you
> are trying to solve.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the orte_errmgr.set_fault_callback: it would be nice if it
returned the previous callback, so you could layer more than one
'thing' on top of ORTE and have them chain in a sigaction-like manner.
- orte_process_info.max_procs: this seems to be only used in the
binomial routed, but I was a bit unclear about its purpose. Can you
describe what it does, and how it is used?
- in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION
message here. Why not push all of that logic into the errmgr
components? It is not a big deal, just curious.

I'll probably send more notes after some more digging and testing of
the code. But the patch is looking good. Good work!

-- Josh

On Tue, Jun 7, 2011 at 10:51 AM, Josh Hursey  wrote:
> I briefly looked over the patch. Excluding the epochs (which we don't
> need now, but will soon) it looks similar to what I have setup on my
> MPI run-through stabilization branch - so it should support that work
> nicely. I'll try to test it this week and send back any other
> comments.
>
> Good work.
>
> Thanks,
> Josh
>
> On Tue, Jun 7, 2011 at 10:46 AM, Wesley Bland  wrote:
>> This could certainly work alongside another ORCM or any other fault
>> detection/prediction/recovery mechanism. Most of the code is just dedicated
>> to keeping the epoch up to date and tracking the status of the processes.
>> The underlying idea was to provide a way for the application to decide what
>> its fault policy would be rather than trying to dictate one in the runtime.
>> If any other layer wanted to register a callback function with this code, it
>> could do anything it wanted to on top of it.
>> Wesley
>>
>> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>>
>> I'm on travel this week, but will look this over when I return. From the
>> description, it sounds nearly identical to what we did in ORCM, so I expect
>> there won't be many issues. You do get some race conditions that the new
>> state machine code should help resolve.
>> Only difference I can quickly see is that we chose not to modify the process
>> name structure, keeping the "epoch" (we called it "incarnation") as a
>> separate value. Since we aren't terribly concerned about backward
>> compatibility, I don't consider this a significant issue - but something the
>> community should recognize.
>> My main concern will be to ensure that the new code contains enough
>> flexibility to allow integration with other layers such as ORCM without
>> creating potential conflict over "double protection" - i.e., if the layer
>> above ORTE wants to provide a certain level of fault protection, then ORTE
>> needs to get out of the way.
>>
>> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  wrote:
>>
>> WHAT: Allow the runtime to handle fail-stop failures for both runtime
>> (daemons) or application level processes. This patch extends the
>> orte_process_name_t structure with a field to store the process epoch (the
>> number of times it died so far), and add an application failure notification
>> callback function to be registered in the runtime.
>>
>> WHY: Necessary to correctly implement the error handling in the MPI 2.2
>> standard. In addition, such a resilient runtime is a cornerstone for any
>> level of fault tolerance support we want to provide in the future (such as
>> the MPI-3 Run-Through Stabilization or FT-MPI).
>>
>> WHEN:
>>
>> WHERE: Patch attached to this email, based on trunk r24747.
>> TIMEOUT: 2 weeks from now, on Monday 20 June.
>>
>> --
>>
>> MORE DETAILS:
>>
>> Currently the infrastructure required to enable any kind of fault tolerance
>> development in Open MPI (with the exception of the checkpoint/restart) is
>> missing. However, before developing any fault tolerant support at the
>> application (MPI) level, we need to have a resilient runtime. The changes in
>> this patch address this lack of support and would allow anyone to implement
>> a fault tolerance protocol at the MPI layer without having to worry about
>> the ORTE stabilization.
>>
>> This patch will allow the runtime to drop any dead daemons, and re-route all
>> communications around the holes in order to __ALWAYS__ deliver a message as
>> long as the destination process is alive. The application is informed (via a
>> callback) about the loss of the processes with the same jobid. In this patch
>> we do not address the MPI_ERROR_RETURN type of failures, we focused on the
>> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the
>> decision, instead of taking it down in the runtime.
>>
>> NEW STUFF:
>>
>> Epoch - A counter that tracks the number of times a process has been
>> detected to 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
Definitely we are targeting ORTED failures here. If an ORTED fails than any 
other ORTEDs connected to it will notice and report the failure. Of course if 
the failure is an application than the ORTED on that node will be the only one 
to detect it.

Also, if an ORTED is lost, all of the applications running underneath it are 
also lost because we have no way to communicate with them anymore.

On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote:

> Quick question: could you please clarify this statement:
> 
> > ...because more than one ORTED could (and often will) detect the failure. 
> 
> I don't understand how this can be true, except for detecting an ORTED 
> failure. Only one orted can detect an MPI process failure, unless you have 
> now involved orted's in MPI communications (and I don't believe you did). If 
> the HNP directs another orted to restart that proc, and then that incarnation 
> fails, then the epoch number -should- increment again, shouldn't it? 
> 
> So are you concerned (re having the HNP mark a proc down multiple times) 
> about orted failure detection? In that case, I agree that you can have 
> multiple failure detections - we dealt with it differently in orcm, but I 
> have no issue with doing it another way. Just helps to know what problem you 
> are trying to solve. 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org (mailto:de...@open-mpi.org)
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Quick question: could you please clarify this statement:

...because more than one ORTED could (and often will) detect the failure.
>

I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved orted's in MPI communications (and I don't believe you did). If
the HNP directs another orted to restart that proc, and then that
incarnation fails, then the epoch number -should- increment again, shouldn't
it?

So are you concerned (re having the HNP mark a proc down multiple times)
about orted failure detection? In that case, I agree that you can have
multiple failure detections - we dealt with it differently in orcm, but I
have no issue with doing it another way. Just helps to know what problem you
are trying to solve.


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Ah - thanks! That really helped clarify things. Much appreciated.

Will look at the patch in this light...

On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland  wrote:

>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
>
> I'm not sure how you can talk about an epoch being too high or too low,
> unless you are envisioning an overall system where procs try to maintain
> some global notion of the value - which sounds like a race condition begging
> to cause problems.
>
>
> When we say epoch we mean a value that is stored locally. When a failure is
> detected the detector notifies the HNP who notifies everyone else. Thus
> everyone will _eventually_ receive the notification that the process has
> failed. It may take a while for you to receive the notification, but in the
> meantime you will behave normally. When you do receive the notification that
> the failure occurred, you update your local copy of the epoch.
>
> This is similar to the definition of the "perfect" failure detector that
> Josh references. It doesn't matter if you don't find about the failure
> immediately, as long as you find out about it eventually. If you aren't
> actually in the same jobid as the failed process you might never find out
> about the failure because it does not apply to you.
>
> Are you then thinking that MPI processes are going to detect failure
> instead of local orteds?? Right now, no MPI process would ever report
> failure of a peer - the orted detects failure using the sigchild and reports
> it. What mechanism would the MPI procs use, and how would that be more
> reliable than sigchild??
>
> Definitely not. ORTEDs are the processes that detect and report the
> failures. They can detect the failure of other ORTEDs or of applications.
> Basically anything to which they have a connection.
>
>
> So right now the HNP can -never- receive more than one failure report at a
> time for a process. The only issue we've been working is that there are
> several pathways for reporting that error - e.g., if the orted detects the
> process fails and reports it, and then the orted itself fails, we can get
> multiple failure events back at the HNP before we respond to the first one.
>
> Not the same issue as having MPI procs reporting failures...
>
> This is where the epoch becomes necessary. When reporting a failure, you
> tell the HNP which process failed by name, including the epoch. Thus the HNP
> will not make a process as having failed twice (thus incrementing the epoch
> twice and notifying everyone about the failure twice). The HNP might receive
> multiple notifications because more than one ORTED could (and often will)
> detect the failure. It is easier to have the HNP decide what is a failure
> and what is a duplicate rather than have the ORTEDs reach some consensus
> about the fact that a process has failed. Much less overhead this way.
>
>
> I'm not sure what ORCM does in the respect, but I don't know of anything in
> ORTE that would track this data other than the process state and that
> doesn't keep track of anything beyond one failure (which admittedly isn't an
> issue until we implement process recovery).
>
>
> We aren't having any problems with process recovery and process state -
> without tracking epochs. We only track "incarnations" so that we can pass it
> down to the apps, which use that info to guide their restart.
>
> Could you clarify why you are having a problem in this regard? Might help
> to better understand your proposed changes.
>
> I think we're talking about the same thing here. The only difference is
> that I'm not looking at the ORCM code so I don't have the "incarnations".
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
> 
> Perhaps it would help if you folks could provide a little explanation about 
> how you use epoch? While the value sounds similar, your explanations are 
> beginning to sound very different from what we are doing and/or had 
> envisioned. 
> 
> I'm not sure how you can talk about an epoch being too high or too low, 
> unless you are envisioning an overall system where procs try to maintain some 
> global notion of the value - which sounds like a race condition begging to 
> cause problems. 
> 
> 
> 
> 

When we say epoch we mean a value that is stored locally. When a failure is 
detected the detector notifies the HNP who notifies everyone else. Thus 
everyone will _eventually_ receive the notification that the process has 
failed. It may take a while for you to receive the notification, but in the 
meantime you will behave normally. When you do receive the notification that 
the failure occurred, you update your local copy of the epoch.

This is similar to the definition of the "perfect" failure detector that Josh 
references. It doesn't matter if you don't find about the failure immediately, 
as long as you find out about it eventually. If you aren't actually in the same 
jobid as the failed process you might never find out about the failure because 
it does not apply to you.
> Are you then thinking that MPI processes are going to detect failure instead 
> of local orteds?? Right now, no MPI process would ever report failure of a 
> peer - the orted detects failure using the sigchild and reports it. What 
> mechanism would the MPI procs use, and how would that be more reliable than 
> sigchild??
> 
> 
> 

Definitely not. ORTEDs are the processes that detect and report the failures. 
They can detect the failure of other ORTEDs or of applications. Basically 
anything to which they have a connection.
> 
> So right now the HNP can -never- receive more than one failure report at a 
> time for a process. The only issue we've been working is that there are 
> several pathways for reporting that error - e.g., if the orted detects the 
> process fails and reports it, and then the orted itself fails, we can get 
> multiple failure events back at the HNP before we respond to the first one. 
> 
> Not the same issue as having MPI procs reporting failures...
This is where the epoch becomes necessary. When reporting a failure, you tell 
the HNP which process failed by name, including the epoch. Thus the HNP will 
not make a process as having failed twice (thus incrementing the epoch twice 
and notifying everyone about the failure twice). The HNP might receive multiple 
notifications because more than one ORTED could (and often will) detect the 
failure. It is easier to have the HNP decide what is a failure and what is a 
duplicate rather than have the ORTEDs reach some consensus about the fact that 
a process has failed. Much less overhead this way.
> > 
> > I'm not sure what ORCM does in the respect, but I don't know of anything in 
> > ORTE that would track this data other than the process state and that 
> > doesn't keep track of anything beyond one failure (which admittedly isn't 
> > an issue until we implement process recovery). 
> 
> We aren't having any problems with process recovery and process state - 
> without tracking epochs. We only track "incarnations" so that we can pass it 
> down to the apps, which use that info to guide their restart. 
> 
> Could you clarify why you are having a problem in this regard? Might help to 
> better understand your proposed changes.
I think we're talking about the same thing here. The only difference is that 
I'm not looking at the ORCM code so I don't have the "incarnations".




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote:

>
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
>
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart all the procs every time one fails.
>
> Yes the epoch is per process, but it is distributed among all participants.
> The difficulty here is to make sure the global view of the processes
> converges toward a common value of the epoch for each process.
>

Sounds racy...is it actually necessary to have a global agreement on epoch?
Per my other note, perhaps we really need a primer on this epoch concept.



>
> > So if I look at the epoch of the proc sending me a message, I really
> can't check it against my own value as the comparison is meaningless. All I
> really can do is check to see if it changed from the last time I heard from
> that proc, which would tell me that the proc has been restarted in the
> interim.
>
> I fail to understand your statement here. However, comparing message epoch
> is critical to ensure the correct behavior.  It ensures we do not react on
> old messages (that were floating in the system for some obscure reasons),
> and that we have the right contact information for a specific peer (on the
> correct epoch).
>

Again, maybe we need a better understanding of what you mean by epoch -
clearly, there is misunderstanding of what you are proposing to do.

I'm leery of anything that requires a general consensus as it creates a lot
of race conditions - might work under certain circumstances, but we've been
burned by that approach too many times.



>  george.
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland  wrote:

>
>  On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland  wrote:
>
>  To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than in a separately maintained list
> simplifies things later.
>
>
> Not really concerned - I was just noting we had done it a tad differently,
> but nothing important.
>
>
>
> For example, during communication you need to attach the epoch to each of
> your messages so they can be tracked later. If a process dies while the
> message is in flight, or you need to cancel your communication, you need to
> be able to find the matching message to the matching epoch. If the epoch
> isn't in the process name, then you have to modify to the message header for
> each type of message to include that information. Each process not only
> needs to know what the current version of the epoch is from it's own
> perspective, but also from the perspective of whomever is sending the
> message.
>
>
> But the epoch is process-unique - i.e., it is the number of times that this
> specific process has been started, which differs per proc since we don't
> restart all the procs every time one fails. So if I look at the epoch of the
> proc sending me a message, I really can't check it against my own value as
> the comparison is meaningless. All I really can do is check to see if it
> changed from the last time I heard from that proc, which would tell me that
> the proc has been restarted in the interim.
>
> But that is the point of the epoch. It prevents communication with a failed
> process. If the epoch. If the epoch is too low, you know you're
> communicating with an old process and you need to drop the message. If it is
> too high, you know that the process has been restarted and you need to
> update your known epoch.
>
> Maybe I'm misunderstanding what you're saying?
>

Perhaps it would help if you folks could provide a little explanation about
how you use epoch? While the value sounds similar, your explanations are
beginning to sound very different from what we are doing and/or had
envisioned.

I'm not sure how you can talk about an epoch being too high or too low,
unless you are envisioning an overall system where procs try to maintain
some global notion of the value - which sounds like a race condition begging
to cause problems.


>
>
> This is also true for things like reporting failures. To prevent duplicate
> notifications you would need to include your epoch in all the notifications
> so no one marks a process as failing twice.
>
>
> I'm not sure of the relevance here. We handle this without problem right
> now (at least, within orcm - haven't looked inside orte yet to see what
> needs to be brought back, if anything) without an epoch - and the state
> machine will resolve the remaining race conditions, which really don't
> pertain to epoch anyway.
>
> An example here might be if a process fails and two other processes detect
> it. By marking which version of the process failed, the HNP knows that it is
> one failure detected by two processes rather than two failures being
> detected in quick succession.
>

Are you then thinking that MPI processes are going to detect failure instead
of local orteds?? Right now, no MPI process would ever report failure of a
peer - the orted detects failure using the sigchild and reports it. What
mechanism would the MPI procs use, and how would that be more reliable than
sigchild??

So right now the HNP can -never- receive more than one failure report at a
time for a process. The only issue we've been working is that there are
several pathways for reporting that error - e.g., if the orted detects the
process fails and reports it, and then the orted itself fails, we can get
multiple failure events back at the HNP before we respond to the first one.

Not the same issue as having MPI procs reporting failures...



>
> I'm not sure what ORCM does in the respect, but I don't know of anything in
> ORTE that would track this data other than the process state and that
> doesn't keep track of anything beyond one failure (which admittedly isn't an
> issue until we implement process recovery).
>

We aren't having any problems with process recovery and process state -
without tracking epochs. We only track "incarnations" so that we can pass it
down to the apps, which use that info to guide their restart.

Could you clarify why you are having a problem in this regard? Might help to
better understand your proposed changes.


>
>
> Really the point is that by changing the process name, you prevent the need
> to pack the epoch each time you have any sort of communication. All that
> work is done along with packing the rest of the structure.
>
>
> No argument - I don't mind having the value in the name. Makes no
> difference to me.
>
>
>  On Tuesday, June 7, 2011 at 11:21 AM, 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread George Bosilca

On Jun 7, 2011, at 12:14 , Ralph Castain wrote:

> But the epoch is process-unique - i.e., it is the number of times that this 
> specific process has been started, which differs per proc since we don't 
> restart all the procs every time one fails.

Yes the epoch is per process, but it is distributed among all participants. The 
difficulty here is to make sure the global view of the processes converges 
toward a common value of the epoch for each process. 

> So if I look at the epoch of the proc sending me a message, I really can't 
> check it against my own value as the comparison is meaningless. All I really 
> can do is check to see if it changed from the last time I heard from that 
> proc, which would tell me that the proc has been restarted in the interim.

I fail to understand your statement here. However, comparing message epoch is 
critical to ensure the correct behavior.  It ensures we do not react on old 
messages (that were floating in the system for some obscure reasons), and that 
we have the right contact information for a specific peer (on the correct 
epoch).

  george.





Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland


On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:

> 
> 
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland  (mailto:wbl...@eecs.utk.edu)> wrote:
> > To adress your concerns about putting the epoch in the process name 
> > structure, putting it in there rather than in a separately maintained list 
> > simplifies things later. 
> 
> Not really concerned - I was just noting we had done it a tad differently, 
> but nothing important.
> 
> > 
> > For example, during communication you need to attach the epoch to each of 
> > your messages so they can be tracked later. If a process dies while the 
> > message is in flight, or you need to cancel your communication, you need to 
> > be able to find the matching message to the matching epoch. If the epoch 
> > isn't in the process name, then you have to modify to the message header 
> > for each type of message to include that information. Each process not only 
> > needs to know what the current version of the epoch is from it's own 
> > perspective, but also from the perspective of whomever is sending the 
> > message. 
> 
> But the epoch is process-unique - i.e., it is the number of times that this 
> specific process has been started, which differs per proc since we don't 
> restart all the procs every time one fails. So if I look at the epoch of the 
> proc sending me a message, I really can't check it against my own value as 
> the comparison is meaningless. All I really can do is check to see if it 
> changed from the last time I heard from that proc, which would tell me that 
> the proc has been restarted in the interim.
But that is the point of the epoch. It prevents communication with a failed 
process. If the epoch. If the epoch is too low, you know you're communicating 
with an old process and you need to drop the message. If it is too high, you 
know that the process has been restarted and you need to update your known 
epoch.

Maybe I'm misunderstanding what you're saying?
> 
> > 
> > This is also true for things like reporting failures. To prevent duplicate 
> > notifications you would need to include your epoch in all the notifications 
> > so no one marks a process as failing twice. 
> 
> I'm not sure of the relevance here. We handle this without problem right now 
> (at least, within orcm - haven't looked inside orte yet to see what needs to 
> be brought back, if anything) without an epoch - and the state machine will 
> resolve the remaining race conditions, which really don't pertain to epoch 
> anyway.
An example here might be if a process fails and two other processes detect it. 
By marking which version of the process failed, the HNP knows that it is one 
failure detected by two processes rather than two failures being detected in 
quick succession.

I'm not sure what ORCM does in the respect, but I don't know of anything in 
ORTE that would track this data other than the process state and that doesn't 
keep track of anything beyond one failure (which admittedly isn't an issue 
until we implement process recovery).
> 
> > 
> > Really the point is that by changing the process name, you prevent the need 
> > to pack the epoch each time you have any sort of communication. All that 
> > work is done along with packing the rest of the structure. 
> 
> No argument - I don't mind having the value in the name. Makes no difference 
> to me.
> 
> > 
> > On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:
> > 
> > > Thanks for the explanation - as I said, I won't have time to really 
> > > review the patch this week, but appreciate the info. I don't really 
> > > expect to see a conflict as George had discussed this with me previously.
> > > 
> > > I know I'll have merge conflicts with my state machine branch, which 
> > > would be ready for commit in the same time frame, but I'll hold off on 
> > > that one and deal with the merge issues on my side. 
> > > 
> > > 
> > > 
> > > On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  > > (mailto:wbl...@eecs.utk.edu)> wrote:
> > > > This could certainly work alongside another ORCM or any other fault 
> > > > detection/prediction/recovery mechanism. Most of the code is just 
> > > > dedicated to keeping the epoch up to date and tracking the status of 
> > > > the processes. The underlying idea was to provide a way for the 
> > > > application to decide what its fault policy would be rather than trying 
> > > > to dictate one in the runtime. If any other layer wanted to register a 
> > > > callback function with this code, it could do anything it wanted to on 
> > > > top of it. 
> > > > 
> > > > Wesley
> > > > 
> > > > On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
> > > > 
> > > > > I'm on travel this week, but will look this over when I return. From 
> > > > > the description, it sounds nearly identical to what we did in ORCM, 
> > > > > so I expect there won't be many issues. You do get some race 
> > > > > conditions that the new state 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
To adress your concerns about putting the epoch in the process name structure, 
putting it in there rather than in a separately maintained list simplifies 
things later. 

For example, during communication you need to attach the epoch to each of your 
messages so they can be tracked later. If a process dies while the message is 
in flight, or you need to cancel your communication, you need to be able to 
find the matching message to the matching epoch. If the epoch isn't in the 
process name, then you have to modify to the message header for each type of 
message to include that information. Each process not only needs to know what 
the current version of the epoch is from it's own perspective, but also from 
the perspective of whomever is sending the message.

This is also true for things like reporting failures. To prevent duplicate 
notifications you would need to include your epoch in all the notifications so 
no one marks a process as failing twice.

Really the point is that by changing the process name, you prevent the need to 
pack the epoch each time you have any sort of communication. All that work is 
done along with packing the rest of the structure. 

On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:

> Thanks for the explanation - as I said, I won't have time to really review 
> the patch this week, but appreciate the info. I don't really expect to see a 
> conflict as George had discussed this with me previously.
> 
> I know I'll have merge conflicts with my state machine branch, which would be 
> ready for commit in the same time frame, but I'll hold off on that one and 
> deal with the merge issues on my side.
> 
> 
> 
> On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  (mailto:wbl...@eecs.utk.edu)> wrote:
> > This could certainly work alongside another ORCM or any other fault 
> > detection/prediction/recovery mechanism. Most of the code is just dedicated 
> > to keeping the epoch up to date and tracking the status of the processes. 
> > The underlying idea was to provide a way for the application to decide what 
> > its fault policy would be rather than trying to dictate one in the runtime. 
> > If any other layer wanted to register a callback function with this code, 
> > it could do anything it wanted to on top of it. 
> > 
> > Wesley
> > 
> > On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
> > 
> > > I'm on travel this week, but will look this over when I return. From the 
> > > description, it sounds nearly identical to what we did in ORCM, so I 
> > > expect there won't be many issues. You do get some race conditions that 
> > > the new state machine code should help resolve.
> > > 
> > > Only difference I can quickly see is that we chose not to modify the 
> > > process name structure, keeping the "epoch" (we called it "incarnation") 
> > > as a separate value. Since we aren't terribly concerned about backward 
> > > compatibility, I don't consider this a significant issue - but something 
> > > the community should recognize. 
> > > 
> > > My main concern will be to ensure that the new code contains enough 
> > > flexibility to allow integration with other layers such as ORCM without 
> > > creating potential conflict over "double protection" - i.e., if the layer 
> > > above ORTE wants to provide a certain level of fault protection, then 
> > > ORTE needs to get out of the way. 
> > > 
> > > 
> > > On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  > > (mailto:bosi...@eecs.utk.edu)> wrote:
> > > >  WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> > > > (daemons) or application level processes. This patch extends the 
> > > > orte_process_name_t structure with a field to store the process epoch 
> > > > (the number of times it died so far), and add an application failure 
> > > > notification callback function to be registered in the runtime.
> > > > 
> > > >  WHY: Necessary to correctly implement the error handling in the MPI 
> > > > 2.2 standard. In addition, such a resilient runtime is a cornerstone 
> > > > for any level of fault tolerance support we want to provide in the 
> > > > future (such as the MPI-3 Run-Through Stabilization or FT-MPI).
> > > > 
> > > >  WHEN:
> > > > 
> > > >  WHERE: Patch attached to this email, based on trunk r24747.
> > > >  TIMEOUT: 2 weeks from now, on Monday 20 June.
> > > > 
> > > >  --
> > > > 
> > > >  MORE DETAILS:
> > > > 
> > > >  Currently the infrastructure required to enable any kind of fault 
> > > > tolerance development in Open MPI (with the exception of the 
> > > > checkpoint/restart) is missing. However, before developing any fault 
> > > > tolerant support at the application (MPI) level, we need to have a 
> > > > resilient runtime. The changes in this patch address this lack of 
> > > > support and would allow anyone to implement a fault tolerance protocol 
> > > > at the MPI layer without having to worry about the ORTE stabilization.
> > > > 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks for the explanation - as I said, I won't have time to really review
the patch this week, but appreciate the info. I don't really expect to see a
conflict as George had discussed this with me previously.

I know I'll have merge conflicts with my state machine branch, which would
be ready for commit in the same time frame, but I'll hold off on that one
and deal with the merge issues on my side.



On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  wrote:

>  This could certainly work alongside another ORCM or any other fault
> detection/prediction/recovery mechanism. Most of the code is just dedicated
> to keeping the epoch up to date and tracking the status of the processes.
> The underlying idea was to provide a way for the application to decide what
> its fault policy would be rather than trying to dictate one in the runtime.
> If any other layer wanted to register a callback function with this code, it
> could do anything it wanted to on top of it.
>
> Wesley
>
> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>
> I'm on travel this week, but will look this over when I return. From the
> description, it sounds nearly identical to what we did in ORCM, so I expect
> there won't be many issues. You do get some race conditions that the new
> state machine code should help resolve.
>
> Only difference I can quickly see is that we chose not to modify the
> process name structure, keeping the "epoch" (we called it "incarnation") as
> a separate value. Since we aren't terribly concerned about backward
> compatibility, I don't consider this a significant issue - but something the
> community should recognize.
>
> My main concern will be to ensure that the new code contains enough
> flexibility to allow integration with other layers such as ORCM without
> creating potential conflict over "double protection" - i.e., if the layer
> above ORTE wants to provide a certain level of fault protection, then ORTE
> needs to get out of the way.
>
>
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca wrote:
>
> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization or FT-MPI).
>
> WHEN:
>
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
>
> --
>
> MORE DETAILS:
>
> Currently the infrastructure required to enable any kind of fault tolerance
> development in Open MPI (with the exception of the checkpoint/restart) is
> missing. However, before developing any fault tolerant support at the
> application (MPI) level, we need to have a resilient runtime. The changes in
> this patch address this lack of support and would allow anyone to implement
> a fault tolerance protocol at the MPI layer without having to worry about
> the ORTE stabilization.
>
> This patch will allow the runtime to drop any dead daemons, and re-route
> all communications around the holes in order to __ALWAYS__ deliver a message
> as long as the destination process is alive. The application is informed
> (via a callback) about the loss of the processes with the same jobid. In
> this patch we do not address the MPI_ERROR_RETURN type of failures, we
> focused on the MPI_ERROR_ABORT ones. Moreover, we empowered the application
> level with the decision, instead of taking it down in the runtime.
>
> NEW STUFF:
>
> Epoch - A counter that tracks the number of times a process has been
> detected to have terminated, either from a failure or an expected
> termination. After the termination is detected, the HNP coordinates all
> other process’s knowledge of the new epoch. Each ORTED will know the epoch
> of the other processes in the job, but it will not actually store anything
> until the epochs change.
>
> Run-Through Stabilization - When an ORTED (or HNP) detects that another
> process has terminated, it repairs the routing layer and informs the HNP.
> The HNP tells all other processes about the failure so they can also repair
> their routing layers an update their internal bookkeeping. The processes do
> not abort after the termination is detected.
>
> Callback Function - When the HNP tells all the ORTEDs about the failures,
> they tell the ORTE layers within the applications. The application level
> ORTE layers have a callback function that they use to inform the OMPI layer
> about the error. Currently the OMPI errhandler code fills in this callback
> function so it is informed when there is an error and 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
I'm on travel this week, but will look this over when I return. From the
description, it sounds nearly identical to what we did in ORCM, so I expect
there won't be many issues. You do get some race conditions that the new
state machine code should help resolve.

Only difference I can quickly see is that we chose not to modify the process
name structure, keeping the "epoch" (we called it "incarnation") as a
separate value. Since we aren't terribly concerned about backward
compatibility, I don't consider this a significant issue - but something the
community should recognize.

My main concern will be to ensure that the new code contains enough
flexibility to allow integration with other layers such as ORCM without
creating potential conflict over "double protection" - i.e., if the layer
above ORTE wants to provide a certain level of fault protection, then ORTE
needs to get out of the way.


On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  wrote:

> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization or FT-MPI).
>
> WHEN:
>
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
>
> --
>
> MORE DETAILS:
>
> Currently the infrastructure required to enable any kind of fault tolerance
> development in Open MPI (with the exception of the checkpoint/restart) is
> missing. However, before developing any fault tolerant support at the
> application (MPI) level, we need to have a resilient runtime. The changes in
> this patch address this lack of support and would allow anyone to implement
> a fault tolerance protocol at the MPI layer without having to worry about
> the ORTE stabilization.
>
> This patch will allow the runtime to drop any dead daemons, and re-route
> all communications around the holes in order to __ALWAYS__ deliver a message
> as long as the destination process is alive. The application is informed
> (via a callback) about the loss of the processes with the same jobid. In
> this patch we do not address the MPI_ERROR_RETURN type of failures, we
> focused on the MPI_ERROR_ABORT ones. Moreover, we empowered the application
> level with the decision, instead of taking it down in the runtime.
>
> NEW STUFF:
>
> Epoch - A counter that tracks the number of times a process has been
> detected to have terminated, either from a failure or an expected
> termination. After the termination is detected, the HNP coordinates all
> other process’s knowledge of the new epoch. Each ORTED will know the epoch
> of the other processes in the job, but it will not actually store anything
> until the epochs change.
>
> Run-Through Stabilization - When an ORTED (or HNP) detects that another
> process has terminated, it repairs the routing layer and informs the HNP.
> The HNP tells all other processes about the failure so they can also repair
> their routing layers an update their internal bookkeeping. The processes do
> not abort after the termination is detected.
>
> Callback Function - When the HNP tells all the ORTEDs about the failures,
> they tell the ORTE layers within the applications. The application level
> ORTE layers have a callback function that they use to inform the OMPI layer
> about the error. Currently the OMPI errhandler code fills in this callback
> function so it is informed when there is an error and it aborts (to maintain
> the current default behavior of MPI). This callback function can also be
> used in an ORTE only application to perform application based fault
> tolerance (ABFT) and allow the application to continue.
>
> NECESSARY FOR IMPLEMENTATION:
>
> Epoch - The orte_process_name_t struct now has a field for epoch. This
> means that whenever sending a message, the most current version of the epoch
> needs to be in this field. This is a simple look up using the function in
> orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c
> code, there is a check to make sure that it isn’t trying to send messages to
> a process that has already terminated (don’t send to a process with an epoch
> less than the current epoch). Make sure that if you are sending a message,
> you have the most up to date data here.
>
> Routing - So far, only the binomial routing layer has been updated to use
> the new resilience features. To modify other routing layers to be able to
> continue running after a process failure, they need to be able to detect
> which processes are 

[OMPI devel] RFC: Resilient ORTE

2011-06-06 Thread George Bosilca
WHAT: Allow the runtime to handle fail-stop failures for both runtime (daemons) 
or application level processes. This patch extends the orte_process_name_t 
structure with a field to store the process epoch (the number of times it died 
so far), and add an application failure notification callback function to be 
registered in the runtime. 

WHY: Necessary to correctly implement the error handling in the MPI 2.2 
standard. In addition, such a resilient runtime is a cornerstone for any level 
of fault tolerance support we want to provide in the future (such as the MPI-3 
Run-Through Stabilization or FT-MPI).

WHEN:

WHERE: Patch attached to this email, based on trunk r24747.
TIMEOUT: 2 weeks from now, on Monday 20 June.

--

MORE DETAILS:

Currently the infrastructure required to enable any kind of fault tolerance 
development in Open MPI (with the exception of the checkpoint/restart) is 
missing. However, before developing any fault tolerant support at the 
application (MPI) level, we need to have a resilient runtime. The changes in 
this patch address this lack of support and would allow anyone to implement a 
fault tolerance protocol at the MPI layer without having to worry about the 
ORTE stabilization.

This patch will allow the runtime to drop any dead daemons, and re-route all 
communications around the holes in order to __ALWAYS__ deliver a message as 
long as the destination process is alive. The application is informed (via a 
callback) about the loss of the processes with the same jobid. In this patch we 
do not address the MPI_ERROR_RETURN type of failures, we focused on the 
MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the 
decision, instead of taking it down in the runtime.

NEW STUFF:

Epoch - A counter that tracks the number of times a process has been detected 
to have terminated, either from a failure or an expected termination. After the 
termination is detected, the HNP coordinates all other process’s knowledge of 
the new epoch. Each ORTED will know the epoch of the other processes in the 
job, but it will not actually store anything until the epochs change. 

Run-Through Stabilization - When an ORTED (or HNP) detects that another process 
has terminated, it repairs the routing layer and informs the HNP. The HNP tells 
all other processes about the failure so they can also repair their routing 
layers an update their internal bookkeeping. The processes do not abort after 
the termination is detected.

Callback Function - When the HNP tells all the ORTEDs about the failures, they 
tell the ORTE layers within the applications. The application level ORTE layers 
have a callback function that they use to inform the OMPI layer about the 
error. Currently the OMPI errhandler code fills in this callback function so it 
is informed when there is an error and it aborts (to maintain the current 
default behavior of MPI). This callback function can also be used in an ORTE 
only application to perform application based fault tolerance (ABFT) and allow 
the application to continue.

NECESSARY FOR IMPLEMENTATION:

Epoch - The orte_process_name_t struct now has a field for epoch. This means 
that whenever sending a message, the most current version of the epoch needs to 
be in this field. This is a simple look up using the function in 
orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c 
code, there is a check to make sure that it isn’t trying to send messages to a 
process that has already terminated (don’t send to a process with an epoch less 
than the current epoch). Make sure that if you are sending a message, you have 
the most up to date data here.

Routing - So far, only the binomial routing layer has been updated to use the 
new resilience features. To modify other routing layers to be able to continue 
running after a process failure, they need to be able to detect which processes 
are not currently running and route around them. The errmgr gives the routing 
layer two chances to do this. First it calls delete_route for each process that 
fails, then it calls update_routing_tree after it has appropriately marked each 
process. Before either of these things happen the epoch and process state have 
already been updates so the routing layer can use this data to determine which 
processes are alive and which are dead. A convenience function has been added 
to orte/util/nidmap.h called orte_util_proc_is_running() which allows the 
ORTEDs to determine the status of a process. Keep in mind that a process is not 
running if it hasn’t started up yet so it is wise to check the epoch (to make 
sure that it isn’t ORTE_EPOCH_MIN) as well to make sure that you’re actually 
detecting an error and not just noticing that an ORTED hasn’t finished starting.

Callback - If you want to implement some sort of fault tolerance on top of this 
code, use the callback function in the errmgr framework. There is a new 
function in the errmgr code called