Committed in r24815.

On Thursday, June 23, 2011 at 4:19 PM, Ralph Castain wrote:

> 
> On Jun 23, 2011, at 2:14 PM, Wesley Bland wrote:
> > Maybe before the ORTED saw the signal, it detected a communication failure 
> > and reacted to that. 
> 
> Quite possible. However, remember that procs local to mpirun (in most 
> environments) directly receive the ctrl-c instead of the orted getting a cmd 
> from mpirun to kill them. Thus, they "abort_by_signal" rather than "terminate 
> by cmd".
> 
> I've had this problem a lot on my Mac, in particular. The ctrl-c is seen 
> directly by the procs, so the abort code path is totally different.
> 
> 
> > Either way, I haven't had any trouble being able to ctrl-c out of my 
> > applications. I'll go ahead and comment the code out of the HNP and if we 
> > want to put it back later, it will be there.
> > 
> > On Thursday, June 23, 2011 at 4:05 PM, Ralph Castain wrote:
> > 
> > > 
> > > On Jun 23, 2011, at 1:59 PM, Wesley Bland wrote:
> > > > I don't see any code in the orted errmgr that deals with the state 
> > > > ORTE_PROC_STATE_ABORTED_BY_SIG however the HNP does deal with that 
> > > > state.
> > > 
> > > Like I said, the orted just passes it along - as it does with all failure 
> > > states.
> > > 
> > > > 
> > > > The discussion Josh and I were having was whether or not to remove the 
> > > > code dealing with ORTE_PROC_STATE_ABORTED_BY_SIG from the HNP so that 
> > > > the processes running on that node can also be aborted by a kill signal 
> > > > and allow the rest of the job to run.
> > > 
> > > I don't see any reason to treat that state any differently than all the 
> > > other failure states. However, be careful - if someone -wants- to kill 
> > > the job, then we need to ensure they can do so - i.e., if mpirun 
> > > sigterms/sigkills a proc, we don't want it auto-recovering or we'll never 
> > > ctrl-c out of mpirun.
> > > 
> > > In my branch, I have a special code for procs terminated deliberately by 
> > > mpirun - pretty sure I put that code back into the trunk, but I don't 
> > > believe the trunk errmgr modules know what to do with it 
> > > (TERMINATED_BY_CMD).
> > > 
> > > You might need to add some code for that case.
> > > > 
> > > > On Thursday, June 23, 2011 at 3:54 PM, Ralph Castain wrote:
> > > > 
> > > > > I'm not entirely sure what that means. The orteds certainly detect 
> > > > > and mark that a local proc aborted by signal - the orted errmgr just 
> > > > > sends a note back to the HNP notifying it of the situation rather 
> > > > > than responding to it directly.
> > > > > 
> > > > > I don't believe the HNP does anything different when responding to a 
> > > > > local proc's abort-by-signal vs getting a message from an orted, does 
> > > > > it?
> > > > > 
> > > > > What is it you want the HNP/orted to do? I haven't dug that deeply 
> > > > > into your branch
> > > > > 
> > > > > On Jun 23, 2011, at 1:47 PM, Josh Hursey wrote:
> > > > > 
> > > > > > I would mention this to Ralph to be sure (CC'ed). I bet that you can
> > > > > > push this change in with the rest so that mpirun hosting a failed
> > > > > > process works.
> > > > > > 
> > > > > > Ralph, what do you think?
> > > > > > 
> > > > > > -- Josh
> > > > > > 
> > > > > > On Thu, Jun 23, 2011 at 3:29 PM, Wesley Bland <wbl...@eecs.utk.edu 
> > > > > > (mailto:wbl...@eecs.utk.edu)> wrote:
> > > > > > > There is still one problem that you'll notice when you run your 
> > > > > > > tests. The
> > > > > > > HNP errmgr catches "aborted by signal" while the orteds don't. I 
> > > > > > > wasn't sure
> > > > > > > if this had a purpose that I wasn't aware of so I left that in 
> > > > > > > there. It's a
> > > > > > > simple matter of removing the code to make the behavior the same 
> > > > > > > on the HNP
> > > > > > > as the orteds, but I don't want to remove something like that if 
> > > > > > > it's going
> > > > > > > to cause problems for someone else.
> > > > > > > 
> > > > > > > On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> > > > > > > 
> > > > > > > So I finally got a chance to test the branch this morning. I 
> > > > > > > cannot
> > > > > > > get it to work. Maybe I'm doing some wrong, missing some MCA
> > > > > > > parameter?
> > > > > > > 
> > > > > > > -------------------------
> > > > > > > [jjhursey@smoky-login1 resilient-orte] hg summary
> > > > > > > parent: 2:c550cf6ed6a2 tip
> > > > > > > Newest version. Synced with trunk r24785.
> > > > > > > branch: default
> > > > > > > commit: 1 modified, 8097 unknown
> > > > > > > update: (current)
> > > > > > > -------------------------
> > > > > > > (the 1 modified was the test program attached)
> > > > > > > 
> > > > > > > Attached is a modified version of the orte_abort.c program found 
> > > > > > > in
> > > > > > > ${top}/orte/test/system. This program is ORTE only, and registers 
> > > > > > > the
> > > > > > > errmgr callback to trigger correct termination. You will need to
> > > > > > > configure Open MPI with '--with-devel-headers' to build this. But 
> > > > > > > then
> > > > > > > you can compile with:
> > > > > > > ortecc -g orte_abort.c -o orte_abort
> > > > > > > 
> > > > > > > These are the configure options that I used:
> > > > > > > --with-devel-headers --enable-binaries --disable-io-romio
> > > > > > > --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> > > > > > > F77=gfortran FC=gfortran
> > > > > > > 
> > > > > > > 
> > > > > > > If the HNP has no processes on it - I get a hang:
> > > > > > > -------------------------------
> > > > > > > mpirun -np 4 --nolocal orte_abort
> > > > > > > orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- 
> > > > > > > Calling Abort
> > > > > > > mpirun: killing job...
> > > > > > > 
> > > > > > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would 
> > > > > > > read
> > > > > > > past end of buffer in file errmgr_hnp.c at line 824
> > > > > > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would 
> > > > > > > read
> > > > > > > past end of buffer in file orted/orted_comm.c at line 1341
> > > > > > > mpirun: abort is already in progress...hit ctrl-c again to 
> > > > > > > forcibly
> > > > > > > terminate
> > > > > > > 
> > > > > > > [jjhursey@smoky14 system] echo $?
> > > > > > > 1
> > > > > > > -------------------------------
> > > > > > > 
> > > > > > > If the HNP has processes on it, but not the one that aborted - I 
> > > > > > > get a hang:
> > > > > > > -------------------------------
> > > > > > > [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> > > > > > > orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- 
> > > > > > > Calling Abort
> > > > > > > mpirun: killing job...
> > > > > > > 
> > > > > > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] 
> > > > > > > mca_oob_tcp_msg_recv:
> > > > > > > readv failed: Connection reset by peer (104)
> > > > > > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] 
> > > > > > > mca_oob_tcp_msg_recv:
> > > > > > > readv failed: Connection reset by peer (104)
> > > > > > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would 
> > > > > > > read
> > > > > > > past end of buffer in file errmgr_hnp.c at line 824
> > > > > > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would 
> > > > > > > read
> > > > > > > past end of buffer in file orted/orted_comm.c at line 1341
> > > > > > > mpirun: abort is already in progress...hit ctrl-c again to 
> > > > > > > forcibly
> > > > > > > terminate
> > > > > > > 
> > > > > > > [jjhursey@smoky14 system] echo $?
> > > > > > > 1
> > > > > > > --------------------------------
> > > > > > > 
> > > > > > > If the HNP has processes on it, and it is the one that aborted - 
> > > > > > > I get
> > > > > > > immediate return, but no callback:
> > > > > > > --------------------------------
> > > > > > > [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> > > > > > > orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- 
> > > > > > > Initalized
> > > > > > > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- 
> > > > > > > Calling Abort
> > > > > > > [jjhursey@smoky14 system] echo $?
> > > > > > > 3
> > > > > > > --------------------------------
> > > > > > > 
> > > > > > > Any ideas on what I might be doing wrong?
> > > > > > > 
> > > > > > > I tried with both calling 
> > > > > > > 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
> > > > > > > NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
> > > > > > > 
> > > > > > > -- Josh
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland 
> > > > > > > <wbl...@eecs.utk.edu (mailto:wbl...@eecs.utk.edu)> wrote:
> > > > > > > 
> > > > > > > Last reminder (I hope). RFC goes in a COB today.
> > > > > > > Wesley
> > > > > > > _______________________________________________
> > > > > > > devel mailing list
> > > > > > > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Joshua Hursey
> > > > > > > Postdoctoral Research Associate
> > > > > > > Oak Ridge National Laboratory
> > > > > > > http://users.nccs.gov/~jjhursey
> > > > > > > _______________________________________________
> > > > > > > devel mailing list
> > > > > > > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > > > 
> > > > > > > Attachments:
> > > > > > > - orte_abort.c
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Joshua Hursey
> > > > > > Postdoctoral Research Associate
> > > > > > Oak Ridge National Laboratory
> > > > > > http://users.nccs.gov/~jjhursey
> > > > 
> > > 
> > 
> 

Reply via email to