Josh,

There were a couple of bugs that I cleared up in my most recent checkin, but I 
also needed to modify your test. The callback for the application layer errmgr 
actually occurs in the application layer. Your test was never giving up the 
thread to the ORTE application event loop to receive its message from the 
ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that fixed the 
problem.

Try running the attached code with the modifications and see if that clears up 
the problem. It did for me.

Thanks,
Wesley

On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:

> So I finally got a chance to test the branch this morning. I cannot
> get it to work. Maybe I'm doing some wrong, missing some MCA
> parameter?
> 
> -------------------------
> [jjhursey@smoky-login1 resilient-orte] hg summary
> parent: 2:c550cf6ed6a2 tip
>  Newest version. Synced with trunk r24785.
> branch: default
> commit: 1 modified, 8097 unknown
> update: (current)
> -------------------------
> (the 1 modified was the test program attached)
> 
> Attached is a modified version of the orte_abort.c program found in
> ${top}/orte/test/system. This program is ORTE only, and registers the
> errmgr callback to trigger correct termination. You will need to
> configure Open MPI with '--with-devel-headers' to build this. But then
> you can compile with:
>  ortecc -g orte_abort.c -o orte_abort
> 
> These are the configure options that I used:
>  --with-devel-headers --enable-binaries --disable-io-romio
> --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> F77=gfortran FC=gfortran
> 
> 
> If the HNP has no processes on it - I get a hang:
> -------------------------------
> mpirun -np 4 --nolocal orte_abort
> orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> mpirun: killing job...
> 
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> 
> [jjhursey@smoky14 system] echo $?
> 1
> -------------------------------
> 
> If the HNP has processes on it, but not the one that aborted - I get a hang:
> -------------------------------
> [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> mpirun: killing job...
> 
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file errmgr_hnp.c at line 824
> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file orted/orted_comm.c at line 1341
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
> 
> [jjhursey@smoky14 system] echo $?
> 1
> --------------------------------
> 
> If the HNP has processes on it, and it is the one that aborted - I get
> immediate return, but no callback:
> --------------------------------
> [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
> [jjhursey@smoky14 system] echo $?
> 3
> --------------------------------
> 
> Any ideas on what I might be doing wrong?
> 
> I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
> NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
> 
> -- Josh
> 
> 
> 
> On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbl...@eecs.utk.edu 
> (mailto:wbl...@eecs.utk.edu)> wrote:
> > Last reminder (I hope). RFC goes in a COB today.
> > Wesley
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> devel mailing list
> de...@open-mpi.org (mailto:de...@open-mpi.org)
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> Attachments: 
> - orte_abort.c
> 


Attachment: orte_abort.c
Description: Binary data

Reply via email to