Ga - what a rookie mistake :) I tested the patched test and it works as advertised for the small scale tests I used before. So I'm good with this going in today.
Thanks, Josh On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland <wbl...@eecs.utk.edu> wrote: > Right. Sorry I misspoke. > > On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote: > > Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem > of "not giving up the thread". The problem was that Josh's test never called > progress. It would have been equally okay to simply call > "opal_event_dispatch" while waiting for the callback. > All applications have to cycle the progress engine. > > On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote: > > Josh, > There were a couple of bugs that I cleared up in my most recent checkin, but > I also needed to modify your test. The callback for the application layer > errmgr actually occurs in the application layer. Your test was never giving > up the thread to the ORTE application event loop to receive its message from > the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that > fixed the problem. > Try running the attached code with the modifications and see if that clears > up the problem. It did for me. > Thanks, > Wesley > > On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote: > > So I finally got a chance to test the branch this morning. I cannot > get it to work. Maybe I'm doing some wrong, missing some MCA > parameter? > > ------------------------- > [jjhursey@smoky-login1 resilient-orte] hg summary > parent: 2:c550cf6ed6a2 tip > Newest version. Synced with trunk r24785. > branch: default > commit: 1 modified, 8097 unknown > update: (current) > ------------------------- > (the 1 modified was the test program attached) > > Attached is a modified version of the orte_abort.c program found in > ${top}/orte/test/system. This program is ORTE only, and registers the > errmgr callback to trigger correct termination. You will need to > configure Open MPI with '--with-devel-headers' to build this. But then > you can compile with: > ortecc -g orte_abort.c -o orte_abort > > These are the configure options that I used: > --with-devel-headers --enable-binaries --disable-io-romio > --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++ > F77=gfortran FC=gfortran > > > If the HNP has no processes on it - I get a hang: > ------------------------------- > mpirun -np 4 --nolocal orte_abort > orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized > orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized > orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort > mpirun: killing job... > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read > past end of buffer in file errmgr_hnp.c at line 824 > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read > past end of buffer in file orted/orted_comm.c at line 1341 > mpirun: abort is already in progress...hit ctrl-c again to forcibly > terminate > > [jjhursey@smoky14 system] echo $? > 1 > ------------------------------- > > If the HNP has processes on it, but not the one that aborted - I get a hang: > ------------------------------- > [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort > orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized > orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized > orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort > mpirun: killing job... > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv: > readv failed: Connection reset by peer (104) > [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv: > readv failed: Connection reset by peer (104) > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read > past end of buffer in file errmgr_hnp.c at line 824 > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read > past end of buffer in file orted/orted_comm.c at line 1341 > mpirun: abort is already in progress...hit ctrl-c again to forcibly > terminate > > [jjhursey@smoky14 system] echo $? > 1 > -------------------------------- > > If the HNP has processes on it, and it is the one that aborted - I get > immediate return, but no callback: > -------------------------------- > [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort > orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized > orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized > orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort > [jjhursey@smoky14 system] echo $? > 3 > -------------------------------- > > Any ideas on what I might be doing wrong? > > I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid, > NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior. > > -- Josh > > > > On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbl...@eecs.utk.edu> wrote: > > Last reminder (I hope). RFC goes in a COB today. > Wesley > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Attachments: > - orte_abort.c > > <orte_abort.c>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey