So I finally got a chance to test the branch this morning. I cannot get it to work. Maybe I'm doing some wrong, missing some MCA parameter?
------------------------- [jjhursey@smoky-login1 resilient-orte] hg summary parent: 2:c550cf6ed6a2 tip Newest version. Synced with trunk r24785. branch: default commit: 1 modified, 8097 unknown update: (current) ------------------------- (the 1 modified was the test program attached) Attached is a modified version of the orte_abort.c program found in ${top}/orte/test/system. This program is ORTE only, and registers the errmgr callback to trigger correct termination. You will need to configure Open MPI with '--with-devel-headers' to build this. But then you can compile with: ortecc -g orte_abort.c -o orte_abort These are the configure options that I used: --with-devel-headers --enable-binaries --disable-io-romio --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++ F77=gfortran FC=gfortran If the HNP has no processes on it - I get a hang: ------------------------------- mpirun -np 4 --nolocal orte_abort orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort mpirun: killing job... [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file errmgr_hnp.c at line 824 [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file orted/orted_comm.c at line 1341 mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate [jjhursey@smoky14 system] echo $? 1 ------------------------------- If the HNP has processes on it, but not the one that aborted - I get a hang: ------------------------------- [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort mpirun: killing job... [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file errmgr_hnp.c at line 824 [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file orted/orted_comm.c at line 1341 mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate [jjhursey@smoky14 system] echo $? 1 -------------------------------- If the HNP has processes on it, and it is the one that aborted - I get immediate return, but no callback: -------------------------------- [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort [jjhursey@smoky14 system] echo $? 3 -------------------------------- Any ideas on what I might be doing wrong? I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid, NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior. -- Josh On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbl...@eecs.utk.edu> wrote: > Last reminder (I hope). RFC goes in a COB today. > Wesley > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
/* -*- C -*- * * $HEADER$ * * A program that just spins, with vpid 3 aborting - provides mechanism for testing * abnormal program termination */ #include <stdio.h> #include <unistd.h> #include "orte/runtime/runtime.h" #include "orte/util/proc_info.h" #include "orte/util/name_fns.h" #include "orte/runtime/orte_globals.h" #include "orte/mca/errmgr/errmgr.h" #include "orte/mca/grpcomm/grpcomm.h" #include "opal/class/opal_pointer_array.h" static pid_t pid; static char hostname[500]; static finished = 0; void my_errhandler_runtime_callback(opal_pointer_array_t *procs); void my_errhandler_runtime_callback(opal_pointer_array_t *procs) { printf("orte_abort: Name %s Host: %s Pid %ld " "-- In callback\n", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), hostname, (long)pid); fflush(NULL); sleep(1); finished = 1; #if 0 orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid, NULL); #endif } int main(int argc, char* argv[]) { int i, rc; int flags; flags = ORTE_PROC_NON_MPI; /* flags = ORTE_PROC_MPI; */ if (0 > (rc = orte_init(&argc, &argv, flags))) { fprintf(stderr, "orte_abort: couldn't init orte - error code %d\n", rc); return rc; } pid = getpid(); gethostname(hostname, 500); printf("orte_abort: Name %s Host: %s Pid %ld " "-- Initalized\n", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), hostname, (long)pid); orte_errmgr.set_fault_callback(my_errhandler_runtime_callback); orte_grpcomm.barrier(); i = 0; while(finished == 0 ) { ++i; if( i > 10000 && (ORTE_PROC_MY_NAME->vpid == 3 || (orte_process_info.num_procs <= 3 && ORTE_PROC_MY_NAME->vpid == 0)) ) { printf("orte_abort: Name %s Host: %s Pid %ld " "-- Calling Abort\n", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), hostname, (long)pid); #if 1 orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid, NULL); #else kill(getpid(), SIGKILL); #endif } } printf("orte_abort: Name %s Host: %s Pid %ld " "-- Finish\n", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), hostname, (long)pid); if (ORTE_SUCCESS != orte_finalize()) { fprintf(stderr, "Failed orte_finalize\n"); exit(1); } return 0; }