So I finally got a chance to test the branch this morning. I cannot
get it to work. Maybe I'm doing some wrong, missing some MCA
parameter?

-------------------------
[jjhursey@smoky-login1 resilient-orte] hg summary
parent: 2:c550cf6ed6a2 tip
 Newest version. Synced with trunk r24785.
branch: default
commit: 1 modified, 8097 unknown
update: (current)
-------------------------
(the 1 modified was the test program attached)

Attached is a modified version of the orte_abort.c program found in
${top}/orte/test/system. This program is ORTE only, and registers the
errmgr callback to trigger correct termination. You will need to
configure Open MPI with '--with-devel-headers' to build this. But then
you can compile with:
  ortecc -g    orte_abort.c   -o orte_abort

These are the configure options that I used:
 --with-devel-headers --enable-binaries --disable-io-romio
--enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
F77=gfortran FC=gfortran


If the HNP has no processes on it - I get a hang:
-------------------------------
mpirun -np 4 --nolocal orte_abort
orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
mpirun: killing job...

[smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file errmgr_hnp.c at line 824
[smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file orted/orted_comm.c at line 1341
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[jjhursey@smoky14 system] echo $?
1
-------------------------------

If the HNP has processes on it, but not the one that aborted - I get a hang:
-------------------------------
[jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
mpirun: killing job...

[smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file errmgr_hnp.c at line 824
[smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file orted/orted_comm.c at line 1341
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[jjhursey@smoky14 system] echo $?
1
--------------------------------

If the HNP has processes on it, and it is the one that aborted - I get
immediate return, but no callback:
--------------------------------
[jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
[jjhursey@smoky14 system] echo $?
3
--------------------------------

Any ideas on what I might be doing wrong?

I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.

-- Josh



On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbl...@eecs.utk.edu> wrote:
> Last reminder (I hope). RFC goes in a COB today.
> Wesley
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
/* -*- C -*-
 *
 * $HEADER$
 *
 * A program that just spins, with vpid 3 aborting - provides mechanism for testing
 * abnormal program termination
 */

#include <stdio.h>
#include <unistd.h>

#include "orte/runtime/runtime.h"
#include "orte/util/proc_info.h"
#include "orte/util/name_fns.h"
#include "orte/runtime/orte_globals.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "opal/class/opal_pointer_array.h"

static pid_t pid;
static char hostname[500];
static finished = 0;

void my_errhandler_runtime_callback(opal_pointer_array_t *procs);

void my_errhandler_runtime_callback(opal_pointer_array_t *procs) {
   printf("orte_abort: Name %s Host: %s Pid %ld "
           "-- In callback\n",
           ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
           hostname, (long)pid);
   fflush(NULL);
   sleep(1);
   finished = 1;
#if 0
   orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid, NULL);
#endif
}

int main(int argc, char* argv[])
{
    int i, rc;
    int flags;

    flags = ORTE_PROC_NON_MPI;
    /* flags = ORTE_PROC_MPI; */
    if (0 > (rc = orte_init(&argc, &argv, flags))) {
        fprintf(stderr, "orte_abort: couldn't init orte - error code %d\n", rc);
        return rc;
    }
    pid = getpid();
    gethostname(hostname, 500);

    printf("orte_abort: Name %s Host: %s Pid %ld "
           "-- Initalized\n",
           ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
           hostname, (long)pid);

    orte_errmgr.set_fault_callback(my_errhandler_runtime_callback);

    orte_grpcomm.barrier();

    i = 0;
    while(finished == 0 ) {
        ++i;

        if( i > 10000 && 
            (ORTE_PROC_MY_NAME->vpid == 3 || 
             (orte_process_info.num_procs <= 3 && ORTE_PROC_MY_NAME->vpid == 0)) ) {

            printf("orte_abort: Name %s Host: %s Pid %ld "
                   "-- Calling Abort\n",
                   ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                   hostname, (long)pid);
#if 1
            orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid, NULL);
#else
            kill(getpid(), SIGKILL);
#endif
        }
    }

    printf("orte_abort: Name %s Host: %s Pid %ld "
           "-- Finish\n",
           ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
           hostname, (long)pid);

    if (ORTE_SUCCESS != orte_finalize()) {
        fprintf(stderr, "Failed orte_finalize\n");
        exit(1);
    }

    return 0;
}

Reply via email to