Ralph, i tried again after the merge and found the same behaviour, though the internals are very different.
i run without any batch manager from node0: mpirun -np 1 --mca btl tcp,self -host node1 ./abort exit with exit code zero :-( short story : i applied pmix.2.patch and that fixed my problem could you please review this ? long story : i initially applied pmix.1.patch and it solved my problem then i ran mpirun -np 1 --mca btl openib,self -host node1 ./abort and i came back to square one : exit code is zero so i used the debugger and was unable to reproduce the issue (one more race condition, yeah !) finally, i wrote pmix.2.patch, fixed my issue and realized that pmix.1.patch was no more needed. currently, and assuming pmix.2.patch is correct, i cannot tell wether pmix.1.patch is needed or not since this part of the code is no more executed. i also found one hang with the following trivial program within one node : int main (int argc, char *argv[]) { MPI_Init(&argc, &argv); MPI_Finalize(); return 3; } from node0 : $ mpirun -np 1 ./test ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- AND THE PROGRAM HANGS *but* $ mpirun -np 1 -host node1 ./test ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[22080,1],0] Exit code: 3 -------------------------------------------------------------------------- return with exit code 3. then i found a strange behaviour with helloworld if only the self btl is used : $ mpirun -np 1 --mca btl self ./hw [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3 [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at line 722 the program returns with exit code zero, but display an error message. Cheers, Gilles On 2014/08/21 6:21, Ralph Castain wrote: > I'm aware of the problem, but it will be fixed when the PMIx branch is merged > later this week. > > On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Folks, >> >> let's look at the following trivial test program : >> >> #include <mpi.h> >> #include <stdio.h> >> >> int main (int argc, char * argv[]) { >> int rank, size; >> MPI_Init(&argc, &argv); >> MPI_Comm_size(MPI_COMM_WORLD, &size); >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> printf ("I am %d/%d and i abort\n", rank, size); >> MPI_Abort(MPI_COMM_WORLD, 2); >> printf ("%d/%d aborted !\n", rank, size); >> return 3; >> } >> >> and let's run mpirun (trunk) on node0 and ask the mpi task to run on >> task 1 : >> with two tasks or more : >> >> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 2. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> I am 1/2 and i abort >> I am 0/2 and i abort >> [node0:00740] 1 more process has sent help message help-mpi-api.txt / >> mpi-abort >> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> >> node0 $ echo $? >> 0 >> >> the exit status of mpirun is zero >> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */ >> >> now if we run only one task : >> >> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >> I am 0/1 and i abort >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 2. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 0 with PID 15884 on >> node node1 exiting improperly. There are three reasons this could occur: >> >> 1. this process did not call "init" before exiting, but others in >> the job did. This can cause a job to hang indefinitely while it waits >> for all processes to call "init". By rule, if one process calls "init", >> then ALL processes must call "init" prior to termination. >> >> 2. this process called "init", but exited without calling "finalize". >> By rule, all processes that call "init" MUST call "finalize" prior to >> exiting or it will be considered an "abnormal termination" >> >> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >> orte_create_session_dirs is set to false. In this case, the run-time cannot >> detect that the abort call was an abnormal termination. Hence, the only >> error message you will receive is this one. >> >> This may have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> >> You can avoid this message by specifying -quiet on the mpirun command line. >> >> -------------------------------------------------------------------------- >> node0 $ echo $? >> 1 >> >> the program displayed a misleading error message and mpirun exited with >> error code 1 >> /* i would have expected 2, or 3 in the worst case scenario */ >> >> >> i digged it a bit and found a kind of race condition in orted (running >> on node 1) >> basically, when the process dies, it writes stuff in the openmpi session >> directory and exits. >> exiting send a SIGCHLD to orted and close the socket/pipe connected to >> orted. >> on orted, the loss of connection is generally processed before the >> SIGCHLD by libevent, >> and as a consequence, the exit code is not correctly set (e.g. it is >> left to zero). >> i did not see any kind of communication between the mpi task and orted >> (except writing a file in the openmpi session directory) as i would have >> expected >> /* but this was just my initial guess, the truth is i do not know what >> is supposed to happen */ >> >> i wrote the attached abort.patch patch to basically get it working. >> i highly suspect this is not the right thing to do so i did not commit it. >> >> it works fine with two tasks or more. >> with only one task, mpirun display a misleading error message but the >> exit status is ok. >> >> could someone (Ralph ?) have a look at this ? >> >> Cheers, >> >> Gilles >> >> >> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >> I am 1/2 and i abort >> I am 0/2 and i abort >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 2. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> [node0:00920] 1 more process has sent help message help-mpi-api.txt / >> mpi-abort >> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> node0 $ echo $? >> 2 >> >> >> >> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >> I am 0/1 and i abort >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 2. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun detected that one or more processes exited with non-zero status, >> thus causing >> the job to be terminated. The first process to do so was: >> >> Process name: [[7955,1],0] >> Exit code: 2 >> -------------------------------------------------------------------------- >> node0 $ echo $? >> 2 >> >> >> >> <abort.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15672.php
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c =================================================================== --- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c (revision 32577) +++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c (working copy) @@ -10,6 +10,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. * All rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -522,13 +524,20 @@ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(proc), pptr->exit_code)); if (!ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_ABORTED)) { + int ret; jdata->state = ORTE_JOB_STATE_CALLED_ABORT; /* point to the first proc to cause the problem */ orte_set_attribute(&jdata->attributes, ORTE_JOB_ABORTED_PROC, ORTE_ATTR_LOCAL, pptr, OPAL_PTR); /* retain the object so it doesn't get free'd */ OBJ_RETAIN(pptr); ORTE_FLAG_SET(jdata, ORTE_JOB_FLAG_ABORTED); - ORTE_UPDATE_EXIT_STATUS(pptr->exit_code); + /* decode the pptr->exit_code */ + if (WIFSIGNALED(pptr->exit_code)) { /* died on signal */ + ret = WTERMSIG(pptr->exit_code); + } else { + ret = WEXITSTATUS(pptr->exit_code); + } + ORTE_UPDATE_EXIT_STATUS(ret); /* abnormal termination - abort, but only do it once * to avoid creating a lot of confusion */ default_hnp_abort(jdata);
Index: orte/orted/pmix/pmix_server_sendrecv.c =================================================================== --- orte/orted/pmix/pmix_server_sendrecv.c (revision 32577) +++ orte/orted/pmix/pmix_server_sendrecv.c (working copy) @@ -14,6 +14,8 @@ * Copyright (c) 2009 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2011 Oak Ridge National Labs. All rights reserved. * Copyright (c) 2013-2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -673,6 +675,7 @@ proc->exit_code = ret; ORTE_FLAG_SET(proc, ORTE_PROC_FLAG_ABORT); ORTE_UPDATE_EXIT_STATUS(ret); + ORTE_ACTIVATE_PROC_STATE(&proc->name, ORTE_PROC_STATE_ABORTED); } } /* we will let the ODLS report this to errmgr when the proc exits, so