You might want to try again with current head of trunk as something seems off in what you are seeing - more below
On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Ralph, > > i tried again after the merge and found the same behaviour, though the > internals are very different. > > i run without any batch manager > > from node0: > mpirun -np 1 --mca btl tcp,self -host node1 ./abort > > exit with exit code zero :-( Hmmm...it works fine for me, without your patch: 07:35:41 $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort Hello, World, I am 0 of 1 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 2. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on signal 0 (Unknown signal 0). -------------------------------------------------------------------------- 07:35:56 $ showcode 130 > > short story : i applied pmix.2.patch and that fixed my problem > could you please review this ? > > long story : > i initially applied pmix.1.patch and it solved my problem > then i ran > mpirun -np 1 --mca btl openib,self -host node1 ./abort > and i came back to square one : exit code is zero > so i used the debugger and was unable to reproduce the issue > (one more race condition, yeah !) > finally, i wrote pmix.2.patch, fixed my issue and realized that > pmix.1.patch was no more needed. > currently, and assuming pmix.2.patch is correct, i cannot tell wether > pmix.1.patch is needed or not > since this part of the code is no more executed. > > i also found one hang with the following trivial program within one node : > > int main (int argc, char *argv[]) { > MPI_Init(&argc, &argv); > MPI_Finalize(); > return 3; > } > > from node0 : > $ mpirun -np 1 ./test > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > > AND THE PROGRAM HANGS This also works fine for me: 07:37:27 $ mpirun -n 1 ./mpi_no_op 07:37:36 $ cat mpi_no_op.c /* -*- C -*- * * $HEADER$ * * The most basic of MPI applications */ #include <stdio.h> #include "mpi.h" int main(int argc, char* argv[]) { MPI_Init(&argc, &argv); MPI_Finalize(); return 0; } > > *but* > $ mpirun -np 1 -host node1 ./test > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[22080,1],0] > Exit code: 3 > -------------------------------------------------------------------------- > > return with exit code 3. Likewise here - works just fine for me > > then i found a strange behaviour with helloworld if only the self btl is > used : > $ mpirun -np 1 --mca btl self ./hw > [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3 > [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in > file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at > line 722 > > the program returns with exit code zero, but display an error message. > > Cheers, > > Gilles > > On 2014/08/21 6:21, Ralph Castain wrote: >> I'm aware of the problem, but it will be fixed when the PMIx branch is >> merged later this week. >> >> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Folks, >>> >>> let's look at the following trivial test program : >>> >>> #include <mpi.h> >>> #include <stdio.h> >>> >>> int main (int argc, char * argv[]) { >>> int rank, size; >>> MPI_Init(&argc, &argv); >>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> printf ("I am %d/%d and i abort\n", rank, size); >>> MPI_Abort(MPI_COMM_WORLD, 2); >>> printf ("%d/%d aborted !\n", rank, size); >>> return 3; >>> } >>> >>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on >>> task 1 : >>> with two tasks or more : >>> >>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>> with errorcode 2. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> -------------------------------------------------------------------------- >>> I am 1/2 and i abort >>> I am 0/2 and i abort >>> [node0:00740] 1 more process has sent help message help-mpi-api.txt / >>> mpi-abort >>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>> all help / error messages >>> >>> node0 $ echo $? >>> 0 >>> >>> the exit status of mpirun is zero >>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */ >>> >>> now if we run only one task : >>> >>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>> I am 0/1 and i abort >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>> with errorcode 2. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpirun has exited due to process rank 0 with PID 15884 on >>> node node1 exiting improperly. There are three reasons this could occur: >>> >>> 1. this process did not call "init" before exiting, but others in >>> the job did. This can cause a job to hang indefinitely while it waits >>> for all processes to call "init". By rule, if one process calls "init", >>> then ALL processes must call "init" prior to termination. >>> >>> 2. this process called "init", but exited without calling "finalize". >>> By rule, all processes that call "init" MUST call "finalize" prior to >>> exiting or it will be considered an "abnormal termination" >>> >>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >>> orte_create_session_dirs is set to false. In this case, the run-time cannot >>> detect that the abort call was an abnormal termination. Hence, the only >>> error message you will receive is this one. >>> >>> This may have caused other processes in the application to be >>> terminated by signals sent by mpirun (as reported here). >>> >>> You can avoid this message by specifying -quiet on the mpirun command line. >>> >>> -------------------------------------------------------------------------- >>> node0 $ echo $? >>> 1 >>> >>> the program displayed a misleading error message and mpirun exited with >>> error code 1 >>> /* i would have expected 2, or 3 in the worst case scenario */ >>> >>> >>> i digged it a bit and found a kind of race condition in orted (running >>> on node 1) >>> basically, when the process dies, it writes stuff in the openmpi session >>> directory and exits. >>> exiting send a SIGCHLD to orted and close the socket/pipe connected to >>> orted. >>> on orted, the loss of connection is generally processed before the >>> SIGCHLD by libevent, >>> and as a consequence, the exit code is not correctly set (e.g. it is >>> left to zero). >>> i did not see any kind of communication between the mpi task and orted >>> (except writing a file in the openmpi session directory) as i would have >>> expected >>> /* but this was just my initial guess, the truth is i do not know what >>> is supposed to happen */ >>> >>> i wrote the attached abort.patch patch to basically get it working. >>> i highly suspect this is not the right thing to do so i did not commit it. >>> >>> it works fine with two tasks or more. >>> with only one task, mpirun display a misleading error message but the >>> exit status is ok. >>> >>> could someone (Ralph ?) have a look at this ? >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>> I am 1/2 and i abort >>> I am 0/2 and i abort >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>> with errorcode 2. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> -------------------------------------------------------------------------- >>> [node0:00920] 1 more process has sent help message help-mpi-api.txt / >>> mpi-abort >>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>> all help / error messages >>> node0 $ echo $? >>> 2 >>> >>> >>> >>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>> I am 0/1 and i abort >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>> with errorcode 2. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> -------------------------------------------------------------------------- >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpirun detected that one or more processes exited with non-zero status, >>> thus causing >>> the job to be terminated. The first process to do so was: >>> >>> Process name: [[7955,1],0] >>> Exit code: 2 >>> -------------------------------------------------------------------------- >>> node0 $ echo $? >>> 2 >>> >>> >>> >>> <abort.patch>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php > > <pmix.1.patch><pmix.2.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15689.php