Thanks Ralph ! i confirm my all test cases pass now :-)
FYI, i commited r32592 in order to fix a parsing bug on 32bits platform (hence the mtt failures on trunk on x86) Cheers, Gilles On 2014/08/23 4:59, Ralph Castain wrote: > I think these are fixed now - at least, your test cases all pass for me > > > On Aug 22, 2014, at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >>> Ralph, >>> >>> Will do on Monday >>> >>> About the first test, in my case echo $? returns 0 >> My "showcode" is just an alias for the echo >> >>> I noticed this confusing message in your output : >>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on >>> signal 0 (Unknown signal 0). >> I'll take a look at why that happened >> >>> About the second test, please note my test program return 3; >>> whereas your mpi_no_op.c return 0; >> I didn't see that little cuteness - sigh >> >>> Cheers, >>> >>> Gilles >>> >>> Ralph Castain <r...@open-mpi.org> wrote: >>> You might want to try again with current head of trunk as something seems >>> off in what you are seeing - more below >>> >>> >>> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> wrote: >>> >>>> Ralph, >>>> >>>> i tried again after the merge and found the same behaviour, though the >>>> internals are very different. >>>> >>>> i run without any batch manager >>>> >>>> from node0: >>>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort >>>> >>>> exit with exit code zero :-( >>> Hmmm...it works fine for me, without your patch: >>> >>> 07:35:41 $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort >>> Hello, World, I am 0 of 1 >>> -------------------------------------------------------------------------- >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>> with errorcode 2. >>> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>> You may or may not see output from other processes, depending on >>> exactly when Open MPI kills them. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on >>> signal 0 (Unknown signal 0). >>> -------------------------------------------------------------------------- >>> 07:35:56 $ showcode >>> 130 >>> >>>> short story : i applied pmix.2.patch and that fixed my problem >>>> could you please review this ? >>>> >>>> long story : >>>> i initially applied pmix.1.patch and it solved my problem >>>> then i ran >>>> mpirun -np 1 --mca btl openib,self -host node1 ./abort >>>> and i came back to square one : exit code is zero >>>> so i used the debugger and was unable to reproduce the issue >>>> (one more race condition, yeah !) >>>> finally, i wrote pmix.2.patch, fixed my issue and realized that >>>> pmix.1.patch was no more needed. >>>> currently, and assuming pmix.2.patch is correct, i cannot tell wether >>>> pmix.1.patch is needed or not >>>> since this part of the code is no more executed. >>>> >>>> i also found one hang with the following trivial program within one node : >>>> >>>> int main (int argc, char *argv[]) { >>>> MPI_Init(&argc, &argv); >>>> MPI_Finalize(); >>>> return 3; >>>> } >>>> >>>> from node0 : >>>> $ mpirun -np 1 ./test >>>> ------------------------------------------------------- >>>> Primary job terminated normally, but 1 process returned >>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>> ------------------------------------------------------- >>>> >>>> AND THE PROGRAM HANGS >>> This also works fine for me: >>> >>> 07:37:27 $ mpirun -n 1 ./mpi_no_op >>> 07:37:36 $ cat mpi_no_op.c >>> /* -*- C -*- >>> * >>> * $HEADER$ >>> * >>> * The most basic of MPI applications >>> */ >>> >>> #include <stdio.h> >>> #include "mpi.h" >>> >>> int main(int argc, char* argv[]) >>> { >>> MPI_Init(&argc, &argv); >>> >>> MPI_Finalize(); >>> return 0; >>> } >>> >>> >>>> *but* >>>> $ mpirun -np 1 -host node1 ./test >>>> ------------------------------------------------------- >>>> Primary job terminated normally, but 1 process returned >>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>> ------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun detected that one or more processes exited with non-zero status, >>>> thus causing >>>> the job to be terminated. The first process to do so was: >>>> >>>> Process name: [[22080,1],0] >>>> Exit code: 3 >>>> -------------------------------------------------------------------------- >>>> >>>> return with exit code 3. >>> Likewise here - works just fine for me >>> >>> >>>> then i found a strange behaviour with helloworld if only the self btl is >>>> used : >>>> $ mpirun -np 1 --mca btl self ./hw >>>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3 >>>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in >>>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at >>>> line 722 >>>> >>>> the program returns with exit code zero, but display an error message. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/08/21 6:21, Ralph Castain wrote: >>>>> I'm aware of the problem, but it will be fixed when the PMIx branch is >>>>> merged later this week. >>>>> >>>>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>> >>>>>> Folks, >>>>>> >>>>>> let's look at the following trivial test program : >>>>>> >>>>>> #include <mpi.h> >>>>>> #include <stdio.h> >>>>>> >>>>>> int main (int argc, char * argv[]) { >>>>>> int rank, size; >>>>>> MPI_Init(&argc, &argv); >>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>> printf ("I am %d/%d and i abort\n", rank, size); >>>>>> MPI_Abort(MPI_COMM_WORLD, 2); >>>>>> printf ("%d/%d aborted !\n", rank, size); >>>>>> return 3; >>>>>> } >>>>>> >>>>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on >>>>>> task 1 : >>>>>> with two tasks or more : >>>>>> >>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>>>>> -------------------------------------------------------------------------- >>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>>> with errorcode 2. >>>>>> >>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>>> You may or may not see output from other processes, depending on >>>>>> exactly when Open MPI kills them. >>>>>> -------------------------------------------------------------------------- >>>>>> I am 1/2 and i abort >>>>>> I am 0/2 and i abort >>>>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt / >>>>>> mpi-abort >>>>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>>> all help / error messages >>>>>> >>>>>> node0 $ echo $? >>>>>> 0 >>>>>> >>>>>> the exit status of mpirun is zero >>>>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */ >>>>>> >>>>>> now if we run only one task : >>>>>> >>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>>>>> I am 0/1 and i abort >>>>>> -------------------------------------------------------------------------- >>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>>> with errorcode 2. >>>>>> >>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>>> You may or may not see output from other processes, depending on >>>>>> exactly when Open MPI kills them. >>>>>> -------------------------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun has exited due to process rank 0 with PID 15884 on >>>>>> node node1 exiting improperly. There are three reasons this could occur: >>>>>> >>>>>> 1. this process did not call "init" before exiting, but others in >>>>>> the job did. This can cause a job to hang indefinitely while it waits >>>>>> for all processes to call "init". By rule, if one process calls "init", >>>>>> then ALL processes must call "init" prior to termination. >>>>>> >>>>>> 2. this process called "init", but exited without calling "finalize". >>>>>> By rule, all processes that call "init" MUST call "finalize" prior to >>>>>> exiting or it will be considered an "abnormal termination" >>>>>> >>>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >>>>>> orte_create_session_dirs is set to false. In this case, the run-time >>>>>> cannot >>>>>> detect that the abort call was an abnormal termination. Hence, the only >>>>>> error message you will receive is this one. >>>>>> >>>>>> This may have caused other processes in the application to be >>>>>> terminated by signals sent by mpirun (as reported here). >>>>>> >>>>>> You can avoid this message by specifying -quiet on the mpirun command >>>>>> line. >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> node0 $ echo $? >>>>>> 1 >>>>>> >>>>>> the program displayed a misleading error message and mpirun exited with >>>>>> error code 1 >>>>>> /* i would have expected 2, or 3 in the worst case scenario */ >>>>>> >>>>>> >>>>>> i digged it a bit and found a kind of race condition in orted (running >>>>>> on node 1) >>>>>> basically, when the process dies, it writes stuff in the openmpi session >>>>>> directory and exits. >>>>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to >>>>>> orted. >>>>>> on orted, the loss of connection is generally processed before the >>>>>> SIGCHLD by libevent, >>>>>> and as a consequence, the exit code is not correctly set (e.g. it is >>>>>> left to zero). >>>>>> i did not see any kind of communication between the mpi task and orted >>>>>> (except writing a file in the openmpi session directory) as i would have >>>>>> expected >>>>>> /* but this was just my initial guess, the truth is i do not know what >>>>>> is supposed to happen */ >>>>>> >>>>>> i wrote the attached abort.patch patch to basically get it working. >>>>>> i highly suspect this is not the right thing to do so i did not commit >>>>>> it. >>>>>> >>>>>> it works fine with two tasks or more. >>>>>> with only one task, mpirun display a misleading error message but the >>>>>> exit status is ok. >>>>>> >>>>>> could someone (Ralph ?) have a look at this ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>>>>> I am 1/2 and i abort >>>>>> I am 0/2 and i abort >>>>>> -------------------------------------------------------------------------- >>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>>> with errorcode 2. >>>>>> >>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>>> You may or may not see output from other processes, depending on >>>>>> exactly when Open MPI kills them. >>>>>> -------------------------------------------------------------------------- >>>>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt / >>>>>> mpi-abort >>>>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>>> all help / error messages >>>>>> node0 $ echo $? >>>>>> 2 >>>>>> >>>>>> >>>>>> >>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>>>>> I am 0/1 and i abort >>>>>> -------------------------------------------------------------------------- >>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>>> with errorcode 2. >>>>>> >>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>>> You may or may not see output from other processes, depending on >>>>>> exactly when Open MPI kills them. >>>>>> -------------------------------------------------------------------------- >>>>>> ------------------------------------------------------- >>>>>> Primary job terminated normally, but 1 process returned >>>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> ------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun detected that one or more processes exited with non-zero status, >>>>>> thus causing >>>>>> the job to be terminated. The first process to do so was: >>>>>> >>>>>> Process name: [[7955,1],0] >>>>>> Exit code: 2 >>>>>> -------------------------------------------------------------------------- >>>>>> node0 $ echo $? >>>>>> 2 >>>>>> >>>>>> >>>>>> >>>>>> <abort.patch>_______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php >>>> <pmix.1.patch><pmix.2.patch>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15692.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15697.php