On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com> wrote:

> Ralph,
> 
> Will do on Monday
> 
> About the first test, in my case echo $? returns 0

My "showcode" is just an alias for the echo

> I noticed this confusing message in your output :
> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
> signal 0 (Unknown signal 0).

I'll take a look at why that happened

> 
> About the second test, please note my test program return 3;
> whereas your mpi_no_op.c return 0;

I didn't see that little cuteness - sigh

> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain <r...@open-mpi.org> wrote:
> You might want to try again with current head of trunk as something seems off 
> in what you are seeing - more below
> 
> 
> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
>> Ralph,
>> 
>> i tried again after the merge and found the same behaviour, though the
>> internals are very different.
>> 
>> i run without any batch manager
>> 
>> from node0:
>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
>> 
>> exit with exit code zero :-(
> 
> Hmmm...it works fine for me, without your patch:
> 
> 07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
> Hello, World, I am 0 of 1
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
> signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> 07:35:56  $ showcode
> 130
> 
>> 
>> short story : i applied pmix.2.patch and that fixed my problem
>> could you please review this ?
>> 
>> long story :
>> i initially applied pmix.1.patch and it solved my problem
>> then i ran
>> mpirun -np 1 --mca btl openib,self -host node1 ./abort
>> and i came back to square one : exit code is zero
>> so i used the debugger and was unable to reproduce the issue
>> (one more race condition, yeah !)
>> finally, i wrote pmix.2.patch, fixed my issue and realized that
>> pmix.1.patch was no more needed.
>> currently, and assuming pmix.2.patch is correct, i cannot tell wether
>> pmix.1.patch is needed or not
>> since this part of the code is no more executed.
>> 
>> i also found one hang with the following trivial program within one node :
>> 
>> int main (int argc, char *argv[]) {
>>     MPI_Init(&argc, &argv);
>>    MPI_Finalize();
>>    return 3;
>> }
>> 
>> from node0 :
>> $ mpirun -np 1 ./test
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> 
>> AND THE PROGRAM HANGS
> 
> This also works fine for me:
> 
> 07:37:27  $ mpirun -n 1 ./mpi_no_op
> 07:37:36  $ cat mpi_no_op.c
> /* -*- C -*-
>  *
>  * $HEADER$
>  *
>  * The most basic of MPI applications
>  */
> 
> #include <stdio.h>
> #include "mpi.h"
> 
> int main(int argc, char* argv[])
> {
>     MPI_Init(&argc, &argv);
> 
>     MPI_Finalize();
>     return 0;
> }
> 
> 
>> 
>> *but*
>> $ mpirun -np 1 -host node1 ./test
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>> 
>>  Process name: [[22080,1],0]
>>  Exit code:    3
>> --------------------------------------------------------------------------
>> 
>> return with exit code 3.
> 
> Likewise here - works just fine for me
> 
> 
>> 
>> then i found a strange behaviour with helloworld if only the self btl is
>> used :
>> $ mpirun -np 1 --mca btl self ./hw
>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
>> line 722
>> 
>> the program returns with exit code zero, but display an error message.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/21 6:21, Ralph Castain wrote:
>>> I'm aware of the problem, but it will be fixed when the PMIx branch is 
>>> merged later this week.
>>> 
>>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org> wrote:
>>> 
>>>> Folks,
>>>> 
>>>> let's look at the following trivial test program :
>>>> 
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>> 
>>>> int main (int argc, char * argv[]) {
>>>>   int rank, size;
>>>>   MPI_Init(&argc, &argv);
>>>>   MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>   printf ("I am %d/%d and i abort\n", rank, size);
>>>>   MPI_Abort(MPI_COMM_WORLD, 2);
>>>>   printf ("%d/%d aborted !\n", rank, size);
>>>>   return 3;
>>>> }
>>>> 
>>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>>>> task 1 :
>>>> with two tasks or more :
>>>> 
>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>> with errorcode 2.
>>>> 
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>> --------------------------------------------------------------------------
>>>> I am 1/2 and i abort
>>>> I am 0/2 and i abort
>>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
>>>> mpi-abort
>>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>> all help / error messages
>>>> 
>>>> node0 $ echo $?
>>>> 0
>>>> 
>>>> the exit status of mpirun is zero
>>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
>>>> 
>>>> now if we run only one task :
>>>> 
>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>>> I am 0/1 and i abort
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>> with errorcode 2.
>>>> 
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 0 with PID 15884 on
>>>> node node1 exiting improperly. There are three reasons this could occur:
>>>> 
>>>> 1. this process did not call "init" before exiting, but others in
>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>> for all processes to call "init". By rule, if one process calls "init",
>>>> then ALL processes must call "init" prior to termination.
>>>> 
>>>> 2. this process called "init", but exited without calling "finalize".
>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>> exiting or it will be considered an "abnormal termination"
>>>> 
>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>>>> orte_create_session_dirs is set to false. In this case, the run-time cannot
>>>> detect that the abort call was an abnormal termination. Hence, the only
>>>> error message you will receive is this one.
>>>> 
>>>> This may have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>> 
>>>> You can avoid this message by specifying -quiet on the mpirun command line.
>>>> 
>>>> --------------------------------------------------------------------------
>>>> node0 $ echo $?
>>>> 1
>>>> 
>>>> the program displayed a misleading error message and mpirun exited with
>>>> error code 1
>>>> /* i would have expected 2, or 3 in the worst case scenario */
>>>> 
>>>> 
>>>> i digged it a bit and found a kind of race condition in orted (running
>>>> on node 1)
>>>> basically, when the process dies, it writes stuff in the openmpi session
>>>> directory and exits.
>>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to
>>>> orted.
>>>> on orted, the loss of connection is generally processed before the
>>>> SIGCHLD by libevent,
>>>> and as a consequence, the exit code is not correctly set (e.g. it is
>>>> left to zero).
>>>> i did not see any kind of communication between the mpi task and orted
>>>> (except writing a file in the openmpi session directory) as i would have
>>>> expected
>>>> /* but this was just my initial guess, the truth is i do not know what
>>>> is supposed to happen */
>>>> 
>>>> i wrote the attached abort.patch patch to basically get it working.
>>>> i highly suspect this is not the right thing to do so i did not commit it.
>>>> 
>>>> it works fine with two tasks or more.
>>>> with only one task, mpirun display a misleading error message but the
>>>> exit status is ok.
>>>> 
>>>> could someone (Ralph ?) have a look at this ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> 
>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>>> I am 1/2 and i abort
>>>> I am 0/2 and i abort
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>> with errorcode 2.
>>>> 
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>> --------------------------------------------------------------------------
>>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
>>>> mpi-abort
>>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>> all help / error messages
>>>> node0 $ echo $?
>>>> 2
>>>> 
>>>> 
>>>> 
>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>>> I am 0/1 and i abort
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>> with errorcode 2.
>>>> 
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>> --------------------------------------------------------------------------
>>>> -------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun detected that one or more processes exited with non-zero status,
>>>> thus causing
>>>> the job to be terminated. The first process to do so was:
>>>> 
>>>> Process name: [[7955,1],0]
>>>> Exit code:    2
>>>> --------------------------------------------------------------------------
>>>> node0 $ echo $?
>>>> 2
>>>> 
>>>> 
>>>> 
>>>> <abort.patch>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php
>> 
>> <pmix.1.patch><pmix.2.patch>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15692.php

Reply via email to