You might want to try again with current head of trunk as something seems off 
in what you are seeing - more below


On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Ralph,
> 
> i tried again after the merge and found the same behaviour, though the
> internals are very different.
> 
> i run without any batch manager
> 
> from node0:
> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
> 
> exit with exit code zero :-(

Hmmm...it works fine for me, without your patch:

07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
Hello, World, I am 0 of 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
07:35:56  $ showcode
130

> 
> short story : i applied pmix.2.patch and that fixed my problem
> could you please review this ?
> 
> long story :
> i initially applied pmix.1.patch and it solved my problem
> then i ran
> mpirun -np 1 --mca btl openib,self -host node1 ./abort
> and i came back to square one : exit code is zero
> so i used the debugger and was unable to reproduce the issue
> (one more race condition, yeah !)
> finally, i wrote pmix.2.patch, fixed my issue and realized that
> pmix.1.patch was no more needed.
> currently, and assuming pmix.2.patch is correct, i cannot tell wether
> pmix.1.patch is needed or not
> since this part of the code is no more executed.
> 
> i also found one hang with the following trivial program within one node :
> 
> int main (int argc, char *argv[]) {
>     MPI_Init(&argc, &argv);
>    MPI_Finalize();
>    return 3;
> }
> 
> from node0 :
> $ mpirun -np 1 ./test
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> 
> AND THE PROGRAM HANGS

This also works fine for me:

07:37:27  $ mpirun -n 1 ./mpi_no_op
07:37:36  $ cat mpi_no_op.c
/* -*- C -*-
 *
 * $HEADER$
 *
 * The most basic of MPI applications
 */

#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);

    MPI_Finalize();
    return 0;
}


> 
> *but*
> $ mpirun -np 1 -host node1 ./test
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
> 
>  Process name: [[22080,1],0]
>  Exit code:    3
> --------------------------------------------------------------------------
> 
> return with exit code 3.

Likewise here - works just fine for me


> 
> then i found a strange behaviour with helloworld if only the self btl is
> used :
> $ mpirun -np 1 --mca btl self ./hw
> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
> line 722
> 
> the program returns with exit code zero, but display an error message.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/21 6:21, Ralph Castain wrote:
>> I'm aware of the problem, but it will be fixed when the PMIx branch is 
>> merged later this week.
>> 
>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>> 
>>> Folks,
>>> 
>>> let's look at the following trivial test program :
>>> 
>>> #include <mpi.h>
>>> #include <stdio.h>
>>> 
>>> int main (int argc, char * argv[]) {
>>>   int rank, size;
>>>   MPI_Init(&argc, &argv);
>>>   MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>   printf ("I am %d/%d and i abort\n", rank, size);
>>>   MPI_Abort(MPI_COMM_WORLD, 2);
>>>   printf ("%d/%d aborted !\n", rank, size);
>>>   return 3;
>>> }
>>> 
>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>>> task 1 :
>>> with two tasks or more :
>>> 
>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 2.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> I am 1/2 and i abort
>>> I am 0/2 and i abort
>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
>>> mpi-abort
>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>> all help / error messages
>>> 
>>> node0 $ echo $?
>>> 0
>>> 
>>> the exit status of mpirun is zero
>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
>>> 
>>> now if we run only one task :
>>> 
>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>> I am 0/1 and i abort
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 2.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun has exited due to process rank 0 with PID 15884 on
>>> node node1 exiting improperly. There are three reasons this could occur:
>>> 
>>> 1. this process did not call "init" before exiting, but others in
>>> the job did. This can cause a job to hang indefinitely while it waits
>>> for all processes to call "init". By rule, if one process calls "init",
>>> then ALL processes must call "init" prior to termination.
>>> 
>>> 2. this process called "init", but exited without calling "finalize".
>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>> exiting or it will be considered an "abnormal termination"
>>> 
>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>>> orte_create_session_dirs is set to false. In this case, the run-time cannot
>>> detect that the abort call was an abnormal termination. Hence, the only
>>> error message you will receive is this one.
>>> 
>>> This may have caused other processes in the application to be
>>> terminated by signals sent by mpirun (as reported here).
>>> 
>>> You can avoid this message by specifying -quiet on the mpirun command line.
>>> 
>>> --------------------------------------------------------------------------
>>> node0 $ echo $?
>>> 1
>>> 
>>> the program displayed a misleading error message and mpirun exited with
>>> error code 1
>>> /* i would have expected 2, or 3 in the worst case scenario */
>>> 
>>> 
>>> i digged it a bit and found a kind of race condition in orted (running
>>> on node 1)
>>> basically, when the process dies, it writes stuff in the openmpi session
>>> directory and exits.
>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to
>>> orted.
>>> on orted, the loss of connection is generally processed before the
>>> SIGCHLD by libevent,
>>> and as a consequence, the exit code is not correctly set (e.g. it is
>>> left to zero).
>>> i did not see any kind of communication between the mpi task and orted
>>> (except writing a file in the openmpi session directory) as i would have
>>> expected
>>> /* but this was just my initial guess, the truth is i do not know what
>>> is supposed to happen */
>>> 
>>> i wrote the attached abort.patch patch to basically get it working.
>>> i highly suspect this is not the right thing to do so i did not commit it.
>>> 
>>> it works fine with two tasks or more.
>>> with only one task, mpirun display a misleading error message but the
>>> exit status is ok.
>>> 
>>> could someone (Ralph ?) have a look at this ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>> I am 1/2 and i abort
>>> I am 0/2 and i abort
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 2.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
>>> mpi-abort
>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>> all help / error messages
>>> node0 $ echo $?
>>> 2
>>> 
>>> 
>>> 
>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>> I am 0/1 and i abort
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 2.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun detected that one or more processes exited with non-zero status,
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>> Process name: [[7955,1],0]
>>> Exit code:    2
>>> --------------------------------------------------------------------------
>>> node0 $ echo $?
>>> 2
>>> 
>>> 
>>> 
>>> <abort.patch>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php
> 
> <pmix.1.patch><pmix.2.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php

Reply via email to