Ralph,

i tried again after the merge and found the same behaviour, though the
internals are very different.

i run without any batch manager

from node0:
mpirun -np 1 --mca btl tcp,self -host node1 ./abort

exit with exit code zero :-(

short story : i applied pmix.2.patch and that fixed my problem
could you please review this ?

long story :
i initially applied pmix.1.patch and it solved my problem
then i ran
mpirun -np 1 --mca btl openib,self -host node1 ./abort
and i came back to square one : exit code is zero
so i used the debugger and was unable to reproduce the issue
(one more race condition, yeah !)
finally, i wrote pmix.2.patch, fixed my issue and realized that
pmix.1.patch was no more needed.
currently, and assuming pmix.2.patch is correct, i cannot tell wether
pmix.1.patch is needed or not
since this part of the code is no more executed.

i also found one hang with the following trivial program within one node :

int main (int argc, char *argv[]) {
     MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 3;
}

from node0 :
$ mpirun -np 1 ./test
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

AND THE PROGRAM HANGS

*but*
$ mpirun -np 1 -host node1 ./test
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[22080,1],0]
  Exit code:    3
--------------------------------------------------------------------------

return with exit code 3.

then i found a strange behaviour with helloworld if only the self btl is
used :
$ mpirun -np 1 --mca btl self ./hw
[helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
[helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
line 722

the program returns with exit code zero, but display an error message.

Cheers,

Gilles

On 2014/08/21 6:21, Ralph Castain wrote:
> I'm aware of the problem, but it will be fixed when the PMIx branch is merged 
> later this week.
>
> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Folks,
>>
>> let's look at the following trivial test program :
>>
>> #include <mpi.h>
>> #include <stdio.h>
>>
>> int main (int argc, char * argv[]) {
>>    int rank, size;
>>    MPI_Init(&argc, &argv);
>>    MPI_Comm_size(MPI_COMM_WORLD, &size);
>>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>    printf ("I am %d/%d and i abort\n", rank, size);
>>    MPI_Abort(MPI_COMM_WORLD, 2);
>>    printf ("%d/%d aborted !\n", rank, size);
>>    return 3;
>> }
>>
>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>> task 1 :
>> with two tasks or more :
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> I am 1/2 and i abort
>> I am 0/2 and i abort
>> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
>> mpi-abort
>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>>
>> node0 $ echo $?
>> 0
>>
>> the exit status of mpirun is zero
>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
>>
>> now if we run only one task :
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>> I am 0/1 and i abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 15884 on
>> node node1 exiting improperly. There are three reasons this could occur:
>>
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>>
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>>
>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>> orte_create_session_dirs is set to false. In this case, the run-time cannot
>> detect that the abort call was an abnormal termination. Hence, the only
>> error message you will receive is this one.
>>
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>>
>> You can avoid this message by specifying -quiet on the mpirun command line.
>>
>> --------------------------------------------------------------------------
>> node0 $ echo $?
>> 1
>>
>> the program displayed a misleading error message and mpirun exited with
>> error code 1
>> /* i would have expected 2, or 3 in the worst case scenario */
>>
>>
>> i digged it a bit and found a kind of race condition in orted (running
>> on node 1)
>> basically, when the process dies, it writes stuff in the openmpi session
>> directory and exits.
>> exiting send a SIGCHLD to orted and close the socket/pipe connected to
>> orted.
>> on orted, the loss of connection is generally processed before the
>> SIGCHLD by libevent,
>> and as a consequence, the exit code is not correctly set (e.g. it is
>> left to zero).
>> i did not see any kind of communication between the mpi task and orted
>> (except writing a file in the openmpi session directory) as i would have
>> expected
>> /* but this was just my initial guess, the truth is i do not know what
>> is supposed to happen */
>>
>> i wrote the attached abort.patch patch to basically get it working.
>> i highly suspect this is not the right thing to do so i did not commit it.
>>
>> it works fine with two tasks or more.
>> with only one task, mpirun display a misleading error message but the
>> exit status is ok.
>>
>> could someone (Ralph ?) have a look at this ?
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>> I am 1/2 and i abort
>> I am 0/2 and i abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
>> mpi-abort
>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>> node0 $ echo $?
>> 2
>>
>>
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>> I am 0/1 and i abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>>
>>  Process name: [[7955,1],0]
>>  Exit code:    2
>> --------------------------------------------------------------------------
>> node0 $ echo $?
>> 2
>>
>>
>>
>> <abort.patch>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php

Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
===================================================================
--- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c    (revision 32577)
+++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c    (working copy)
@@ -10,6 +10,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.
  *                         All rights reserved.
  * Copyright (c) 2014      Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -522,13 +524,20 @@
                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                              ORTE_NAME_PRINT(proc), pptr->exit_code));
         if (!ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_ABORTED)) {
+            int ret;
             jdata->state = ORTE_JOB_STATE_CALLED_ABORT;
             /* point to the first proc to cause the problem */
             orte_set_attribute(&jdata->attributes, ORTE_JOB_ABORTED_PROC, 
ORTE_ATTR_LOCAL, pptr, OPAL_PTR);
             /* retain the object so it doesn't get free'd */
             OBJ_RETAIN(pptr);
             ORTE_FLAG_SET(jdata, ORTE_JOB_FLAG_ABORTED);
-            ORTE_UPDATE_EXIT_STATUS(pptr->exit_code);
+            /* decode the pptr->exit_code */
+            if (WIFSIGNALED(pptr->exit_code)) { /* died on signal */
+                ret = WTERMSIG(pptr->exit_code);
+            } else {
+                ret = WEXITSTATUS(pptr->exit_code);
+            }
+            ORTE_UPDATE_EXIT_STATUS(ret);
             /* abnormal termination - abort, but only do it once
              * to avoid creating a lot of confusion */
             default_hnp_abort(jdata);
Index: orte/orted/pmix/pmix_server_sendrecv.c
===================================================================
--- orte/orted/pmix/pmix_server_sendrecv.c      (revision 32577)
+++ orte/orted/pmix/pmix_server_sendrecv.c      (working copy)
@@ -14,6 +14,8 @@
  * Copyright (c) 2009      Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2011      Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -673,6 +675,7 @@
                 proc->exit_code = ret;
                 ORTE_FLAG_SET(proc, ORTE_PROC_FLAG_ABORT);
                 ORTE_UPDATE_EXIT_STATUS(ret);
+                ORTE_ACTIVATE_PROC_STATE(&proc->name, ORTE_PROC_STATE_ABORTED);
             }
         }
         /* we will let the ODLS report this to errmgr when the proc exits, so

Reply via email to