Ralph,
i tried again after the merge and found the same behaviour, though the
internals are very different.
i run without any batch manager
from node0:
mpirun -np 1 --mca btl tcp,self -host node1 ./abort
exit with exit code zero :-(
short story : i applied pmix.2.patch and that fixed my problem
could you please review this ?
long story :
i initially applied pmix.1.patch and it solved my problem
then i ran
mpirun -np 1 --mca btl openib,self -host node1 ./abort
and i came back to square one : exit code is zero
so i used the debugger and was unable to reproduce the issue
(one more race condition, yeah !)
finally, i wrote pmix.2.patch, fixed my issue and realized that
pmix.1.patch was no more needed.
currently, and assuming pmix.2.patch is correct, i cannot tell wether
pmix.1.patch is needed or not
since this part of the code is no more executed.
i also found one hang with the following trivial program within one node :
int main (int argc, char *argv[]) {
MPI_Init(&argc, &argv);
MPI_Finalize();
return 3;
}
from node0 :
$ mpirun -np 1 ./test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
AND THE PROGRAM HANGS
*but*
$ mpirun -np 1 -host node1 ./test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[22080,1],0]
Exit code: 3
--------------------------------------------------------------------------
return with exit code 3.
then i found a strange behaviour with helloworld if only the self btl is
used :
$ mpirun -np 1 --mca btl self ./hw
[helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
[helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
line 722
the program returns with exit code zero, but display an error message.
Cheers,
Gilles
On 2014/08/21 6:21, Ralph Castain wrote:
> I'm aware of the problem, but it will be fixed when the PMIx branch is merged
> later this week.
>
> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet
> <[email protected]> wrote:
>
>> Folks,
>>
>> let's look at the following trivial test program :
>>
>> #include <mpi.h>
>> #include <stdio.h>
>>
>> int main (int argc, char * argv[]) {
>> int rank, size;
>> MPI_Init(&argc, &argv);
>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> printf ("I am %d/%d and i abort\n", rank, size);
>> MPI_Abort(MPI_COMM_WORLD, 2);
>> printf ("%d/%d aborted !\n", rank, size);
>> return 3;
>> }
>>
>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>> task 1 :
>> with two tasks or more :
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> I am 1/2 and i abort
>> I am 0/2 and i abort
>> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
>> mpi-abort
>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>>
>> node0 $ echo $?
>> 0
>>
>> the exit status of mpirun is zero
>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
>>
>> now if we run only one task :
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>> I am 0/1 and i abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 15884 on
>> node node1 exiting improperly. There are three reasons this could occur:
>>
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>>
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>>
>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>> orte_create_session_dirs is set to false. In this case, the run-time cannot
>> detect that the abort call was an abnormal termination. Hence, the only
>> error message you will receive is this one.
>>
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>>
>> You can avoid this message by specifying -quiet on the mpirun command line.
>>
>> --------------------------------------------------------------------------
>> node0 $ echo $?
>> 1
>>
>> the program displayed a misleading error message and mpirun exited with
>> error code 1
>> /* i would have expected 2, or 3 in the worst case scenario */
>>
>>
>> i digged it a bit and found a kind of race condition in orted (running
>> on node 1)
>> basically, when the process dies, it writes stuff in the openmpi session
>> directory and exits.
>> exiting send a SIGCHLD to orted and close the socket/pipe connected to
>> orted.
>> on orted, the loss of connection is generally processed before the
>> SIGCHLD by libevent,
>> and as a consequence, the exit code is not correctly set (e.g. it is
>> left to zero).
>> i did not see any kind of communication between the mpi task and orted
>> (except writing a file in the openmpi session directory) as i would have
>> expected
>> /* but this was just my initial guess, the truth is i do not know what
>> is supposed to happen */
>>
>> i wrote the attached abort.patch patch to basically get it working.
>> i highly suspect this is not the right thing to do so i did not commit it.
>>
>> it works fine with two tasks or more.
>> with only one task, mpirun display a misleading error message but the
>> exit status is ok.
>>
>> could someone (Ralph ?) have a look at this ?
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>> I am 1/2 and i abort
>> I am 0/2 and i abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
>> mpi-abort
>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>> node0 $ echo $?
>> 2
>>
>>
>>
>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>> I am 0/1 and i abort
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 2.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>>
>> Process name: [[7955,1],0]
>> Exit code: 2
>> --------------------------------------------------------------------------
>> node0 $ echo $?
>> 2
>>
>>
>>
>> <abort.patch>_______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
===================================================================
--- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c (revision 32577)
+++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c (working copy)
@@ -10,6 +10,8 @@
* Copyright (c) 2011-2013 Los Alamos National Security, LLC.
* All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -522,13 +524,20 @@
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(proc), pptr->exit_code));
if (!ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_ABORTED)) {
+ int ret;
jdata->state = ORTE_JOB_STATE_CALLED_ABORT;
/* point to the first proc to cause the problem */
orte_set_attribute(&jdata->attributes, ORTE_JOB_ABORTED_PROC,
ORTE_ATTR_LOCAL, pptr, OPAL_PTR);
/* retain the object so it doesn't get free'd */
OBJ_RETAIN(pptr);
ORTE_FLAG_SET(jdata, ORTE_JOB_FLAG_ABORTED);
- ORTE_UPDATE_EXIT_STATUS(pptr->exit_code);
+ /* decode the pptr->exit_code */
+ if (WIFSIGNALED(pptr->exit_code)) { /* died on signal */
+ ret = WTERMSIG(pptr->exit_code);
+ } else {
+ ret = WEXITSTATUS(pptr->exit_code);
+ }
+ ORTE_UPDATE_EXIT_STATUS(ret);
/* abnormal termination - abort, but only do it once
* to avoid creating a lot of confusion */
default_hnp_abort(jdata);
Index: orte/orted/pmix/pmix_server_sendrecv.c
===================================================================
--- orte/orted/pmix/pmix_server_sendrecv.c (revision 32577)
+++ orte/orted/pmix/pmix_server_sendrecv.c (working copy)
@@ -14,6 +14,8 @@
* Copyright (c) 2009 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2011 Oak Ridge National Labs. All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -673,6 +675,7 @@
proc->exit_code = ret;
ORTE_FLAG_SET(proc, ORTE_PROC_FLAG_ABORT);
ORTE_UPDATE_EXIT_STATUS(ret);
+ ORTE_ACTIVATE_PROC_STATE(&proc->name, ORTE_PROC_STATE_ABORTED);
}
}
/* we will let the ODLS report this to errmgr when the proc exits, so