Thanks Ralph !

i confirm my all test cases pass now :-)

FYI, i commited r32592 in order to fix a parsing bug on 32bits platform
(hence the mtt failures on trunk on x86)

Cheers,

Gilles


On 2014/08/23 4:59, Ralph Castain wrote:
> I think these are fixed now - at least, your test cases all pass for me
>
>
> On Aug 22, 2014, at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>>
>>> Ralph,
>>>
>>> Will do on Monday
>>>
>>> About the first test, in my case echo $? returns 0
>> My "showcode" is just an alias for the echo
>>
>>> I noticed this confusing message in your output :
>>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>>> signal 0 (Unknown signal 0).
>> I'll take a look at why that happened
>>
>>> About the second test, please note my test program return 3;
>>> whereas your mpi_no_op.c return 0;
>> I didn't see that little cuteness - sigh
>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> Ralph Castain <r...@open-mpi.org> wrote:
>>> You might want to try again with current head of trunk as something seems 
>>> off in what you are seeing - more below
>>>
>>>
>>> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org> wrote:
>>>
>>>> Ralph,
>>>>
>>>> i tried again after the merge and found the same behaviour, though the
>>>> internals are very different.
>>>>
>>>> i run without any batch manager
>>>>
>>>> from node0:
>>>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
>>>>
>>>> exit with exit code zero :-(
>>> Hmmm...it works fine for me, without your patch:
>>>
>>> 07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
>>> Hello, World, I am 0 of 1
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>>> with errorcode 2.
>>>
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>>> signal 0 (Unknown signal 0).
>>> --------------------------------------------------------------------------
>>> 07:35:56  $ showcode
>>> 130
>>>
>>>> short story : i applied pmix.2.patch and that fixed my problem
>>>> could you please review this ?
>>>>
>>>> long story :
>>>> i initially applied pmix.1.patch and it solved my problem
>>>> then i ran
>>>> mpirun -np 1 --mca btl openib,self -host node1 ./abort
>>>> and i came back to square one : exit code is zero
>>>> so i used the debugger and was unable to reproduce the issue
>>>> (one more race condition, yeah !)
>>>> finally, i wrote pmix.2.patch, fixed my issue and realized that
>>>> pmix.1.patch was no more needed.
>>>> currently, and assuming pmix.2.patch is correct, i cannot tell wether
>>>> pmix.1.patch is needed or not
>>>> since this part of the code is no more executed.
>>>>
>>>> i also found one hang with the following trivial program within one node :
>>>>
>>>> int main (int argc, char *argv[]) {
>>>>     MPI_Init(&argc, &argv);
>>>>    MPI_Finalize();
>>>>    return 3;
>>>> }
>>>>
>>>> from node0 :
>>>> $ mpirun -np 1 ./test
>>>> -------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>>
>>>> AND THE PROGRAM HANGS
>>> This also works fine for me:
>>>
>>> 07:37:27  $ mpirun -n 1 ./mpi_no_op
>>> 07:37:36  $ cat mpi_no_op.c
>>> /* -*- C -*-
>>>  *
>>>  * $HEADER$
>>>  *
>>>  * The most basic of MPI applications
>>>  */
>>>
>>> #include <stdio.h>
>>> #include "mpi.h"
>>>
>>> int main(int argc, char* argv[])
>>> {
>>>     MPI_Init(&argc, &argv);
>>>
>>>     MPI_Finalize();
>>>     return 0;
>>> }
>>>
>>>
>>>> *but*
>>>> $ mpirun -np 1 -host node1 ./test
>>>> -------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun detected that one or more processes exited with non-zero status,
>>>> thus causing
>>>> the job to be terminated. The first process to do so was:
>>>>
>>>>  Process name: [[22080,1],0]
>>>>  Exit code:    3
>>>> --------------------------------------------------------------------------
>>>>
>>>> return with exit code 3.
>>> Likewise here - works just fine for me
>>>
>>>
>>>> then i found a strange behaviour with helloworld if only the self btl is
>>>> used :
>>>> $ mpirun -np 1 --mca btl self ./hw
>>>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
>>>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
>>>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
>>>> line 722
>>>>
>>>> the program returns with exit code zero, but display an error message.
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 2014/08/21 6:21, Ralph Castain wrote:
>>>>> I'm aware of the problem, but it will be fixed when the PMIx branch is 
>>>>> merged later this week.
>>>>>
>>>>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>
>>>>>> Folks,
>>>>>>
>>>>>> let's look at the following trivial test program :
>>>>>>
>>>>>> #include <mpi.h>
>>>>>> #include <stdio.h>
>>>>>>
>>>>>> int main (int argc, char * argv[]) {
>>>>>>   int rank, size;
>>>>>>   MPI_Init(&argc, &argv);
>>>>>>   MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>   printf ("I am %d/%d and i abort\n", rank, size);
>>>>>>   MPI_Abort(MPI_COMM_WORLD, 2);
>>>>>>   printf ("%d/%d aborted !\n", rank, size);
>>>>>>   return 3;
>>>>>> }
>>>>>>
>>>>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>>>>>> task 1 :
>>>>>> with two tasks or more :
>>>>>>
>>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>>>>> --------------------------------------------------------------------------
>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>> with errorcode 2.
>>>>>>
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> --------------------------------------------------------------------------
>>>>>> I am 1/2 and i abort
>>>>>> I am 0/2 and i abort
>>>>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
>>>>>> mpi-abort
>>>>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>>>> all help / error messages
>>>>>>
>>>>>> node0 $ echo $?
>>>>>> 0
>>>>>>
>>>>>> the exit status of mpirun is zero
>>>>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
>>>>>>
>>>>>> now if we run only one task :
>>>>>>
>>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>>>>> I am 0/1 and i abort
>>>>>> --------------------------------------------------------------------------
>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>> with errorcode 2.
>>>>>>
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun has exited due to process rank 0 with PID 15884 on
>>>>>> node node1 exiting improperly. There are three reasons this could occur:
>>>>>>
>>>>>> 1. this process did not call "init" before exiting, but others in
>>>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>>>> for all processes to call "init". By rule, if one process calls "init",
>>>>>> then ALL processes must call "init" prior to termination.
>>>>>>
>>>>>> 2. this process called "init", but exited without calling "finalize".
>>>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>>>> exiting or it will be considered an "abnormal termination"
>>>>>>
>>>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>>>>>> orte_create_session_dirs is set to false. In this case, the run-time 
>>>>>> cannot
>>>>>> detect that the abort call was an abnormal termination. Hence, the only
>>>>>> error message you will receive is this one.
>>>>>>
>>>>>> This may have caused other processes in the application to be
>>>>>> terminated by signals sent by mpirun (as reported here).
>>>>>>
>>>>>> You can avoid this message by specifying -quiet on the mpirun command 
>>>>>> line.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> node0 $ echo $?
>>>>>> 1
>>>>>>
>>>>>> the program displayed a misleading error message and mpirun exited with
>>>>>> error code 1
>>>>>> /* i would have expected 2, or 3 in the worst case scenario */
>>>>>>
>>>>>>
>>>>>> i digged it a bit and found a kind of race condition in orted (running
>>>>>> on node 1)
>>>>>> basically, when the process dies, it writes stuff in the openmpi session
>>>>>> directory and exits.
>>>>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to
>>>>>> orted.
>>>>>> on orted, the loss of connection is generally processed before the
>>>>>> SIGCHLD by libevent,
>>>>>> and as a consequence, the exit code is not correctly set (e.g. it is
>>>>>> left to zero).
>>>>>> i did not see any kind of communication between the mpi task and orted
>>>>>> (except writing a file in the openmpi session directory) as i would have
>>>>>> expected
>>>>>> /* but this was just my initial guess, the truth is i do not know what
>>>>>> is supposed to happen */
>>>>>>
>>>>>> i wrote the attached abort.patch patch to basically get it working.
>>>>>> i highly suspect this is not the right thing to do so i did not commit 
>>>>>> it.
>>>>>>
>>>>>> it works fine with two tasks or more.
>>>>>> with only one task, mpirun display a misleading error message but the
>>>>>> exit status is ok.
>>>>>>
>>>>>> could someone (Ralph ?) have a look at this ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>>>>> I am 1/2 and i abort
>>>>>> I am 0/2 and i abort
>>>>>> --------------------------------------------------------------------------
>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>> with errorcode 2.
>>>>>>
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> --------------------------------------------------------------------------
>>>>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
>>>>>> mpi-abort
>>>>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>>>> all help / error messages
>>>>>> node0 $ echo $?
>>>>>> 2
>>>>>>
>>>>>>
>>>>>>
>>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>>>>> I am 0/1 and i abort
>>>>>> --------------------------------------------------------------------------
>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>> with errorcode 2.
>>>>>>
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> --------------------------------------------------------------------------
>>>>>> -------------------------------------------------------
>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>> -------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun detected that one or more processes exited with non-zero status,
>>>>>> thus causing
>>>>>> the job to be terminated. The first process to do so was:
>>>>>>
>>>>>> Process name: [[7955,1],0]
>>>>>> Exit code:    2
>>>>>> --------------------------------------------------------------------------
>>>>>> node0 $ echo $?
>>>>>> 2
>>>>>>
>>>>>>
>>>>>>
>>>>>> <abort.patch>_______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php
>>>> <pmix.1.patch><pmix.2.patch>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15692.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15697.php

Reply via email to