Ralph,

There seems to be some problems after this commit. The hello_world application 
(the MPI flavor) complete, I get all the output but in addition I have a nice 
message stating that my MPI application didn't call MPI_Init.

[bosilca@dancer c]$ mpirun -np 8 --mca pml ob1 ./hello
Hello, world, I am 5 of 8 on node04
Hello, world, I am 7 of 8 on node04
Hello, world, I am 0 of 8 on node03
Hello, world, I am 1 of 8 on node03
Hello, world, I am 3 of 8 on node03
Hello, world, I am 6 of 8 on node04
Hello, world, I am 2 of 8 on node03
Hello, world, I am 4 of 8 on node04
--------------------------------------------------------------------------
mpirun has exited due to process rank 6 with PID 15398 on
node node04 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

  george.

On Dec 17, 2009, at 14:42 , Ralph Castain wrote:

> Okay, this "feature" has now been added to the devel trunk with r22329.
> 
> Feel free to test it and let me know of problems. I have a test code for it 
> in orte/test/mpi/early_abort.c
> 
> On Dec 16, 2009, at 11:27 AM, Jeff Squyres wrote:
> 
>> I think I understand you're saying:
>> 
>> - it's ok to abort during MPI_INIT (we can rationalize it as the default 
>> error handler)
>> - we should only abort during MPI functions
>> 
>> Is that right?  If so, I agree with your interpretation.  :-)  ...with one 
>> addition: it's ok to abort before MPI_INIT, because the MPI spec makes no 
>> guarantees about what happens before MPI_INIT.
>> 
>> Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 
>> process calls MPI_INIT, then it is reasonable for OMPI to expect there to be 
>> N MPI_INIT's.  If any process exits without calling MPI_INIT -- regardless 
>> of that process' exit status -- it should be treated as an error.
>> 
>> Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting 
>> when ORTE detects that a) at least one process has called MPI_INIT, and b) 
>> at least one process has exited without calling MPI_INIT, is acceptable to 
>> me.  It's also acceptable to the first point above, because all the other 
>> processes are either stuck in the MPI_INIT (either at the barrier or getting 
>> there) or haven't yet entered MPI_INIT -- and the MPI spec makes no 
>> guarantees about what happens before MPI_INIT.
>> 
>> Does that make sense?
>> 
>> 
>> 
>> On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:
>> 
>>> There are two citation from the MPI standard that I would like to highlight.
>>> 
>>>> All MPI programs must contain exactly one call to an MPI initialization 
>>>> routine: MPI_INIT or MPI_INIT_THREAD.
>>> 
>>>> One goal of MPI is to achieve source code portability. By this we mean 
>>>> that a program written using MPI and complying with the relevant language 
>>>> standards is portable as written, and must not require any source code 
>>>> changes when moved from one system to another. This explicitly does not 
>>>> say anything about how an MPI program is started or launched from the 
>>>> command line, nor what the user must do to set up the environment in which 
>>>> an MPI program will run. However, an implementation may require some setup 
>>>> to be performed before other MPI routines may be called. To provide for 
>>>> this, MPI includes an initialization routine MPI_INIT.
>>> 
>>> While these two statement do not necessarily clarify the original question, 
>>> they highlight an acceptable solution. Before exiting the MPI_Init function 
>>> (which we don't have to assume as being collective), any "MPI-like" process 
>>> can be killed without problems (we can even claim that we call the default 
>>> error handler). For those that successfully exited the MPI_Init, I guess 
>>> the next MPI call will have to trigger the error handler and these 
>>> processes should be allowed to gracefully exit.
>>> 
>>> So, while it is clear that the best approach is to allow even bad 
>>> application to terminate, it is better if we follow what MPI describe as a 
>>> "high quality implementation".
>>> 
>>> george.
>>> 
>>> 
>>> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
>>> 
>>>> Understandable - and we can count on your patch in the near future, then? 
>>>> :-)
>>>> 
>>>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
>>>> 
>>>>> My 0.02USD says that for pragmatic reasons one should attempt to 
>>>>> terminate the job in this case, regardless of ones opinion of this 
>>>>> unusual application behavior.
>>>>> 
>>>>> -Paul
>>>>> 
>>>>> Ralph Castain wrote:
>>>>>> Hi folks
>>>>>> 
>>>>>> In case you didn't follow this on the user list, we had a question come 
>>>>>> up about proper OMPI behavior. Basically, the user has an application 
>>>>>> where one process decides it should cleanly terminate prior to calling 
>>>>>> MPI_Init, but all the others go ahead and enter MPI_Init. The 
>>>>>> application hangs since we don't detect the one proc's exit as an 
>>>>>> abnormal termination (no segfault, and it didn't call MPI_Init so it 
>>>>>> isn't required to call MPI_Finalize prior to termination).
>>>>>> 
>>>>>> I can probably come up with a way to detect this scenario and abort it. 
>>>>>> But before I spend the effort chasing this down, my question to you MPI 
>>>>>> folks is:
>>>>>> 
>>>>>> What -should- OMPI do in this situation? We have never previously 
>>>>>> detected such behavior - was this an oversight, or is this simply a 
>>>>>> "bad" application?
>>>>>> 
>>>>>> Thanks
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>> Future Technologies Group                 Tel: +1-510-495-2352
>>>>> HPC Research Department                   Fax: +1-510-486-6900
>>>>> Lawrence Berkeley National Laboratory    
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to