Re: [OMPI devel] Bug or feature?

Ralph Castain Thu, 17 Dec 2009 15:23:16 -0500

Well that stinks! I'll take another look at it - was working for me, but...been 
there before!


On Dec 17, 2009, at 1:13 PM, George Bosilca wrote:

> Ralph,
> 
> There seems to be some problems after this commit. The hello_world 
> application (the MPI flavor) complete, I get all the output but in addition I 
> have a nice message stating that my MPI application didn't call MPI_Init.
> 
> [bosilca@dancer c]$ mpirun -np 8 --mca pml ob1 ./hello
> Hello, world, I am 5 of 8 on node04
> Hello, world, I am 7 of 8 on node04
> Hello, world, I am 0 of 8 on node03
> Hello, world, I am 1 of 8 on node03
> Hello, world, I am 3 of 8 on node03
> Hello, world, I am 6 of 8 on node04
> Hello, world, I am 2 of 8 on node03
> Hello, world, I am 4 of 8 on node04
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 6 with PID 15398 on
> node node04 exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
>  george.
> 
> On Dec 17, 2009, at 14:42 , Ralph Castain wrote:
> 
>> Okay, this "feature" has now been added to the devel trunk with r22329.
>> 
>> Feel free to test it and let me know of problems. I have a test code for it 
>> in orte/test/mpi/early_abort.c
>> 
>> On Dec 16, 2009, at 11:27 AM, Jeff Squyres wrote:
>> 
>>> I think I understand you're saying:
>>> 
>>> - it's ok to abort during MPI_INIT (we can rationalize it as the default 
>>> error handler)
>>> - we should only abort during MPI functions
>>> 
>>> Is that right?  If so, I agree with your interpretation.  :-)  ...with one 
>>> addition: it's ok to abort before MPI_INIT, because the MPI spec makes no 
>>> guarantees about what happens before MPI_INIT.
>>> 
>>> Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 
>>> process calls MPI_INIT, then it is reasonable for OMPI to expect there to 
>>> be N MPI_INIT's.  If any process exits without calling MPI_INIT -- 
>>> regardless of that process' exit status -- it should be treated as an error.
>>> 
>>> Don't forget that we have a barrier in MPI_INIT (in most cases), so 
>>> aborting when ORTE detects that a) at least one process has called 
>>> MPI_INIT, and b) at least one process has exited without calling MPI_INIT, 
>>> is acceptable to me.  It's also acceptable to the first point above, 
>>> because all the other processes are either stuck in the MPI_INIT (either at 
>>> the barrier or getting there) or haven't yet entered MPI_INIT -- and the 
>>> MPI spec makes no guarantees about what happens before MPI_INIT.
>>> 
>>> Does that make sense?
>>> 
>>> 
>>> 
>>> On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:
>>> 
>>>> There are two citation from the MPI standard that I would like to 
>>>> highlight.
>>>> 
>>>>> All MPI programs must contain exactly one call to an MPI initialization 
>>>>> routine: MPI_INIT or MPI_INIT_THREAD.
>>>> 
>>>>> One goal of MPI is to achieve source code portability. By this we mean 
>>>>> that a program written using MPI and complying with the relevant language 
>>>>> standards is portable as written, and must not require any source code 
>>>>> changes when moved from one system to another. This explicitly does not 
>>>>> say anything about how an MPI program is started or launched from the 
>>>>> command line, nor what the user must do to set up the environment in 
>>>>> which an MPI program will run. However, an implementation may require 
>>>>> some setup to be performed before other MPI routines may be called. To 
>>>>> provide for this, MPI includes an initialization routine MPI_INIT.
>>>> 
>>>> While these two statement do not necessarily clarify the original 
>>>> question, they highlight an acceptable solution. Before exiting the 
>>>> MPI_Init function (which we don't have to assume as being collective), any 
>>>> "MPI-like" process can be killed without problems (we can even claim that 
>>>> we call the default error handler). For those that successfully exited the 
>>>> MPI_Init, I guess the next MPI call will have to trigger the error handler 
>>>> and these processes should be allowed to gracefully exit.
>>>> 
>>>> So, while it is clear that the best approach is to allow even bad 
>>>> application to terminate, it is better if we follow what MPI describe as a 
>>>> "high quality implementation".
>>>> 
>>>> george.
>>>> 
>>>> 
>>>> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
>>>> 
>>>>> Understandable - and we can count on your patch in the near future, then? 
>>>>> :-)
>>>>> 
>>>>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
>>>>> 
>>>>>> My 0.02USD says that for pragmatic reasons one should attempt to 
>>>>>> terminate the job in this case, regardless of ones opinion of this 
>>>>>> unusual application behavior.
>>>>>> 
>>>>>> -Paul
>>>>>> 
>>>>>> Ralph Castain wrote:
>>>>>>> Hi folks
>>>>>>> 
>>>>>>> In case you didn't follow this on the user list, we had a question come 
>>>>>>> up about proper OMPI behavior. Basically, the user has an application 
>>>>>>> where one process decides it should cleanly terminate prior to calling 
>>>>>>> MPI_Init, but all the others go ahead and enter MPI_Init. The 
>>>>>>> application hangs since we don't detect the one proc's exit as an 
>>>>>>> abnormal termination (no segfault, and it didn't call MPI_Init so it 
>>>>>>> isn't required to call MPI_Finalize prior to termination).
>>>>>>> 
>>>>>>> I can probably come up with a way to detect this scenario and abort it. 
>>>>>>> But before I spend the effort chasing this down, my question to you MPI 
>>>>>>> folks is:
>>>>>>> 
>>>>>>> What -should- OMPI do in this situation? We have never previously 
>>>>>>> detected such behavior - was this an oversight, or is this simply a 
>>>>>>> "bad" application?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Paul H. Hargrove                          [email protected]
>>>>>> Future Technologies Group                 Tel: +1-510-495-2352
>>>>>> HPC Research Department                   Fax: +1-510-486-6900
>>>>>> Lawrence Berkeley National Laboratory    
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> [email protected]
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Bug or feature?

Reply via email to