Re: [OMPI devel] Bug or feature?

George Bosilca Wed, 16 Dec 2009 14:45:19 -0500

Makes perfect sense.

  george.


On Dec 16, 2009, at 13:27 , Jeff Squyres wrote:

> I think I understand you're saying:
> 
> - it's ok to abort during MPI_INIT (we can rationalize it as the default 
> error handler)
> - we should only abort during MPI functions
> 
> Is that right?  If so, I agree with your interpretation.  :-)  ...with one 
> addition: it's ok to abort before MPI_INIT, because the MPI spec makes no 
> guarantees about what happens before MPI_INIT.
> 
> Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 
> process calls MPI_INIT, then it is reasonable for OMPI to expect there to be 
> N MPI_INIT's.  If any process exits without calling MPI_INIT -- regardless of 
> that process' exit status -- it should be treated as an error.
> 
> Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting 
> when ORTE detects that a) at least one process has called MPI_INIT, and b) at 
> least one process has exited without calling MPI_INIT, is acceptable to me.  
> It's also acceptable to the first point above, because all the other 
> processes are either stuck in the MPI_INIT (either at the barrier or getting 
> there) or haven't yet entered MPI_INIT -- and the MPI spec makes no 
> guarantees about what happens before MPI_INIT.
> 
> Does that make sense?
> 
> 
> 
> On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:
> 
>> There are two citation from the MPI standard that I would like to highlight.
>> 
>>> All MPI programs must contain exactly one call to an MPI initialization 
>>> routine: MPI_INIT or MPI_INIT_THREAD.
>> 
>>> One goal of MPI is to achieve source code portability. By this we mean that 
>>> a program written using MPI and complying with the relevant language 
>>> standards is portable as written, and must not require any source code 
>>> changes when moved from one system to another. This explicitly does not say 
>>> anything about how an MPI program is started or launched from the command 
>>> line, nor what the user must do to set up the environment in which an MPI 
>>> program will run. However, an implementation may require some setup to be 
>>> performed before other MPI routines may be called. To provide for this, MPI 
>>> includes an initialization routine MPI_INIT.
>> 
>> While these two statement do not necessarily clarify the original question, 
>> they highlight an acceptable solution. Before exiting the MPI_Init function 
>> (which we don't have to assume as being collective), any "MPI-like" process 
>> can be killed without problems (we can even claim that we call the default 
>> error handler). For those that successfully exited the MPI_Init, I guess the 
>> next MPI call will have to trigger the error handler and these processes 
>> should be allowed to gracefully exit.
>> 
>> So, while it is clear that the best approach is to allow even bad 
>> application to terminate, it is better if we follow what MPI describe as a 
>> "high quality implementation".
>> 
>>  george.
>> 
>> 
>> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
>> 
>>> Understandable - and we can count on your patch in the near future, then? 
>>> :-)
>>> 
>>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
>>> 
>>>> My 0.02USD says that for pragmatic reasons one should attempt to terminate 
>>>> the job in this case, regardless of ones opinion of this unusual 
>>>> application behavior.
>>>> 
>>>> -Paul
>>>> 
>>>> Ralph Castain wrote:
>>>>> Hi folks
>>>>> 
>>>>> In case you didn't follow this on the user list, we had a question come 
>>>>> up about proper OMPI behavior. Basically, the user has an application 
>>>>> where one process decides it should cleanly terminate prior to calling 
>>>>> MPI_Init, but all the others go ahead and enter MPI_Init. The application 
>>>>> hangs since we don't detect the one proc's exit as an abnormal 
>>>>> termination (no segfault, and it didn't call MPI_Init so it isn't 
>>>>> required to call MPI_Finalize prior to termination).
>>>>> 
>>>>> I can probably come up with a way to detect this scenario and abort it. 
>>>>> But before I spend the effort chasing this down, my question to you MPI 
>>>>> folks is:
>>>>> 
>>>>> What -should- OMPI do in this situation? We have never previously 
>>>>> detected such behavior - was this an oversight, or is this simply a "bad" 
>>>>> application?
>>>>> 
>>>>> Thanks
>>>>> Ralph
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>> Future Technologies Group                 Tel: +1-510-495-2352
>>>> HPC Research Department                   Fax: +1-510-486-6900
>>>> Lawrence Berkeley National Laboratory    
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Bug or feature?

Reply via email to