I think I understand you're saying:

- it's ok to abort during MPI_INIT (we can rationalize it as the default error 
handler)
- we should only abort during MPI functions

Is that right?  If so, I agree with your interpretation.  :-)  ...with one 
addition: it's ok to abort before MPI_INIT, because the MPI spec makes no 
guarantees about what happens before MPI_INIT.

Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 process 
calls MPI_INIT, then it is reasonable for OMPI to expect there to be N 
MPI_INIT's.  If any process exits without calling MPI_INIT -- regardless of 
that process' exit status -- it should be treated as an error.

Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting 
when ORTE detects that a) at least one process has called MPI_INIT, and b) at 
least one process has exited without calling MPI_INIT, is acceptable to me.  
It's also acceptable to the first point above, because all the other processes 
are either stuck in the MPI_INIT (either at the barrier or getting there) or 
haven't yet entered MPI_INIT -- and the MPI spec makes no guarantees about what 
happens before MPI_INIT.

Does that make sense?



On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:

> There are two citation from the MPI standard that I would like to highlight.
> 
> > All MPI programs must contain exactly one call to an MPI initialization 
> > routine: MPI_INIT or MPI_INIT_THREAD.
> 
> > One goal of MPI is to achieve source code portability. By this we mean that 
> > a program written using MPI and complying with the relevant language 
> > standards is portable as written, and must not require any source code 
> > changes when moved from one system to another. This explicitly does not say 
> > anything about how an MPI program is started or launched from the command 
> > line, nor what the user must do to set up the environment in which an MPI 
> > program will run. However, an implementation may require some setup to be 
> > performed before other MPI routines may be called. To provide for this, MPI 
> > includes an initialization routine MPI_INIT.
> 
> While these two statement do not necessarily clarify the original question, 
> they highlight an acceptable solution. Before exiting the MPI_Init function 
> (which we don't have to assume as being collective), any "MPI-like" process 
> can be killed without problems (we can even claim that we call the default 
> error handler). For those that successfully exited the MPI_Init, I guess the 
> next MPI call will have to trigger the error handler and these processes 
> should be allowed to gracefully exit.
> 
> So, while it is clear that the best approach is to allow even bad application 
> to terminate, it is better if we follow what MPI describe as a "high quality 
> implementation".
> 
>   george.
> 
> 
> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
> 
> > Understandable - and we can count on your patch in the near future, then? 
> > :-)
> >
> > On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
> >
> >> My 0.02USD says that for pragmatic reasons one should attempt to terminate 
> >> the job in this case, regardless of ones opinion of this unusual 
> >> application behavior.
> >>
> >> -Paul
> >>
> >> Ralph Castain wrote:
> >>> Hi folks
> >>>
> >>> In case you didn't follow this on the user list, we had a question come 
> >>> up about proper OMPI behavior. Basically, the user has an application 
> >>> where one process decides it should cleanly terminate prior to calling 
> >>> MPI_Init, but all the others go ahead and enter MPI_Init. The application 
> >>> hangs since we don't detect the one proc's exit as an abnormal 
> >>> termination (no segfault, and it didn't call MPI_Init so it isn't 
> >>> required to call MPI_Finalize prior to termination).
> >>>
> >>> I can probably come up with a way to detect this scenario and abort it. 
> >>> But before I spend the effort chasing this down, my question to you MPI 
> >>> folks is:
> >>>
> >>> What -should- OMPI do in this situation? We have never previously 
> >>> detected such behavior - was this an oversight, or is this simply a "bad" 
> >>> application?
> >>>
> >>> Thanks
> >>> Ralph
> >>>
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >> --
> >> Paul H. Hargrove                          phhargr...@lbl.gov
> >> Future Technologies Group                 Tel: +1-510-495-2352
> >> HPC Research Department                   Fax: +1-510-486-6900
> >> Lawrence Berkeley National Laboratory    
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to