I think I understand you're saying: - it's ok to abort during MPI_INIT (we can rationalize it as the default error handler) - we should only abort during MPI functions
Is that right? If so, I agree with your interpretation. :-) ...with one addition: it's ok to abort before MPI_INIT, because the MPI spec makes no guarantees about what happens before MPI_INIT. Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 process calls MPI_INIT, then it is reasonable for OMPI to expect there to be N MPI_INIT's. If any process exits without calling MPI_INIT -- regardless of that process' exit status -- it should be treated as an error. Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting when ORTE detects that a) at least one process has called MPI_INIT, and b) at least one process has exited without calling MPI_INIT, is acceptable to me. It's also acceptable to the first point above, because all the other processes are either stuck in the MPI_INIT (either at the barrier or getting there) or haven't yet entered MPI_INIT -- and the MPI spec makes no guarantees about what happens before MPI_INIT. Does that make sense? On Dec 16, 2009, at 10:06 AM, George Bosilca wrote: > There are two citation from the MPI standard that I would like to highlight. > > > All MPI programs must contain exactly one call to an MPI initialization > > routine: MPI_INIT or MPI_INIT_THREAD. > > > One goal of MPI is to achieve source code portability. By this we mean that > > a program written using MPI and complying with the relevant language > > standards is portable as written, and must not require any source code > > changes when moved from one system to another. This explicitly does not say > > anything about how an MPI program is started or launched from the command > > line, nor what the user must do to set up the environment in which an MPI > > program will run. However, an implementation may require some setup to be > > performed before other MPI routines may be called. To provide for this, MPI > > includes an initialization routine MPI_INIT. > > While these two statement do not necessarily clarify the original question, > they highlight an acceptable solution. Before exiting the MPI_Init function > (which we don't have to assume as being collective), any "MPI-like" process > can be killed without problems (we can even claim that we call the default > error handler). For those that successfully exited the MPI_Init, I guess the > next MPI call will have to trigger the error handler and these processes > should be allowed to gracefully exit. > > So, while it is clear that the best approach is to allow even bad application > to terminate, it is better if we follow what MPI describe as a "high quality > implementation". > > george. > > > On Dec 15, 2009, at 23:17 , Ralph Castain wrote: > > > Understandable - and we can count on your patch in the near future, then? > > :-) > > > > On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote: > > > >> My 0.02USD says that for pragmatic reasons one should attempt to terminate > >> the job in this case, regardless of ones opinion of this unusual > >> application behavior. > >> > >> -Paul > >> > >> Ralph Castain wrote: > >>> Hi folks > >>> > >>> In case you didn't follow this on the user list, we had a question come > >>> up about proper OMPI behavior. Basically, the user has an application > >>> where one process decides it should cleanly terminate prior to calling > >>> MPI_Init, but all the others go ahead and enter MPI_Init. The application > >>> hangs since we don't detect the one proc's exit as an abnormal > >>> termination (no segfault, and it didn't call MPI_Init so it isn't > >>> required to call MPI_Finalize prior to termination). > >>> > >>> I can probably come up with a way to detect this scenario and abort it. > >>> But before I spend the effort chasing this down, my question to you MPI > >>> folks is: > >>> > >>> What -should- OMPI do in this situation? We have never previously > >>> detected such behavior - was this an oversight, or is this simply a "bad" > >>> application? > >>> > >>> Thanks > >>> Ralph > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >> > >> > >> -- > >> Paul H. Hargrove phhargr...@lbl.gov > >> Future Technologies Group Tel: +1-510-495-2352 > >> HPC Research Department Fax: +1-510-486-6900 > >> Lawrence Berkeley National Laboratory > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com