Makes perfect sense. george.
On Dec 16, 2009, at 13:27 , Jeff Squyres wrote: > I think I understand you're saying: > > - it's ok to abort during MPI_INIT (we can rationalize it as the default > error handler) > - we should only abort during MPI functions > > Is that right? If so, I agree with your interpretation. :-) ...with one > addition: it's ok to abort before MPI_INIT, because the MPI spec makes no > guarantees about what happens before MPI_INIT. > > Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 > process calls MPI_INIT, then it is reasonable for OMPI to expect there to be > N MPI_INIT's. If any process exits without calling MPI_INIT -- regardless of > that process' exit status -- it should be treated as an error. > > Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting > when ORTE detects that a) at least one process has called MPI_INIT, and b) at > least one process has exited without calling MPI_INIT, is acceptable to me. > It's also acceptable to the first point above, because all the other > processes are either stuck in the MPI_INIT (either at the barrier or getting > there) or haven't yet entered MPI_INIT -- and the MPI spec makes no > guarantees about what happens before MPI_INIT. > > Does that make sense? > > > > On Dec 16, 2009, at 10:06 AM, George Bosilca wrote: > >> There are two citation from the MPI standard that I would like to highlight. >> >>> All MPI programs must contain exactly one call to an MPI initialization >>> routine: MPI_INIT or MPI_INIT_THREAD. >> >>> One goal of MPI is to achieve source code portability. By this we mean that >>> a program written using MPI and complying with the relevant language >>> standards is portable as written, and must not require any source code >>> changes when moved from one system to another. This explicitly does not say >>> anything about how an MPI program is started or launched from the command >>> line, nor what the user must do to set up the environment in which an MPI >>> program will run. However, an implementation may require some setup to be >>> performed before other MPI routines may be called. To provide for this, MPI >>> includes an initialization routine MPI_INIT. >> >> While these two statement do not necessarily clarify the original question, >> they highlight an acceptable solution. Before exiting the MPI_Init function >> (which we don't have to assume as being collective), any "MPI-like" process >> can be killed without problems (we can even claim that we call the default >> error handler). For those that successfully exited the MPI_Init, I guess the >> next MPI call will have to trigger the error handler and these processes >> should be allowed to gracefully exit. >> >> So, while it is clear that the best approach is to allow even bad >> application to terminate, it is better if we follow what MPI describe as a >> "high quality implementation". >> >> george. >> >> >> On Dec 15, 2009, at 23:17 , Ralph Castain wrote: >> >>> Understandable - and we can count on your patch in the near future, then? >>> :-) >>> >>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote: >>> >>>> My 0.02USD says that for pragmatic reasons one should attempt to terminate >>>> the job in this case, regardless of ones opinion of this unusual >>>> application behavior. >>>> >>>> -Paul >>>> >>>> Ralph Castain wrote: >>>>> Hi folks >>>>> >>>>> In case you didn't follow this on the user list, we had a question come >>>>> up about proper OMPI behavior. Basically, the user has an application >>>>> where one process decides it should cleanly terminate prior to calling >>>>> MPI_Init, but all the others go ahead and enter MPI_Init. The application >>>>> hangs since we don't detect the one proc's exit as an abnormal >>>>> termination (no segfault, and it didn't call MPI_Init so it isn't >>>>> required to call MPI_Finalize prior to termination). >>>>> >>>>> I can probably come up with a way to detect this scenario and abort it. >>>>> But before I spend the effort chasing this down, my question to you MPI >>>>> folks is: >>>>> >>>>> What -should- OMPI do in this situation? We have never previously >>>>> detected such behavior - was this an oversight, or is this simply a "bad" >>>>> application? >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Future Technologies Group Tel: +1-510-495-2352 >>>> HPC Research Department Fax: +1-510-486-6900 >>>> Lawrence Berkeley National Laboratory >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel