Should have it now in r22332. Thanks again Ralph
On Dec 17, 2009, at 1:13 PM, George Bosilca wrote: > Ralph, > > There seems to be some problems after this commit. The hello_world > application (the MPI flavor) complete, I get all the output but in addition I > have a nice message stating that my MPI application didn't call MPI_Init. > > [bosilca@dancer c]$ mpirun -np 8 --mca pml ob1 ./hello > Hello, world, I am 5 of 8 on node04 > Hello, world, I am 7 of 8 on node04 > Hello, world, I am 0 of 8 on node03 > Hello, world, I am 1 of 8 on node03 > Hello, world, I am 3 of 8 on node03 > Hello, world, I am 6 of 8 on node04 > Hello, world, I am 2 of 8 on node03 > Hello, world, I am 4 of 8 on node04 > -------------------------------------------------------------------------- > mpirun has exited due to process rank 6 with PID 15398 on > node node04 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > > george. > > On Dec 17, 2009, at 14:42 , Ralph Castain wrote: > >> Okay, this "feature" has now been added to the devel trunk with r22329. >> >> Feel free to test it and let me know of problems. I have a test code for it >> in orte/test/mpi/early_abort.c >> >> On Dec 16, 2009, at 11:27 AM, Jeff Squyres wrote: >> >>> I think I understand you're saying: >>> >>> - it's ok to abort during MPI_INIT (we can rationalize it as the default >>> error handler) >>> - we should only abort during MPI functions >>> >>> Is that right? If so, I agree with your interpretation. :-) ...with one >>> addition: it's ok to abort before MPI_INIT, because the MPI spec makes no >>> guarantees about what happens before MPI_INIT. >>> >>> Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 >>> process calls MPI_INIT, then it is reasonable for OMPI to expect there to >>> be N MPI_INIT's. If any process exits without calling MPI_INIT -- >>> regardless of that process' exit status -- it should be treated as an error. >>> >>> Don't forget that we have a barrier in MPI_INIT (in most cases), so >>> aborting when ORTE detects that a) at least one process has called >>> MPI_INIT, and b) at least one process has exited without calling MPI_INIT, >>> is acceptable to me. It's also acceptable to the first point above, >>> because all the other processes are either stuck in the MPI_INIT (either at >>> the barrier or getting there) or haven't yet entered MPI_INIT -- and the >>> MPI spec makes no guarantees about what happens before MPI_INIT. >>> >>> Does that make sense? >>> >>> >>> >>> On Dec 16, 2009, at 10:06 AM, George Bosilca wrote: >>> >>>> There are two citation from the MPI standard that I would like to >>>> highlight. >>>> >>>>> All MPI programs must contain exactly one call to an MPI initialization >>>>> routine: MPI_INIT or MPI_INIT_THREAD. >>>> >>>>> One goal of MPI is to achieve source code portability. By this we mean >>>>> that a program written using MPI and complying with the relevant language >>>>> standards is portable as written, and must not require any source code >>>>> changes when moved from one system to another. This explicitly does not >>>>> say anything about how an MPI program is started or launched from the >>>>> command line, nor what the user must do to set up the environment in >>>>> which an MPI program will run. However, an implementation may require >>>>> some setup to be performed before other MPI routines may be called. To >>>>> provide for this, MPI includes an initialization routine MPI_INIT. >>>> >>>> While these two statement do not necessarily clarify the original >>>> question, they highlight an acceptable solution. Before exiting the >>>> MPI_Init function (which we don't have to assume as being collective), any >>>> "MPI-like" process can be killed without problems (we can even claim that >>>> we call the default error handler). For those that successfully exited the >>>> MPI_Init, I guess the next MPI call will have to trigger the error handler >>>> and these processes should be allowed to gracefully exit. >>>> >>>> So, while it is clear that the best approach is to allow even bad >>>> application to terminate, it is better if we follow what MPI describe as a >>>> "high quality implementation". >>>> >>>> george. >>>> >>>> >>>> On Dec 15, 2009, at 23:17 , Ralph Castain wrote: >>>> >>>>> Understandable - and we can count on your patch in the near future, then? >>>>> :-) >>>>> >>>>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote: >>>>> >>>>>> My 0.02USD says that for pragmatic reasons one should attempt to >>>>>> terminate the job in this case, regardless of ones opinion of this >>>>>> unusual application behavior. >>>>>> >>>>>> -Paul >>>>>> >>>>>> Ralph Castain wrote: >>>>>>> Hi folks >>>>>>> >>>>>>> In case you didn't follow this on the user list, we had a question come >>>>>>> up about proper OMPI behavior. Basically, the user has an application >>>>>>> where one process decides it should cleanly terminate prior to calling >>>>>>> MPI_Init, but all the others go ahead and enter MPI_Init. The >>>>>>> application hangs since we don't detect the one proc's exit as an >>>>>>> abnormal termination (no segfault, and it didn't call MPI_Init so it >>>>>>> isn't required to call MPI_Finalize prior to termination). >>>>>>> >>>>>>> I can probably come up with a way to detect this scenario and abort it. >>>>>>> But before I spend the effort chasing this down, my question to you MPI >>>>>>> folks is: >>>>>>> >>>>>>> What -should- OMPI do in this situation? We have never previously >>>>>>> detected such behavior - was this an oversight, or is this simply a >>>>>>> "bad" application? >>>>>>> >>>>>>> Thanks >>>>>>> Ralph >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>> Future Technologies Group Tel: +1-510-495-2352 >>>>>> HPC Research Department Fax: +1-510-486-6900 >>>>>> Lawrence Berkeley National Laboratory >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel