Hi Christof Sorry if I missed this, but it sounds like you are saying that one of your procs abnormally terminates, and we are failing to kill the remaining job? Is that correct?
If so, I just did some work that might relate to that problem that is pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <https://github.com/open-mpi/ompi/pull/2528> Would you be able to try that? Ralph > On Dec 7, 2016, at 9:37 AM, Christof Koehler > <christof.koeh...@bccms.uni-bremen.de> wrote: > > Hello, > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler >>> <christof.koeh...@bccms.uni-bremen.de> wrote: >>>> >>> I really think the hang is a consequence of >>> unclean termination (in the sense that the non-root ranks are not >>> terminated) and probably not the cause, in my interpretation of what I >>> see. Would you have any suggestion to catch signals sent between orterun >>> (mpirun) and the child tasks ? >> >> Do you know where in the code the termination call is? Is it actually >> calling mpi_abort(), or just doing something ugly like calling fortran >> “stop”? If the latter, would that explain a possible hang? > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 > input contains > an error, a restart is requested and the wannier90.chk file the restart > information is missing. > " > Exiting....... > Error: restart requested but wannier90.chk file not found > " > So it must terminate. > > The termination happens in the libwannier.a, source file io.F90: > > write(stdout,*) 'Exiting.......' > write(stdout, '(1x,a)') trim(error_msg) > close(stdout) > stop "wannier90 error: examine the output/error file for details" > > So it calls stop as you assumed. > >> Presumably someone here can comment on what the standard says about the >> validity of terminating without mpi_abort. > > Well, probably stop is not a good way to terminate then. > > My main point was the change relative to 1.10 anyway :-) > > >> >> Actually, if you’re willing to share enough input files to reproduce, I >> could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a >> crash that was apparently addressed by some change in the memory allocator >> in a recent version of openmpi. Just e-mail me if that’s the case. > > I think that is no longer necessary ? In principle it is no problem but > it at the end of a (small) GW calculation, the Si tutorial example. > So the mail would be abit larger due to the WAVECAR. > > >> >> Noam >> >> >> ____________ >> || >> |U.S. NAVAL| >> |_RESEARCH_| >> LABORATORY >> Noam Bernstein, Ph.D. >> Center for Materials Physics and Technology >> U.S. Naval Research Laboratory >> T +1 202 404 8628 F +1 202 404 7546 >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/> > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users