Hello everybody, I tried it with the nightly and the direct 2.0.2 branch from git which according to the log should contain that patch
commit d0b97d7a408b87425ca53523de369da405358ba2 Merge: ac8c019 b9420bb Author: Jeff Squyres <jsquy...@users.noreply.github.com> Date: Wed Dec 7 18:24:46 2016 -0500 Merge pull request #2528 from rhc54/cmr20x/signals Unfortunately it changes nothing. The root rank stops and all other ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting apparently in that allreduce. The stack trace looks a bit more interesting (git is always debug build ?), so I include it at the very bottom just in case. Off-list Gilles Gouaillardet suggested to set breakpoints at exit, __exit etc. to try to catch signals. Would that be useful ? I need a moment to figure out how to do this, but I can definitively try. Some remark: During "make install" from the git repo I see a WARNING! Common symbols found: mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_precision mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte I have never noticed this before. Best Regards Christof Thread 1 (Thread 0x2af84cde4840 (LWP 11219)): #0 0x00002af84e4c669d in poll () from /lib64/libc.so.6 #1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 #2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 #3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207 #4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144, requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80 #5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, module=0xdee69e0) at base/coll_base_allreduce.c:225 #6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66 #7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107 #8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0, datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at pallreduce_f.c:87 #9 0x000000000045ecc6 in m_sum_i_ () #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ () #11 0x00000000004325ff in vamp () at main.F:2640 #12 0x000000000040de1e in main () #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6 #14 0x000000000040dd29 in _start () On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org wrote: > Hi Christof > > Sorry if I missed this, but it sounds like you are saying that one of your > procs abnormally terminates, and we are failing to kill the remaining job? Is > that correct? > > If so, I just did some work that might relate to that problem that is pending > in PR #2528: https://github.com/open-mpi/ompi/pull/2528 > <https://github.com/open-mpi/ompi/pull/2528> > > Would you be able to try that? > > Ralph > > > On Dec 7, 2016, at 9:37 AM, Christof Koehler > > <christof.koeh...@bccms.uni-bremen.de> wrote: > > > > Hello, > > > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler > >>> <christof.koeh...@bccms.uni-bremen.de> wrote: > >>>> > >>> I really think the hang is a consequence of > >>> unclean termination (in the sense that the non-root ranks are not > >>> terminated) and probably not the cause, in my interpretation of what I > >>> see. Would you have any suggestion to catch signals sent between orterun > >>> (mpirun) and the child tasks ? > >> > >> Do you know where in the code the termination call is? Is it actually > >> calling mpi_abort(), or just doing something ugly like calling fortran > >> “stop”? If the latter, would that explain a possible hang? > > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 > > input contains > > an error, a restart is requested and the wannier90.chk file the restart > > information is missing. > > " > > Exiting....... > > Error: restart requested but wannier90.chk file not found > > " > > So it must terminate. > > > > The termination happens in the libwannier.a, source file io.F90: > > > > write(stdout,*) 'Exiting.......' > > write(stdout, '(1x,a)') trim(error_msg) > > close(stdout) > > stop "wannier90 error: examine the output/error file for details" > > > > So it calls stop as you assumed. > > > >> Presumably someone here can comment on what the standard says about the > >> validity of terminating without mpi_abort. > > > > Well, probably stop is not a good way to terminate then. > > > > My main point was the change relative to 1.10 anyway :-) > > > > > >> > >> Actually, if you’re willing to share enough input files to reproduce, I > >> could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a > >> crash that was apparently addressed by some change in the memory allocator > >> in a recent version of openmpi. Just e-mail me if that’s the case. > > > > I think that is no longer necessary ? In principle it is no problem but > > it at the end of a (small) GW calculation, the Si tutorial example. > > So the mail would be abit larger due to the WAVECAR. > > > > > >> > >> Noam > >> > >> > >> ____________ > >> || > >> |U.S. NAVAL| > >> |_RESEARCH_| > >> LABORATORY > >> Noam Bernstein, Ph.D. > >> Center for Materials Physics and Technology > >> U.S. Naval Research Laboratory > >> T +1 202 404 8628 F +1 202 404 7546 > >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/> > > > > -- > > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > 28359 Bremen > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 28359 Bremen PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users