Hello everybody,

I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch

commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Author: Jeff Squyres <jsquy...@users.noreply.github.com>
Date:   Wed Dec 7 18:24:46 2016 -0500
    Merge pull request #2528 from rhc54/cmr20x/signals

Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very 
bottom just in case.

Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.

Some remark: During "make install" from the git repo I see a 

WARNING!  Common symbols found:
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_precision
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte

I have never noticed this before.


Best Regards

Christof

Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
#0  0x00002af84e4c669d in poll () from /lib64/libc.so.6
#1  0x00002af850517496 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2  0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3  0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4  0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144, 
requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5  0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling 
(sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, 
module=0xdee69e0) at base/coll_base_allreduce.c:225
#6  0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed 
(sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, 
module=0x1) at coll_tuned_decision_fixed.c:66
#7  0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, 
count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
#8  0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", 
recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0, 
datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at 
pallreduce_f.c:87
#9  0x000000000045ecc6 in m_sum_i_ ()
#10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x00000000004325ff in vamp () at main.F:2640
#12 0x000000000040de1e in main ()
#13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x000000000040dd29 in _start ()
    
On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org wrote:
> Hi Christof
> 
> Sorry if I missed this, but it sounds like you are saying that one of your 
> procs abnormally terminates, and we are failing to kill the remaining job? Is 
> that correct?
> 
> If so, I just did some work that might relate to that problem that is pending 
> in PR #2528: https://github.com/open-mpi/ompi/pull/2528 
> <https://github.com/open-mpi/ompi/pull/2528>
> 
> Would you be able to try that?
> 
> Ralph
> 
> > On Dec 7, 2016, at 9:37 AM, Christof Koehler 
> > <christof.koeh...@bccms.uni-bremen.de> wrote:
> > 
> > Hello,
> > 
> > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler 
> >>> <christof.koeh...@bccms.uni-bremen.de> wrote:
> >>>> 
> >>> I really think the hang is a consequence of
> >>> unclean termination (in the sense that the non-root ranks are not
> >>> terminated) and probably not the cause, in my interpretation of what I
> >>> see. Would you have any suggestion to catch signals sent between orterun
> >>> (mpirun) and the child tasks ?
> >> 
> >> Do you know where in the code the termination call is?  Is it actually 
> >> calling mpi_abort(), or just doing something ugly like calling fortran 
> >> “stop”?  If the latter, would that explain a possible hang?
> > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 
> > input contains
> > an error, a restart is requested and the wannier90.chk file the restart
> > information is missing.
> > "
> > Exiting.......
> > Error: restart requested but wannier90.chk file not found
> > "
> > So it must terminate.
> > 
> > The termination happens in the libwannier.a, source file io.F90:
> > 
> > write(stdout,*)  'Exiting.......'
> > write(stdout, '(1x,a)') trim(error_msg)
> > close(stdout)
> > stop "wannier90 error: examine the output/error file for details"
> > 
> > So it calls stop  as you assumed.
> > 
> >> Presumably someone here can comment on what the standard says about the 
> >> validity of terminating without mpi_abort.
> > 
> > Well, probably stop is not a good way to terminate then.
> > 
> > My main point was the change relative to 1.10 anyway :-) 
> > 
> > 
> >> 
> >> Actually, if you’re willing to share enough input files to reproduce, I 
> >> could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix a 
> >> crash that was apparently addressed by some change in the memory allocator 
> >> in a recent version of openmpi.  Just e-mail me if that’s the case.
> > 
> > I think that is no longer necessary ? In principle it is no problem but
> > it at the end of a (small) GW calculation, the Si tutorial example. 
> > So the mail would be abit larger due to the WAVECAR.
> > 
> > 
> >> 
> >>                                                                    Noam
> >> 
> >> 
> >> ____________
> >> ||
> >> |U.S. NAVAL|
> >> |_RESEARCH_|
> >> LABORATORY
> >> Noam Bernstein, Ph.D.
> >> Center for Materials Physics and Technology
> >> U.S. Naval Research Laboratory
> >> T +1 202 404 8628  F +1 202 404 7546
> >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
> > 
> > -- 
> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > 28359 Bremen  
> > 
> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to