To the best I can determine, mpirun catches SIGTERM just fine and will hit the 
procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to 
see the remote daemons complete after they hit their procs with the same 
sequence.


> On Dec 8, 2016, at 5:18 AM, Christof Koehler 
> <christof.koeh...@bccms.uni-bremen.de> wrote:
> 
> Hello  again,
> 
> I am still not sure about breakpoints. But I did a "catch signal" in
> gdb, gdb's were attached to the two vasp processes and mpirun.
> 
> When the root rank exits I see in the gdb attaching to it
> [Thread 0x2b2787df8700 (LWP 2457) exited]
> [Thread 0x2b277f483180 (LWP 2455) exited]
> [Inferior 1 (process 2455) exited normally]
> 
> In the gdb attached to the mpirun
> Catchpoint 1 (signal SIGCHLD), 0x00002b16560f769d in poll () from
> /lib64/libc.so.6
> 
> In the gdb attached to the second rank I see no output.
> 
> Issuing "continue" in the gdb session attached to mpi run does not lead
> to anything new as far as I can tell.
> 
> The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
> #0  0x00002b16560f769d in poll () from /lib64/libc.so.6
> #1  0x00002b1654b3a496 in poll_dispatch () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #2  0x00002b1654b32fa5 in opal_libevent2022_event_base_loop () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #3  0x0000000000406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
> orterun.c:1071
> #4  0x00000000004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
> main.c:13
> 
> So there is a signal and mpirun does nothing with it ?
> 
> Cheers
> 
> Christof
> 
> 
> On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote:
>> Hello,
>> 
>> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
>>> Christof,
>>> 
>>> 
>>> There is something really odd with this stack trace.
>>> count is zero, and some pointers do not point to valid addresses (!)
>> Yes, I assumed it was interesting :-) Note that the program is compiled
>> with   -O2 -fp-model source, so optimization is on. I can try with -O0
>> or the gcc/gfortran ( will take a moment) to make sure it is not a
>> problem from that.
>> 
>>> 
>>> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
>>> the stack has been corrupted inside MPI_Allreduce(), or that you are not
>>> using the library you think you use
>>> pmap <pid> will show you which lib is used
>> The pmap of the survivor is at the very end of this mail.
>> 
>>> 
>>> btw, this was not started with
>>> mpirun --mca coll ^tuned ...
>>> right ?
>> This is correct, not started with "mpirun --mca coll ^tuned". Using it
>> does not change something.
>> 
>>> 
>>> just to make it clear ...
>>> a task from your program bluntly issues a fortran STOP, and this is kind of
>>> a feature.
>> Yes. The library where the stack occurs is/was written for serial use as
>> far as I can tell. As I mentioned, it is not our code but this one
>> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
>> should
>> be a working combination.
>> 
>>> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
>>> never completes.
>>> did i get it right ?
>> Yes ! So it is not a really big problem IMO. Just a bit nasty if this
>> would happen with a job in the queueing system.
>> 
>> Best Regards
>> 
>> Christof
>> 
>> Note: git branch 2.0.2 of openmpi was configured and installed (make
>> install) with
>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
>> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
>> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
>> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
>> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016
>> 
>> The OS is Centos 7, relatively current :-) with current Omni-Path driver
>> package from Intel (10.2).
>> 
>> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
>> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
>> course the libwannier.a version 1.2 statically linked.
>> 
>> pmap -p of the survivor
>> 
>> 32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 0000000000400000  65200K r-x-- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 00000000045ab000    100K r---- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 00000000045c4000   2244K rw--- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 00000000047f5000 100900K rw---   [ anon ]
>> 000000000bfaa000    684K rw---   [ anon ]
>> 000000000c055000     20K rw---   [ anon ]
>> 000000000c05a000    424K rw---   [ anon ]
>> 000000000c0c4000     68K rw---   [ anon ]
>> 000000000c0d5000  25384K rw---   [ anon ]
>> 00002b17e34f6000    132K r-x-- /usr/lib64/ld-2.17.so
>> 00002b17e3517000      4K rw---   [ anon ]
>> 00002b17e3518000     28K rw-s- /dev/infiniband/uverbs0
>> 00002b17e3523000     88K rw---   [ anon ]
>> 00002b17e3539000    772K rw-s- /dev/infiniband/uverbs0
>> 00002b17e35fa000    772K rw-s- /dev/infiniband/uverbs0
>> 00002b17e36bb000    196K rw-s- /dev/infiniband/uverbs0
>> 00002b17e36ec000     28K rw-s- /dev/infiniband/uverbs0
>> 00002b17e36f3000     20K rw-s- /dev/infiniband/uverbs0
>> 00002b17e3717000      4K r---- /usr/lib64/ld-2.17.so
>> 00002b17e3718000      4K rw--- /usr/lib64/ld-2.17.so
>> 00002b17e3719000      4K rw---   [ anon ]
>> 00002b17e371a000     88K r-x-- /usr/lib64/libpthread-2.17.so
>> 00002b17e3730000   2048K ----- /usr/lib64/libpthread-2.17.so
>> 00002b17e3930000      4K r---- /usr/lib64/libpthread-2.17.so
>> 00002b17e3931000      4K rw--- /usr/lib64/libpthread-2.17.so
>> 00002b17e3932000     16K rw---   [ anon ]
>> 00002b17e3936000   1028K r-x-- /usr/lib64/libm-2.17.so
>> 00002b17e3a37000   2044K ----- /usr/lib64/libm-2.17.so
>> 00002b17e3c36000      4K r---- /usr/lib64/libm-2.17.so
>> 00002b17e3c37000      4K rw--- /usr/lib64/libm-2.17.so
>> 00002b17e3c38000     12K r-x-- /usr/lib64/libdl-2.17.so
>> 00002b17e3c3b000   2044K ----- /usr/lib64/libdl-2.17.so
>> 00002b17e3e3a000      4K r---- /usr/lib64/libdl-2.17.so
>> 00002b17e3e3b000      4K rw--- /usr/lib64/libdl-2.17.so
>> 00002b17e3e3c000    184K r-x-- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
>> 00002b17e3e6a000   2044K ----- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
>> 00002b17e4069000      4K r---- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
>> 00002b17e406a000      4K rw--- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
>> 00002b17e406b000     36K r-x-- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
>> 00002b17e4074000   2044K ----- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
>> 00002b17e4273000      4K r---- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
>> 00002b17e4274000      4K rw--- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
>> 00002b17e4275000    396K r-x-- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
>> 00002b17e42d8000   2044K ----- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
>> 00002b17e44d7000      4K r---- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
>> 00002b17e44d8000      4K rw--- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
>> 00002b17e44d9000   1948K r-x-- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
>> 00002b17e46c0000   2044K ----- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
>> 00002b17e48bf000     12K r---- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
>> 00002b17e48c2000    104K rw--- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
>> 00002b17e48dc000     76K rw---   [ anon ]
>> 00002b17e48ef000    948K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e49dc000      4K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e49dd000     12K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e49e0000      4K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e49e1000     20K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e49e6000      8K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e49e8000    760K r-x-- /usr/lib64/libc-2.17.so
>> 00002b17e4aa6000   2048K ----- /usr/lib64/libc-2.17.so
>> 00002b17e4ca6000     16K r---- /usr/lib64/libc-2.17.so
>> 00002b17e4caa000      8K rw--- /usr/lib64/libc-2.17.so
>> 00002b17e4cac000     20K rw---   [ anon ]
>> 00002b17e4cb1000     84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
>> 00002b17e4cc6000   2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
>> 00002b17e4ec5000      4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
>> 00002b17e4ec6000      4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
>> 00002b17e4ec7000    452K r-x-- /usr/lib64/libpsm2.so.2.1
>> 00002b17e4f38000   2044K ----- /usr/lib64/libpsm2.so.2.1
>> 00002b17e5137000      4K r---- /usr/lib64/libpsm2.so.2.1
>> 00002b17e5138000      8K rw--- /usr/lib64/libpsm2.so.2.1
>> 00002b17e513a000      4K rw---   [ anon ]
>> 00002b17e513b000   1344K r-x-- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
>> 00002b17e528b000   2044K ----- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
>> 00002b17e548a000      8K r---- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
>> 00002b17e548c000     44K rw--- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
>> 00002b17e5497000     12K rw---   [ anon ]
>> 00002b17e549a000    480K r-x-- /usr/lib64/libtorque.so.2.0.0
>> 00002b17e5512000   2044K ----- /usr/lib64/libtorque.so.2.0.0
>> 00002b17e5711000      8K r---- /usr/lib64/libtorque.so.2.0.0
>> 00002b17e5713000      8K rw--- /usr/lib64/libtorque.so.2.0.0
>> 00002b17e5715000   6704K rw---   [ anon ]
>> 00002b17e5da1000   1404K r-x-- /usr/lib64/libxml2.so.2.9.1
>> 00002b17e5f00000   2044K ----- /usr/lib64/libxml2.so.2.9.1
>> 00002b17e60ff000     32K r---- /usr/lib64/libxml2.so.2.9.1
>> 00002b17e6107000      8K rw--- /usr/lib64/libxml2.so.2.9.1
>> 00002b17e6109000      8K rw---   [ anon ]
>> 00002b17e610b000     84K r-x-- /usr/lib64/libz.so.1.2.7
>> 00002b17e6120000   2044K ----- /usr/lib64/libz.so.1.2.7
>> 00002b17e631f000      4K r---- /usr/lib64/libz.so.1.2.7
>> 00002b17e6320000      4K rw--- /usr/lib64/libz.so.1.2.7
>> 00002b17e6321000   1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e
>> 00002b17e64df000   2048K ----- /usr/lib64/libcrypto.so.1.0.1e
>> 00002b17e66df000    104K r---- /usr/lib64/libcrypto.so.1.0.1e
>> 00002b17e66f9000     48K rw--- /usr/lib64/libcrypto.so.1.0.1e
>> 00002b17e6705000     16K rw---   [ anon ]
>> 00002b17e6709000    396K r-x-- /usr/lib64/libssl.so.1.0.1e
>> 00002b17e676c000   2044K ----- /usr/lib64/libssl.so.1.0.1e
>> 00002b17e696b000     16K r---- /usr/lib64/libssl.so.1.0.1e
>> 00002b17e696f000     28K rw--- /usr/lib64/libssl.so.1.0.1e
>> 00002b17e6976000   1572K r-x-- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
>> 00002b17e6aff000   2044K ----- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
>> 00002b17e6cfe000     20K r---- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
>> 00002b17e6d03000     56K rw--- 
>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
>> 00002b17e6d11000    552K rw---   [ anon ]
>> 00002b17e6d9b000     84K r-x-- /usr/lib64/librdmacm.so.1.0.0
>> 00002b17e6db0000   2044K ----- /usr/lib64/librdmacm.so.1.0.0
>> 00002b17e6faf000      4K r---- /usr/lib64/librdmacm.so.1.0.0
>> 00002b17e6fb0000      4K rw--- /usr/lib64/librdmacm.so.1.0.0
>> 00002b17e6fb1000      4K rw---   [ anon ]
>> 00002b17e6fb2000     68K r-x-- /usr/lib64/libibverbs.so.1.0.0
>> 00002b17e6fc3000   2044K ----- /usr/lib64/libibverbs.so.1.0.0
>> 00002b17e71c2000      4K r---- /usr/lib64/libibverbs.so.1.0.0
>> 00002b17e71c3000      4K rw--- /usr/lib64/libibverbs.so.1.0.0
>> 00002b17e71c4000     40K r-x-- /usr/lib64/libnuma.so.1
>> 00002b17e71ce000   2048K ----- /usr/lib64/libnuma.so.1
>> 00002b17e73ce000      4K r---- /usr/lib64/libnuma.so.1
>> 00002b17e73cf000      4K rw--- /usr/lib64/libnuma.so.1
>> 00002b17e73d0000     32K r-x-- /usr/lib64/libpciaccess.so.0.11.1
>> 00002b17e73d8000   2048K ----- /usr/lib64/libpciaccess.so.0.11.1
>> 00002b17e75d8000      4K r---- /usr/lib64/libpciaccess.so.0.11.1
>> 00002b17e75d9000      4K rw--- /usr/lib64/libpciaccess.so.0.11.1
>> 00002b17e75da000     28K r-x-- /usr/lib64/librt-2.17.so
>> 00002b17e75e1000   2044K ----- /usr/lib64/librt-2.17.so
>> 00002b17e77e0000      4K r---- /usr/lib64/librt-2.17.so
>> 00002b17e77e1000      4K rw--- /usr/lib64/librt-2.17.so
>> 00002b17e77e2000      8K r-x-- /usr/lib64/libutil-2.17.so
>> 00002b17e77e4000   2044K ----- /usr/lib64/libutil-2.17.so
>> 00002b17e79e3000      4K r---- /usr/lib64/libutil-2.17.so
>> 00002b17e79e4000      4K rw--- /usr/lib64/libutil-2.17.so
>> 00002b17e79e5000    152K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
>> 00002b17e7a0b000   2044K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
>> 00002b17e7c0a000      4K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
>> 00002b17e7c0b000      8K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
>> 00002b17e7c0d000     24K rw---   [ anon ]
>> 00002b17e7c13000   1288K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
>> 00002b17e7d55000   2044K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
>> 00002b17e7f54000     12K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
>> 00002b17e7f57000     12K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
>> 00002b17e7f5a000    116K rw---   [ anon ]
>> 00002b17e7f77000   2696K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
>> 00002b17e8219000   2044K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
>> 00002b17e8418000     24K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
>> 00002b17e841e000    340K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
>> 00002b17e8473000    420K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
>> 00002b17e84dc000   2048K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
>> 00002b17e86dc000      4K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
>> 00002b17e86dd000      4K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
>> 00002b17e86de000      4K rw---   [ anon ]
>> 00002b17e86df000  13124K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
>> 00002b17e93b0000   2048K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
>> 00002b17e95b0000    220K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
>> 00002b17e95e7000     20K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
>> 00002b17e95ec000   1304K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
>> 00002b17e9732000   2048K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
>> 00002b17e9932000     12K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
>> 00002b17e9935000     12K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
>> 00002b17e9938000    296K rw---   [ anon ]
>> 00002b17e9982000   1464K r-x-- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
>> 00002b17e9af0000   2044K ----- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
>> 00002b17e9cef000      4K r---- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
>> 00002b17e9cf0000     16K rw--- 
>> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
>> 00002b17e9cf4000     16K r-x-- /usr/lib64/libuuid.so.1.3.0
>> 00002b17e9cf8000   2044K ----- /usr/lib64/libuuid.so.1.3.0
>> 00002b17e9ef7000      4K r---- /usr/lib64/libuuid.so.1.3.0
>> 00002b17e9ef8000      4K rw--- /usr/lib64/libuuid.so.1.3.0
>> 00002b17e9ef9000    932K r-x-- /usr/lib64/libstdc++.so.6.0.19
>> 00002b17e9fe2000   2048K ----- /usr/lib64/libstdc++.so.6.0.19
>> 00002b17ea1e2000     32K r---- /usr/lib64/libstdc++.so.6.0.19
>> 00002b17ea1ea000      8K rw--- /usr/lib64/libstdc++.so.6.0.19
>> 00002b17ea1ec000     84K rw---   [ anon ]
>> 00002b17ea201000    144K r-x-- /usr/lib64/liblzma.so.5.0.99
>> 00002b17ea225000   2044K ----- /usr/lib64/liblzma.so.5.0.99
>> 00002b17ea424000      4K r---- /usr/lib64/liblzma.so.5.0.99
>> 00002b17ea425000      4K rw--- /usr/lib64/liblzma.so.5.0.99
>> 00002b17ea426000    292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2
>> 00002b17ea46f000   2048K ----- /usr/lib64/libgssapi_krb5.so.2.2
>> 00002b17ea66f000      4K r---- /usr/lib64/libgssapi_krb5.so.2.2
>> 00002b17ea670000      8K rw--- /usr/lib64/libgssapi_krb5.so.2.2
>> 00002b17ea672000    852K r-x-- /usr/lib64/libkrb5.so.3.3
>> 00002b17ea747000   2048K ----- /usr/lib64/libkrb5.so.3.3
>> 00002b17ea947000     52K r---- /usr/lib64/libkrb5.so.3.3
>> 00002b17ea954000     12K rw--- /usr/lib64/libkrb5.so.3.3
>> 00002b17ea957000     12K r-x-- /usr/lib64/libcom_err.so.2.1
>> 00002b17ea95a000   2044K ----- /usr/lib64/libcom_err.so.2.1
>> 00002b17eab59000      4K r---- /usr/lib64/libcom_err.so.2.1
>> 00002b17eab5a000      4K rw--- /usr/lib64/libcom_err.so.2.1
>> 00002b17eab5b000    188K r-x-- /usr/lib64/libk5crypto.so.3.1
>> 00002b17eab8a000   2044K ----- /usr/lib64/libk5crypto.so.3.1
>> 00002b17ead89000      8K r---- /usr/lib64/libk5crypto.so.3.1
>> 00002b17ead8b000      4K rw--- /usr/lib64/libk5crypto.so.3.1
>> 00002b17ead8c000      4K rw---   [ anon ]
>> 00002b17ead8d000    284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1
>> 00002b17eadd4000   2044K ----- /usr/lib64/libnl-route-3.so.200.16.1
>> 00002b17eafd3000     12K r---- /usr/lib64/libnl-route-3.so.200.16.1
>> 00002b17eafd6000     16K rw--- /usr/lib64/libnl-route-3.so.200.16.1
>> 00002b17eafda000      8K rw---   [ anon ]
>> 00002b17eafdc000    104K r-x-- /usr/lib64/libnl-3.so.200.16.1
>> 00002b17eaff6000   2044K ----- /usr/lib64/libnl-3.so.200.16.1
>> 00002b17eb1f5000      8K r---- /usr/lib64/libnl-3.so.200.16.1
>> 00002b17eb1f7000      4K rw--- /usr/lib64/libnl-3.so.200.16.1
>> 00002b17eb1f8000     52K r-x-- /usr/lib64/libkrb5support.so.0.1
>> 00002b17eb205000   2048K ----- /usr/lib64/libkrb5support.so.0.1
>> 00002b17eb405000      4K r---- /usr/lib64/libkrb5support.so.0.1
>> 00002b17eb406000      4K rw--- /usr/lib64/libkrb5support.so.0.1
>> 00002b17eb407000     12K r-x-- /usr/lib64/libkeyutils.so.1.5
>> 00002b17eb40a000   2044K ----- /usr/lib64/libkeyutils.so.1.5
>> 00002b17eb609000      4K r---- /usr/lib64/libkeyutils.so.1.5
>> 00002b17eb60a000      4K rw--- /usr/lib64/libkeyutils.so.1.5
>> 00002b17eb60b000     88K r-x-- /usr/lib64/libresolv-2.17.so
>> 00002b17eb621000   2048K ----- /usr/lib64/libresolv-2.17.so
>> 00002b17eb821000      4K r---- /usr/lib64/libresolv-2.17.so
>> 00002b17eb822000      4K rw--- /usr/lib64/libresolv-2.17.so
>> 00002b17eb823000      8K rw---   [ anon ]
>> 00002b17eb825000    132K r-x-- /usr/lib64/libselinux.so.1
>> 00002b17eb846000   2048K ----- /usr/lib64/libselinux.so.1
>> 00002b17eba46000      4K r---- /usr/lib64/libselinux.so.1
>> 00002b17eba47000      4K rw--- /usr/lib64/libselinux.so.1
>> 00002b17eba48000      8K rw---   [ anon ]
>> 00002b17eba4a000    384K r-x-- /usr/lib64/libpcre.so.1.2.0
>> 00002b17ebaaa000   2044K ----- /usr/lib64/libpcre.so.1.2.0
>> 00002b17ebca9000      4K r---- /usr/lib64/libpcre.so.1.2.0
>> 00002b17ebcaa000      4K rw--- /usr/lib64/libpcre.so.1.2.0
>> 00002b17ebcab000      4K -----   [ anon ]
>> 00002b17ebcac000   3352K rw---   [ anon ]
>> 00002b17ec000000    132K rw---   [ anon ]
>> 00002b17ec021000  65404K -----   [ anon ]
>> 00002b17f0000000      4K -----   [ anon ]
>> 00002b17f0001000   2048K rw---   [ anon ]
>> 00002b17f0201000     16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so
>> 00002b17f0205000   2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so
>> 00002b17f0404000      4K r---- /usr/lib64/libhfi1verbs-rdmav2.so
>> 00002b17f0405000      4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so
>> 00002b17f0406000      4K rw---   [ anon ]
>> 00002b17f0407000   4096K rw---   [ anon ]
>> 00002b17f0807000   1032K rw---   [ anon ]
>> 00002b17f0909000   4100K rw-s- 
>> /tmp/openmpi-sessions-12001@node109_0/52426/1/1/vader_segment.node109.1
>> 00002b17f0d0a000   4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200
>> 00002b17f112d000    132K rw---   [ anon ]
>> 00002b17f114e000   4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 
>> (deleted)
>> 00002b17f1571000   8628K rw---   [ anon ]
>> 00002b17f4000000    132K rw---   [ anon ]
>> 00002b17f4021000  65404K -----   [ anon ]
>> 00002b17f9e85000   9164K rw---   [ anon ]
>> 00007ffd8b021000  31316K rw---   [ stack ]
>> 00007ffd8cfa4000      8K r-x--   [ anon ]
>> ffffffffff600000      4K r-x--   [ anon ]
>> total           539352K
>> 
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Thursday, December 8, 2016, Christof Koehler <
>>> christof.koeh...@bccms.uni-bremen.de> wrote:
>>> 
>>>> Hello everybody,
>>>> 
>>>> I tried it with the nightly and the direct 2.0.2 branch from git which
>>>> according to the log should contain that patch
>>>> 
>>>> commit d0b97d7a408b87425ca53523de369da405358ba2
>>>> Merge: ac8c019 b9420bb
>>>> Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>>
>>>> Date:   Wed Dec 7 18:24:46 2016 -0500
>>>>    Merge pull request #2528 from rhc54/cmr20x/signals
>>>> 
>>>> Unfortunately it changes nothing. The root rank stops and all other
>>>> ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
>>>> apparently in that allreduce. The stack trace looks a bit more
>>>> interesting (git is always debug build ?), so I include it at the very
>>>> bottom just in case.
>>>> 
>>>> Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
>>>> __exit etc. to try to catch signals. Would that be useful ? I need a
>>>> moment to figure out how to do this, but I can definitively try.
>>>> 
>>>> Some remark: During "make install" from the git repo I see a
>>>> 
>>>> WARNING!  Common symbols found:
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
>>>>          mpi-f08-types.o: 0000000000000004 C
>>>> ompi_f08_mpi_2double_precision
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
>>>>          mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
>>>> 
>>>> I have never noticed this before.
>>>> 
>>>> 
>>>> Best Regards
>>>> 
>>>> Christof
>>>> 
>>>> Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
>>>> #0  0x00002af84e4c669d in poll () from /lib64/libc.so.6
>>>> #1  0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
>>>> intel2016/lib/libopen-pal.so.20
>>>> #2  0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
>>>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
>>>> #3  0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
>>>> #4  0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
>>>> requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
>>>> #5  0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
>>>> (sbuf=0xdecbae0,
>>>> rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
>>>> module=0xdee69e0) at base/coll_base_allreduce.c:225
>>>> #6  0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
>>>> (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
>>>> comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
>>>> #7  0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
>>>> count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
>>>> #8  0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
>>>> recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
>>>> datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
>>>> pallreduce_f.c:87
>>>> #9  0x000000000045ecc6 in m_sum_i_ ()
>>>> #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
>>>> #11 0x00000000004325ff in vamp () at main.F:2640
>>>> #12 0x000000000040de1e in main ()
>>>> #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
>>>> #14 0x000000000040dd29 in _start ()
>>>> 
>>>> On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;>
>>>> wrote:
>>>>> Hi Christof
>>>>> 
>>>>> Sorry if I missed this, but it sounds like you are saying that one of
>>>> your procs abnormally terminates, and we are failing to kill the remaining
>>>> job? Is that correct?
>>>>> 
>>>>> If so, I just did some work that might relate to that problem that is
>>>> pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
>>>> https://github.com/open-mpi/ompi/pull/2528>
>>>>> 
>>>>> Would you be able to try that?
>>>>> 
>>>>> Ralph
>>>>> 
>>>>>> On Dec 7, 2016, at 9:37 AM, Christof Koehler <
>>>> christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
>>>>>>>> On Dec 7, 2016, at 10:07 AM, Christof Koehler <
>>>> christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
>>>>>>>>> 
>>>>>>>> I really think the hang is a consequence of
>>>>>>>> unclean termination (in the sense that the non-root ranks are not
>>>>>>>> terminated) and probably not the cause, in my interpretation of what
>>>> I
>>>>>>>> see. Would you have any suggestion to catch signals sent between
>>>> orterun
>>>>>>>> (mpirun) and the child tasks ?
>>>>>>> 
>>>>>>> Do you know where in the code the termination call is?  Is it
>>>> actually calling mpi_abort(), or just doing something ugly like calling
>>>> fortran “stop”?  If the latter, would that explain a possible hang?
>>>>>> Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
>>>> wannier90 input contains
>>>>>> an error, a restart is requested and the wannier90.chk file the restart
>>>>>> information is missing.
>>>>>> "
>>>>>> Exiting.......
>>>>>> Error: restart requested but wannier90.chk file not found
>>>>>> "
>>>>>> So it must terminate.
>>>>>> 
>>>>>> The termination happens in the libwannier.a, source file io.F90:
>>>>>> 
>>>>>> write(stdout,*)  'Exiting.......'
>>>>>> write(stdout, '(1x,a)') trim(error_msg)
>>>>>> close(stdout)
>>>>>> stop "wannier90 error: examine the output/error file for details"
>>>>>> 
>>>>>> So it calls stop  as you assumed.
>>>>>> 
>>>>>>> Presumably someone here can comment on what the standard says about
>>>> the validity of terminating without mpi_abort.
>>>>>> 
>>>>>> Well, probably stop is not a good way to terminate then.
>>>>>> 
>>>>>> My main point was the change relative to 1.10 anyway :-)
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Actually, if you’re willing to share enough input files to reproduce,
>>>> I could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix
>>>> a crash that was apparently addressed by some change in the memory
>>>> allocator in a recent version of openmpi.  Just e-mail me if that’s the
>>>> case.
>>>>>> 
>>>>>> I think that is no longer necessary ? In principle it is no problem but
>>>>>> it at the end of a (small) GW calculation, the Si tutorial example.
>>>>>> So the mail would be abit larger due to the WAVECAR.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> Noam
>>>>>>> 
>>>>>>> 
>>>>>>> ____________
>>>>>>> ||
>>>>>>> |U.S. NAVAL|
>>>>>>> |_RESEARCH_|
>>>>>>> LABORATORY
>>>>>>> Noam Bernstein, Ph.D.
>>>>>>> Center for Materials Physics and Technology
>>>>>>> U.S. Naval Research Laboratory
>>>>>>> T +1 202 404 8628  F +1 202 404 7546
>>>>>>> https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
>>>>>> 
>>>>>> --
>>>>>> Dr. rer. nat. Christof Köhler       email:
>>>> c.koeh...@bccms.uni-bremen.de <javascript:;>
>>>>>> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
>>>>>> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
>>>>>> 28359 Bremen
>>>>>> 
>>>>>> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org <javascript:;>
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> 
>>>> 
>>>> --
>>>> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
>>>> <javascript:;>
>>>> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
>>>> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
>>>> 28359 Bremen
>>>> 
>>>> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>>>> 
>> 
>> -- 
>> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
>> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
>> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
>> 28359 Bremen  
>> 
>> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> 
> 
> 
> -- 
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de 
> <mailto:c.koeh...@bccms.uni-bremen.de>
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen  
> 
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ 
> <http://www.bccms.uni-bremen.de/cms/people/c_koehler/>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to