Hello,

On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> Christof,
> 
> 
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
Yes, I assumed it was interesting :-) Note that the program is compiled
with   -O2 -fp-model source, so optimization is on. I can try with -O0
or the gcc/gfortran ( will take a moment) to make sure it is not a
problem from that.

> 
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not
> using the library you think you use
> pmap <pid> will show you which lib is used
The pmap of the survivor is at the very end of this mail.

> 
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
This is correct, not started with "mpirun --mca coll ^tuned". Using it
does not change something.

> 
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of
> a feature.
Yes. The library where the stack occurs is/was written for serial use as
far as I can tell. As I mentioned, it is not our code but this one
http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
should
be a working combination.

> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> never completes.
> did i get it right ?
Yes ! So it is not a really big problem IMO. Just a bit nasty if this
would happen with a job in the queueing system.

Best Regards

Christof

Note: git branch 2.0.2 of openmpi was configured and installed (make
install) with
./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
--with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
--prefix=/cluster/mpi/openmpi/2.0.2/intel2016

The OS is Centos 7, relatively current :-) with current Omni-Path driver
package from Intel (10.2).

vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
(trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
course the libwannier.a version 1.2 statically linked.

pmap -p of the survivor

32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
0000000000400000  65200K r-x-- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000045ab000    100K r---- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000045c4000   2244K rw--- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000047f5000 100900K rw---   [ anon ]
000000000bfaa000    684K rw---   [ anon ]
000000000c055000     20K rw---   [ anon ]
000000000c05a000    424K rw---   [ anon ]
000000000c0c4000     68K rw---   [ anon ]
000000000c0d5000  25384K rw---   [ anon ]
00002b17e34f6000    132K r-x-- /usr/lib64/ld-2.17.so
00002b17e3517000      4K rw---   [ anon ]
00002b17e3518000     28K rw-s- /dev/infiniband/uverbs0
00002b17e3523000     88K rw---   [ anon ]
00002b17e3539000    772K rw-s- /dev/infiniband/uverbs0
00002b17e35fa000    772K rw-s- /dev/infiniband/uverbs0
00002b17e36bb000    196K rw-s- /dev/infiniband/uverbs0
00002b17e36ec000     28K rw-s- /dev/infiniband/uverbs0
00002b17e36f3000     20K rw-s- /dev/infiniband/uverbs0
00002b17e3717000      4K r---- /usr/lib64/ld-2.17.so
00002b17e3718000      4K rw--- /usr/lib64/ld-2.17.so
00002b17e3719000      4K rw---   [ anon ]
00002b17e371a000     88K r-x-- /usr/lib64/libpthread-2.17.so
00002b17e3730000   2048K ----- /usr/lib64/libpthread-2.17.so
00002b17e3930000      4K r---- /usr/lib64/libpthread-2.17.so
00002b17e3931000      4K rw--- /usr/lib64/libpthread-2.17.so
00002b17e3932000     16K rw---   [ anon ]
00002b17e3936000   1028K r-x-- /usr/lib64/libm-2.17.so
00002b17e3a37000   2044K ----- /usr/lib64/libm-2.17.so
00002b17e3c36000      4K r---- /usr/lib64/libm-2.17.so
00002b17e3c37000      4K rw--- /usr/lib64/libm-2.17.so
00002b17e3c38000     12K r-x-- /usr/lib64/libdl-2.17.so
00002b17e3c3b000   2044K ----- /usr/lib64/libdl-2.17.so
00002b17e3e3a000      4K r---- /usr/lib64/libdl-2.17.so
00002b17e3e3b000      4K rw--- /usr/lib64/libdl-2.17.so
00002b17e3e3c000    184K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e3e6a000   2044K ----- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e4069000      4K r---- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e406a000      4K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e406b000     36K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4074000   2044K ----- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4273000      4K r---- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4274000      4K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4275000    396K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e42d8000   2044K ----- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d7000      4K r---- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d8000      4K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d9000   1948K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e46c0000   2044K ----- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48bf000     12K r---- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48c2000    104K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48dc000     76K rw---   [ anon ]
00002b17e48ef000    948K r-x-- /usr/lib64/libc-2.17.so
00002b17e49dc000      4K r-x-- /usr/lib64/libc-2.17.so
00002b17e49dd000     12K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e0000      4K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e1000     20K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e6000      8K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e8000    760K r-x-- /usr/lib64/libc-2.17.so
00002b17e4aa6000   2048K ----- /usr/lib64/libc-2.17.so
00002b17e4ca6000     16K r---- /usr/lib64/libc-2.17.so
00002b17e4caa000      8K rw--- /usr/lib64/libc-2.17.so
00002b17e4cac000     20K rw---   [ anon ]
00002b17e4cb1000     84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4cc6000   2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec5000      4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec6000      4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec7000    452K r-x-- /usr/lib64/libpsm2.so.2.1
00002b17e4f38000   2044K ----- /usr/lib64/libpsm2.so.2.1
00002b17e5137000      4K r---- /usr/lib64/libpsm2.so.2.1
00002b17e5138000      8K rw--- /usr/lib64/libpsm2.so.2.1
00002b17e513a000      4K rw---   [ anon ]
00002b17e513b000   1344K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e528b000   2044K ----- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e548a000      8K r---- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e548c000     44K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e5497000     12K rw---   [ anon ]
00002b17e549a000    480K r-x-- /usr/lib64/libtorque.so.2.0.0
00002b17e5512000   2044K ----- /usr/lib64/libtorque.so.2.0.0
00002b17e5711000      8K r---- /usr/lib64/libtorque.so.2.0.0
00002b17e5713000      8K rw--- /usr/lib64/libtorque.so.2.0.0
00002b17e5715000   6704K rw---   [ anon ]
00002b17e5da1000   1404K r-x-- /usr/lib64/libxml2.so.2.9.1
00002b17e5f00000   2044K ----- /usr/lib64/libxml2.so.2.9.1
00002b17e60ff000     32K r---- /usr/lib64/libxml2.so.2.9.1
00002b17e6107000      8K rw--- /usr/lib64/libxml2.so.2.9.1
00002b17e6109000      8K rw---   [ anon ]
00002b17e610b000     84K r-x-- /usr/lib64/libz.so.1.2.7
00002b17e6120000   2044K ----- /usr/lib64/libz.so.1.2.7
00002b17e631f000      4K r---- /usr/lib64/libz.so.1.2.7
00002b17e6320000      4K rw--- /usr/lib64/libz.so.1.2.7
00002b17e6321000   1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e
00002b17e64df000   2048K ----- /usr/lib64/libcrypto.so.1.0.1e
00002b17e66df000    104K r---- /usr/lib64/libcrypto.so.1.0.1e
00002b17e66f9000     48K rw--- /usr/lib64/libcrypto.so.1.0.1e
00002b17e6705000     16K rw---   [ anon ]
00002b17e6709000    396K r-x-- /usr/lib64/libssl.so.1.0.1e
00002b17e676c000   2044K ----- /usr/lib64/libssl.so.1.0.1e
00002b17e696b000     16K r---- /usr/lib64/libssl.so.1.0.1e
00002b17e696f000     28K rw--- /usr/lib64/libssl.so.1.0.1e
00002b17e6976000   1572K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6aff000   2044K ----- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6cfe000     20K r---- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6d03000     56K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6d11000    552K rw---   [ anon ]
00002b17e6d9b000     84K r-x-- /usr/lib64/librdmacm.so.1.0.0
00002b17e6db0000   2044K ----- /usr/lib64/librdmacm.so.1.0.0
00002b17e6faf000      4K r---- /usr/lib64/librdmacm.so.1.0.0
00002b17e6fb0000      4K rw--- /usr/lib64/librdmacm.so.1.0.0
00002b17e6fb1000      4K rw---   [ anon ]
00002b17e6fb2000     68K r-x-- /usr/lib64/libibverbs.so.1.0.0
00002b17e6fc3000   2044K ----- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c2000      4K r---- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c3000      4K rw--- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c4000     40K r-x-- /usr/lib64/libnuma.so.1
00002b17e71ce000   2048K ----- /usr/lib64/libnuma.so.1
00002b17e73ce000      4K r---- /usr/lib64/libnuma.so.1
00002b17e73cf000      4K rw--- /usr/lib64/libnuma.so.1
00002b17e73d0000     32K r-x-- /usr/lib64/libpciaccess.so.0.11.1
00002b17e73d8000   2048K ----- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75d8000      4K r---- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75d9000      4K rw--- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75da000     28K r-x-- /usr/lib64/librt-2.17.so
00002b17e75e1000   2044K ----- /usr/lib64/librt-2.17.so
00002b17e77e0000      4K r---- /usr/lib64/librt-2.17.so
00002b17e77e1000      4K rw--- /usr/lib64/librt-2.17.so
00002b17e77e2000      8K r-x-- /usr/lib64/libutil-2.17.so
00002b17e77e4000   2044K ----- /usr/lib64/libutil-2.17.so
00002b17e79e3000      4K r---- /usr/lib64/libutil-2.17.so
00002b17e79e4000      4K rw--- /usr/lib64/libutil-2.17.so
00002b17e79e5000    152K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7a0b000   2044K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0a000      4K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0b000      8K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0d000     24K rw---   [ anon ]
00002b17e7c13000   1288K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7d55000   2044K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f54000     12K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f57000     12K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f5a000    116K rw---   [ anon ]
00002b17e7f77000   2696K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8219000   2044K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8418000     24K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e841e000    340K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8473000    420K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e84dc000   2048K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86dc000      4K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86dd000      4K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86de000      4K rw---   [ anon ]
00002b17e86df000  13124K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e93b0000   2048K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95b0000    220K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95e7000     20K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95ec000   1304K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9732000   2048K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9932000     12K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9935000     12K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9938000    296K rw---   [ anon ]
00002b17e9982000   1464K r-x-- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9af0000   2044K ----- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cef000      4K r---- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cf0000     16K rw--- 
/cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cf4000     16K r-x-- /usr/lib64/libuuid.so.1.3.0
00002b17e9cf8000   2044K ----- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef7000      4K r---- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef8000      4K rw--- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef9000    932K r-x-- /usr/lib64/libstdc++.so.6.0.19
00002b17e9fe2000   2048K ----- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1e2000     32K r---- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1ea000      8K rw--- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1ec000     84K rw---   [ anon ]
00002b17ea201000    144K r-x-- /usr/lib64/liblzma.so.5.0.99
00002b17ea225000   2044K ----- /usr/lib64/liblzma.so.5.0.99
00002b17ea424000      4K r---- /usr/lib64/liblzma.so.5.0.99
00002b17ea425000      4K rw--- /usr/lib64/liblzma.so.5.0.99
00002b17ea426000    292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea46f000   2048K ----- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea66f000      4K r---- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea670000      8K rw--- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea672000    852K r-x-- /usr/lib64/libkrb5.so.3.3
00002b17ea747000   2048K ----- /usr/lib64/libkrb5.so.3.3
00002b17ea947000     52K r---- /usr/lib64/libkrb5.so.3.3
00002b17ea954000     12K rw--- /usr/lib64/libkrb5.so.3.3
00002b17ea957000     12K r-x-- /usr/lib64/libcom_err.so.2.1
00002b17ea95a000   2044K ----- /usr/lib64/libcom_err.so.2.1
00002b17eab59000      4K r---- /usr/lib64/libcom_err.so.2.1
00002b17eab5a000      4K rw--- /usr/lib64/libcom_err.so.2.1
00002b17eab5b000    188K r-x-- /usr/lib64/libk5crypto.so.3.1
00002b17eab8a000   2044K ----- /usr/lib64/libk5crypto.so.3.1
00002b17ead89000      8K r---- /usr/lib64/libk5crypto.so.3.1
00002b17ead8b000      4K rw--- /usr/lib64/libk5crypto.so.3.1
00002b17ead8c000      4K rw---   [ anon ]
00002b17ead8d000    284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eadd4000   2044K ----- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafd3000     12K r---- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafd6000     16K rw--- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafda000      8K rw---   [ anon ]
00002b17eafdc000    104K r-x-- /usr/lib64/libnl-3.so.200.16.1
00002b17eaff6000   2044K ----- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f5000      8K r---- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f7000      4K rw--- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f8000     52K r-x-- /usr/lib64/libkrb5support.so.0.1
00002b17eb205000   2048K ----- /usr/lib64/libkrb5support.so.0.1
00002b17eb405000      4K r---- /usr/lib64/libkrb5support.so.0.1
00002b17eb406000      4K rw--- /usr/lib64/libkrb5support.so.0.1
00002b17eb407000     12K r-x-- /usr/lib64/libkeyutils.so.1.5
00002b17eb40a000   2044K ----- /usr/lib64/libkeyutils.so.1.5
00002b17eb609000      4K r---- /usr/lib64/libkeyutils.so.1.5
00002b17eb60a000      4K rw--- /usr/lib64/libkeyutils.so.1.5
00002b17eb60b000     88K r-x-- /usr/lib64/libresolv-2.17.so
00002b17eb621000   2048K ----- /usr/lib64/libresolv-2.17.so
00002b17eb821000      4K r---- /usr/lib64/libresolv-2.17.so
00002b17eb822000      4K rw--- /usr/lib64/libresolv-2.17.so
00002b17eb823000      8K rw---   [ anon ]
00002b17eb825000    132K r-x-- /usr/lib64/libselinux.so.1
00002b17eb846000   2048K ----- /usr/lib64/libselinux.so.1
00002b17eba46000      4K r---- /usr/lib64/libselinux.so.1
00002b17eba47000      4K rw--- /usr/lib64/libselinux.so.1
00002b17eba48000      8K rw---   [ anon ]
00002b17eba4a000    384K r-x-- /usr/lib64/libpcre.so.1.2.0
00002b17ebaaa000   2044K ----- /usr/lib64/libpcre.so.1.2.0
00002b17ebca9000      4K r---- /usr/lib64/libpcre.so.1.2.0
00002b17ebcaa000      4K rw--- /usr/lib64/libpcre.so.1.2.0
00002b17ebcab000      4K -----   [ anon ]
00002b17ebcac000   3352K rw---   [ anon ]
00002b17ec000000    132K rw---   [ anon ]
00002b17ec021000  65404K -----   [ anon ]
00002b17f0000000      4K -----   [ anon ]
00002b17f0001000   2048K rw---   [ anon ]
00002b17f0201000     16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0205000   2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0404000      4K r---- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0405000      4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0406000      4K rw---   [ anon ]
00002b17f0407000   4096K rw---   [ anon ]
00002b17f0807000   1032K rw---   [ anon ]
00002b17f0909000   4100K rw-s- 
/tmp/openmpi-sessions-12001@node109_0/52426/1/1/vader_segment.node109.1
00002b17f0d0a000   4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200
00002b17f112d000    132K rw---   [ anon ]
00002b17f114e000   4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 
(deleted)
00002b17f1571000   8628K rw---   [ anon ]
00002b17f4000000    132K rw---   [ anon ]
00002b17f4021000  65404K -----   [ anon ]
00002b17f9e85000   9164K rw---   [ anon ]
00007ffd8b021000  31316K rw---   [ stack ]
00007ffd8cfa4000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total           539352K



> 
> Cheers,
> 
> Gilles
> 
> On Thursday, December 8, 2016, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de> wrote:
> 
> > Hello everybody,
> >
> > I tried it with the nightly and the direct 2.0.2 branch from git which
> > according to the log should contain that patch
> >
> > commit d0b97d7a408b87425ca53523de369da405358ba2
> > Merge: ac8c019 b9420bb
> > Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>>
> > Date:   Wed Dec 7 18:24:46 2016 -0500
> >     Merge pull request #2528 from rhc54/cmr20x/signals
> >
> > Unfortunately it changes nothing. The root rank stops and all other
> > ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
> > apparently in that allreduce. The stack trace looks a bit more
> > interesting (git is always debug build ?), so I include it at the very
> > bottom just in case.
> >
> > Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
> > __exit etc. to try to catch signals. Would that be useful ? I need a
> > moment to figure out how to do this, but I can definitively try.
> >
> > Some remark: During "make install" from the git repo I see a
> >
> > WARNING!  Common symbols found:
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
> >           mpi-f08-types.o: 0000000000000004 C
> > ompi_f08_mpi_2double_precision
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
> >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
> >
> > I have never noticed this before.
> >
> >
> > Best Regards
> >
> > Christof
> >
> > Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
> > #0  0x00002af84e4c669d in poll () from /lib64/libc.so.6
> > #1  0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
> > intel2016/lib/libopen-pal.so.20
> > #2  0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
> > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> > #3  0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
> > #4  0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
> > requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
> > #5  0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
> > (sbuf=0xdecbae0,
> > rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
> > module=0xdee69e0) at base/coll_base_allreduce.c:225
> > #6  0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
> > comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
> > #7  0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
> > count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
> > #8  0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
> > recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
> > datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
> > pallreduce_f.c:87
> > #9  0x000000000045ecc6 in m_sum_i_ ()
> > #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
> > #11 0x00000000004325ff in vamp () at main.F:2640
> > #12 0x000000000040de1e in main ()
> > #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
> > #14 0x000000000040dd29 in _start ()
> >
> > On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;>
> > wrote:
> > > Hi Christof
> > >
> > > Sorry if I missed this, but it sounds like you are saying that one of
> > your procs abnormally terminates, and we are failing to kill the remaining
> > job? Is that correct?
> > >
> > > If so, I just did some work that might relate to that problem that is
> > pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
> > https://github.com/open-mpi/ompi/pull/2528>
> > >
> > > Would you be able to try that?
> > >
> > > Ralph
> > >
> > > > On Dec 7, 2016, at 9:37 AM, Christof Koehler <
> > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler <
> > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > > >>>>
> > > >>> I really think the hang is a consequence of
> > > >>> unclean termination (in the sense that the non-root ranks are not
> > > >>> terminated) and probably not the cause, in my interpretation of what
> > I
> > > >>> see. Would you have any suggestion to catch signals sent between
> > orterun
> > > >>> (mpirun) and the child tasks ?
> > > >>
> > > >> Do you know where in the code the termination call is?  Is it
> > actually calling mpi_abort(), or just doing something ugly like calling
> > fortran “stop”?  If the latter, would that explain a possible hang?
> > > > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
> > wannier90 input contains
> > > > an error, a restart is requested and the wannier90.chk file the restart
> > > > information is missing.
> > > > "
> > > > Exiting.......
> > > > Error: restart requested but wannier90.chk file not found
> > > > "
> > > > So it must terminate.
> > > >
> > > > The termination happens in the libwannier.a, source file io.F90:
> > > >
> > > > write(stdout,*)  'Exiting.......'
> > > > write(stdout, '(1x,a)') trim(error_msg)
> > > > close(stdout)
> > > > stop "wannier90 error: examine the output/error file for details"
> > > >
> > > > So it calls stop  as you assumed.
> > > >
> > > >> Presumably someone here can comment on what the standard says about
> > the validity of terminating without mpi_abort.
> > > >
> > > > Well, probably stop is not a good way to terminate then.
> > > >
> > > > My main point was the change relative to 1.10 anyway :-)
> > > >
> > > >
> > > >>
> > > >> Actually, if you’re willing to share enough input files to reproduce,
> > I could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix
> > a crash that was apparently addressed by some change in the memory
> > allocator in a recent version of openmpi.  Just e-mail me if that’s the
> > case.
> > > >
> > > > I think that is no longer necessary ? In principle it is no problem but
> > > > it at the end of a (small) GW calculation, the Si tutorial example.
> > > > So the mail would be abit larger due to the WAVECAR.
> > > >
> > > >
> > > >>
> > > >>
> > Noam
> > > >>
> > > >>
> > > >> ____________
> > > >> ||
> > > >> |U.S. NAVAL|
> > > >> |_RESEARCH_|
> > > >> LABORATORY
> > > >> Noam Bernstein, Ph.D.
> > > >> Center for Materials Physics and Technology
> > > >> U.S. Naval Research Laboratory
> > > >> T +1 202 404 8628  F +1 202 404 7546
> > > >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
> > > >
> > > > --
> > > > Dr. rer. nat. Christof Köhler       email:
> > c.koeh...@bccms.uni-bremen.de <javascript:;>
> > > > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > > > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > > > 28359 Bremen
> > > >
> > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > > > _______________________________________________
> > > > users mailing list
> > > > users@lists.open-mpi.org <javascript:;>
> > > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> > >
> >
> > --
> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > <javascript:;>
> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > 28359 Bremen
> >
> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> >

-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to