To the best I can determine, mpirun catches SIGTERM just fine and will hit the procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to see the remote daemons complete after they hit their procs with the same sequence.
> On Dec 8, 2016, at 5:18 AM, Christof Koehler > <christof.koeh...@bccms.uni-bremen.de> wrote: > > Hello again, > > I am still not sure about breakpoints. But I did a "catch signal" in > gdb, gdb's were attached to the two vasp processes and mpirun. > > When the root rank exits I see in the gdb attaching to it > [Thread 0x2b2787df8700 (LWP 2457) exited] > [Thread 0x2b277f483180 (LWP 2455) exited] > [Inferior 1 (process 2455) exited normally] > > In the gdb attached to the mpirun > Catchpoint 1 (signal SIGCHLD), 0x00002b16560f769d in poll () from > /lib64/libc.so.6 > > In the gdb attached to the second rank I see no output. > > Issuing "continue" in the gdb session attached to mpi run does not lead > to anything new as far as I can tell. > > The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is > #0 0x00002b16560f769d in poll () from /lib64/libc.so.6 > #1 0x00002b1654b3a496 in poll_dispatch () from > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 > #2 0x00002b1654b32fa5 in opal_libevent2022_event_base_loop () from > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 > #3 0x0000000000406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at > orterun.c:1071 > #4 0x00000000004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at > main.c:13 > > So there is a signal and mpirun does nothing with it ? > > Cheers > > Christof > > > On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote: >> Hello, >> >> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote: >>> Christof, >>> >>> >>> There is something really odd with this stack trace. >>> count is zero, and some pointers do not point to valid addresses (!) >> Yes, I assumed it was interesting :-) Note that the program is compiled >> with -O2 -fp-model source, so optimization is on. I can try with -O0 >> or the gcc/gfortran ( will take a moment) to make sure it is not a >> problem from that. >> >>> >>> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that >>> the stack has been corrupted inside MPI_Allreduce(), or that you are not >>> using the library you think you use >>> pmap <pid> will show you which lib is used >> The pmap of the survivor is at the very end of this mail. >> >>> >>> btw, this was not started with >>> mpirun --mca coll ^tuned ... >>> right ? >> This is correct, not started with "mpirun --mca coll ^tuned". Using it >> does not change something. >> >>> >>> just to make it clear ... >>> a task from your program bluntly issues a fortran STOP, and this is kind of >>> a feature. >> Yes. The library where the stack occurs is/was written for serial use as >> far as I can tell. As I mentioned, it is not our code but this one >> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which >> should >> be a working combination. >> >>> the *only* issue is mpirun does not kill the other MPI tasks and mpirun >>> never completes. >>> did i get it right ? >> Yes ! So it is not a really big problem IMO. Just a bit nasty if this >> would happen with a job in the queueing system. >> >> Best Regards >> >> Christof >> >> Note: git branch 2.0.2 of openmpi was configured and installed (make >> install) with >> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model >> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise" >> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm >> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default >> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016 >> >> The OS is Centos 7, relatively current :-) with current Omni-Path driver >> package from Intel (10.2). >> >> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack >> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of >> course the libwannier.a version 1.2 statically linked. >> >> pmap -p of the survivor >> >> 32282: /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca >> 0000000000400000 65200K r-x-- >> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca >> 00000000045ab000 100K r---- >> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca >> 00000000045c4000 2244K rw--- >> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca >> 00000000047f5000 100900K rw--- [ anon ] >> 000000000bfaa000 684K rw--- [ anon ] >> 000000000c055000 20K rw--- [ anon ] >> 000000000c05a000 424K rw--- [ anon ] >> 000000000c0c4000 68K rw--- [ anon ] >> 000000000c0d5000 25384K rw--- [ anon ] >> 00002b17e34f6000 132K r-x-- /usr/lib64/ld-2.17.so >> 00002b17e3517000 4K rw--- [ anon ] >> 00002b17e3518000 28K rw-s- /dev/infiniband/uverbs0 >> 00002b17e3523000 88K rw--- [ anon ] >> 00002b17e3539000 772K rw-s- /dev/infiniband/uverbs0 >> 00002b17e35fa000 772K rw-s- /dev/infiniband/uverbs0 >> 00002b17e36bb000 196K rw-s- /dev/infiniband/uverbs0 >> 00002b17e36ec000 28K rw-s- /dev/infiniband/uverbs0 >> 00002b17e36f3000 20K rw-s- /dev/infiniband/uverbs0 >> 00002b17e3717000 4K r---- /usr/lib64/ld-2.17.so >> 00002b17e3718000 4K rw--- /usr/lib64/ld-2.17.so >> 00002b17e3719000 4K rw--- [ anon ] >> 00002b17e371a000 88K r-x-- /usr/lib64/libpthread-2.17.so >> 00002b17e3730000 2048K ----- /usr/lib64/libpthread-2.17.so >> 00002b17e3930000 4K r---- /usr/lib64/libpthread-2.17.so >> 00002b17e3931000 4K rw--- /usr/lib64/libpthread-2.17.so >> 00002b17e3932000 16K rw--- [ anon ] >> 00002b17e3936000 1028K r-x-- /usr/lib64/libm-2.17.so >> 00002b17e3a37000 2044K ----- /usr/lib64/libm-2.17.so >> 00002b17e3c36000 4K r---- /usr/lib64/libm-2.17.so >> 00002b17e3c37000 4K rw--- /usr/lib64/libm-2.17.so >> 00002b17e3c38000 12K r-x-- /usr/lib64/libdl-2.17.so >> 00002b17e3c3b000 2044K ----- /usr/lib64/libdl-2.17.so >> 00002b17e3e3a000 4K r---- /usr/lib64/libdl-2.17.so >> 00002b17e3e3b000 4K rw--- /usr/lib64/libdl-2.17.so >> 00002b17e3e3c000 184K r-x-- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 >> 00002b17e3e6a000 2044K ----- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 >> 00002b17e4069000 4K r---- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 >> 00002b17e406a000 4K rw--- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 >> 00002b17e406b000 36K r-x-- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 >> 00002b17e4074000 2044K ----- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 >> 00002b17e4273000 4K r---- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 >> 00002b17e4274000 4K rw--- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 >> 00002b17e4275000 396K r-x-- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 >> 00002b17e42d8000 2044K ----- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 >> 00002b17e44d7000 4K r---- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 >> 00002b17e44d8000 4K rw--- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 >> 00002b17e44d9000 1948K r-x-- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 >> 00002b17e46c0000 2044K ----- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 >> 00002b17e48bf000 12K r---- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 >> 00002b17e48c2000 104K rw--- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 >> 00002b17e48dc000 76K rw--- [ anon ] >> 00002b17e48ef000 948K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e49dc000 4K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e49dd000 12K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e49e0000 4K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e49e1000 20K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e49e6000 8K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e49e8000 760K r-x-- /usr/lib64/libc-2.17.so >> 00002b17e4aa6000 2048K ----- /usr/lib64/libc-2.17.so >> 00002b17e4ca6000 16K r---- /usr/lib64/libc-2.17.so >> 00002b17e4caa000 8K rw--- /usr/lib64/libc-2.17.so >> 00002b17e4cac000 20K rw--- [ anon ] >> 00002b17e4cb1000 84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 >> 00002b17e4cc6000 2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 >> 00002b17e4ec5000 4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 >> 00002b17e4ec6000 4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 >> 00002b17e4ec7000 452K r-x-- /usr/lib64/libpsm2.so.2.1 >> 00002b17e4f38000 2044K ----- /usr/lib64/libpsm2.so.2.1 >> 00002b17e5137000 4K r---- /usr/lib64/libpsm2.so.2.1 >> 00002b17e5138000 8K rw--- /usr/lib64/libpsm2.so.2.1 >> 00002b17e513a000 4K rw--- [ anon ] >> 00002b17e513b000 1344K r-x-- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 >> 00002b17e528b000 2044K ----- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 >> 00002b17e548a000 8K r---- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 >> 00002b17e548c000 44K rw--- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 >> 00002b17e5497000 12K rw--- [ anon ] >> 00002b17e549a000 480K r-x-- /usr/lib64/libtorque.so.2.0.0 >> 00002b17e5512000 2044K ----- /usr/lib64/libtorque.so.2.0.0 >> 00002b17e5711000 8K r---- /usr/lib64/libtorque.so.2.0.0 >> 00002b17e5713000 8K rw--- /usr/lib64/libtorque.so.2.0.0 >> 00002b17e5715000 6704K rw--- [ anon ] >> 00002b17e5da1000 1404K r-x-- /usr/lib64/libxml2.so.2.9.1 >> 00002b17e5f00000 2044K ----- /usr/lib64/libxml2.so.2.9.1 >> 00002b17e60ff000 32K r---- /usr/lib64/libxml2.so.2.9.1 >> 00002b17e6107000 8K rw--- /usr/lib64/libxml2.so.2.9.1 >> 00002b17e6109000 8K rw--- [ anon ] >> 00002b17e610b000 84K r-x-- /usr/lib64/libz.so.1.2.7 >> 00002b17e6120000 2044K ----- /usr/lib64/libz.so.1.2.7 >> 00002b17e631f000 4K r---- /usr/lib64/libz.so.1.2.7 >> 00002b17e6320000 4K rw--- /usr/lib64/libz.so.1.2.7 >> 00002b17e6321000 1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e >> 00002b17e64df000 2048K ----- /usr/lib64/libcrypto.so.1.0.1e >> 00002b17e66df000 104K r---- /usr/lib64/libcrypto.so.1.0.1e >> 00002b17e66f9000 48K rw--- /usr/lib64/libcrypto.so.1.0.1e >> 00002b17e6705000 16K rw--- [ anon ] >> 00002b17e6709000 396K r-x-- /usr/lib64/libssl.so.1.0.1e >> 00002b17e676c000 2044K ----- /usr/lib64/libssl.so.1.0.1e >> 00002b17e696b000 16K r---- /usr/lib64/libssl.so.1.0.1e >> 00002b17e696f000 28K rw--- /usr/lib64/libssl.so.1.0.1e >> 00002b17e6976000 1572K r-x-- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 >> 00002b17e6aff000 2044K ----- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 >> 00002b17e6cfe000 20K r---- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 >> 00002b17e6d03000 56K rw--- >> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 >> 00002b17e6d11000 552K rw--- [ anon ] >> 00002b17e6d9b000 84K r-x-- /usr/lib64/librdmacm.so.1.0.0 >> 00002b17e6db0000 2044K ----- /usr/lib64/librdmacm.so.1.0.0 >> 00002b17e6faf000 4K r---- /usr/lib64/librdmacm.so.1.0.0 >> 00002b17e6fb0000 4K rw--- /usr/lib64/librdmacm.so.1.0.0 >> 00002b17e6fb1000 4K rw--- [ anon ] >> 00002b17e6fb2000 68K r-x-- /usr/lib64/libibverbs.so.1.0.0 >> 00002b17e6fc3000 2044K ----- /usr/lib64/libibverbs.so.1.0.0 >> 00002b17e71c2000 4K r---- /usr/lib64/libibverbs.so.1.0.0 >> 00002b17e71c3000 4K rw--- /usr/lib64/libibverbs.so.1.0.0 >> 00002b17e71c4000 40K r-x-- /usr/lib64/libnuma.so.1 >> 00002b17e71ce000 2048K ----- /usr/lib64/libnuma.so.1 >> 00002b17e73ce000 4K r---- /usr/lib64/libnuma.so.1 >> 00002b17e73cf000 4K rw--- /usr/lib64/libnuma.so.1 >> 00002b17e73d0000 32K r-x-- /usr/lib64/libpciaccess.so.0.11.1 >> 00002b17e73d8000 2048K ----- /usr/lib64/libpciaccess.so.0.11.1 >> 00002b17e75d8000 4K r---- /usr/lib64/libpciaccess.so.0.11.1 >> 00002b17e75d9000 4K rw--- /usr/lib64/libpciaccess.so.0.11.1 >> 00002b17e75da000 28K r-x-- /usr/lib64/librt-2.17.so >> 00002b17e75e1000 2044K ----- /usr/lib64/librt-2.17.so >> 00002b17e77e0000 4K r---- /usr/lib64/librt-2.17.so >> 00002b17e77e1000 4K rw--- /usr/lib64/librt-2.17.so >> 00002b17e77e2000 8K r-x-- /usr/lib64/libutil-2.17.so >> 00002b17e77e4000 2044K ----- /usr/lib64/libutil-2.17.so >> 00002b17e79e3000 4K r---- /usr/lib64/libutil-2.17.so >> 00002b17e79e4000 4K rw--- /usr/lib64/libutil-2.17.so >> 00002b17e79e5000 152K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 >> 00002b17e7a0b000 2044K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 >> 00002b17e7c0a000 4K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 >> 00002b17e7c0b000 8K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 >> 00002b17e7c0d000 24K rw--- [ anon ] >> 00002b17e7c13000 1288K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 >> 00002b17e7d55000 2044K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 >> 00002b17e7f54000 12K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 >> 00002b17e7f57000 12K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 >> 00002b17e7f5a000 116K rw--- [ anon ] >> 00002b17e7f77000 2696K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so >> 00002b17e8219000 2044K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so >> 00002b17e8418000 24K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so >> 00002b17e841e000 340K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so >> 00002b17e8473000 420K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 >> 00002b17e84dc000 2048K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 >> 00002b17e86dc000 4K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 >> 00002b17e86dd000 4K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 >> 00002b17e86de000 4K rw--- [ anon ] >> 00002b17e86df000 13124K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so >> 00002b17e93b0000 2048K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so >> 00002b17e95b0000 220K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so >> 00002b17e95e7000 20K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so >> 00002b17e95ec000 1304K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 >> 00002b17e9732000 2048K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 >> 00002b17e9932000 12K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 >> 00002b17e9935000 12K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 >> 00002b17e9938000 296K rw--- [ anon ] >> 00002b17e9982000 1464K r-x-- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so >> 00002b17e9af0000 2044K ----- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so >> 00002b17e9cef000 4K r---- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so >> 00002b17e9cf0000 16K rw--- >> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so >> 00002b17e9cf4000 16K r-x-- /usr/lib64/libuuid.so.1.3.0 >> 00002b17e9cf8000 2044K ----- /usr/lib64/libuuid.so.1.3.0 >> 00002b17e9ef7000 4K r---- /usr/lib64/libuuid.so.1.3.0 >> 00002b17e9ef8000 4K rw--- /usr/lib64/libuuid.so.1.3.0 >> 00002b17e9ef9000 932K r-x-- /usr/lib64/libstdc++.so.6.0.19 >> 00002b17e9fe2000 2048K ----- /usr/lib64/libstdc++.so.6.0.19 >> 00002b17ea1e2000 32K r---- /usr/lib64/libstdc++.so.6.0.19 >> 00002b17ea1ea000 8K rw--- /usr/lib64/libstdc++.so.6.0.19 >> 00002b17ea1ec000 84K rw--- [ anon ] >> 00002b17ea201000 144K r-x-- /usr/lib64/liblzma.so.5.0.99 >> 00002b17ea225000 2044K ----- /usr/lib64/liblzma.so.5.0.99 >> 00002b17ea424000 4K r---- /usr/lib64/liblzma.so.5.0.99 >> 00002b17ea425000 4K rw--- /usr/lib64/liblzma.so.5.0.99 >> 00002b17ea426000 292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2 >> 00002b17ea46f000 2048K ----- /usr/lib64/libgssapi_krb5.so.2.2 >> 00002b17ea66f000 4K r---- /usr/lib64/libgssapi_krb5.so.2.2 >> 00002b17ea670000 8K rw--- /usr/lib64/libgssapi_krb5.so.2.2 >> 00002b17ea672000 852K r-x-- /usr/lib64/libkrb5.so.3.3 >> 00002b17ea747000 2048K ----- /usr/lib64/libkrb5.so.3.3 >> 00002b17ea947000 52K r---- /usr/lib64/libkrb5.so.3.3 >> 00002b17ea954000 12K rw--- /usr/lib64/libkrb5.so.3.3 >> 00002b17ea957000 12K r-x-- /usr/lib64/libcom_err.so.2.1 >> 00002b17ea95a000 2044K ----- /usr/lib64/libcom_err.so.2.1 >> 00002b17eab59000 4K r---- /usr/lib64/libcom_err.so.2.1 >> 00002b17eab5a000 4K rw--- /usr/lib64/libcom_err.so.2.1 >> 00002b17eab5b000 188K r-x-- /usr/lib64/libk5crypto.so.3.1 >> 00002b17eab8a000 2044K ----- /usr/lib64/libk5crypto.so.3.1 >> 00002b17ead89000 8K r---- /usr/lib64/libk5crypto.so.3.1 >> 00002b17ead8b000 4K rw--- /usr/lib64/libk5crypto.so.3.1 >> 00002b17ead8c000 4K rw--- [ anon ] >> 00002b17ead8d000 284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1 >> 00002b17eadd4000 2044K ----- /usr/lib64/libnl-route-3.so.200.16.1 >> 00002b17eafd3000 12K r---- /usr/lib64/libnl-route-3.so.200.16.1 >> 00002b17eafd6000 16K rw--- /usr/lib64/libnl-route-3.so.200.16.1 >> 00002b17eafda000 8K rw--- [ anon ] >> 00002b17eafdc000 104K r-x-- /usr/lib64/libnl-3.so.200.16.1 >> 00002b17eaff6000 2044K ----- /usr/lib64/libnl-3.so.200.16.1 >> 00002b17eb1f5000 8K r---- /usr/lib64/libnl-3.so.200.16.1 >> 00002b17eb1f7000 4K rw--- /usr/lib64/libnl-3.so.200.16.1 >> 00002b17eb1f8000 52K r-x-- /usr/lib64/libkrb5support.so.0.1 >> 00002b17eb205000 2048K ----- /usr/lib64/libkrb5support.so.0.1 >> 00002b17eb405000 4K r---- /usr/lib64/libkrb5support.so.0.1 >> 00002b17eb406000 4K rw--- /usr/lib64/libkrb5support.so.0.1 >> 00002b17eb407000 12K r-x-- /usr/lib64/libkeyutils.so.1.5 >> 00002b17eb40a000 2044K ----- /usr/lib64/libkeyutils.so.1.5 >> 00002b17eb609000 4K r---- /usr/lib64/libkeyutils.so.1.5 >> 00002b17eb60a000 4K rw--- /usr/lib64/libkeyutils.so.1.5 >> 00002b17eb60b000 88K r-x-- /usr/lib64/libresolv-2.17.so >> 00002b17eb621000 2048K ----- /usr/lib64/libresolv-2.17.so >> 00002b17eb821000 4K r---- /usr/lib64/libresolv-2.17.so >> 00002b17eb822000 4K rw--- /usr/lib64/libresolv-2.17.so >> 00002b17eb823000 8K rw--- [ anon ] >> 00002b17eb825000 132K r-x-- /usr/lib64/libselinux.so.1 >> 00002b17eb846000 2048K ----- /usr/lib64/libselinux.so.1 >> 00002b17eba46000 4K r---- /usr/lib64/libselinux.so.1 >> 00002b17eba47000 4K rw--- /usr/lib64/libselinux.so.1 >> 00002b17eba48000 8K rw--- [ anon ] >> 00002b17eba4a000 384K r-x-- /usr/lib64/libpcre.so.1.2.0 >> 00002b17ebaaa000 2044K ----- /usr/lib64/libpcre.so.1.2.0 >> 00002b17ebca9000 4K r---- /usr/lib64/libpcre.so.1.2.0 >> 00002b17ebcaa000 4K rw--- /usr/lib64/libpcre.so.1.2.0 >> 00002b17ebcab000 4K ----- [ anon ] >> 00002b17ebcac000 3352K rw--- [ anon ] >> 00002b17ec000000 132K rw--- [ anon ] >> 00002b17ec021000 65404K ----- [ anon ] >> 00002b17f0000000 4K ----- [ anon ] >> 00002b17f0001000 2048K rw--- [ anon ] >> 00002b17f0201000 16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so >> 00002b17f0205000 2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so >> 00002b17f0404000 4K r---- /usr/lib64/libhfi1verbs-rdmav2.so >> 00002b17f0405000 4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so >> 00002b17f0406000 4K rw--- [ anon ] >> 00002b17f0407000 4096K rw--- [ anon ] >> 00002b17f0807000 1032K rw--- [ anon ] >> 00002b17f0909000 4100K rw-s- >> /tmp/openmpi-sessions-12001@node109_0/52426/1/1/vader_segment.node109.1 >> 00002b17f0d0a000 4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200 >> 00002b17f112d000 132K rw--- [ anon ] >> 00002b17f114e000 4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 >> (deleted) >> 00002b17f1571000 8628K rw--- [ anon ] >> 00002b17f4000000 132K rw--- [ anon ] >> 00002b17f4021000 65404K ----- [ anon ] >> 00002b17f9e85000 9164K rw--- [ anon ] >> 00007ffd8b021000 31316K rw--- [ stack ] >> 00007ffd8cfa4000 8K r-x-- [ anon ] >> ffffffffff600000 4K r-x-- [ anon ] >> total 539352K >> >> >> >>> >>> Cheers, >>> >>> Gilles >>> >>> On Thursday, December 8, 2016, Christof Koehler < >>> christof.koeh...@bccms.uni-bremen.de> wrote: >>> >>>> Hello everybody, >>>> >>>> I tried it with the nightly and the direct 2.0.2 branch from git which >>>> according to the log should contain that patch >>>> >>>> commit d0b97d7a408b87425ca53523de369da405358ba2 >>>> Merge: ac8c019 b9420bb >>>> Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>> >>>> Date: Wed Dec 7 18:24:46 2016 -0500 >>>> Merge pull request #2528 from rhc54/cmr20x/signals >>>> >>>> Unfortunately it changes nothing. The root rank stops and all other >>>> ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting >>>> apparently in that allreduce. The stack trace looks a bit more >>>> interesting (git is always debug build ?), so I include it at the very >>>> bottom just in case. >>>> >>>> Off-list Gilles Gouaillardet suggested to set breakpoints at exit, >>>> __exit etc. to try to catch signals. Would that be useful ? I need a >>>> moment to figure out how to do this, but I can definitively try. >>>> >>>> Some remark: During "make install" from the git repo I see a >>>> >>>> WARNING! Common symbols found: >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex >>>> mpi-f08-types.o: 0000000000000004 C >>>> ompi_f08_mpi_2double_precision >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor >>>> mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte >>>> >>>> I have never noticed this before. >>>> >>>> >>>> Best Regards >>>> >>>> Christof >>>> >>>> Thread 1 (Thread 0x2af84cde4840 (LWP 11219)): >>>> #0 0x00002af84e4c669d in poll () from /lib64/libc.so.6 >>>> #1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/ >>>> intel2016/lib/libopen-pal.so.20 >>>> #2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from >>>> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 >>>> #3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207 >>>> #4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144, >>>> requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80 >>>> #5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling >>>> (sbuf=0xdecbae0, >>>> rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, >>>> module=0xdee69e0) at base/coll_base_allreduce.c:225 >>>> #6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed >>>> (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, >>>> comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66 >>>> #7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, >>>> count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107 >>>> #8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", >>>> recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0, >>>> datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at >>>> pallreduce_f.c:87 >>>> #9 0x000000000045ecc6 in m_sum_i_ () >>>> #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ () >>>> #11 0x00000000004325ff in vamp () at main.F:2640 >>>> #12 0x000000000040de1e in main () >>>> #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6 >>>> #14 0x000000000040dd29 in _start () >>>> >>>> On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;> >>>> wrote: >>>>> Hi Christof >>>>> >>>>> Sorry if I missed this, but it sounds like you are saying that one of >>>> your procs abnormally terminates, and we are failing to kill the remaining >>>> job? Is that correct? >>>>> >>>>> If so, I just did some work that might relate to that problem that is >>>> pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 < >>>> https://github.com/open-mpi/ompi/pull/2528> >>>>> >>>>> Would you be able to try that? >>>>> >>>>> Ralph >>>>> >>>>>> On Dec 7, 2016, at 9:37 AM, Christof Koehler < >>>> christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: >>>>>>>> On Dec 7, 2016, at 10:07 AM, Christof Koehler < >>>> christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote: >>>>>>>>> >>>>>>>> I really think the hang is a consequence of >>>>>>>> unclean termination (in the sense that the non-root ranks are not >>>>>>>> terminated) and probably not the cause, in my interpretation of what >>>> I >>>>>>>> see. Would you have any suggestion to catch signals sent between >>>> orterun >>>>>>>> (mpirun) and the child tasks ? >>>>>>> >>>>>>> Do you know where in the code the termination call is? Is it >>>> actually calling mpi_abort(), or just doing something ugly like calling >>>> fortran “stop”? If the latter, would that explain a possible hang? >>>>>> Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The >>>> wannier90 input contains >>>>>> an error, a restart is requested and the wannier90.chk file the restart >>>>>> information is missing. >>>>>> " >>>>>> Exiting....... >>>>>> Error: restart requested but wannier90.chk file not found >>>>>> " >>>>>> So it must terminate. >>>>>> >>>>>> The termination happens in the libwannier.a, source file io.F90: >>>>>> >>>>>> write(stdout,*) 'Exiting.......' >>>>>> write(stdout, '(1x,a)') trim(error_msg) >>>>>> close(stdout) >>>>>> stop "wannier90 error: examine the output/error file for details" >>>>>> >>>>>> So it calls stop as you assumed. >>>>>> >>>>>>> Presumably someone here can comment on what the standard says about >>>> the validity of terminating without mpi_abort. >>>>>> >>>>>> Well, probably stop is not a good way to terminate then. >>>>>> >>>>>> My main point was the change relative to 1.10 anyway :-) >>>>>> >>>>>> >>>>>>> >>>>>>> Actually, if you’re willing to share enough input files to reproduce, >>>> I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix >>>> a crash that was apparently addressed by some change in the memory >>>> allocator in a recent version of openmpi. Just e-mail me if that’s the >>>> case. >>>>>> >>>>>> I think that is no longer necessary ? In principle it is no problem but >>>>>> it at the end of a (small) GW calculation, the Si tutorial example. >>>>>> So the mail would be abit larger due to the WAVECAR. >>>>>> >>>>>> >>>>>>> >>>>>>> >>>> Noam >>>>>>> >>>>>>> >>>>>>> ____________ >>>>>>> || >>>>>>> |U.S. NAVAL| >>>>>>> |_RESEARCH_| >>>>>>> LABORATORY >>>>>>> Noam Bernstein, Ph.D. >>>>>>> Center for Materials Physics and Technology >>>>>>> U.S. Naval Research Laboratory >>>>>>> T +1 202 404 8628 F +1 202 404 7546 >>>>>>> https://www.nrl.navy.mil <https://www.nrl.navy.mil/> >>>>>> >>>>>> -- >>>>>> Dr. rer. nat. Christof Köhler email: >>>> c.koeh...@bccms.uni-bremen.de <javascript:;> >>>>>> Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 >>>>>> Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 >>>>>> 28359 Bremen >>>>>> >>>>>> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org <javascript:;> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>> >>>> -- >>>> Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de >>>> <javascript:;> >>>> Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 >>>> Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 >>>> 28359 Bremen >>>> >>>> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ >>>> >> >> -- >> Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de >> Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 >> Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 >> 28359 Bremen >> >> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > <mailto:c.koeh...@bccms.uni-bremen.de> > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > <http://www.bccms.uni-bremen.de/cms/people/c_koehler/> > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users