On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com>
wrote:
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not using
the library you think you use
pmap <pid> will show you which lib is used
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a
feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never
completes.
did i get it right ?
I just ran across very similar behavior in VASP (which we just switched over to
openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one,
others call the other), and I discovered several interesting things.
The most important is that when MPI is active, the preprocessor converts (via a
#define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F),
which is a wrapper around mpi_finalize. So in my case some processes in the
communicator call mpi_finalize, others call mpi_allreduce. I’m not really
surprised this hangs, because I think the correct thing to replace STOP with is
mpi_abort, not mpi_finalize. If you know where the STOP is called, you can
check the preprocessed equivalent file (.f90 instead of .F), and see if it’s
actually been replaced with a call to m_exit. I’m planning to test whether
replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e.
program termination when the original source file executes a STOP.
I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected to
hang, but just in case that’s surprising, here are my stack traces:
hung in collective:
(gdb) where
#0 0x00002b8d5a095ec6 in opal_progress () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling ()
from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b8d59b495ac in PMPI_Allreduce () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4 0x00002b8d598e4027 in pmpi_allreduce__ () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type
(null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=...,
kpoints_f=...) at mkpoints_full.F:1099
#7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for type
(null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid
bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9 0x0000000002976478 in vamp () at main.F:2093
#10 0x0000000000412f9e in main ()
#11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000412ea9 in _start ()
hung in mpi_finalize:
#0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
#1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
#2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b11daf8b399 in pmpi_finalize__ () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4 0x00000000004199c5 in m_exit () at mpi.F:375
#5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., wdes=Cannot
resolve DW_OP_push_object_address for a missing object
) at mkpoints_full.F:1065
#6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve
DW_OP_push_object_address for a missing object
) at fock.F:1669
#7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address
for a missing object
) at fock.F:1413
#8 0x0000000002976478 in vamp () at main.F:2093
#9 0x0000000000412f9e in main ()
#10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000412ea9 in _start ()
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>