Christof,

Ralph fixed the issue,

meanwhile, the patch can be manually downloaded at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/2552.patch


Cheers,


Gilles



On 12/9/2016 5:39 PM, Christof Koehler wrote:
Hello,

our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process.

I finally managed to set a breakpoint at the program exit of the root
rank:

(gdb) bt
#0  0x00002b7ccd2e4220 in _exit () from /lib64/libc.so.6
#1  0x00002b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6
#2  0x00002b7ccd25eeb5 in exit () from /lib64/libc.so.6
#3  0x000000000407298d in for_stop_core ()
#4  0x00000000012fad41 in w90_io_mp_io_error_ ()
#5  0x0000000001302147 in w90_parameters_mp_param_read_ ()
#6  0x00000000012f49c6 in wannier_setup_ ()
#7  0x0000000000e166a8 in mlwf_mp_mlwf_wannier90_ ()
#8  0x00000000004319ff in vamp () at main.F:2640
#9  0x000000000040d21e in main ()
#10 0x00002b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6
#11 0x000000000040d129 in _start ()

So for_stop_core is called apparently ? Of course it is below the main()
process of vasp, so additional things might happen which are not
visible. Is SIGCHILD (as observed when catching signals in mpirun) the
signal expectd after a for_stop_core ?

Thank you very much for investigating this !

Cheers

Christof

On Thu, Dec 08, 2016 at 03:15:47PM -0500, Noam Bernstein wrote:
On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
wrote:

Christof,


There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)

in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not using 
the library you think you use
pmap <pid> will show you which lib is used

btw, this was not started with
mpirun --mca coll ^tuned ...
right ?

just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a 
feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never 
completes.
did i get it right ?
I just ran across very similar behavior in VASP (which we just switched over to 
openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, 
others call the other), and I discovered several interesting things.

The most important is that when MPI is active, the preprocessor converts (via a 
#define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), 
which is a wrapper around mpi_finalize.  So in my case some processes in the 
communicator call mpi_finalize, others call mpi_allreduce.  I’m not really 
surprised this hangs, because I think the correct thing to replace STOP with is 
mpi_abort, not mpi_finalize.  If you know where the STOP is called, you can 
check the preprocessed equivalent file (.f90 instead of .F), and see if it’s 
actually been replaced with a call to m_exit.  I’m planning to test whether 
replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. 
program termination when the original source file executes a STOP.

I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected to 
hang, but just in case that’s surprising, here are my stack traces:


hung in collective:

(gdb) where
#0  0x00002b8d5a095ec6 in opal_progress () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1  0x00002b8d59b3a36d in ompi_request_default_wait_all () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2  0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () 
from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x00002b8d59b495ac in PMPI_Allreduce () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4  0x00002b8d598e4027 in pmpi_allreduce__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5  0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type 
(null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6  0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., 
kpoints_f=...) at mkpoints_full.F:1099
#7  0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for type 
(null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8  fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid 
bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9  0x0000000002976478 in vamp () at main.F:2093
#10 0x0000000000412f9e in main ()
#11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000412ea9 in _start ()

hung in mpi_finalize:

#0  0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
#1  0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
#2  0x00002b11db1e0ae7 in ompi_mpi_finalize () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x00002b11daf8b399 in pmpi_finalize__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4  0x00000000004199c5 in m_exit () at mpi.F:375
#5  0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., wdes=Cannot 
resolve DW_OP_push_object_address for a missing object
) at mkpoints_full.F:1065
#6  0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve 
DW_OP_push_object_address for a missing object
) at fock.F:1669
#7  fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address 
for a missing object
) at fock.F:1413
#8  0x0000000002976478 in vamp () at main.F:2093
#9  0x0000000000412f9e in main ()
#10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000412ea9 in _start ()



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to