Hello everybody,

I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
5.3.5) and the two psm2 progress threads. However:

In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.

With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.

Best Regards

Christof


Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
#0  0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1  0x00002ad35d114f42 in epoll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x00002ad35d16e996 in progress_engine () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
#0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1  0x00002ad35d11dc42 
in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x00002ad35d0c61d1 in progress_engine () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
#0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1  0x00002ad35d11dc42 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x00002ad35d0c28cf in opal_progress () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x00002ad35adce8d8 in ompi_request_wait_completion () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5  0x00002ad35adce838 in mca_pml_cm_recv () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6  0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () 
from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7  0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8  0x00002ad35ad1f0f4 in PMPI_Allreduce () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9  0x00002ad35aa99c38 in pmpi_allreduce__ () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()


-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to