Christoph, can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ... that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not. if that causes a crash, then can you please try mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ... that will help figuring out whether coll/tuned is involved or not coll/tuned is known not to correctly handle collectives with different but matching signatures (e.g. some tasks invoke the collective with one vector of N elements, and some other invoke the same collective with N elements) if everything fails, can you describe of MPI_Allreduce is invoked ? /* number of tasks, datatype, number of elements */ Cheers, Gilles On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler <christof.koeh...@bccms.uni-bremen.de> wrote: > Hello everybody, > > I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single > node. A stack tracke (pstack) of one rank is below showing the program (vasp > 5.3.5) and the two psm2 progress threads. However: > > In fact, the vasp input is not ok and it should abort at the point where > it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just > deadlocks in some allreduce operation. Originally it was started with 20 > ranks, when it hangs there are only 19 left. From the PIDs I would > assume it is the master rank which is missing. So, this looks like a > failure to terminate. > > With 1.10 I get a clean > -------------------------------------------------------------------------- > mpiexec noticed that process rank 0 with PID 18789 on node node109 > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Any ideas what to try ? Of course in this situation it may well be the > program. Still, with the observed difference between 2.0.1 and 1.10 (and > mvapich) this might be interesting to someone. > > Best Regards > > Christof > > > Thread 3 (Thread 0x2ad362577700 (LWP 4629)): > #0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6 > #1 0x00002ad35d114f42 in epoll_dispatch () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #3 0x00002ad35d16e996 in progress_engine () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0 > #5 0x00002ad35b155ced in clone () from /lib64/libc.so.6 > Thread 2 (Thread 0x2ad362778700 (LWP 4640)): > #0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 > 0x00002ad35d11dc42 in poll_dispatch () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #3 0x00002ad35d0c61d1 in progress_engine () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0 > #5 0x00002ad35b155ced in clone () from /lib64/libc.so.6 > Thread 1 (Thread 0x2ad35978d040 (LWP 4609)): > #0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 > #1 0x00002ad35d11dc42 in poll_dispatch () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #3 0x00002ad35d0c28cf in opal_progress () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 > #4 0x00002ad35adce8d8 in ompi_request_wait_completion () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20 > #5 0x00002ad35adce838 in mca_pml_cm_recv () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20 > #6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () > from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20 > #7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20 > #8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20 > #9 0x00002ad35aa99c38 in pmpi_allreduce__ () from > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20 > #10 0x000000000045f8c6 in m_sum_i_ () > #11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ () > #12 0x00000000004331ff in vamp () at main.F:2640 > #13 0x000000000040ea1e in main () > #14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6 > #15 0x000000000040e929 in _start () > > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users