Christoph,

can you please try again with

mpirun --mca btl tcp,self --mca pml ob1 ...

that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not.


if that causes a crash, then can you please try

mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

that will help figuring out whether coll/tuned is involved or not

coll/tuned is known not to correctly handle collectives with different
but matching signatures
(e.g. some tasks invoke the collective with one vector of N elements,
and some other invoke
the same collective with N elements)


if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */



Cheers,

Gilles

On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
<christof.koeh...@bccms.uni-bremen.de> wrote:
> Hello everybody,
>
> I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
> node. A stack tracke (pstack) of one rank is below showing the program (vasp
> 5.3.5) and the two psm2 progress threads. However:
>
> In fact, the vasp input is not ok and it should abort at the point where
> it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
> deadlocks in some allreduce operation. Originally it was started with 20
> ranks, when it hangs there are only 19 left. From the PIDs I would
> assume it is the master rank which is missing. So, this looks like a
> failure to terminate.
>
> With 1.10 I get a clean
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 18789 on node node109
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Any ideas what to try ? Of course in this situation it may well be the
> program. Still, with the observed difference between 2.0.1 and 1.10 (and
> mvapich) this might be interesting to someone.
>
> Best Regards
>
> Christof
>
>
> Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
> #0  0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
> #1  0x00002ad35d114f42 in epoll_dispatch () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #3  0x00002ad35d16e996 in progress_engine () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> #5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
> #0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1  
> 0x00002ad35d11dc42 in poll_dispatch () from
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #3  0x00002ad35d0c61d1 in progress_engine () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> #5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
> #0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6
> #1  0x00002ad35d11dc42 in poll_dispatch () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #3  0x00002ad35d0c28cf in opal_progress () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #4  0x00002ad35adce8d8 in ompi_request_wait_completion () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #5  0x00002ad35adce838 in mca_pml_cm_recv () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #6  0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () 
> from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #7  0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #8  0x00002ad35ad1f0f4 in PMPI_Allreduce () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #9  0x00002ad35aa99c38 in pmpi_allreduce__ () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
> #10 0x000000000045f8c6 in m_sum_i_ ()
> #11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
> #12 0x00000000004331ff in vamp () at main.F:2640
> #13 0x000000000040ea1e in main ()
> #14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
> #15 0x000000000040e929 in _start ()
>
>
> --
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to