Re: [OMPI users] Abort/ Deadlock issue in allreduce

Christof Koehler Wed, 07 Dec 2016 05:10:14 -0800

Hello,

thank you for the fast answer.


On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> Christoph,
> 
> can you please try again with
> 
> mpirun --mca btl tcp,self --mca pml ob1 ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect.

> mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect. There is additional output.

wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.

However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with 
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried

export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20  /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi

but did not get any indication of a problem (segfault), the last lines
are

 calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
 writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %

The last line is my shell prompt.

> 
> if everything fails, can you describe of MPI_Allreduce is invoked ?
> /* number of tasks, datatype, number of elements */
Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].

From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as

CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
         &                MPI_SUM, COMM%MPI_COMM, ierror )

n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank

mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

I of course get a shell prompt back. 

I then started in normally in the shell with 2 ranks 
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt 
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is still 
a process
I can see with "ps", but gdb says
(gdb) continue     <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.

So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
? 

I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.


Best Regards

Christof

[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2
> 
> 
> 
> Cheers,
> 
> Gilles
> 
> On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
> <christof.koeh...@bccms.uni-bremen.de> wrote:
> > Hello everybody,
> >
> > I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
> > node. A stack tracke (pstack) of one rank is below showing the program (vasp
> > 5.3.5) and the two psm2 progress threads. However:
> >
> > In fact, the vasp input is not ok and it should abort at the point where
> > it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
> > deadlocks in some allreduce operation. Originally it was started with 20
> > ranks, when it hangs there are only 19 left. From the PIDs I would
> > assume it is the master rank which is missing. So, this looks like a
> > failure to terminate.
> >
> > With 1.10 I get a clean
> > --------------------------------------------------------------------------
> > mpiexec noticed that process rank 0 with PID 18789 on node node109
> > exited on signal 11 (Segmentation fault).
> > --------------------------------------------------------------------------
> >
> > Any ideas what to try ? Of course in this situation it may well be the
> > program. Still, with the observed difference between 2.0.1 and 1.10 (and
> > mvapich) this might be interesting to someone.
> >
> > Best Regards
> >
> > Christof
> >
> >
> > Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
> > #0  0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
> > #1  0x00002ad35d114f42 in epoll_dispatch () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #3  0x00002ad35d16e996 in progress_engine () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> > #5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
> > Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
> > #0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1  
> > 0x00002ad35d11dc42 in poll_dispatch () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #3  0x00002ad35d0c61d1 in progress_engine () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> > #5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
> > Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
> > #0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6
> > #1  0x00002ad35d11dc42 in poll_dispatch () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #3  0x00002ad35d0c28cf in opal_progress () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > #4  0x00002ad35adce8d8 in ompi_request_wait_completion () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > #5  0x00002ad35adce838 in mca_pml_cm_recv () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > #6  0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling 
> > () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > #7  0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > #8  0x00002ad35ad1f0f4 in PMPI_Allreduce () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > #9  0x00002ad35aa99c38 in pmpi_allreduce__ () from 
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
> > #10 0x000000000045f8c6 in m_sum_i_ ()
> > #11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
> > #12 0x00000000004331ff in vamp () at main.F:2640
> > #13 0x000000000040ea1e in main ()
> > #14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
> > #15 0x000000000040e929 in _start ()
> >
> >
> > --
> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > 28359 Bremen
> >
> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users

-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Reply via email to