Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
> On Dec 7, 2016, at 12:37 PM, Christof Koehler > wrote: > > >> Presumably someone here can comment on what the standard says about the >> validity of terminating without mpi_abort. > > Well, probably stop is not a good way to terminate then. > > My

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread r...@open-mpi.org
Hi Christof Sorry if I missed this, but it sounds like you are saying that one of your procs abnormally terminates, and we are failing to kill the remaining job? Is that correct? If so, I just did some work that might relate to that problem that is pending in PR #2528:

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello, On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: > > On Dec 7, 2016, at 10:07 AM, Christof Koehler > > wrote: > >> > > I really think the hang is a consequence of > > unclean termination (in the sense that the non-root ranks are not >

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
> On Dec 7, 2016, at 10:07 AM, Christof Koehler > wrote: >> > I really think the hang is a consequence of > unclean termination (in the sense that the non-root ranks are not > terminated) and probably not the cause, in my interpretation of what I > see.

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello, On Wed, Dec 07, 2016 at 11:07:49PM +0900, Gilles Gouaillardet wrote: > Christof, > > out of curiosity, can you run > dmesg > and see if you find some tasks killed by the oom-killer ? Definitively not the oom-killer. It is a real tiny example. I checked the machines logfile and dmesg. >

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello again, attaching the gdb to mpirun the back trace when it hangs is (gdb) bt #0 0x2b039f74169d in poll () from /usr/lib64/libc.so.6 #1 0x2b039e1a9c42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 #2 0x2b039e1a2751 in

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello, thank you for the fast answer. On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote: > Christoph, > > can you please try again with > > mpirun --mca btl tcp,self --mca pml ob1 ... mpirun -n 20 --mca btl tcp,self --mca pml ob1

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Gilles Gouaillardet
Christoph, can you please try again with mpirun --mca btl tcp,self --mca pml ob1 ... that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not. if that causes a crash, then can you please try mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ... that will help

[OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello everybody, I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single node. A stack tracke (pstack) of one rank is below showing the program (vasp 5.3.5) and the two psm2 progress threads. However: In fact, the vasp input is not ok and it should abort at the point where it