subject:"\[OMPI users\] Abort\/ Deadlock issue in allreduce"

Re: [OMPI users] Abort/ Deadlock issue in allreduce (Gilles Gouaillardet)

2016-12-12 Thread Christof Koehler

-- > > Message: 1 > Date: Mon, 12 Dec 2016 09:32:25 +0900 > From: Gilles Gouaillardet > To: users@lists.open-mpi.org > Subject: Re: [OMPI users] Abort/ Deadlock issue in allreduce > Message-ID: <8316882f-01a

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-11 Thread Gilles Gouaillardet

Christof, Ralph fixed the issue, meanwhile, the patch can be manually downloaded at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/2552.patch Cheers, Gilles On 12/9/2016 5:39 PM, Christof Koehler wrote: Hello, our case is. The libwannier.a is a "third party" library

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Noam Bernstein

> On Dec 9, 2016, at 3:39 AM, Christof Koehler > wrote: > > Hello, > > our case is. The libwannier.a is a "third party" > library which is built seperately and the just linked in. So the vasp > preprocessor never touches it. As far as I can see no preprocessing of > the f90 source is involved i

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Christof Koehler

Hello, our case is. The libwannier.a is a "third party" library which is built seperately and the just linked in. So the vasp preprocessor never touches it. As far as I can see no preprocessing of the f90 source is involved in the libwannier build process. I finally managed to set a breakpoint a

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Gilles Gouaillardet

Folks, the problem is indeed pretty trivial to reproduce i opened https://github.com/open-mpi/ompi/issues/2550 (and included a reproducer) Cheers, Gilles On Fri, Dec 9, 2016 at 5:15 AM, Noam Bernstein wrote: > On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet > wrote: > > Christof, > > > Ther

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Noam Bernstein

> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet > wrote: > > Christof, > > > There is something really odd with this stack trace. > count is zero, and some pointers do not point to valid addresses (!) > > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that > the sta

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread r...@open-mpi.org

To the best I can determine, mpirun catches SIGTERM just fine and will hit the procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to see the remote daemons complete after they hit their procs with the same sequence. > On Dec 8, 2016, at 5:18 AM, Christof Koehler > wr

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler

Hello again, I am still not sure about breakpoints. But I did a "catch signal" in gdb, gdb's were attached to the two vasp processes and mpirun. When the root rank exits I see in the gdb attaching to it [Thread 0x2b2787df8700 (LWP 2457) exited] [Thread 0x2b277f483180 (LWP 2455) exited] [Inferior

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler

Hello, On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote: > Christof, > > > There is something really odd with this stack trace. > count is zero, and some pointers do not point to valid addresses (!) Yes, I assumed it was interesting :-) Note that the program is compiled with

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Gilles Gouaillardet

Christof, There is something really odd with this stack trace. count is zero, and some pointers do not point to valid addresses (!) in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that the stack has been corrupted inside MPI_Allreduce(), or that you are not using the libr

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler

Hello everybody, I tried it with the nightly and the direct 2.0.2 branch from git which according to the log should contain that patch commit d0b97d7a408b87425ca53523de369da405358ba2 Merge: ac8c019 b9420bb Author: Jeff Squyres Date: Wed Dec 7 18:24:46 2016 -0500 Merge pull request #2528 fr

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein

> On Dec 7, 2016, at 12:37 PM, Christof Koehler > wrote: > > >> Presumably someone here can comment on what the standard says about the >> validity of terminating without mpi_abort. > > Well, probably stop is not a good way to terminate then. > > My main point was the change relative to 1.10

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread r...@open-mpi.org

Hi Christof Sorry if I missed this, but it sounds like you are saying that one of your procs abnormally terminates, and we are failing to kill the remaining job? Is that correct? If so, I just did some work that might relate to that problem that is pending in PR #2528: https://github.com/open-

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler

Hello, On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: > > On Dec 7, 2016, at 10:07 AM, Christof Koehler > > wrote: > >> > > I really think the hang is a consequence of > > unclean termination (in the sense that the non-root ranks are not > > terminated) and probably not the cau

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein

> On Dec 7, 2016, at 10:07 AM, Christof Koehler > wrote: >> > I really think the hang is a consequence of > unclean termination (in the sense that the non-root ranks are not > terminated) and probably not the cause, in my interpretation of what I > see. Would you have any suggestion to catch sig

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler

Hello, On Wed, Dec 07, 2016 at 11:07:49PM +0900, Gilles Gouaillardet wrote: > Christof, > > out of curiosity, can you run > dmesg > and see if you find some tasks killed by the oom-killer ? Definitively not the oom-killer. It is a real tiny example. I checked the machines logfile and dmesg. > >

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler

Hello again, attaching the gdb to mpirun the back trace when it hangs is (gdb) bt #0 0x2b039f74169d in poll () from /usr/lib64/libc.so.6 #1 0x2b039e1a9c42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20 #2 0x2b039e1a2751 in opal_libevent2022_even

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler

Hello, thank you for the fast answer. On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote: > Christoph, > > can you please try again with > > mpirun --mca btl tcp,self --mca pml ob1 ... mpirun -n 20 --mca btl tcp,self --mca pml ob1 /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Gilles Gouaillardet

Christoph, can you please try again with mpirun --mca btl tcp,self --mca pml ob1 ... that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not. if that causes a crash, then can you please try mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ... that will help fig

[OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler

Hello everybody, I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single node. A stack tracke (pstack) of one rank is below showing the program (vasp 5.3.5) and the two psm2 progress threads. However: In fact, the vasp input is not ok and it should abort at the point where it hangs.

Re: [OMPI users] Abort/ Deadlock issue in allreduce (Gilles Gouaillardet)

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

[OMPI users] Abort/ Deadlock issue in allreduce

20 matches

Site Navigation

Mail list logo

Footer information