> On Dec 7, 2016, at 12:37 PM, Christof Koehler
> wrote:
>
>
>> Presumably someone here can comment on what the standard says about the
>> validity of terminating without mpi_abort.
>
> Well, probably stop is not a good way to terminate then.
>
> My
Hi Christof
Sorry if I missed this, but it sounds like you are saying that one of your
procs abnormally terminates, and we are failing to kill the remaining job? Is
that correct?
If so, I just did some work that might relate to that problem that is pending
in PR #2528:
Hello,
On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > On Dec 7, 2016, at 10:07 AM, Christof Koehler
> > wrote:
> >>
> > I really think the hang is a consequence of
> > unclean termination (in the sense that the non-root ranks are not
>
> On Dec 7, 2016, at 10:07 AM, Christof Koehler
> wrote:
>>
> I really think the hang is a consequence of
> unclean termination (in the sense that the non-root ranks are not
> terminated) and probably not the cause, in my interpretation of what I
> see.
Hello,
On Wed, Dec 07, 2016 at 11:07:49PM +0900, Gilles Gouaillardet wrote:
> Christof,
>
> out of curiosity, can you run
> dmesg
> and see if you find some tasks killed by the oom-killer ?
Definitively not the oom-killer. It is a real tiny example. I checked
the machines logfile and dmesg.
>
Hello again,
attaching the gdb to mpirun the back trace when it hangs is
(gdb) bt
#0 0x2b039f74169d in poll () from /usr/lib64/libc.so.6
#1 0x2b039e1a9c42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x2b039e1a2751 in
Hello,
thank you for the fast answer.
On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> Christoph,
>
> can you please try again with
>
> mpirun --mca btl tcp,self --mca pml ob1 ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...
that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not.
if that causes a crash, then can you please try
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
that will help
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
5.3.5) and the two psm2 progress threads. However:
In fact, the vasp input is not ok and it should abort at the point where
it