Okay, this exposed the problem. The issue is that "ib0" on the two machines is
defined on two completely different IP subnets:
linuxbmc0008: 134.61.202.7
linuxscc004: 192.168.222.4
The OOB doesn't think those two are directly reachable by each other as the
IP/subnet-mask don't match - we
Attached the output from openmpi/1.7.5a1r30708
$ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100 -H
linuxscc004 -np 1 hostname 2>&1 | tee oob_base_verbose-linuxbmc0008-175a1r29587.txt
Well, some 5 lines added.
(The ib0 on linuxscc004 is not reachable from linuxbmc00
Could you please give the nightly 1.7.5 tarball a try using the same cmd line
options and send me the output? I see the problem, but am trying to understand
how it happens. I've added a bunch of diagnostic statements that should help me
track it down.
Thanks
Ralph
On Feb 12, 2014, at 1:26 AM,
As said, the change in behaviour is new in 1.7.4 - all previous versions has
been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a workaround for
older versions of Open MPI for some 60-seconds timeout when starting the same
command (which is still sucessfull); or for infinite waiting
I've added better error messages in the trunk, scheduled to move over to 1.7.5.
I don't see anything in the code that would explain why we don't pickup and use
ib0 if it is present and specified in if_include - we should be doing it.
For now, can you run this with "-mca oob_base_verbose 100" on
Dear Open MPI developer,
I.
we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a change
to previous versions:
- when calling "mpiexec", it returns "1" and exits silently.
The behaviour is reproducible; well not that easy reproducible.
We have multiple InfiniBand islands i