I've added better error messages in the trunk, scheduled to move over to 1.7.5. I don't see anything in the code that would explain why we don't pickup and use ib0 if it is present and specified in if_include - we should be doing it.
For now, can you run this with "-mca oob_base_verbose 100" on your cmd line and send me the output? Might help debug the behavior. Thanks Ralph On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote: > Dear Open MPI developer, > > I. > we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a > change to previous versions: > - when calling "mpiexec", it returns "1" and exits silently. > > The behaviour is reproducible; well not that easy reproducible. > > We have multiple InfiniBand islands in our cluster. All nodes are > passwordless reachable from each other in somehow way; some via IPoIB, for > some routing you also have to use ethernet cards and IB/TCP gateways. > > One island (b) is configured to use the IB card as the main TCP interface. In > this island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*) > > Another island (h) is configured in convenient way: IB cards also are here > and may be used for IPoIB in the island, but the "main interface" used for > DNS and Hostname binds is eth0. > > When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI > version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec > just exits with return value "1" and no error/warning. > > When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine. > > All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this > behaviour; so this is aligned to v1.7.4 only. See log below. > > You ask why to hell starting MPI processes on other IB island? Because our > front-end nodes are in the island (b) but we sometimes need to start > something also on island (h), which has been worced perfectly until 1.7.4. > > > (*) This is another Spaghetti Western long story. In short, we set > OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is > configured to be the main network interface, in order to stop Open MPI trying > to connect via (possibly unconfigured) ethernet cards - which lead to endless > waiting, sometimes. > Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php > > ------------------------------------------------------------------------------ > pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3 > Unloading openmpi 1.7.3 [ OK ] > Loading openmpi 1.7.3 for intel compiler [ OK ] > pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname ; > echo $? > linuxscc004.rz.RWTH-Aachen.DE > 0 > pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4 > Unloading openmpi 1.7.3 [ OK ] > Loading openmpi 1.7.4 for intel compiler [ OK ] > pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname ; > echo $? > 1 > pk224850@cluster:~[527]$ > ------------------------------------------------------------------------------ > > > > > > > > > II. > During some experiments with envvars and v1.7.4, got the below messages. > > -------------------------------------------------------------------------- > Sorry! You were supposed to get help about: > no-included-found > But I couldn't open the help file: > /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No such > file or directory. Sorry! > -------------------------------------------------------------------------- > [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not available > in file ess_hnp_module.c at line 314 > -------------------------------------------------------------------------- > > Reproducing: > $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -H linuxscc004 -np 1 > hostname > > *frome one node with no 'ib0' card*, also without infiniband. Yessir this is > a bad idea, and the 1.7.3 has said more understanding "you do wrong thing": > -------------------------------------------------------------------------- > None of the networks specified to be included for out-of-band communications > could be found: > > Value given: ib0 > > Please revise the specification and try again. > -------------------------------------------------------------------------- > > > No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed > in 1.7.4, as we compile this version in pretty the same way as previous > versions.. > > > > > Best, > Paul Kapinos > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, IT Center > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 >