I've added better error messages in the trunk, scheduled to move over to 1.7.5. 
I don't see anything in the code that would explain why we don't pickup and use 
ib0 if it is present and specified in if_include - we should be doing it.

For now, can you run this with "-mca oob_base_verbose 100" on your cmd line and 
send me the output? Might help debug the behavior.

Thanks
Ralph

On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:

> Dear Open MPI developer,
> 
> I.
> we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a 
> change to previous versions:
> - when calling "mpiexec", it returns "1" and exits silently.
> 
> The behaviour is reproducible; well not that easy reproducible.
> 
> We have multiple InfiniBand islands in our cluster. All nodes are 
> passwordless reachable from each other in somehow way; some via IPoIB, for 
> some routing you also have to use ethernet cards and IB/TCP gateways.
> 
> One island (b) is configured to use the IB card as the main TCP interface. In 
> this island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*)
> 
> Another island (h) is configured in convenient way: IB cards also are here 
> and may be used for IPoIB in the island, but the "main interface" used for 
> DNS and Hostname binds is eth0.
> 
> When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI 
> version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec 
> just exits with return value "1" and no error/warning.
> 
> When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine.
> 
> All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this 
> behaviour; so this is aligned to v1.7.4 only. See log below.
> 
> You ask why to hell starting MPI processes on other IB island? Because our 
> front-end nodes are in the island (b) but we sometimes need to start 
> something also on island (h), which has been worced perfectly until 1.7.4.
> 
> 
> (*) This is another Spaghetti Western long story. In short, we set 
> OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is 
> configured to be the main network interface, in order to stop Open MPI trying 
> to connect via (possibly unconfigured) ethernet cards - which lead to endless 
> waiting, sometimes.
> Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php
> 
> ------------------------------------------------------------------------------
> pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3 
> Unloading openmpi 1.7.3                         [ OK ]
> Loading openmpi 1.7.3 for intel compiler                         [ OK ]
> pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 hostname ; 
> echo $?
> linuxscc004.rz.RWTH-Aachen.DE
> 0
> pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4 
> Unloading openmpi 1.7.3                         [ OK ]
> Loading openmpi 1.7.4 for intel compiler                         [ OK ]
> pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 hostname ; 
> echo $?
> 1
> pk224850@cluster:~[527]$
> ------------------------------------------------------------------------------
> 
> 
> 
> 
> 
> 
> 
> 
> II.
> During some experiments with envvars and v1.7.4, got the below messages.
> 
> --------------------------------------------------------------------------
> Sorry!  You were supposed to get help about:
>    no-included-found
> But I couldn't open the help file:
>    /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No such 
> file or directory.  Sorry!
> --------------------------------------------------------------------------
> [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not available 
> in file ess_hnp_module.c at line 314
> --------------------------------------------------------------------------
> 
> Reproducing:
> $MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0   -H linuxscc004 -np 1 
> hostname
> 
> *frome one node with no 'ib0' card*, also without infiniband. Yessir this is 
> a bad idea, and the 1.7.3 has said more understanding "you do wrong thing":
> --------------------------------------------------------------------------
> None of the networks specified to be included for out-of-band communications
> could be found:
> 
>  Value given: ib0
> 
> Please revise the specification and try again.
> --------------------------------------------------------------------------
> 
> 
> No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed 
> in 1.7.4, as we compile this version in pretty the same way as previous 
> versions..
> 
> 
> 
> 
> Best,
> Paul Kapinos
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, IT Center
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> 

Reply via email to