As said, the change in behaviour is new in 1.7.4 - all previous versions has been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a workaround for older versions of Open MPI for some 60-seconds timeout when starting the same command (which is still sucessfull); or for infinite waiting in same cases.


Attached are logs of the commands:
$ export | grep OMPI | tee export_OMPI-linuxbmc0008.txt

$ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100 -H linuxscc004 -np 1 hostname 2>&1 | tee oob_base_verbose-linuxbmc0008-173.txt

(and -174 for appropriate versions 1.7.3 and 1.7.4)


$ ifconfig 2>&1 | tee ifconfig-linuxbmc0008.txt

(and -linuxscc004 for the two nodes; linuxscc004 is in (h) fabric and 'mpiexec' was called from node linuxbmc0008 which is in the (b) fabric where the 'ib0' is configured to be the main interface)

and the OMPI environment on linuxbmc0008. Maybe you can see something from this.

Best
Paul


On 02/11/14 20:29, Ralph Castain wrote:
I've added better error messages in the trunk, scheduled to move over to 1.7.5. 
I don't see anything in the code that would explain why we don't pickup and use 
ib0 if it is present and specified in if_include - we should be doing it.

For now, can you run this with "-mca oob_base_verbose 100" on your cmd line and 
send me the output? Might help debug the behavior.

Thanks
Ralph

On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:

Dear Open MPI developer,

I.
we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a 
change to previous versions:
- when calling "mpiexec", it returns "1" and exits silently.

The behaviour is reproducible; well not that easy reproducible.

We have multiple InfiniBand islands in our cluster. All nodes are passwordless 
reachable from each other in somehow way; some via IPoIB, for some routing you 
also have to use ethernet cards and IB/TCP gateways.

One island (b) is configured to use the IB card as the main TCP interface. In this 
island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*)

Another island (h) is configured in convenient way: IB cards also are here and may be 
used for IPoIB in the island, but the "main interface" used for DNS and 
Hostname binds is eth0.

When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI version is 1.7.4, and 
OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec just exits with return value 
"1" and no error/warning.

When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine.

All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this 
behaviour; so this is aligned to v1.7.4 only. See log below.

You ask why to hell starting MPI processes on other IB island? Because our 
front-end nodes are in the island (b) but we sometimes need to start something 
also on island (h), which has been worced perfectly until 1.7.4.


(*) This is another Spaghetti Western long story. In short, we set 
OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is 
configured to be the main network interface, in order to stop Open MPI trying 
to connect via (possibly unconfigured) ethernet cards - which lead to endless 
waiting, sometimes.
Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php

------------------------------------------------------------------------------
pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3
Unloading openmpi 1.7.3                         [ OK ]
Loading openmpi 1.7.3 for intel compiler                         [ OK ]
pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 hostname ; 
echo $?
linuxscc004.rz.RWTH-Aachen.DE
0
pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4
Unloading openmpi 1.7.3                         [ OK ]
Loading openmpi 1.7.4 for intel compiler                         [ OK ]
pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 hostname ; 
echo $?
1
pk224850@cluster:~[527]$
------------------------------------------------------------------------------








II.
During some experiments with envvars and v1.7.4, got the below messages.

--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no-included-found
But I couldn't open the help file:
    /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No such 
file or directory.  Sorry!
--------------------------------------------------------------------------
[linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not available 
in file ess_hnp_module.c at line 314
--------------------------------------------------------------------------

Reproducing:
$MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0   -H linuxscc004 -np 1 hostname

*frome one node with no 'ib0' card*, also without infiniband. Yessir this is a bad idea, 
and the 1.7.3 has said more understanding "you do wrong thing":
--------------------------------------------------------------------------
None of the networks specified to be included for out-of-band communications
could be found:

  Value given: ib0

Please revise the specification and try again.
--------------------------------------------------------------------------


No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed in 
1.7.4, as we compile this version in pretty the same way as previous versions..




Best,
Paul Kapinos

--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915




--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: components_open: Looking for 
oob components
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: components_open: opening oob 
components
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: components_open: found loaded 
component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: components_open: component 
tcp register function successful
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: components_open: component 
tcp open function successful
[linuxscc004.rz.RWTH-Aachen.DE:22548] mca: base: components_open: Looking for 
oob components
[linuxscc004.rz.RWTH-Aachen.DE:22548] mca: base: components_open: opening oob 
components
[linuxscc004.rz.RWTH-Aachen.DE:22548] mca: base: components_open: found loaded 
component tcp
[linuxscc004.rz.RWTH-Aachen.DE:22548] mca: base: components_open: component tcp 
register function successful
[linuxscc004.rz.RWTH-Aachen.DE:22548] mca: base: components_open: component tcp 
open function successful
linuxscc004.rz.RWTH-Aachen.DE
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: close: component tcp closed
[linuxbmc0008.rz.RWTH-Aachen.DE:17040] mca: base: close: unloading component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: components_register: 
registering oob components
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: components_register: found 
loaded component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: components_register: 
component tcp register function successful
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: components_open: opening oob 
components
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: components_open: found loaded 
component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: components_open: component 
tcp open function successful
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca_oob_base_init: tcp module selected
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: components_register: 
registering oob components
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: components_register: found 
loaded component tcp
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: components_register: component 
tcp register function successful
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: components_open: opening oob 
components
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: components_open: found loaded 
component tcp
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: components_open: component tcp 
open function successful
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca_oob_base_init: tcp module selected
linuxscc004.rz.RWTH-Aachen.DE
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: close: component tcp closed
[linuxscc004.rz.RWTH-Aachen.DE:23865] mca: base: close: unloading component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: close: component tcp closed
[linuxbmc0008.rz.RWTH-Aachen.DE:18524] mca: base: close: unloading component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: components_register: 
registering oob components
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: components_register: found 
loaded component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: components_register: 
component tcp register function successful
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: components_open: opening oob 
components
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: components_open: found loaded 
component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: components_open: component 
tcp open function successful
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca:oob:select: checking available 
component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca:oob:select: Querying component [tcp]
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] oob:tcp: component_available called
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] WORKING INTERFACE 1 KERNEL INDEX 1 
FAMILY: V4
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] [[21954,0],0] oob:tcp:init rejecting 
interface lo (not in include list)
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] WORKING INTERFACE 2 KERNEL INDEX 2 
FAMILY: V4
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] [[21954,0],0] oob:tcp:init rejecting 
interface eth0 (not in include list)
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] WORKING INTERFACE 3 KERNEL INDEX 4 
FAMILY: V4
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] [[21954,0],0] TCP STARTUP
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] [[21954,0],0] attempting to bind to IPv4 
port 0
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] [[21954,0],0] assigned IPv4 port 1721
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca:oob:select: Adding component to end
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca:oob:select: Found 1 active transports
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: components_register: 
registering oob components
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: components_register: found 
loaded component tcp
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: components_register: component 
tcp register function successful
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: components_open: opening oob 
components
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: components_open: found loaded 
component tcp
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: components_open: component tcp 
open function successful
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca:oob:select: checking available 
component tcp
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca:oob:select: Querying component [tcp]
[linuxscc004.rz.RWTH-Aachen.DE:22754] oob:tcp: component_available called
[linuxscc004.rz.RWTH-Aachen.DE:22754] WORKING INTERFACE 1 KERNEL INDEX 1 
FAMILY: V4
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] oob:tcp:init rejecting 
interface lo (not in include list)
[linuxscc004.rz.RWTH-Aachen.DE:22754] WORKING INTERFACE 2 KERNEL INDEX 2 
FAMILY: V4
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] oob:tcp:init rejecting 
interface eth0 (not in include list)
[linuxscc004.rz.RWTH-Aachen.DE:22754] WORKING INTERFACE 3 KERNEL INDEX 4 
FAMILY: V4
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] TCP STARTUP
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] attempting to bind to IPv4 
port 0
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] assigned IPv4 port 19676
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca:oob:select: Adding component to end
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca:oob:select: Found 1 active transports
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1]: set_addr to uri 
1438777344.0;tcp://134.61.202.7:47366
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] oob:tcp: working peer 
[[21954,0],0] address tcp://134.61.202.7:47366
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] NO MODULE AT KINDEX 2 FOR 
ADDRESS 134.61.202.7
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] oob:base:send to target 
[[21954,0],0]
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] oob:base:send no path to 
target [[21954,0],0]
[linuxscc004.rz.RWTH-Aachen.DE:22754] [[21954,0],1] TCP SHUTDOWN
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: close: component tcp closed
[linuxscc004.rz.RWTH-Aachen.DE:22754] mca: base: close: unloading component tcp
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] [[21954,0],0] TCP SHUTDOWN
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: close: component tcp closed
[linuxbmc0008.rz.RWTH-Aachen.DE:17217] mca: base: close: unloading component tcp
LAST_COMPILER=intel
OMPI_MCA_btl='^tcp'
OMPI_MCA_btl_openib_ib_timeout=24
OMPI_MCA_btl_tcp_if_include=ib0
OMPI_MCA_oob_tcp_if_include=ib0
_COMPILER_OPENMPI=intel
_LAST_COMPILER_MAJORVERSION=14
_LAST_COMPILER_MINORVERSION=0
_LAST_COMPILER_REVISION=1.106
_LIST_LAST_COMPILER=intel
_LIST__LAST_COMPILER_MAJORVERSION=14
_LIST__LAST_COMPILER_MINORVERSION=0
_LIST__LAST_COMPILER_REVISION=1.106
Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
eth0      Link encap:Ethernet  HWaddr 08:00:38:36:DC:22  
          inet addr:134.61.224.17  Bcast:134.61.239.255  Mask:255.255.240.0
          inet6 addr: fe80::a00:38ff:fe36:dc22/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:386719 errors:0 dropped:0 overruns:0 frame:0
          TX packets:234 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:32558542 (31.0 MiB)  TX bytes:14805 (14.4 KiB)
          Memory:c3b60000-c3b80000 

ib0       Link encap:InfiniBand  HWaddr 
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:134.61.202.7  Bcast:134.61.223.255  Mask:255.255.224.0
          inet6 addr: fe80::a00:3800:137:3c34/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:22986346 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10037639 errors:0 dropped:496 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:29169627021 (27.1 GiB)  TX bytes:11033130652 (10.2 GiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:683084 errors:0 dropped:0 overruns:0 frame:0
          TX packets:683084 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1311389621 (1.2 GiB)  TX bytes:1311389621 (1.2 GiB)

Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
eth0      Link encap:Ethernet  HWaddr 00:0A:E4:87:A2:EF  
          inet addr:134.61.220.73  Bcast:134.61.223.255  Mask:255.255.224.0
          inet6 addr: fe80::20a:e4ff:fe87:a2ef/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:62454819 errors:0 dropped:920 overruns:0 frame:0
          TX packets:57964287 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:54096582054 (50.3 GiB)  TX bytes:71079599155 (66.1 GiB)

ib0       Link encap:InfiniBand  HWaddr 
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.222.4  Bcast:192.168.255.255  Mask:255.255.0.0
          inet6 addr: fe80::205:ad00:c:75ad/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4322 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1241 errors:0 dropped:6 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:576711 (563.1 KiB)  TX bytes:195096 (190.5 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:120079 errors:0 dropped:0 overruns:0 frame:0
          TX packets:120079 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:106113288 (101.1 MiB)  TX bytes:106113288 (101.1 MiB)

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to