[OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-17 Thread Tony Ladd via users
I would very much appreciate some advice in how to debug this problem. I 
am trying to get OpenMPI to work on my reconfigured cluster - upgrading 
from Centos 5 to Ubuntu 18. The problem is that a simple job using 
Intel's IMB message passing test code will not run on any of the new 
clients (4 so far).


mpirun -np 2 IMB-MPI1

just hangs - no printout, no messages in syslog. I left it for 1 hr and 
it remained in the same state.


On the other hand the same code runs fine on the server (see outfoam). 
Comparing the two it seems the client version hangs while trying to load 
the openib module (it works with tcp,self or vader,self).


Digging a bit more I found the --mca btl_base_verbose option. Now I can 
see a difference in the two cases:


On the server: ibv_obj->logical_index=1, my_obj->logical_index=0

On the client: ibv_obj->type set to NULL. I don't believe this is a good 
sign, but I don't understand what it means. My guess is that openib is 
not being initialized in this case.


The server (foam) is SuperMicro server with X10DAi m'board and 2XE52630 
(10 core).


The client (f34) is a Dell R410 server with 2XE5620 (4 core). The 
outputs from ompi_info are attached.


They are both running Ubuntu 18.04 with the latest updates. I installed 
openmpi-bin 2.1.1-8. Both boxes have Mellanox Connect X2 cards with the 
latest firmware (2.9.1000). I have checked that the cards send and 
receive packets using the IB protocols and pass the Mellanox diagnostics.


I did notice that the Mellanox card has the PCI address 81:00.0 on the 
server but 03:00.0 on the client. Not sure of the significance of this.


Any help anyone can offer would be much appreciated. I am stuck.

Thanks

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

 Package: Open MPI buildd@lcy01-amd64-009 Distribution
Open MPI: 2.1.1
  Open MPI repo revision: v2.1.0-100-ga2fdb5b
   Open MPI release date: May 10, 2017
Open RTE: 2.1.1
  Open RTE repo revision: v2.1.0-100-ga2fdb5b
   Open RTE release date: May 10, 2017
OPAL: 2.1.1
  OPAL repo revision: v2.1.0-100-ga2fdb5b
   OPAL release date: May 10, 2017
 MPI API: 3.1.0
Ident string: 2.1.1
  Prefix: /usr
 Configured architecture: x86_64-pc-linux-gnu
  Configure host: lcy01-amd64-009
   Configured by: buildd
   Configured on: Mon Feb  5 19:59:59 UTC 2018
  Configure host: lcy01-amd64-009
Built by: buildd
Built on: Mon Feb  5 20:05:56 UTC 2018
  Built host: lcy01-amd64-009
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to 
limitations in the gfortran compiler, does not support the following: array 
subsections, direct passthru (where possible) to underlying Open MPI's C 
functionality
  Fort mpi_f08 subarrays: no
   Java bindings: yes
  Wrapper compiler rpath: disabled
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
  C compiler version: 7.3.0
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
   Fort compiler: gfortran
   Fort compiler abs: /usr/bin/gfortran
 Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
  Fort optional args: yes
  Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
   Fort STORAGE_SIZE: yes
  Fort BIND(C) (all): yes
  Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
   Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
  Fort PROTECTED: yes
   Fort ABSTRACT: yes
   Fort ASYNCHRONOUS: yes
  Fort PROCEDURE: yes
 Fort USE...ONLY: yes
   Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
 Fort MPI_SIZEOF: yes
 C profiling: yes
   C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
  C++ exceptions: no
  Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
   Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
  dl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
 MPI I/O support: yes
   MPI_WTIME support: native
 Symbol vis. support: yes
   Host topology support: yes
  MPI extensions: affinity, 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-17 Thread Tony Ladd via users
My apologies - I did not read the FAQ's carefully enough - with regard 
to 14:


1. openib

2. Ubuntu supplied drivers etc.

3. Ubuntu 18.04  4.15.0-112-generic

4. opensm-3.3.5_mlnx-0.1.g6b18e73

5. Attached

6. Attached

7. unlimited on foam and 16384 on f34

I changed the ulimit to unlimited on f34 but it did not help.

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

foam:root(ib)> ibv_devinfo
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.9.1000
node_guid:  0002:c903:000f:666e
sys_image_guid: 0002:c903:000f:6671
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   26
port_lmc:   0x00
link_layer: InfiniBand

root@f34:/home/tladd# ibv_devinfo 
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.9.1000
node_guid:  0002:c903:000a:af92
sys_image_guid: 0002:c903:000a:af95
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   32
port_lmc:   0x00
link_layer: InfiniBand

root@f34:/home/tladd# ifconfig
eno1: flags=4163  mtu 1500
inet 10.1.2.34  netmask 255.255.255.0  broadcast 10.1.2.255
inet6 fe80::862b:2bff:fe18:3729  prefixlen 64  scopeid 0x20
ether 84:2b:2b:18:37:29  txqueuelen 1000  (Ethernet)
RX packets 1015244  bytes 146716710 (146.7 MB)
RX errors 0  dropped 234903  overruns 0  frame 0
TX packets 176298  bytes 17106041 (17.1 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163  mtu 2044
inet 10.2.2.34  netmask 255.255.255.0  broadcast 10.2.2.255
inet6 fe80::202:c903:a:af93  prefixlen 64  scopeid 0x20
unspec 80-00-02-08-FE-80-00-00-00-00-00-00-00-00-00-00  txqueuelen 256  
(UNSPEC)
RX packets 289257  bytes 333876570 (333.8 MB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 140385  bytes 324882131 (324.8 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73  mtu 65536
inet 127.0.0.1  netmask 255.0.0.0
inet6 ::1  prefixlen 128  scopeid 0x10
loop  txqueuelen 1000  (Local Loopback)
RX packets 317853  bytes 21490738 (21.4 MB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 317853  bytes 21490738 (21.4 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

foam:root(ib)> ifconfig
enp4s0: flags=4163  mtu 1500
inet 10.1.2.251  netmask 255.255.255.0  broadcast 10.1.2.255
inet6 fe80::ae1f:6bff:feb1:7f02  prefixlen 64  scopeid 0x20
ether ac:1f:6b:b1:7f:02  txqueuelen 1000  (Ethernet)
RX packets 1092343  bytes 98282221 (98.2 MB)
RX errors 0  dropped 176607  overruns 0  frame 0
TX packets 248746  bytes 206951391 (206.9 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device memory 0xf040-f047  

enp5s0: flags=4163  mtu 1500
inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
inet6 fe80::ae1f:6bff:feb1:7f03  prefixlen 64  scopeid 0x20
ether ac:1f:6b:b1:7f:03  txqueuelen 1000  (Ethernet)
RX packets 1039387  bytes 87199457 (87.1 MB)
RX errors 0  dropped 187625  overruns 0  frame 0
TX packets 5884980  bytes 8649612519 (8.6 GB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device memory 0xf030-f037  

enp6s0: flags=4163  mtu 1500
inet 10.227.121.95  netmask 255.255.255.0  broadcast 10.227.121.255
inet6 fe80::6a05:caff:febd:397c  prefixlen 64  scopeid 0x20
ether 68:05:ca:bd:39:7c  txqueuelen 1000  (Eth