I would very much appreciate some advice in how to debug this problem. I
am trying to get OpenMPI to work on my reconfigured cluster - upgrading
from Centos 5 to Ubuntu 18. The problem is that a simple job using
Intel's IMB message passing test code will not run on any of the new
clients (4 so far).
mpirun -np 2 IMB-MPI1
just hangs - no printout, no messages in syslog. I left it for 1 hr and
it remained in the same state.
On the other hand the same code runs fine on the server (see outfoam).
Comparing the two it seems the client version hangs while trying to load
the openib module (it works with tcp,self or vader,self).
Digging a bit more I found the --mca btl_base_verbose option. Now I can
see a difference in the two cases:
On the server: ibv_obj->logical_index=1, my_obj->logical_index=0
On the client: ibv_obj->type set to NULL. I don't believe this is a good
sign, but I don't understand what it means. My guess is that openib is
not being initialized in this case.
The server (foam) is SuperMicro server with X10DAi m'board and 2XE52630
(10 core).
The client (f34) is a Dell R410 server with 2XE5620 (4 core). The
outputs from ompi_info are attached.
They are both running Ubuntu 18.04 with the latest updates. I installed
openmpi-bin 2.1.1-8. Both boxes have Mellanox Connect X2 cards with the
latest firmware (2.9.1000). I have checked that the cards send and
receive packets using the IB protocols and pass the Mellanox diagnostics.
I did notice that the Mellanox card has the PCI address 81:00.0 on the
server but 03:00.0 on the client. Not sure of the significance of this.
Any help anyone can offer would be much appreciated. I am stuck.
Thanks
Tony
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
Package: Open MPI buildd@lcy01-amd64-009 Distribution
Open MPI: 2.1.1
Open MPI repo revision: v2.1.0-100-ga2fdb5b
Open MPI release date: May 10, 2017
Open RTE: 2.1.1
Open RTE repo revision: v2.1.0-100-ga2fdb5b
Open RTE release date: May 10, 2017
OPAL: 2.1.1
OPAL repo revision: v2.1.0-100-ga2fdb5b
OPAL release date: May 10, 2017
MPI API: 3.1.0
Ident string: 2.1.1
Prefix: /usr
Configured architecture: x86_64-pc-linux-gnu
Configure host: lcy01-amd64-009
Configured by: buildd
Configured on: Mon Feb 5 19:59:59 UTC 2018
Configure host: lcy01-amd64-009
Built by: buildd
Built on: Mon Feb 5 20:05:56 UTC 2018
Built host: lcy01-amd64-009
C bindings: yes
C++ bindings: yes
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler, does not support the following: array
subsections, direct passthru (where possible) to underlying Open MPI's C
functionality
Fort mpi_f08 subarrays: no
Java bindings: yes
Wrapper compiler rpath: disabled
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 7.3.0
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fort compiler: gfortran
Fort compiler abs: /usr/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: yes
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
MPI extensions: affinity,