Hi Åke,

On 02-12-2021 14:18, Åke Sandgren wrote:
On 12/2/21 2:06 PM, Ole Holm Nielsen wrote:
These are updated observations of running OpenMPI codes with an
Omni-Path network fabric on AlmaLinux 8.5::

Using the foss-2021b toolchain and OpenMPI/4.1.1-GCC-11.2.0 my trivial
MPI test code works correctly:

$ ml OpenMPI
$ ml

Currently Loaded Modules:
   1) GCCcore/11.2.0                     9) hwloc/2.5.0-GCCcore-11.2.0
   2) zlib/1.2.11-GCCcore-11.2.0        10) OpenSSL/1.1
   3) binutils/2.37-GCCcore-11.2.0      11) libevent/2.1.12-GCCcore-11.2.0
   4) GCC/11.2.0                        12) UCX/1.11.2-GCCcore-11.2.0
   5) numactl/2.0.14-GCCcore-11.2.0     13) libfabric/1.13.2-GCCcore-11.2.0
   6) XZ/5.2.5-GCCcore-11.2.0           14) PMIx/4.1.0-GCCcore-11.2.0
   7) libxml2/2.9.10-GCCcore-11.2.0     15) OpenMPI/4.1.1-GCC-11.2.0
   8) libpciaccess/0.16-GCCcore-11.2.0

$ mpicc mpi_test.c
$ mpirun -n 2 a.out

(null): There are 2 processes

(null): Rank  1:  d008

(null): Rank  0:  d008


I also tried the OpenMPI/4.1.0-GCC-10.2.0 module, but this still gives
the error messages:

$ ml OpenMPI/4.1.0-GCC-10.2.0
$ ml

Currently Loaded Modules:
   1) GCCcore/10.2.0               3) binutils/2.35-GCCcore-10.2.0   5)
numactl/2.0.13-GCCcore-10.2.0   7) libxml2/2.9.10-GCCcore-10.2.0      9)
hwloc/2.2.0-GCCcore-10.2.0      11) UCX/1.9.0-GCCcore-10.2.0         13)
PMIx/3.1.5-GCCcore-10.2.0
   2) zlib/1.2.11-GCCcore-10.2.0   4) GCC/10.2.0                     6)
XZ/5.2.5-GCCcore-10.2.0         8) libpciaccess/0.16-GCCcore-10.2.0  10)
libevent/2.1.12-GCCcore-10.2.0  12) libfabric/1.11.0-GCCcore-10.2.0  14)
OpenMPI/4.1.0-GCC-10.2.0

$ mpicc mpi_test.c
$ mpirun -n 2 a.out
[1638449983.577933] [d008:910356:0]       ib_iface.c:966  UCX  ERROR
ibv_create_cq(cqe=4096) failed: Operation not supported
[1638449983.577827] [d008:910355:0]       ib_iface.c:966  UCX  ERROR
ibv_create_cq(cqe=4096) failed: Operation not supported
[d008.nifl.fysik.dtu.dk:910355] pml_ucx.c:273  Error: Failed to create
UCP worker
[d008.nifl.fysik.dtu.dk:910356] pml_ucx.c:273  Error: Failed to create
UCP worker

(null): There are 2 processes

(null): Rank  0:  d008

(null): Rank  1:  d008

Conclusion: The foss-2021b toolchain with OpenMPI/4.1.1-GCC-11.2.0 seems
to be required on systems with an Omni-Path network fabric on AlmaLinux
8.5.  Perhaps the newer UCX/1.11.2-GCCcore-11.2.0 is really what's
needed, compared to UCX/1.9.0-GCCcore-10.2.0 from foss-2020b.

Does anyone have comments on this?

UCX is the problem here in combination with libfabric I think. Write a
hook that upgrades the version of UCX to 1.11-something if it's <
1.11-ish, or just that specific version if you have older-and-working
versions.

You are right that the nodes with Omni-Path have different libfabric packages which come from the EL8.5 BaseOS as well as the latest Cornelis/Intel Omni-Path drivers:

$ rpm -qa | grep libfabric
libfabric-verbs-1.10.0-2.x86_64
libfabric-1.12.1-1.el8.x86_64
libfabric-devel-1.12.1-1.el8.x86_64
libfabric-psm2-1.10.0-2.x86_64

The 1.12 packages are from EL8.5, and 1.10 packages are from Cornelis.

Regarding UCX, I was first using the trusted foss-2020b toolchain which includes UCX/1.9.0-GCCcore-10.2.0. I guess that we shouldn't mess with the toolchains?

The foss-2021b toolchain includes the newer UCX 1.11, which seems to solve this particular problem.

Can we make any best practices recommendations from these observations?

Thanks,
Ole

Reply via email to