On 12/2/21 2:06 PM, Ole Holm Nielsen wrote:
> These are updated observations of running OpenMPI codes with an
> Omni-Path network fabric on AlmaLinux 8.5::
>
> Using the foss-2021b toolchain and OpenMPI/4.1.1-GCC-11.2.0 my trivial
> MPI test code works correctly:
>
> $ ml OpenMPI
> $ ml
>
> Currently Loaded Modules:
> 1) GCCcore/11.2.0 9) hwloc/2.5.0-GCCcore-11.2.0
> 2) zlib/1.2.11-GCCcore-11.2.0 10) OpenSSL/1.1
> 3) binutils/2.37-GCCcore-11.2.0 11) libevent/2.1.12-GCCcore-11.2.0
> 4) GCC/11.2.0 12) UCX/1.11.2-GCCcore-11.2.0
> 5) numactl/2.0.14-GCCcore-11.2.0 13) libfabric/1.13.2-GCCcore-11.2.0
> 6) XZ/5.2.5-GCCcore-11.2.0 14) PMIx/4.1.0-GCCcore-11.2.0
> 7) libxml2/2.9.10-GCCcore-11.2.0 15) OpenMPI/4.1.1-GCC-11.2.0
> 8) libpciaccess/0.16-GCCcore-11.2.0
>
> $ mpicc mpi_test.c
> $ mpirun -n 2 a.out
>
> (null): There are 2 processes
>
> (null): Rank 1: d008
>
> (null): Rank 0: d008
>
>
> I also tried the OpenMPI/4.1.0-GCC-10.2.0 module, but this still gives
> the error messages:
>
> $ ml OpenMPI/4.1.0-GCC-10.2.0
> $ ml
>
> Currently Loaded Modules:
> 1) GCCcore/10.2.0 3) binutils/2.35-GCCcore-10.2.0 5)
> numactl/2.0.13-GCCcore-10.2.0 7) libxml2/2.9.10-GCCcore-10.2.0 9)
> hwloc/2.2.0-GCCcore-10.2.0 11) UCX/1.9.0-GCCcore-10.2.0 13)
> PMIx/3.1.5-GCCcore-10.2.0
> 2) zlib/1.2.11-GCCcore-10.2.0 4) GCC/10.2.0 6)
> XZ/5.2.5-GCCcore-10.2.0 8) libpciaccess/0.16-GCCcore-10.2.0 10)
> libevent/2.1.12-GCCcore-10.2.0 12) libfabric/1.11.0-GCCcore-10.2.0 14)
> OpenMPI/4.1.0-GCC-10.2.0
>
> $ mpicc mpi_test.c
> $ mpirun -n 2 a.out
> [1638449983.577933] [d008:910356:0] ib_iface.c:966 UCX ERROR
> ibv_create_cq(cqe=4096) failed: Operation not supported
> [1638449983.577827] [d008:910355:0] ib_iface.c:966 UCX ERROR
> ibv_create_cq(cqe=4096) failed: Operation not supported
> [d008.nifl.fysik.dtu.dk:910355] pml_ucx.c:273 Error: Failed to create
> UCP worker
> [d008.nifl.fysik.dtu.dk:910356] pml_ucx.c:273 Error: Failed to create
> UCP worker
>
> (null): There are 2 processes
>
> (null): Rank 0: d008
>
> (null): Rank 1: d008
>
> Conclusion: The foss-2021b toolchain with OpenMPI/4.1.1-GCC-11.2.0 seems
> to be required on systems with an Omni-Path network fabric on AlmaLinux
> 8.5. Perhaps the newer UCX/1.11.2-GCCcore-11.2.0 is really what's
> needed, compared to UCX/1.9.0-GCCcore-10.2.0 from foss-2020b.
>
> Does anyone have comments on this?
UCX is the problem here in combination with libfabric I think. Write a
hook that upgrades the version of UCX to 1.11-something if it's <
1.11-ish, or just that specific version if you have older-and-working
versions.
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se