On 12/2/21 2:06 PM, Ole Holm Nielsen wrote:
> These are updated observations of running OpenMPI codes with an
> Omni-Path network fabric on AlmaLinux 8.5::
> 
> Using the foss-2021b toolchain and OpenMPI/4.1.1-GCC-11.2.0 my trivial
> MPI test code works correctly:
> 
> $ ml OpenMPI
> $ ml
> 
> Currently Loaded Modules:
>   1) GCCcore/11.2.0                     9) hwloc/2.5.0-GCCcore-11.2.0
>   2) zlib/1.2.11-GCCcore-11.2.0        10) OpenSSL/1.1
>   3) binutils/2.37-GCCcore-11.2.0      11) libevent/2.1.12-GCCcore-11.2.0
>   4) GCC/11.2.0                        12) UCX/1.11.2-GCCcore-11.2.0
>   5) numactl/2.0.14-GCCcore-11.2.0     13) libfabric/1.13.2-GCCcore-11.2.0
>   6) XZ/5.2.5-GCCcore-11.2.0           14) PMIx/4.1.0-GCCcore-11.2.0
>   7) libxml2/2.9.10-GCCcore-11.2.0     15) OpenMPI/4.1.1-GCC-11.2.0
>   8) libpciaccess/0.16-GCCcore-11.2.0
> 
> $ mpicc mpi_test.c
> $ mpirun -n 2 a.out
> 
> (null): There are 2 processes
> 
> (null): Rank  1:  d008
> 
> (null): Rank  0:  d008
> 
> 
> I also tried the OpenMPI/4.1.0-GCC-10.2.0 module, but this still gives
> the error messages:
> 
> $ ml OpenMPI/4.1.0-GCC-10.2.0
> $ ml
> 
> Currently Loaded Modules:
>   1) GCCcore/10.2.0               3) binutils/2.35-GCCcore-10.2.0   5)
> numactl/2.0.13-GCCcore-10.2.0   7) libxml2/2.9.10-GCCcore-10.2.0      9)
> hwloc/2.2.0-GCCcore-10.2.0      11) UCX/1.9.0-GCCcore-10.2.0         13)
> PMIx/3.1.5-GCCcore-10.2.0
>   2) zlib/1.2.11-GCCcore-10.2.0   4) GCC/10.2.0                     6)
> XZ/5.2.5-GCCcore-10.2.0         8) libpciaccess/0.16-GCCcore-10.2.0  10)
> libevent/2.1.12-GCCcore-10.2.0  12) libfabric/1.11.0-GCCcore-10.2.0  14)
> OpenMPI/4.1.0-GCC-10.2.0
> 
> $ mpicc mpi_test.c
> $ mpirun -n 2 a.out
> [1638449983.577933] [d008:910356:0]       ib_iface.c:966  UCX  ERROR
> ibv_create_cq(cqe=4096) failed: Operation not supported
> [1638449983.577827] [d008:910355:0]       ib_iface.c:966  UCX  ERROR
> ibv_create_cq(cqe=4096) failed: Operation not supported
> [d008.nifl.fysik.dtu.dk:910355] pml_ucx.c:273  Error: Failed to create
> UCP worker
> [d008.nifl.fysik.dtu.dk:910356] pml_ucx.c:273  Error: Failed to create
> UCP worker
> 
> (null): There are 2 processes
> 
> (null): Rank  0:  d008
> 
> (null): Rank  1:  d008
> 
> Conclusion: The foss-2021b toolchain with OpenMPI/4.1.1-GCC-11.2.0 seems
> to be required on systems with an Omni-Path network fabric on AlmaLinux
> 8.5.  Perhaps the newer UCX/1.11.2-GCCcore-11.2.0 is really what's
> needed, compared to UCX/1.9.0-GCCcore-10.2.0 from foss-2020b.
> 
> Does anyone have comments on this?

UCX is the problem here in combination with libfabric I think. Write a
hook that upgrades the version of UCX to 1.11-something if it's <
1.11-ish, or just that specific version if you have older-and-working
versions.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

Reply via email to