Hi, I'm trying to build an OpenMPI 5.0.3 environment on the Cray EX HPC with Slingshot 10 support.
General speaking, there were error messages while building OpenMPI, and make
check also didn't report any failure.
While tested OpenMPI Env. with a simple 'hello world' MPI Fortran codes, it
threw out these error messages and caught signal 11 with libucs if specified
'-mca btl ofi'.
No components were able to be opened in the btl framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded
Host: x3001c027b4n0
Framework: btl
-----------------------------------------------------------------------------------------------------
Caught signal 11 ( Segmentation fault: address not mapped to object at address
(nil))
/project/app/ucx/1.12.1/lib/libucs.so.0 (ucs_handle_error+0x134)
This made me confused and not sure if got OpenMPI built with full Slingshot 10
support successfully and run over Slingshot 10 properly.
Here are the building env. on Cray EX HPC with SLES 15 SP3
OpenMPI 5.0.3 + Intel 2022.0.2 + UCX 1.12.1 + libfabric
1.11.0.4.125-SSHOT2.0.0 + mlnx-ofed 5.5.1
Here are my configurations
--enable-mpi-fortran \
--enable-shared \
--with-pic \
--with-ofi=/opt/cray/libfabric/1.11.0.4.125 \
--with-ofi-libdir=/opt/cray/libfabric/1.11.0.4.125/lib64 \
--with-ucx=/project/app/ucx/1.12.1 \
--with-pmix=internal \
--with-slingshot \
--with-pbs \
--with-tm=/opt/pbs \
--with-singularity=/project/app/singularity/3.10.3 \
--with-lustre=/usr \
CC=icc \
FC=ifort \
CXX=icpc
Here are output of lspci on computing nodes
03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
[ConnectX-5]
24:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
Here are what I'm confusing
1. After the configuration completed, the pmix summary didn't tell slingshot
support is turned on for the transports
2. config.log didn't show any checking info. against slingshot while
conducting mca checking, just showed --with-slingshot was passed as an
argument.
3. Further looked into the configure script, the only script which will
check Slingshot support is 3rd-party/openmix/src/mca/pnet/sshot/configure.m4,
but looked like it's never called, as config.log didn't show any checking
info. against appropriate dependencies, such as CXI, JANSSON, and I believed
that CXI library was not installed on the machine.
Here are my questions
1.
How it could tell OpenMPI was built with full Slingshot 10 support successfully
based on ompi_info and ucx_info or some other info. ?
2.
Is CXI library just an optional package for OpenMPI getting Slingshot 10
support ?
3.
Which sort of mpirun arguments, like cma, pmi, etc., could be used to make
sure MPI application running over Slingshot 10 properly ?
4.
Which sort of OpenMPI parameters could be used for double checking runtime
info. over Slingshot 10 ?
5.
Which sort of OpenMPI parameters could be used for tunning up performance over
Slingshot 10 ?
Also attached output of 'ompi_info -a', 'ucx_info -d' for your reference.
Appreciating your time and comments.
Regards
Jerry
<<attachment: ompi_ucx_info.zip>>
