HI Greg,

Oh yes that’s not good about rdmacm.

Yes the OFED looks pretty old.

Did you by any chance apply that patch?  I generated that for a sysadmin here 
who was in the situation where they needed to maintain Open MPI 3.1.6 but had 
to also upgrade to some newer RHEL release, but the Open MPi wasn’t compiling 
after the RHEL upgrade.

Howard


From: "Fischer, Greg A." <fisch...@westinghouse.com>
Date: Thursday, October 14, 2021 at 1:47 PM
To: "Pritchard Jr., Howard" <howa...@lanl.gov>, Open MPI Users 
<users@lists.open-mpi.org>
Cc: "Fischer, Greg A." <fisch...@westinghouse.com>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

I added –enable-mt and re-installed UCX. Same result. (I didn’t re-compile 
OpenMPI.)

A conspicuous warning I see in my UCX configure output is:

checking for rdma_establish in -lrdmacm... no
configure: WARNING: RDMACM requested but librdmacm is not found or does not 
provide rdma_establish() API

The version of librdmacm we have comes from 
librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from 
mid-2017. I wonder if that’s too old?

Greg

From: Pritchard Jr., Howard <howa...@lanl.gov>
Sent: Thursday, October 14, 2021 3:31 PM
To: Fischer, Greg A. <fisch...@westinghouse.com>; Open MPI Users 
<users@lists.open-mpi.org>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
Hi Greg,

I think the UCX PML may be discomfited by the lack of thread safety.

Could you try using the contrib/configure-release-mt  in your ucx folder?  You 
want to add –enable-mt.
That’s what stands out in your configure output from the one I usually get when 
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0

Here’s the output from one of my UCX configs:

configure: =========================================================
configure: UCX build configuration:
configure:         Build prefix:   <foobar>/ucx_testing/ucx/test_install
configure:    Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src 
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure:           C compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch 
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length 
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure:         C++ compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure:         Multi-thread:   enabled
configure:         NUMA support:   disabled
configure:            MPI tests:   disabled
configure:          VFS support:   no
configure:        Devel headers:   no
configure: io_demo CUDA support:   no
configure:             Bindings:   < >
configure:          UCS modules:   < >
configure:          UCT modules:   < ib cma knem >
configure:         CUDA modules:   < >
configure:         ROCM modules:   < >
configure:           IB modules:   < >
configure:          UCM modules:   < >
configure:         Perf modules:   < >
configure: =========================================================


Howard

From: "Fischer, Greg A." 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>>
Date: Thursday, October 14, 2021 at 12:46 PM
To: "Pritchard Jr., Howard" <howa...@lanl.gov<mailto:howa...@lanl.gov>>, Open 
MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

Thanks, Howard.

I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 
4.1.1. When I try to specify the “-mca pml ucx” for a simple, 2-process 
benchmark problem, I get:

--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      bl1311
  Framework: pml
--------------------------------------------------------------------------
[bl1311:20168] PML ucx cannot be selected
[bl1311:20169] PML ucx cannot be selected
------------------------------------------------------------

I’ve attached my ucx_info -d output, as well as the ucx configuration 
information. I’m not sure I follow everything on the UCX FAQ page, but it seems 
like everything is being routed over TCP, which is probably not what I want. 
Any thoughts as to what I might be doing wrong?

Thanks,
Greg

From: Pritchard Jr., Howard <howa...@lanl.gov<mailto:howa...@lanl.gov>>
Sent: Wednesday, October 13, 2021 12:28 PM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
HI Greg,

It’s the aging of the openib btl.

You may be able to apply the attached patch.  Note the 3.1.x release stream is 
no longer supported.

You may want to try using the 4.1.1 release, in which case you’ll want to use 
UCX.

Howard


From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Fischer, Greg A. via users" 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Date: Wednesday, October 13, 2021 at 10:06 AM
To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>>
Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno 
says Success"


Hello,



I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the 
following errors when I try to use the openib btl:



WARNING: There was an error initializing an OpenFabrics device.



  Local host:   bl1308

  Local device: mlx4_0

--------------------------------------------------------------------------

[bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device]
 error obtaining device attributes for mlx4_0 errno says Success



I have disabled UCX ("--without-ucx") because the UCX installation we have 
seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've 
attached the detailed output of ofed_info and ompi_info.



This issue seems similar to Issue #7461 
(https://github.com/open-mpi/ompi/issues/7461<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F7461&data=04%7C01%7Cfischega%40westinghouse.com%7Cc54353efd73247eabf8908d98f4969d2%7C516ec17ab92f438b8594e11b6f6bec79%7C0%7C0%7C637698367766044374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HDBgNlj8RUIHd3XMk0EBSg1BOYPZWPNQqOKmjcVOsFo%3D&reserved=0>),
 which I don't see a resolution for.



Does anyone know what the likely explanation is? Is the version of OFED on the 
system badly out-of-sync with contemporary OpenMPI?



Thanks,

Greg


________________________________

This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).

________________________________

This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).

________________________________

This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).

Reply via email to