Ryan - I suspect what Sergey was trying to say was that you need to ensure OMPI 
doesn't try to use the OpenIB driver, or at least that it doesn't attempt to 
initialize it. Try adding

OMPI_MCA_pml=ucx

to your environment.


On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

Hi
 This issue arrives from BTL OpenIB, not related to UCX
 From: users <users-boun...@lists.open-mpi.org 
<mailto:users-boun...@lists.open-mpi.org> > on behalf of Ryan Novosielski via 
users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >
Date: Thursday, 29 July 2021, 08:25
To: users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >
Cc: Ryan Novosielski <novos...@rutgers.edu <mailto:novos...@rutgers.edu> >
Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There 
was an error initializing an OpenFabrics device."

Hi there,

New to using UCX, as a result of having built OpenMPI without it and running 
tests and getting warned. Installed UCX from the distribution:

[novosirj@amarel-test2 ~]$ rpm -qa ucx
ucx-1.5.2-1.el7.x86_64

…and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
messages about not using the IB card. I looked around the internet some and set 
a couple of environment variables to get a little more information:

OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100

Here’s what happens:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6 
srun: job 13993927 queued and waiting for resources
srun: job 13993927 has been allocated resources
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--------------------------------------------------------------------------
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02326 <http://gpu004.amarel.rutgers.edu:02326> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327 <http://gpu004.amarel.rutgers.edu:02327> ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
Hello world from processor gpu004.amarel.rutgers.edu 
<http://gpu004.amarel.rutgers.edu> , rank 0 out of 2 processors
Hello world from processor gpu004.amarel.rutgers.edu 
<http://gpu004.amarel.rutgers.edu> , rank 1 out of 2 processors

Here’s the output of a couple more commands that seem to be recommended when 
looking into this:

[novosirj@gpu004 ~]$ ucx_info -d
#
# Memory domain: self
#            component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 8 bytes
#
#   Transport: self
#
#   Device: self
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8k
#             am_bcopy: <= 8k
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: tcp
#            component: tcp
#
#   Transport: tcp
#
#   Device: eno1
#
#      capabilities:
#            bandwidth: 113.16 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#             am_bcopy: <= 8k
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Device: ib0
#
#      capabilities:
#            bandwidth: 6239.81 MB/sec
#              latency: 5210 nsec
#             overhead: 50000 nsec
#             am_bcopy: <= 8k
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Memory domain: ib/mlx4_0
#            component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 16 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc
#
#   Device: mlx4_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 900 nsec + 1 * N
#             overhead: 75 nsec
#            put_short: <= 88
#            put_bcopy: <= 8k
#            put_zcopy: <= 1g, up to 6 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 2k
#            get_bcopy: <= 8k
#            get_zcopy: 33..1g, up to 6 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 2k
#             am_short: <= 87
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 2k
#            am header: <= 127
#               domain: device
#           connection: to ep
#             priority: 10
#       device address: 3 bytes
#           ep address: 4 bytes
#       error handling: peer failure
#
#
#   Transport: ud
#
#   Device: mlx4_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 910 nsec
#             overhead: 105 nsec
#             am_short: <= 172
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#            am header: <= 3984
#           connection: to ep, to iface
#             priority: 10
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
# Memory domain: rdmacm
#            component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: sysv
#            component: sysv
#             allocate: unlimited
#           remote key: 32 bytes
#
#   Transport: mm
#
#   Device: sysv
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8k
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: posix
#            component: posix
#             allocate: unlimited
#           remote key: 37 bytes
#
#   Transport: mm
#
#   Device: posix
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8k
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: cma
#            component: cma
#             register: unlimited, cost: 9 nsec
#
#   Transport: cma
#
#   Device: cma
#
#      capabilities:
#            bandwidth: 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#

[novosirj@gpu004 ~]$ ucx_info -p -u t
#
# UCP context
#
#            md 0  :  self
#            md 1  :  tcp
#            md 2  :  ib/mlx4_0
#            md 3  :  rdmacm
#            md 4  :  sysv
#            md 5  :  posix
#            md 6  :  cma
#
#      resource 0  :  md 0  dev 0  flags -- self/self
#      resource 1  :  md 1  dev 1  flags -- tcp/eno1
#      resource 2  :  md 1  dev 2  flags -- tcp/ib0
#      resource 3  :  md 2  dev 3  flags -- rc/mlx4_0:1
#      resource 4  :  md 2  dev 3  flags -- ud/mlx4_0:1
#      resource 5  :  md 3  dev 4  flags -s rdmacm/sockaddr
#      resource 6  :  md 4  dev 5  flags -- mm/sysv
#      resource 7  :  md 5  dev 6  flags -- mm/posix
#      resource 8  :  md 6  dev 7  flags -- cma/cma
#
# memory: 0.84MB, file descriptors: 2
# create time: 5.032 ms
#

Thanks for any help you can offer. What am I missing?

--
#BlackLivesMatter
____
|| \\UTGERS,      |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu 
<mailto:novos...@rutgers.edu> 
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
    `'


Reply via email to