Hi This issue arrives from BTL OpenIB, not related to UCX
From: users <users-boun...@lists.open-mpi.org> on behalf of Ryan Novosielski via users <users@lists.open-mpi.org> Date: Thursday, 29 July 2021, 08:25 To: users@lists.open-mpi.org <users@lists.open-mpi.org> Cc: Ryan Novosielski <novos...@rutgers.edu> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device." Hi there, New to using UCX, as a result of having built OpenMPI without it and running tests and getting warned. Installed UCX from the distribution: [novosirj@amarel-test2 ~]$ rpm -qa ucx ucx-1.5.2-1.el7.x86_64 …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful messages about not using the IB card. I looked around the internet some and set a couple of environment variables to get a little more information: OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 export OMPI_MCA_pml_ucx_verbose=100 Here’s what happens: [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX ./mpihello-gcc-8-openmpi-4.0.6 srun: job 13993927 queued and waiting for resources srun: job 13993927 has been allocated resources -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: gpu004 Local device: mlx4_0 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: gpu004 Local device: mlx4_0 -------------------------------------------------------------------------- [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL memory hooks as external events [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.5.2 [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL memory hooks as external events [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.5.2 [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did not match transport list [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level is none [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did not match transport list [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level is none [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close [gpu004.amarel.rutgers.edu:02326] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL memory hooks as external events [gpu004.amarel.rutgers.edu:02327] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL memory hooks as external events Hello world from processor gpu004.amarel.rutgers.edu, rank 0 out of 2 processors Hello world from processor gpu004.amarel.rutgers.edu, rank 1 out of 2 processors Here’s the output of a couple more commands that seem to be recommended when looking into this: [novosirj@gpu004 ~]$ ucx_info -d # # Memory domain: self # component: self # register: unlimited, cost: 0 nsec # remote key: 8 bytes # # Transport: self # # Device: self # # capabilities: # bandwidth: 6911.00 MB/sec # latency: 0 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 8k # am_bcopy: <= 8k # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # priority: 0 # device address: 0 bytes # iface address: 8 bytes # error handling: none # # # Memory domain: tcp # component: tcp # # Transport: tcp # # Device: eno1 # # capabilities: # bandwidth: 113.16 MB/sec # latency: 5776 nsec # overhead: 50000 nsec # am_bcopy: <= 8k # connection: to iface # priority: 1 # device address: 4 bytes # iface address: 2 bytes # error handling: none # # Device: ib0 # # capabilities: # bandwidth: 6239.81 MB/sec # latency: 5210 nsec # overhead: 50000 nsec # am_bcopy: <= 8k # connection: to iface # priority: 1 # device address: 4 bytes # iface address: 2 bytes # error handling: none # # # Memory domain: ib/mlx4_0 # component: ib # register: unlimited, cost: 90 nsec # remote key: 16 bytes # local memory handle is required for zcopy # # Transport: rc # # Device: mlx4_0:1 # # capabilities: # bandwidth: 6433.22 MB/sec # latency: 900 nsec + 1 * N # overhead: 75 nsec # put_short: <= 88 # put_bcopy: <= 8k # put_zcopy: <= 1g, up to 6 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 2k # get_bcopy: <= 8k # get_zcopy: 33..1g, up to 6 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 2k # am_short: <= 87 # am_bcopy: <= 8191 # am_zcopy: <= 8191, up to 5 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 2k # am header: <= 127 # domain: device # connection: to ep # priority: 10 # device address: 3 bytes # ep address: 4 bytes # error handling: peer failure # # # Transport: ud # # Device: mlx4_0:1 # # capabilities: # bandwidth: 6433.22 MB/sec # latency: 910 nsec # overhead: 105 nsec # am_short: <= 172 # am_bcopy: <= 4088 # am_zcopy: <= 4088, up to 7 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4k # am header: <= 3984 # connection: to ep, to iface # priority: 10 # device address: 3 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure # # # Memory domain: rdmacm # component: rdmacm # supports client-server connection establishment via sockaddr # < no supported devices found > # # Memory domain: sysv # component: sysv # allocate: unlimited # remote key: 32 bytes # # Transport: mm # # Device: sysv # # capabilities: # bandwidth: 6911.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 92 # am_bcopy: <= 8k # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # priority: 0 # device address: 8 bytes # iface address: 16 bytes # error handling: none # # # Memory domain: posix # component: posix # allocate: unlimited # remote key: 37 bytes # # Transport: mm # # Device: posix # # capabilities: # bandwidth: 6911.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 92 # am_bcopy: <= 8k # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # priority: 0 # device address: 8 bytes # iface address: 16 bytes # error handling: none # # # Memory domain: cma # component: cma # register: unlimited, cost: 9 nsec # # Transport: cma # # Device: cma # # capabilities: # bandwidth: 11145.00 MB/sec # latency: 80 nsec # overhead: 400 nsec # put_zcopy: unlimited, up to 16 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: unlimited, up to 16 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # priority: 0 # device address: 8 bytes # iface address: 4 bytes # error handling: none # [novosirj@gpu004 ~]$ ucx_info -p -u t # # UCP context # # md 0 : self # md 1 : tcp # md 2 : ib/mlx4_0 # md 3 : rdmacm # md 4 : sysv # md 5 : posix # md 6 : cma # # resource 0 : md 0 dev 0 flags -- self/self # resource 1 : md 1 dev 1 flags -- tcp/eno1 # resource 2 : md 1 dev 2 flags -- tcp/ib0 # resource 3 : md 2 dev 3 flags -- rc/mlx4_0:1 # resource 4 : md 2 dev 3 flags -- ud/mlx4_0:1 # resource 5 : md 3 dev 4 flags -s rdmacm/sockaddr # resource 6 : md 4 dev 5 flags -- mm/sysv # resource 7 : md 5 dev 6 flags -- mm/posix # resource 8 : md 6 dev 7 flags -- cma/cma # # memory: 0.84MB, file descriptors: 2 # create time: 5.032 ms # Thanks for any help you can offer. What am I missing? -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `'