Thanks, Ralph. This /does/ change things, but not very much. I was not under the impression that I needed to do that, since when I ran without having built against UCX, it warned me about the openib method being deprecated. By default, does OpenMPI not use either anymore, and I need to specifically call for UCX? Seems strange.
Anyhow, I’ve got some variables defined still, in addition to your suggestion, for verbosity: [novosirj@amarel-test2 ~]$ env | grep ^OMPI OMPI_MCA_pml=ucx OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 OMPI_MCA_pml_ucx_verbose=100 Here goes: [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX ./mpihello-gcc-8-openmpi-4.0.6 srun: job 13995650 queued and waiting for resources srun: job 13995650 has been allocated resources -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: gpu004 Local device: mlx4_0 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: gpu004 Local device: mlx4_0 -------------------------------------------------------------------------- [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL memory hooks as external events [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL memory hooks as external events [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.5.2 [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.5.2 [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did not match transport list [gpu004.amarel.rutgers.edu:29823] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level is none [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: did not match transport list -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: gpu004 Framework: pml -------------------------------------------------------------------------- [gpu004.amarel.rutgers.edu:29823] PML ucx cannot be selected [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did not match transport list [gpu004.amarel.rutgers.edu:29824] ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level is none -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: gpu004 Framework: pml -------------------------------------------------------------------------- [gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT 2021-07-29T11:31:19 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: gpu004: tasks 0-1: Exited with exit code 1 -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Jul 29, 2021, at 8:34 AM, Ralph Castain via users > <users@lists.open-mpi.org> wrote: > > Ryan - I suspect what Sergey was trying to say was that you need to ensure > OMPI doesn't try to use the OpenIB driver, or at least that it doesn't > attempt to initialize it. Try adding > > OMPI_MCA_pml=ucx > > to your environment. > > >> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users >> <users@lists.open-mpi.org> wrote: >> >> Hi >> >> This issue arrives from BTL OpenIB, not related to UCX >> >> From: users <users-boun...@lists.open-mpi.org> on behalf of Ryan Novosielski >> via users <users@lists.open-mpi.org> >> Date: Thursday, 29 July 2021, 08:25 >> To: users@lists.open-mpi.org <users@lists.open-mpi.org> >> Cc: Ryan Novosielski <novos...@rutgers.edu> >> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There >> was an error initializing an OpenFabrics device." >> >> Hi there, >> >> New to using UCX, as a result of having built OpenMPI without it and running >> tests and getting warned. Installed UCX from the distribution: >> >> [novosirj@amarel-test2 ~]$ rpm -qa ucx >> ucx-1.5.2-1.el7.x86_64 >> >> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful >> messages about not using the IB card. I looked around the internet some and >> set a couple of environment variables to get a little more information: >> >> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 >> export OMPI_MCA_pml_ucx_verbose=100 >> >> Here’s what happens: >> >> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX >> ./mpihello-gcc-8-openmpi-4.0.6 >> srun: job 13993927 queued and waiting for resources >> srun: job 13993927 has been allocated resources >> -------------------------------------------------------------------------- >> WARNING: There was an error initializing an OpenFabrics device. >> >> Local host: gpu004 >> Local device: mlx4_0 >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> WARNING: There was an error initializing an OpenFabrics device. >> >> Local host: gpu004 >> Local device: mlx4_0 >> -------------------------------------------------------------------------- >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL >> memory hooks as external events >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >> mca_pml_ucx_open: UCX version 1.5.2 >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL >> memory hooks as external events >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >> mca_pml_ucx_open: UCX version 1.5.2 >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> rc/mlx4_0:1: did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> ud/mlx4_0:1: did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >> level is none >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> rc/mlx4_0:1: did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> ud/mlx4_0:1: did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >> did not match transport list >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >> level is none >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close >> [gpu004.amarel.rutgers.edu:02326] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL >> memory hooks as external events >> [gpu004.amarel.rutgers.edu:02327] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL >> memory hooks as external events >> Hello world from processor gpu004.amarel.rutgers.edu, rank 0 out of 2 >> processors >> Hello world from processor gpu004.amarel.rutgers.edu, rank 1 out of 2 >> processors >> >> Here’s the output of a couple more commands that seem to be recommended when >> looking into this: >> >> [novosirj@gpu004 ~]$ ucx_info -d >> # >> # Memory domain: self >> # component: self >> # register: unlimited, cost: 0 nsec >> # remote key: 8 bytes >> # >> # Transport: self >> # >> # Device: self >> # >> # capabilities: >> # bandwidth: 6911.00 MB/sec >> # latency: 0 nsec >> # overhead: 10 nsec >> # put_short: <= 4294967295 >> # put_bcopy: unlimited >> # get_bcopy: unlimited >> # am_short: <= 8k >> # am_bcopy: <= 8k >> # domain: cpu >> # atomic_add: 32, 64 bit >> # atomic_and: 32, 64 bit >> # atomic_or: 32, 64 bit >> # atomic_xor: 32, 64 bit >> # atomic_fadd: 32, 64 bit >> # atomic_fand: 32, 64 bit >> # atomic_for: 32, 64 bit >> # atomic_fxor: 32, 64 bit >> # atomic_swap: 32, 64 bit >> # atomic_cswap: 32, 64 bit >> # connection: to iface >> # priority: 0 >> # device address: 0 bytes >> # iface address: 8 bytes >> # error handling: none >> # >> # >> # Memory domain: tcp >> # component: tcp >> # >> # Transport: tcp >> # >> # Device: eno1 >> # >> # capabilities: >> # bandwidth: 113.16 MB/sec >> # latency: 5776 nsec >> # overhead: 50000 nsec >> # am_bcopy: <= 8k >> # connection: to iface >> # priority: 1 >> # device address: 4 bytes >> # iface address: 2 bytes >> # error handling: none >> # >> # Device: ib0 >> # >> # capabilities: >> # bandwidth: 6239.81 MB/sec >> # latency: 5210 nsec >> # overhead: 50000 nsec >> # am_bcopy: <= 8k >> # connection: to iface >> # priority: 1 >> # device address: 4 bytes >> # iface address: 2 bytes >> # error handling: none >> # >> # >> # Memory domain: ib/mlx4_0 >> # component: ib >> # register: unlimited, cost: 90 nsec >> # remote key: 16 bytes >> # local memory handle is required for zcopy >> # >> # Transport: rc >> # >> # Device: mlx4_0:1 >> # >> # capabilities: >> # bandwidth: 6433.22 MB/sec >> # latency: 900 nsec + 1 * N >> # overhead: 75 nsec >> # put_short: <= 88 >> # put_bcopy: <= 8k >> # put_zcopy: <= 1g, up to 6 iov >> # put_opt_zcopy_align: <= 512 >> # put_align_mtu: <= 2k >> # get_bcopy: <= 8k >> # get_zcopy: 33..1g, up to 6 iov >> # get_opt_zcopy_align: <= 512 >> # get_align_mtu: <= 2k >> # am_short: <= 87 >> # am_bcopy: <= 8191 >> # am_zcopy: <= 8191, up to 5 iov >> # am_opt_zcopy_align: <= 512 >> # am_align_mtu: <= 2k >> # am header: <= 127 >> # domain: device >> # connection: to ep >> # priority: 10 >> # device address: 3 bytes >> # ep address: 4 bytes >> # error handling: peer failure >> # >> # >> # Transport: ud >> # >> # Device: mlx4_0:1 >> # >> # capabilities: >> # bandwidth: 6433.22 MB/sec >> # latency: 910 nsec >> # overhead: 105 nsec >> # am_short: <= 172 >> # am_bcopy: <= 4088 >> # am_zcopy: <= 4088, up to 7 iov >> # am_opt_zcopy_align: <= 512 >> # am_align_mtu: <= 4k >> # am header: <= 3984 >> # connection: to ep, to iface >> # priority: 10 >> # device address: 3 bytes >> # iface address: 3 bytes >> # ep address: 6 bytes >> # error handling: peer failure >> # >> # >> # Memory domain: rdmacm >> # component: rdmacm >> # supports client-server connection establishment via sockaddr >> # < no supported devices found > >> # >> # Memory domain: sysv >> # component: sysv >> # allocate: unlimited >> # remote key: 32 bytes >> # >> # Transport: mm >> # >> # Device: sysv >> # >> # capabilities: >> # bandwidth: 6911.00 MB/sec >> # latency: 80 nsec >> # overhead: 10 nsec >> # put_short: <= 4294967295 >> # put_bcopy: unlimited >> # get_bcopy: unlimited >> # am_short: <= 92 >> # am_bcopy: <= 8k >> # domain: cpu >> # atomic_add: 32, 64 bit >> # atomic_and: 32, 64 bit >> # atomic_or: 32, 64 bit >> # atomic_xor: 32, 64 bit >> # atomic_fadd: 32, 64 bit >> # atomic_fand: 32, 64 bit >> # atomic_for: 32, 64 bit >> # atomic_fxor: 32, 64 bit >> # atomic_swap: 32, 64 bit >> # atomic_cswap: 32, 64 bit >> # connection: to iface >> # priority: 0 >> # device address: 8 bytes >> # iface address: 16 bytes >> # error handling: none >> # >> # >> # Memory domain: posix >> # component: posix >> # allocate: unlimited >> # remote key: 37 bytes >> # >> # Transport: mm >> # >> # Device: posix >> # >> # capabilities: >> # bandwidth: 6911.00 MB/sec >> # latency: 80 nsec >> # overhead: 10 nsec >> # put_short: <= 4294967295 >> # put_bcopy: unlimited >> # get_bcopy: unlimited >> # am_short: <= 92 >> # am_bcopy: <= 8k >> # domain: cpu >> # atomic_add: 32, 64 bit >> # atomic_and: 32, 64 bit >> # atomic_or: 32, 64 bit >> # atomic_xor: 32, 64 bit >> # atomic_fadd: 32, 64 bit >> # atomic_fand: 32, 64 bit >> # atomic_for: 32, 64 bit >> # atomic_fxor: 32, 64 bit >> # atomic_swap: 32, 64 bit >> # atomic_cswap: 32, 64 bit >> # connection: to iface >> # priority: 0 >> # device address: 8 bytes >> # iface address: 16 bytes >> # error handling: none >> # >> # >> # Memory domain: cma >> # component: cma >> # register: unlimited, cost: 9 nsec >> # >> # Transport: cma >> # >> # Device: cma >> # >> # capabilities: >> # bandwidth: 11145.00 MB/sec >> # latency: 80 nsec >> # overhead: 400 nsec >> # put_zcopy: unlimited, up to 16 iov >> # put_opt_zcopy_align: <= 1 >> # put_align_mtu: <= 1 >> # get_zcopy: unlimited, up to 16 iov >> # get_opt_zcopy_align: <= 1 >> # get_align_mtu: <= 1 >> # connection: to iface >> # priority: 0 >> # device address: 8 bytes >> # iface address: 4 bytes >> # error handling: none >> # >> >> [novosirj@gpu004 ~]$ ucx_info -p -u t >> # >> # UCP context >> # >> # md 0 : self >> # md 1 : tcp >> # md 2 : ib/mlx4_0 >> # md 3 : rdmacm >> # md 4 : sysv >> # md 5 : posix >> # md 6 : cma >> # >> # resource 0 : md 0 dev 0 flags -- self/self >> # resource 1 : md 1 dev 1 flags -- tcp/eno1 >> # resource 2 : md 1 dev 2 flags -- tcp/ib0 >> # resource 3 : md 2 dev 3 flags -- rc/mlx4_0:1 >> # resource 4 : md 2 dev 3 flags -- ud/mlx4_0:1 >> # resource 5 : md 3 dev 4 flags -s rdmacm/sockaddr >> # resource 6 : md 4 dev 5 flags -- mm/sysv >> # resource 7 : md 5 dev 6 flags -- mm/posix >> # resource 8 : md 6 dev 7 flags -- cma/cma >> # >> # memory: 0.84MB, file descriptors: 2 >> # create time: 5.032 ms >> # >> >> Thanks for any help you can offer. What am I missing? >> >> -- >> #BlackLivesMatter >> ____ >> || \\UTGERS, |---------------------------*O*--------------------------- >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >> || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark >> `' >> >