So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead
> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski <novos...@rutgers.edu> wrote: > > Thanks, Ralph. This /does/ change things, but not very much. I was not under > the impression that I needed to do that, since when I ran without having > built against UCX, it warned me about the openib method being deprecated. By > default, does OpenMPI not use either anymore, and I need to specifically call > for UCX? Seems strange. > > Anyhow, I’ve got some variables defined still, in addition to your > suggestion, for verbosity: > > [novosirj@amarel-test2 ~]$ env | grep ^OMPI > OMPI_MCA_pml=ucx > OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 > OMPI_MCA_pml_ucx_verbose=100 > > Here goes: > > [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX > ./mpihello-gcc-8-openmpi-4.0.6 > srun: job 13995650 queued and waiting for resources > srun: job 13995650 has been allocated resources > -------------------------------------------------------------------------- > WARNING: There was an error initializing an OpenFabrics device. > > Local host: gpu004 > Local device: mlx4_0 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > WARNING: There was an error initializing an OpenFabrics device. > > Local host: gpu004 > Local device: mlx4_0 > -------------------------------------------------------------------------- > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL > memory hooks as external events > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL > memory hooks as external events > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: > UCX version 1.5.2 > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: > UCX version 1.5.2 > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: > did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: > did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: > did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: > did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 > rc/mlx4_0:1: did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 > ud/mlx4_0:1: did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: > did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: > did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: > did not match transport list > [gpu004.amarel.rutgers.edu:29823] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support > level is none > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: > did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: > did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 > rc/mlx4_0:1: did not match transport list > -------------------------------------------------------------------------- > No components were able to be opened in the pml framework. > > This typically means that either no components of this type were > installed, or none of the installed components can be loaded. > Sometimes this means that shared libraries required by these > components are unable to be found/loaded. > > Host: gpu004 > Framework: pml > -------------------------------------------------------------------------- > [gpu004.amarel.rutgers.edu:29823] PML ucx cannot be selected > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 > ud/mlx4_0:1: did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: > did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: > did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: > did not match transport list > [gpu004.amarel.rutgers.edu:29824] > ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support > level is none > -------------------------------------------------------------------------- > No components were able to be opened in the pml framework. > > This typically means that either no components of this type were > installed, or none of the installed components can be loaded. > Sometimes this means that shared libraries required by these > components are unable to be found/loaded. > > Host: gpu004 > Framework: pml > -------------------------------------------------------------------------- > [gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected > slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT > 2021-07-29T11:31:19 *** > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: gpu004: tasks 0-1: Exited with exit code 1 > > -- > #BlackLivesMatter > ____ > || \\UTGERS, > |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, > Newark > `' > >> On Jul 29, 2021, at 8:34 AM, Ralph Castain via users >> <users@lists.open-mpi.org> wrote: >> >> Ryan - I suspect what Sergey was trying to say was that you need to ensure >> OMPI doesn't try to use the OpenIB driver, or at least that it doesn't >> attempt to initialize it. Try adding >> >> OMPI_MCA_pml=ucx >> >> to your environment. >> >> >>> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users >>> <users@lists.open-mpi.org> wrote: >>> >>> Hi >>> >>> This issue arrives from BTL OpenIB, not related to UCX >>> >>> From: users <users-boun...@lists.open-mpi.org> on behalf of Ryan >>> Novosielski via users <users@lists.open-mpi.org> >>> Date: Thursday, 29 July 2021, 08:25 >>> To: users@lists.open-mpi.org <users@lists.open-mpi.org> >>> Cc: Ryan Novosielski <novos...@rutgers.edu> >>> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: >>> There was an error initializing an OpenFabrics device." >>> >>> Hi there, >>> >>> New to using UCX, as a result of having built OpenMPI without it and >>> running tests and getting warned. Installed UCX from the distribution: >>> >>> [novosirj@amarel-test2 ~]$ rpm -qa ucx >>> ucx-1.5.2-1.el7.x86_64 >>> >>> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty >>> unhelpful messages about not using the IB card. I looked around the >>> internet some and set a couple of environment variables to get a little >>> more information: >>> >>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 >>> export OMPI_MCA_pml_ucx_verbose=100 >>> >>> Here’s what happens: >>> >>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX >>> ./mpihello-gcc-8-openmpi-4.0.6 >>> srun: job 13993927 queued and waiting for resources >>> srun: job 13993927 has been allocated resources >>> -------------------------------------------------------------------------- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: gpu004 >>> Local device: mlx4_0 >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: gpu004 >>> Local device: mlx4_0 >>> -------------------------------------------------------------------------- >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>> mca_pml_ucx_open: UCX version 1.5.2 >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>> mca_pml_ucx_open: UCX version 1.5.2 >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> self/self: did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> self/self: did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> rc/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> ud/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>> level is none >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 >>> mca_pml_ucx_close >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> rc/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> ud/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>> level is none >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 >>> mca_pml_ucx_close >>> [gpu004.amarel.rutgers.edu:02326] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:02327] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> Hello world from processor gpu004.amarel.rutgers.edu, rank 0 out of 2 >>> processors >>> Hello world from processor gpu004.amarel.rutgers.edu, rank 1 out of 2 >>> processors >>> >>> Here’s the output of a couple more commands that seem to be recommended >>> when looking into this: >>> >>> [novosirj@gpu004 ~]$ ucx_info -d >>> # >>> # Memory domain: self >>> # component: self >>> # register: unlimited, cost: 0 nsec >>> # remote key: 8 bytes >>> # >>> # Transport: self >>> # >>> # Device: self >>> # >>> # capabilities: >>> # bandwidth: 6911.00 MB/sec >>> # latency: 0 nsec >>> # overhead: 10 nsec >>> # put_short: <= 4294967295 >>> # put_bcopy: unlimited >>> # get_bcopy: unlimited >>> # am_short: <= 8k >>> # am_bcopy: <= 8k >>> # domain: cpu >>> # atomic_add: 32, 64 bit >>> # atomic_and: 32, 64 bit >>> # atomic_or: 32, 64 bit >>> # atomic_xor: 32, 64 bit >>> # atomic_fadd: 32, 64 bit >>> # atomic_fand: 32, 64 bit >>> # atomic_for: 32, 64 bit >>> # atomic_fxor: 32, 64 bit >>> # atomic_swap: 32, 64 bit >>> # atomic_cswap: 32, 64 bit >>> # connection: to iface >>> # priority: 0 >>> # device address: 0 bytes >>> # iface address: 8 bytes >>> # error handling: none >>> # >>> # >>> # Memory domain: tcp >>> # component: tcp >>> # >>> # Transport: tcp >>> # >>> # Device: eno1 >>> # >>> # capabilities: >>> # bandwidth: 113.16 MB/sec >>> # latency: 5776 nsec >>> # overhead: 50000 nsec >>> # am_bcopy: <= 8k >>> # connection: to iface >>> # priority: 1 >>> # device address: 4 bytes >>> # iface address: 2 bytes >>> # error handling: none >>> # >>> # Device: ib0 >>> # >>> # capabilities: >>> # bandwidth: 6239.81 MB/sec >>> # latency: 5210 nsec >>> # overhead: 50000 nsec >>> # am_bcopy: <= 8k >>> # connection: to iface >>> # priority: 1 >>> # device address: 4 bytes >>> # iface address: 2 bytes >>> # error handling: none >>> # >>> # >>> # Memory domain: ib/mlx4_0 >>> # component: ib >>> # register: unlimited, cost: 90 nsec >>> # remote key: 16 bytes >>> # local memory handle is required for zcopy >>> # >>> # Transport: rc >>> # >>> # Device: mlx4_0:1 >>> # >>> # capabilities: >>> # bandwidth: 6433.22 MB/sec >>> # latency: 900 nsec + 1 * N >>> # overhead: 75 nsec >>> # put_short: <= 88 >>> # put_bcopy: <= 8k >>> # put_zcopy: <= 1g, up to 6 iov >>> # put_opt_zcopy_align: <= 512 >>> # put_align_mtu: <= 2k >>> # get_bcopy: <= 8k >>> # get_zcopy: 33..1g, up to 6 iov >>> # get_opt_zcopy_align: <= 512 >>> # get_align_mtu: <= 2k >>> # am_short: <= 87 >>> # am_bcopy: <= 8191 >>> # am_zcopy: <= 8191, up to 5 iov >>> # am_opt_zcopy_align: <= 512 >>> # am_align_mtu: <= 2k >>> # am header: <= 127 >>> # domain: device >>> # connection: to ep >>> # priority: 10 >>> # device address: 3 bytes >>> # ep address: 4 bytes >>> # error handling: peer failure >>> # >>> # >>> # Transport: ud >>> # >>> # Device: mlx4_0:1 >>> # >>> # capabilities: >>> # bandwidth: 6433.22 MB/sec >>> # latency: 910 nsec >>> # overhead: 105 nsec >>> # am_short: <= 172 >>> # am_bcopy: <= 4088 >>> # am_zcopy: <= 4088, up to 7 iov >>> # am_opt_zcopy_align: <= 512 >>> # am_align_mtu: <= 4k >>> # am header: <= 3984 >>> # connection: to ep, to iface >>> # priority: 10 >>> # device address: 3 bytes >>> # iface address: 3 bytes >>> # ep address: 6 bytes >>> # error handling: peer failure >>> # >>> # >>> # Memory domain: rdmacm >>> # component: rdmacm >>> # supports client-server connection establishment via sockaddr >>> # < no supported devices found > >>> # >>> # Memory domain: sysv >>> # component: sysv >>> # allocate: unlimited >>> # remote key: 32 bytes >>> # >>> # Transport: mm >>> # >>> # Device: sysv >>> # >>> # capabilities: >>> # bandwidth: 6911.00 MB/sec >>> # latency: 80 nsec >>> # overhead: 10 nsec >>> # put_short: <= 4294967295 >>> # put_bcopy: unlimited >>> # get_bcopy: unlimited >>> # am_short: <= 92 >>> # am_bcopy: <= 8k >>> # domain: cpu >>> # atomic_add: 32, 64 bit >>> # atomic_and: 32, 64 bit >>> # atomic_or: 32, 64 bit >>> # atomic_xor: 32, 64 bit >>> # atomic_fadd: 32, 64 bit >>> # atomic_fand: 32, 64 bit >>> # atomic_for: 32, 64 bit >>> # atomic_fxor: 32, 64 bit >>> # atomic_swap: 32, 64 bit >>> # atomic_cswap: 32, 64 bit >>> # connection: to iface >>> # priority: 0 >>> # device address: 8 bytes >>> # iface address: 16 bytes >>> # error handling: none >>> # >>> # >>> # Memory domain: posix >>> # component: posix >>> # allocate: unlimited >>> # remote key: 37 bytes >>> # >>> # Transport: mm >>> # >>> # Device: posix >>> # >>> # capabilities: >>> # bandwidth: 6911.00 MB/sec >>> # latency: 80 nsec >>> # overhead: 10 nsec >>> # put_short: <= 4294967295 >>> # put_bcopy: unlimited >>> # get_bcopy: unlimited >>> # am_short: <= 92 >>> # am_bcopy: <= 8k >>> # domain: cpu >>> # atomic_add: 32, 64 bit >>> # atomic_and: 32, 64 bit >>> # atomic_or: 32, 64 bit >>> # atomic_xor: 32, 64 bit >>> # atomic_fadd: 32, 64 bit >>> # atomic_fand: 32, 64 bit >>> # atomic_for: 32, 64 bit >>> # atomic_fxor: 32, 64 bit >>> # atomic_swap: 32, 64 bit >>> # atomic_cswap: 32, 64 bit >>> # connection: to iface >>> # priority: 0 >>> # device address: 8 bytes >>> # iface address: 16 bytes >>> # error handling: none >>> # >>> # >>> # Memory domain: cma >>> # component: cma >>> # register: unlimited, cost: 9 nsec >>> # >>> # Transport: cma >>> # >>> # Device: cma >>> # >>> # capabilities: >>> # bandwidth: 11145.00 MB/sec >>> # latency: 80 nsec >>> # overhead: 400 nsec >>> # put_zcopy: unlimited, up to 16 iov >>> # put_opt_zcopy_align: <= 1 >>> # put_align_mtu: <= 1 >>> # get_zcopy: unlimited, up to 16 iov >>> # get_opt_zcopy_align: <= 1 >>> # get_align_mtu: <= 1 >>> # connection: to iface >>> # priority: 0 >>> # device address: 8 bytes >>> # iface address: 4 bytes >>> # error handling: none >>> # >>> >>> [novosirj@gpu004 ~]$ ucx_info -p -u t >>> # >>> # UCP context >>> # >>> # md 0 : self >>> # md 1 : tcp >>> # md 2 : ib/mlx4_0 >>> # md 3 : rdmacm >>> # md 4 : sysv >>> # md 5 : posix >>> # md 6 : cma >>> # >>> # resource 0 : md 0 dev 0 flags -- self/self >>> # resource 1 : md 1 dev 1 flags -- tcp/eno1 >>> # resource 2 : md 1 dev 2 flags -- tcp/ib0 >>> # resource 3 : md 2 dev 3 flags -- rc/mlx4_0:1 >>> # resource 4 : md 2 dev 3 flags -- ud/mlx4_0:1 >>> # resource 5 : md 3 dev 4 flags -s rdmacm/sockaddr >>> # resource 6 : md 4 dev 5 flags -- mm/sysv >>> # resource 7 : md 5 dev 6 flags -- mm/posix >>> # resource 8 : md 6 dev 7 flags -- cma/cma >>> # >>> # memory: 0.84MB, file descriptors: 2 >>> # create time: 5.032 ms >>> # >>> >>> Thanks for any help you can offer. What am I missing? >>> >>> -- >>> #BlackLivesMatter >>> ____ >>> || \\UTGERS, |---------------------------*O*--------------------------- >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >>> || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark >>> `' >>> >> >