Hi Jeffrey,
I would suggest trying to debug what may be going wrong with UCX on your DGX
box.
There are several things to try from the UCX faq -
https://openucx.readthedocs.io/en/master/faq.html
I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and
see if UCX says so
Good afternoon MPI fans of all ages,
Yet again, I'm getting an error that I'm having trouble interpreting. This
time, I'm trying to run ior. I've done it a thousand times but not on an
NVIDIA DGX A100 with multiple NICs.
The ultimate command is the following:
/cm/shared/apps/openmpi4/gcc/4.1.5/
Hi Gilles,
Thanks for your assistance.
I tried the recommended settings but got an error saying “sm” is no longer
available in Open MPI 3.0+, and to use “vader” instead. I then tried with
“--mca pml ob1 --mca btl self,vader” but ended up with the original error:
[podman-ci-rocky-8.8:09900] MC