Re: [OMPI users] [EXTERNAL] Helping interpreting error output

2024-04-16 Thread Pritchard Jr., Howard via users
Hi Jeffrey, I would suggest trying to debug what may be going wrong with UCX on your DGX box. There are several things to try from the UCX faq - https://openucx.readthedocs.io/en/master/faq.html I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and see if UCX says so

[OMPI users] Helping interpreting error output

2024-04-16 Thread Jeffrey Layton via users
Good afternoon MPI fans of all ages, Yet again, I'm getting an error that I'm having trouble interpreting. This time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA DGX A100 with multiple NICs. The ultimate command is the following: /cm/shared/apps/openmpi4/gcc/4.1.5/

Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently

2024-04-16 Thread Greg Samonds via users
Hi Gilles, Thanks for your assistance. I tried the recommended settings but got an error saying “sm” is no longer available in Open MPI 3.0+, and to use “vader” instead. I then tried with “--mca pml ob1 --mca btl self,vader” but ended up with the original error: [podman-ci-rocky-8.8:09900] MC