Hi Jeffrey, I would suggest trying to debug what may be going wrong with UCX on your DGX box.
There are several things to try from the UCX faq - https://openucx.readthedocs.io/en/master/faq.html I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and see if UCX says something about what’s going wrong. Also add --mca plm_base_verbose 10 to the mpirun command line. Have you used DGX boxes with only a single NIC successfully? Howard From: users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey Layton via users <users@lists.open-mpi.org> Reply-To: Open MPI Users <users@lists.open-mpi.org> Date: Tuesday, April 16, 2024 at 12:30 PM To: Open MPI Users <users@lists.open-mpi.org> Cc: Jeffrey Layton <layto...@gmail.com> Subject: [EXTERNAL] [OMPI users] Helping interpreting error output Good afternoon MPI fans of all ages, Yet again, I'm getting an error that I'm having trouble interpreting. This time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA DGX A100 with multiple NICs. The ultimate command is the following: /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test It was suggested to me to use these MPI options. The error I get is the following. -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: dgx-02 Framework: pml Component: ucx -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- [dgx-02:2399932] *** An error occurred in MPI_Init [dgx-02:2399932] *** reported by process [2099773441,3] [dgx-02:2399932] *** on a NULL communicator [dgx-02:2399932] *** Unknown error [dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dgx-02:2399932] *** and potentially your MPI job) My first inclination was that it couldn't find ucx. So I loaded that module and re-ran it. I get the exact same error message. I'm still checking if the ucx module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm checking that. Any thoughts? Thanks! Jeff