Hi Jeffrey,

I would suggest trying to debug what may be going wrong with UCX on your DGX 
box.

There are several things to try from the UCX faq - 
https://openucx.readthedocs.io/en/master/faq.html

I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and 
see if UCX says something about what’s going wrong.

Also add --mca plm_base_verbose 10 to the mpirun command line.

Have you used DGX boxes with only a single NIC successfully?

Howard


From: users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey Layton via 
users <users@lists.open-mpi.org>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Tuesday, April 16, 2024 at 12:30 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Jeffrey Layton <layto...@gmail.com>
Subject: [EXTERNAL] [OMPI users] Helping interpreting error output

Good afternoon MPI fans of all ages,

Yet again, I'm getting an error that I'm having trouble interpreting. This 
time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA 
DGX A100 with multiple NICs.

The ultimate command is the following:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by 
ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca 
btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm 
rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test


It was suggested to me to use these MPI options. The error I get is the 
following.

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      dgx-02
Framework: pml
Component: ucx
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[dgx-02:2399932] *** An error occurred in MPI_Init
[dgx-02:2399932] *** reported by process [2099773441,3]
[dgx-02:2399932] *** on a NULL communicator
[dgx-02:2399932] *** Unknown error
[dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[dgx-02:2399932] ***    and potentially your MPI job)


My first inclination was that it couldn't find ucx. So I loaded that module and 
re-ran it. I get the exact same error message. I'm still checking if the ucx 
module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm 
checking that.

Any thoughts?

Thanks!

Jeff






Reply via email to