Good afternoon MPI fans of all ages,

Yet again, I'm getting an error that I'm having trouble interpreting. This
time, I'm trying to run ior. I've done it a thousand times but not on an
NVIDIA DGX A100 with multiple NICs.

The ultimate command is the following:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4
-map-by ppr:4:node --allow-run-as-root --mca
btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude
mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh
/home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test


It was suggested to me to use these MPI options. The error I get is the
following.

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      dgx-02
Framework: pml
Component: ucx
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[dgx-02:2399932] *** An error occurred in MPI_Init
[dgx-02:2399932] *** reported by process [2099773441,3]
[dgx-02:2399932] *** on a NULL communicator
[dgx-02:2399932] *** Unknown error
[dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[dgx-02:2399932] ***    and potentially your MPI job)


My first inclination was that it couldn't find ucx. So I loaded that module
and re-ran it. I get the exact same error message. I'm still checking if
the ucx module gets loaded when I run via Slurm, but mdtest ran without
issue. But I'm checking that.

Any thoughts?

Thanks!

Jeff

Reply via email to