Hello Gilles,
thanks for your response. I'm testing with 20 task, each using 8
threads. When using a single node or only few nodes, we do not see this
either.
Attached is the used slurm script, which reports also the environment
variables, and the output log from three different runs with srun,
mpirun, and mpirun --mca
It is correct, that when running with mpirun we do not see this issue.
The errors are only observed when running with "srun"
Moreover, I notice that fewer tests are performed when using mpirun.
From that we can conclude that the issue is related to slurm-openmpi
interaction.
Switching from srun to mpirun has also some negative implications w.r.t.
to scheduling and also robustness. Therefore, we would like to start the
job with srun.
Cheers,
Alois
Am 5/3/22 um 12:52 schrieb Gilles Gouaillardet via users:
Alois,
Thanks for the report.
FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3)
How many MPI tasks are you running?
Can you please confirm you can evidence the error with
mpirun -np ./mpi_test_suite -d
MPI_TYPE_MIX_ARRAY -c 0 -t collective
Also, can you try the same command with
mpirun --mca pml ob1 --mca btl tcp,self ...
Cheers,
Gilles
On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users
wrote:
Within our cluster (debian10/slurm16, debian11/slurm20), with
infiniband, and we have several instances of openmpi installed
through
the Lmod module system. When testing the openmpi installations
with the
mpi-test-suite 1.1 [1], it shows errors like these
...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
...
when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The
number of
errors may vary, but the first errors are always about
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
When testing on openmpi/3.1.3, the tests runs successfully, and there
are no failed tests.
Typically, the openmpi/4.1.x installation is configured with
./configure --prefix=${PREFIX} \
--with-ucx=$UCX_HOME \
--enable-orterun-prefix-by-default \
--enable-mpi-cxx \
--with-hwloc \
--with-pmi \
--with-pmix \
--with-cuda=$CUDA_HOME \
--with-slurm
but I've also tried different compilation options including w/ and
w/o
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the
OS, or
compiled from source. But I could not identify any pattern.
Therefore, I'd like asking you what the issue might be. Specifically,
I'm would like to know:
- Am I right in assuming that mpi-test-suite [1] suitable for testing
openmpi ?
- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?
Kind regards,
Alois
[1] https://github.com/open-mpi/mpi-test-suite/t
job-mpi-test3.sh
Description: application/shellscript
delta197
/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/ompi_info
running on 20*8 cores with 20 MPI-tasks and 8 threads
SHELL=/bin/bash
SLURM_JOB_USER=schloegl
SLURM_TASKS_PER_NODE=2(x10)
SLURM_JOB_UID=10103
SLURM_TASK_PID=50793
PKG_CONFIG_PATH=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/lib/pkgconfig:/mnt/nfs/clustersw/Debian/bullseye/hwloc/2.7.1/lib/pkgconfig:/mnt/nfs/clustersw/shared/cuda/11.2.2/pkgconfig
SLURM_LOCALID=0
SLURM_SUBMIT_DIR=/nfs/scistore16/jonasgrp/schloegl/slurm
HOSTNAME=delta197
LANGUAGE=en_US:en
SLURMD_NODENAME=delta197
_ModuleTable002_=ewpmbiA9ICIvbW50L25mcy9jbHVzdGVyc3cvRGViaWFuL2J1bGxzZXllL21vZHVsZWZpbGVzL0NvcmUvaHdsb2MvMi43LjEubHVhIiwKZnVsbE5hbWUgPSAiaHdsb2MvMi43LjEiLApsb2FkT3JkZXIgPSAzLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImh3bG9jLzIuNy4xIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwNy4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sCm9wZW5tcGkgPSB7CmZuID0gIi9tbnQvbmZzL2NsdXN0ZXJzdy9EZWJpYW4vYnVsbHNleWUvbW9kdWxlZmlsZXMvQ29yZS9vcGVubXBpLzQuMS4zZC5sdWEiLApmdWxsTmFtZSA9ICJvcGVubXBpLzQuMS4zZCIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0g
MPICC=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/mpicc
__LMOD_REF_COUNT_MODULEPATH=/mnt/nfs/clustersw/Debian/bullseye/modulefiles/MPI/openmpi/4.1.3d:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Linux:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Core:1;/mnt/nfs/clustersw/Debian/bullseye/lmod/lmod/modulefiles/Core:1
OMPI_MCA_btl=self,openib
_ModuleTab